Intuitively understand Huffman trees and Huffman coding, and prove that Huffman coding is the optimal variable-length coding.
Course Content#
Prerequisite Knowledge#
- What is coding?
- Where did you first hear about coding? ASCII code
- 'a' = (97)10 = (0110 0001)2
- '0' = (48)10 = (0011 0000)2
- 8-bit binary encoding, one byte
- Note: In computers, any information is stored in binary.
- The role of coding: mapping human characters to computer characters.
- Where did you first hear about coding? ASCII code
- Example: The impact of coding length
- There is a piece of information in the computer: "aa00", its ASCII encoding is: 01100001 01100001 00110000 00110000
- At this time, transmitting this information from one computer to another
- In the network, the smallest transmission unit is a bit [bit], so "aa00" needs to transmit 32 bits.
- Assume: The network speed of a certain computer is 32bit/s, so the time taken is: 1s.
- 👇
- Convert to a special scenario: only four characters a, b, 0, 1 need to be transmitted.
- A new coding method can be designed [Pirate Class Coding - Fixed-Length Coding]
- Use 2-bit binary to distinguish them — a: 00, b: 01, 0: 10, 1: 11
- At this time, the representation of "aa00" only requires 8 bits.
- Therefore, under the same network speed, the current transmission time only needs: 0.25s.
- [PS] Compared to Tencent Meeting, Zoom's encoding and decoding are more refined, so the Zoom experience is better under the same network speed.
Fixed-Length and Variable-Length Coding#
- Fixed-length coding — the encoding length is the same for each character.
- ASCII code and the Pirate Class coding in specific scenarios are both fixed-length coding.
- [PS] UTF-8 — variable-length coding; UTF-16 — is fixed-length coding [self-study].
- Variable-length coding — the encoding length is different for each character.
- Variable-length coding is never worse than fixed-length coding.
- Fixed-length coding is a special case of variable-length coding.
- 【Extended Understanding】
- Variable-length coding can be fixed-length coding.
- Variable-length coding needs to be optimized for specific scenarios [the occurrence probability of transmission characters needs to be statistically prepared in advance].
- Coding can be seen as a protocol, predetermined, and cannot be dynamically changed.
- Encoding and decoding must remain consistent.
Application Scenarios of Variable-Length Coding#
- Specific scenarios
- Only four characters: ab01
- The probability of each character appearing is different — a: 0.8, b: 0.05; 0: 0.1; 1: 0.05.
- Average coding length
- avg(l) = ∑ li × pi
- li: the encoding length of the i-th character.
- pi: the occurrence probability of the i-th character.
- In fact, this is the expected value of the coding length.
- Example
- Assume the average coding length is 1.16, then transmitting 100 characters requires transmitting 116 bits [estimation].
- The average coding length of Pirate Class coding [see above]: avg(l) = 2 × ∑ pi = 2.
- The average coding length of fixed-length coding is just fixed length.
- New · Pirate Class Coding
- a: 1
- b: 01 [Note: cannot start with 1, as it will conflict with a during decoding, ❌ prefix overlap].
- 0: 000
- 1: 001
- At this time, the average coding length: 1 * 0.8 + 2 * 0.05 + 3 * 0.1 + 3 * 0.05 = 1.35.
- Therefore, transmitting 100 characters only requires transmitting 135 bits [estimation].
- [PS] Characters with higher probabilities correspond to shorter coding lengths.
Huffman Coding#
- First, count the probability of each character.
- Build a Huffman tree with n characters.
- Each character falls on a leaf node [no prefix overlap].
- Read the encoding in the form of left 0, right 1.
- Tree Building Example
-
- Tree building process mnemonic: each time take the two characters with the smallest probability as nodes, and combine them into a subtree [producing a new node].
- ⭐Because all characters fall on leaf nodes, no encoding is a prefix of another character's encoding.
- No conflicts.
- At this time, the average coding length is 1 * 0.8 + 3 * 0.05 + 2 * 0.1 + 3 * 0.05 = 1.3.
-
- Conclusion: Huffman coding is the optimal variable-length coding [proof provided below].
- [PS]
- Mainly focus on length, not too concerned about the left and right of 01.
- If concerned, consider which has a higher transmission cost, 0 or 1.
- Huffman trees are not unique; the order is arbitrary when probabilities are equal.
- Huffman coding can also degenerate into fixed-length coding [similarly, for example, coding 4 characters with equal probability (0.25)].
- Mainly focus on length, not too concerned about the left and right of 01.
Proving Huffman's Optimality#
[The big brother in variable-length coding, optimal means the average coding length is minimized.]
Steps: ① Transform the representation of average coding length; ② Solve for the optimal solution of the formula.
Thought Transformation#
-
- For the Huffman coding process, if the red node corresponds to a character, then the four leaf nodes below must be cut off.
- Otherwise, conflicts will occur.
- [Intuitively, this is also a fact] Huffman coding will cover all leaf nodes.
- Assume there are 4 characters, their encoding results are as follows [red nodes].
-
- It can be seen that for a tree of height H [starting from 0], the number of leaf nodes covered by the nodes at level l is 2 ^ (H - l).
- The l at level l actually represents the coding length.
- The higher the node, the more nodes it covers; the lower the node, the fewer.
-
- 👇
- There is a mapping between the coding length of the i-th character [left] and the number of leaf nodes covered by the corresponding node [right], as follows:
- li → 2 ^ (H - li).
Implicit Constraint Conditions#
-
- Divide both sides of the second equation by 2 ^ H.
- [PS] The second line of the Huffman tree equation must be an equality, but don't rush.
Transforming the Proof Objective#
- Minimize the average coding length Σ pi * li, li = -li'.
- The constraint comes from the tree structure above [i.e., implicit constraint conditions].
-
- Let I_i = 2 ^ li'.
- That is, transform both the objective and the constraint.
- 【Transformation Result】
- Objective: Minimize -(p1logI1 + p2logI2 + p3logI3 + ... + pnlogIn).
- Constraint: I1 + I2 + I3 + ... + In <= 1.
- [PS]
- Is it somewhat like entropy? In fact, it proves from the perspective of entropy and can also introduce the concept of cross-entropy.
- Why transform?
- To derive the relationship between the equality of the constraint condition and the minimum of the objective.
- There is also a limiting condition: the sum of probabilities is 1.
Further Narrowing the Constraint Conditions#
- Prove: When the objective function reaches its minimum, the constraint condition must equal 1.
- Proof by contradiction: that is, the constraint condition < 1.
- Let 1 - ∑I_i = I_x', which is not negative [look at the constraint condition].
-
- I_x' can be added to any term.
- As long as I_x' is not 0, then the objective function must not be minimal and can be smaller.
- Therefore, the constraint condition must equal 1.
Finding the Extreme Value by Setting the Partial Derivative to Zero#
-
- Reduce one unknown: I_n = 1 - II.
- When does the objective reach its minimum: find the partial derivative.
- The partial derivative gives the left side of the equation.
-
- This gives the upper right equation, which holds when the objective can reach its minimum.
- Let p1/I1 = ... = k, and with the known constraint condition and the sum of probabilities being 1, we get two conditions, leading to k = 1.
- ⭐So pi = I_i.
❗ Reflection#
The relationship between average coding length, entropy, and cross-entropy.
- pi = I_i.
- I_i is related to coding length l_i, which means probability and length are closely related.
- The greater the probability, the shorter the coding length.
- Substituting the equation gives the formula for 【entropy】.
- In fact, entropy represents the minimum average representation unit required to represent some information in a system.
- Similarly, it can also be understood as average coding length [⭐thought connection].
- The longer the average coding length → the more characters in the system → the more states of the system → the more chaotic the system.
- Greater entropy indicates more chaotic coding in the system.
- 【Cross-entropy】
- Treat both pi and I_i as a set of probabilities.
- It describes the similarity between two probability vectors.
- I_i is related to coding length l_i, which means probability and length are closely related.
- In fact, the proof is not very rigorous.
- Because the Huffman process is actually discrete, while this proof process is continuous.
- For example: pi = I_i, the coding length l_i calculated from I_i may be a decimal, but the coding length l_i is definitely an integer.
Code Demonstration#
Huffman Tree#
-
-
-
- Structure based on trees.
- 【Key Operations】
- Merge nodes to establish relationships between nodes [conversion between array and node relationships].
- Find the two smallest probabilities [can optimize using a priority queue].
- Recursively extract coding [only leaf nodes correspond to character coding].
- It is recommended to use character arrays to store characters when reading data to reduce error rates.
Additional Knowledge Points#
- In computers, the most basic [smallest] storage unit is a byte, and the most basic [smallest] data representation unit is a bit.
Points for Reflection#
- Other proof methods for Huffman being the optimal coding?
Tips#
- Mathematical knowledge determines the upper limit of computer direction.