3 Huffman Trees and Huffman Coding

Intuitively understand Huffman trees and Huffman coding, and prove that Huffman coding is the optimal variable-length coding.

Course Content#

Prerequisite Knowledge#

What is coding?
- Where did you first hear about coding? ASCII code
  - 'a' = (97)10 = (0110 0001)2
  - '0' = (48)10 = (0011 0000)2
  - 8-bit binary encoding, one byte
- Note: In computers, any information is stored in binary.
- The role of coding: mapping human characters to computer characters.
Example: The impact of coding length
- There is a piece of information in the computer: "aa00", its ASCII encoding is: 01100001 01100001 00110000 00110000
- At this time, transmitting this information from one computer to another
  - In the network, the smallest transmission unit is a bit [bit], so "aa00" needs to transmit 32 bits.
- Assume: The network speed of a certain computer is 32bit/s, so the time taken is: 1s.
- 👇
- Convert to a special scenario: only four characters a, b, 0, 1 need to be transmitted.
- A new coding method can be designed [Pirate Class Coding - Fixed-Length Coding]
  - Use 2-bit binary to distinguish them — a: 00, b: 01, 0: 10, 1: 11
- At this time, the representation of "aa00" only requires 8 bits.
- Therefore, under the same network speed, the current transmission time only needs: 0.25s.
- [PS] Compared to Tencent Meeting, Zoom's encoding and decoding are more refined, so the Zoom experience is better under the same network speed.

Fixed-Length and Variable-Length Coding#

Fixed-length coding — the encoding length is the same for each character.
- ASCII code and the Pirate Class coding in specific scenarios are both fixed-length coding.
- [PS] UTF-8 — variable-length coding; UTF-16 — is fixed-length coding [self-study].
Variable-length coding — the encoding length is different for each character.
Variable-length coding is never worse than fixed-length coding.
- Fixed-length coding is a special case of variable-length coding.
【Extended Understanding】
- Variable-length coding can be fixed-length coding.
- Variable-length coding needs to be optimized for specific scenarios [the occurrence probability of transmission characters needs to be statistically prepared in advance].
- Coding can be seen as a protocol, predetermined, and cannot be dynamically changed.
- Encoding and decoding must remain consistent.

Application Scenarios of Variable-Length Coding#

Specific scenarios
- Only four characters: ab01
- The probability of each character appearing is different — a: 0.8, b: 0.05; 0: 0.1; 1: 0.05.
Average coding length
- avg(l) = ∑ li × pi
- li: the encoding length of the i-th character.
- pi: the occurrence probability of the i-th character.
- In fact, this is the expected value of the coding length.
- Example
  - Assume the average coding length is 1.16, then transmitting 100 characters requires transmitting 116 bits [estimation].
  - The average coding length of Pirate Class coding [see above]: avg(l) = 2 × ∑ pi = 2.
    - The average coding length of fixed-length coding is just fixed length.
New · Pirate Class Coding
- a: 1
- b: 01 [Note: cannot start with 1, as it will conflict with a during decoding, ❌ prefix overlap].
- 0: 000
- 1: 001
- At this time, the average coding length: 1 * 0.8 + 2 * 0.05 + 3 * 0.1 + 3 * 0.05 = 1.35.
- Therefore, transmitting 100 characters only requires transmitting 135 bits [estimation].
[PS] Characters with higher probabilities correspond to shorter coding lengths.

Huffman Coding#

First, count the probability of each character.
Build a Huffman tree with n characters.
Each character falls on a leaf node [no prefix overlap].
Read the encoding in the form of left 0, right 1.

Tree Building Example
- Tree building process mnemonic: each time take the two characters with the smallest probability as nodes, and combine them into a subtree [producing a new node].
- ⭐Because all characters fall on leaf nodes, no encoding is a prefix of another character's encoding.
  - No conflicts.
- At this time, the average coding length is 1 * 0.8 + 3 * 0.05 + 2 * 0.1 + 3 * 0.05 = 1.3.
Conclusion: Huffman coding is the optimal variable-length coding [proof provided below].
[PS]
- Mainly focus on length, not too concerned about the left and right of 01.
  - If concerned, consider which has a higher transmission cost, 0 or 1.
- Huffman trees are not unique; the order is arbitrary when probabilities are equal.
- Huffman coding can also degenerate into fixed-length coding [similarly, for example, coding 4 characters with equal probability (0.25)].

Proving Huffman's Optimality#

[The big brother in variable-length coding, optimal means the average coding length is minimized.]

Steps: ① Transform the representation of average coding length; ② Solve for the optimal solution of the formula.

Thought Transformation#

For the Huffman coding process, if the red node corresponds to a character, then the four leaf nodes below must be cut off.
Otherwise, conflicts will occur.
[Intuitively, this is also a fact] Huffman coding will cover all leaf nodes.
Assume there are 4 characters, their encoding results are as follows [red nodes].
- It can be seen that for a tree of height H [starting from 0], the number of leaf nodes covered by the nodes at level l is 2 ^ (H - l).
  - The l at level l actually represents the coding length.
  - The higher the node, the more nodes it covers; the lower the node, the fewer.
👇
There is a mapping between the coding length of the i-th character [left] and the number of leaf nodes covered by the corresponding node [right], as follows:
- li → 2 ^ (H - li).

Implicit Constraint Conditions#

Divide both sides of the second equation by 2 ^ H.
[PS] The second line of the Huffman tree equation must be an equality, but don't rush.

Transforming the Proof Objective#

Minimize the average coding length Σ pi * li, li = -li'.
The constraint comes from the tree structure above [i.e., implicit constraint conditions].
Let I_i = 2 ^ li'.
That is, transform both the objective and the constraint.
【Transformation Result】
- Objective: Minimize -(p1logI1 + p2logI2 + p3logI3 + ... + pnlogIn).
- Constraint: I1 + I2 + I3 + ... + In <= 1.
[PS]
- Is it somewhat like entropy? In fact, it proves from the perspective of entropy and can also introduce the concept of cross-entropy.
- Why transform?
  - To derive the relationship between the equality of the constraint condition and the minimum of the objective.
  - There is also a limiting condition: the sum of probabilities is 1.

Further Narrowing the Constraint Conditions#

Prove: When the objective function reaches its minimum, the constraint condition must equal 1.
Proof by contradiction: that is, the constraint condition < 1.
- Let 1 - ∑I_i = I_x', which is not negative [look at the constraint condition].
- I_x' can be added to any term.
- As long as I_x' is not 0, then the objective function must not be minimal and can be smaller.
- Therefore, the constraint condition must equal 1.

Finding the Extreme Value by Setting the Partial Derivative to Zero#

Reduce one unknown: I_n = 1 - II.
When does the objective reach its minimum: find the partial derivative.
- The partial derivative gives the left side of the equation.
- This gives the upper right equation, which holds when the objective can reach its minimum.
- Let p1/I1 = ... = k, and with the known constraint condition and the sum of probabilities being 1, we get two conditions, leading to k = 1.
- ⭐So pi = I_i.

❗ Reflection#

The relationship between average coding length, entropy, and cross-entropy.

pi = I_i.
- I_i is related to coding length l_i, which means probability and length are closely related.
  - The greater the probability, the shorter the coding length.
- Substituting the equation gives the formula for 【entropy】.
  - In fact, entropy represents the minimum average representation unit required to represent some information in a system.
  - Similarly, it can also be understood as average coding length [⭐thought connection].
  - The longer the average coding length → the more characters in the system → the more states of the system → the more chaotic the system.
  - Greater entropy indicates more chaotic coding in the system.
- 【Cross-entropy】
  - Treat both pi and I_i as a set of probabilities.
  - It describes the similarity between two probability vectors.
In fact, the proof is not very rigorous.
- Because the Huffman process is actually discrete, while this proof process is continuous.
- For example: pi = I_i, the coding length l_i calculated from I_i may be a decimal, but the coding length l_i is definitely an integer.

Code Demonstration#

Huffman Tree#

Structure based on trees.
【Key Operations】
- Merge nodes to establish relationships between nodes [conversion between array and node relationships].
- Find the two smallest probabilities [can optimize using a priority queue].
- Recursively extract coding [only leaf nodes correspond to character coding].
- It is recommended to use character arrays to store characters when reading data to reduce error rates.

Additional Knowledge Points#

In computers, the most basic [smallest] storage unit is a byte, and the most basic [smallest] data representation unit is a bit.

Points for Reflection#

Other proof methods for Huffman being the optimal coding?
- 058 Proof of the Correctness of the Huffman Algorithm — Coursera
- The Simplest Proof Method that the Simple Huffman Tree is the Optimal Binary Tree — Zhihu

Tips#

Mathematical knowledge determines the upper limit of computer direction.