Bo2SS

Bo2SS

3 Huffman Trees and Huffman Coding

Intuitively understand Huffman trees and Huffman coding, and prove that Huffman coding is the optimal variable-length coding.

Course Content#

Prerequisite Knowledge#

  • What is coding?
    • Where did you first hear about coding? ASCII code
      • 'a' = (97)10 = (0110 0001)2
      • '0' = (48)10 = (0011 0000)2
      • 8-bit binary encoding, one byte
    • Note: In computers, any information is stored in binary.
    • The role of coding: mapping human characters to computer characters.
  • Example: The impact of coding length
    • There is a piece of information in the computer: "aa00", its ASCII encoding is: 01100001 01100001 00110000 00110000
    • At this time, transmitting this information from one computer to another
      • In the network, the smallest transmission unit is a bit [bit], so "aa00" needs to transmit 32 bits.
    • Assume: The network speed of a certain computer is 32bit/s, so the time taken is: 1s.
    • 👇
    • Convert to a special scenario: only four characters a, b, 0, 1 need to be transmitted.
    • A new coding method can be designed [Pirate Class Coding - Fixed-Length Coding]
      • Use 2-bit binary to distinguish them — a: 00, b: 01, 0: 10, 1: 11
    • At this time, the representation of "aa00" only requires 8 bits.
    • Therefore, under the same network speed, the current transmission time only needs: 0.25s.
    • [PS] Compared to Tencent Meeting, Zoom's encoding and decoding are more refined, so the Zoom experience is better under the same network speed.

Fixed-Length and Variable-Length Coding#

  • Fixed-length coding — the encoding length is the same for each character.
    • ASCII code and the Pirate Class coding in specific scenarios are both fixed-length coding.
    • [PS] UTF-8 — variable-length coding; UTF-16 — is fixed-length coding [self-study].
  • Variable-length coding — the encoding length is different for each character.
  • Variable-length coding is never worse than fixed-length coding.
    • Fixed-length coding is a special case of variable-length coding.
  • 【Extended Understanding】
    • Variable-length coding can be fixed-length coding.
    • Variable-length coding needs to be optimized for specific scenarios [the occurrence probability of transmission characters needs to be statistically prepared in advance].
    • Coding can be seen as a protocol, predetermined, and cannot be dynamically changed.
    • Encoding and decoding must remain consistent.

Application Scenarios of Variable-Length Coding#

  • Specific scenarios
    • Only four characters: ab01
    • The probability of each character appearing is different — a: 0.8, b: 0.05; 0: 0.1; 1: 0.05.
  • Average coding length
    • avg(l) = ∑ li × pi
    • li: the encoding length of the i-th character.
    • pi: the occurrence probability of the i-th character.
    • In fact, this is the expected value of the coding length.
    • Example
      • Assume the average coding length is 1.16, then transmitting 100 characters requires transmitting 116 bits [estimation].
      • The average coding length of Pirate Class coding [see above]: avg(l) = 2 × ∑ pi = 2.
        • The average coding length of fixed-length coding is just fixed length.
  • New · Pirate Class Coding
    • a: 1
    • b: 01 [Note: cannot start with 1, as it will conflict with a during decoding, ❌ prefix overlap].
    • 0: 000
    • 1: 001
    • At this time, the average coding length: 1 * 0.8 + 2 * 0.05 + 3 * 0.1 + 3 * 0.05 = 1.35.
    • Therefore, transmitting 100 characters only requires transmitting 135 bits [estimation].
  • [PS] Characters with higher probabilities correspond to shorter coding lengths.

Huffman Coding#

  1. First, count the probability of each character.
  2. Build a Huffman tree with n characters.
  3. Each character falls on a leaf node [no prefix overlap].
  4. Read the encoding in the form of left 0, right 1.
  • Tree Building Example
    • Image
    • Tree building process mnemonic: each time take the two characters with the smallest probability as nodes, and combine them into a subtree [producing a new node].
    • ⭐Because all characters fall on leaf nodes, no encoding is a prefix of another character's encoding.
      • No conflicts.
    • At this time, the average coding length is 1 * 0.8 + 3 * 0.05 + 2 * 0.1 + 3 * 0.05 = 1.3.
  • Conclusion: Huffman coding is the optimal variable-length coding [proof provided below].
  • [PS]
    • Mainly focus on length, not too concerned about the left and right of 01.
      • If concerned, consider which has a higher transmission cost, 0 or 1.
    • Huffman trees are not unique; the order is arbitrary when probabilities are equal.
    • Huffman coding can also degenerate into fixed-length coding [similarly, for example, coding 4 characters with equal probability (0.25)].

Proving Huffman's Optimality#

[The big brother in variable-length coding, optimal means the average coding length is minimized.]

Steps: ① Transform the representation of average coding length; ② Solve for the optimal solution of the formula.

Thought Transformation#

  • Image
  • For the Huffman coding process, if the red node corresponds to a character, then the four leaf nodes below must be cut off.
  • Otherwise, conflicts will occur.
  • [Intuitively, this is also a fact] Huffman coding will cover all leaf nodes.
  • Assume there are 4 characters, their encoding results are as follows [red nodes].
    • Image
    • It can be seen that for a tree of height H [starting from 0], the number of leaf nodes covered by the nodes at level l is 2 ^ (H - l).
      • The l at level l actually represents the coding length.
      • The higher the node, the more nodes it covers; the lower the node, the fewer.
  • 👇
  • There is a mapping between the coding length of the i-th character [left] and the number of leaf nodes covered by the corresponding node [right], as follows:
    • li → 2 ^ (H - li).

Implicit Constraint Conditions#

  • Image
  • Divide both sides of the second equation by 2 ^ H.
  • [PS] The second line of the Huffman tree equation must be an equality, but don't rush.

Transforming the Proof Objective#

  • Minimize the average coding length Σ pi * li, li = -li'.
  • The constraint comes from the tree structure above [i.e., implicit constraint conditions].
  • Image
  • Let I_i = 2 ^ li'.
  • That is, transform both the objective and the constraint.
  • 【Transformation Result】
    • Objective: Minimize -(p1logI1 + p2logI2 + p3logI3 + ... + pnlogIn).
    • Constraint: I1 + I2 + I3 + ... + In <= 1.
  • [PS]
    • Is it somewhat like entropy? In fact, it proves from the perspective of entropy and can also introduce the concept of cross-entropy.
    • Why transform?
      • To derive the relationship between the equality of the constraint condition and the minimum of the objective.
      • There is also a limiting condition: the sum of probabilities is 1.

Further Narrowing the Constraint Conditions#

  • Prove: When the objective function reaches its minimum, the constraint condition must equal 1.
  • Proof by contradiction: that is, the constraint condition < 1.
    • Let 1 - ∑I_i = I_x', which is not negative [look at the constraint condition].
    • Image
    • I_x' can be added to any term.
    • As long as I_x' is not 0, then the objective function must not be minimal and can be smaller.
    • Therefore, the constraint condition must equal 1.

Finding the Extreme Value by Setting the Partial Derivative to Zero#

  • Image
  • Reduce one unknown: I_n = 1 - II.
  • When does the objective reach its minimum: find the partial derivative.
    • The partial derivative gives the left side of the equation.
    • Image
    • This gives the upper right equation, which holds when the objective can reach its minimum.
    • Let p1/I1 = ... = k, and with the known constraint condition and the sum of probabilities being 1, we get two conditions, leading to k = 1.
    • ⭐So pi = I_i.

❗ Reflection#

The relationship between average coding length, entropy, and cross-entropy.

  • pi = I_i.
    • I_i is related to coding length l_i, which means probability and length are closely related.
      • The greater the probability, the shorter the coding length.
    • Substituting the equation gives the formula for 【entropy】.
      • In fact, entropy represents the minimum average representation unit required to represent some information in a system.
      • Similarly, it can also be understood as average coding length [⭐thought connection].
      • The longer the average coding length → the more characters in the system → the more states of the system → the more chaotic the system.
      • Greater entropy indicates more chaotic coding in the system.
    • 【Cross-entropy】
      • Treat both pi and I_i as a set of probabilities.
      • It describes the similarity between two probability vectors.
  • In fact, the proof is not very rigorous.
    • Because the Huffman process is actually discrete, while this proof process is continuous.
    • For example: pi = I_i, the coding length l_i calculated from I_i may be a decimal, but the coding length l_i is definitely an integer.

Code Demonstration#

Huffman Tree#

  • Image
  • Image
  • Image
  • Structure based on trees.
  • 【Key Operations】
    • Merge nodes to establish relationships between nodes [conversion between array and node relationships].
    • Find the two smallest probabilities [can optimize using a priority queue].
    • Recursively extract coding [only leaf nodes correspond to character coding].
    • It is recommended to use character arrays to store characters when reading data to reduce error rates.

Additional Knowledge Points#

  • In computers, the most basic [smallest] storage unit is a byte, and the most basic [smallest] data representation unit is a bit.

Points for Reflection#

Tips#

  • Mathematical knowledge determines the upper limit of computer direction.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.