In 1952 David A. Huffman published the paper A Method for the Construction of Minimum-Redundancy Codes

Here we will see an informal explanation and example of this algorithm to compress data.

Let's think of an ordinary text, something like this (what you are reading right now). In "human texts" there tend to be characters that appear more times than others, for example character 'a' or 'e' appears more times here than character 'x'. As you know these characters are stored using the same amount of bits for each one.

But what if we use fewer bits for the most frequent characters, and larger bits codes for those characters that appear the less? The idea sounds good but here we face a problem, suppose that we assign:

'a'=0

'b'=1

'c'=01

Then if we have the stream: "101" how do we know if it's "bab" or "bc"?

Okay the first thing you will probably think is to use some delimiter to know when a character ends and another begins. This will work but we will need extra space and could happen that due to this, we end up consuming more space that what the original stream takes.

Huffman did something else, you may want to stop reading here and take one or two days to think about it, or if you can't wait let's see how to solve this problem with an example:

Given the stream: "a method for the construction of minimum-redundancy codes", first we create the frequency table, this is, a table with the frequency for each character (or the times it char appears):

Char | Occurrences |
---|---|

'a' | 2 |

' ' | 7 |

'm' | 4 |

'e' | 4 |

't' | 4 |

'h' | 2 |

'o' | 6 |

'd' | 4 |

'f' | 2 |

'r' | 3 |

'c' | 4 |

'n' | 5 |

's' | 2 |

'u' | 3 |

'i' | 3 |

'-' | 1 |

'y' | 1 |

Total stream length | 57 chars |

Now the idea is to assign the shortest codes to the characters with more occurrences, but also we want to use only codes that are uniquely identifiable, something like self-identifiable codes.

We could say that if the code starts with 0 then it will have exactly 3 bits and if it starts with 1, it will have 4 or more bits. Which data structure can help us to find this king of codes? What about a tree, since we are working with bits, then a binary tree? If each of the characters of the stream to compress is a leaf of this tree and you want to describe the path from the root to a character you would say something like: move to the left child, then to the right, then to the left, then to the left again and so forth until you reach the leaf, this is, until you reach the character.

So there you are! A binary tree helps us to find the kind of codes we are looking for! And since we need shortest codes for the most frequent characters, we must place these characters in the less deep levels of the tree (but always leafs), so their paths (codes) will be the shortest!

To get this done, we start building the tree from the bottom, this is, from the leafs:

Consider or replace each character of the frequency table by a node of the tree we are building. (These nodes will have the same frequency attribute as the character and will be leafs) While ( more than one node is in the frequency table ) { Pick (remove from the table) the two less frequent nodes (if more than two, choose any) Set these nodes as the children of a new a node with frequency equal to the sum of it's children. Add this parent node the the table. }Now the only node in the table is the tree (and it will have a frequency equal to the total stream length)!

The deflated stream will be the tree plus the original stream encoded. If a given tree was agreed to be used (because frequencies of characters stay more or less constant among all streams) then the tree is not added to the deflated stream making it smaller. Of course, this should be done when working with the same type of streams. For example: books written in the same dialect. But not with a book as a stream, an executable file as another and an image or MP3 as another.

To apply the seudocode to our table we start with '-' and 'y':

Then we could combine this new node with the node of 's', getting: Now we could take 'f' and 'h' to get: Several steps later we end up with (note that there are other possible trees):So now we add to out table the codes from the tree:

For example, to get 't' code we just describe the path from the root to 't' leaf:

- move to left child (0)
- move to right child (1)
- move to left child (0)
- move to right child (1)

Char | Occurrences | Code | Bits | Total Bits |
---|---|---|---|---|

'a' | 2 | 01111 | 5 | 10 |

' ' | 7 | 101 | 3 | 21 |

'm' | 4 | 0110 | 4 | 16 |

'e' | 4 | 0100 | 4 | 16 |

't' | 4 | 0101 | 4 | 16 |

'h' | 2 | 00111 | 5 | 10 |

'o' | 6 | 100 | 3 | 18 |

'd' | 4 | 0010 | 4 | 16 |

'f' | 2 | 00110 | 5 | 10 |

'r' | 3 | 1101 | 4 | 12 |

'c' | 4 | 0000 | 4 | 16 |

'n' | 5 | 111 | 3 | 15 |

's' | 2 | 00011 | 5 | 10 |

'u' | 3 | 1100 | 4 | 12 |

'i' | 3 | 01110 | 5 | 15 |

'-' | 1 | 000101 | 6 | 6 |

'y' | 1 | 000100 | 6 | 6 |

Total stream length | 57 chars | 225 bits |

So, how much space did we save?

Since we are working with 17 different characters, using a fixed code it's necessary at least 5 bits per character. 5 bits/character multiplied by the amount of characters of the stream, this is: 5 bits/character x 57 characters= 285 bits.

With huffman coding, from the above table, we have 255 characters, this is 20% less.

If we were working with 8 bits/character then: 8 bits/character x 57 characters= 456 bits. And so huffman coding would give us a 51% smaller stream.

But bear in mind that we might need to send the codes as part of the stream and also this is just an example intended to explain the overall idea. Depending of the stream nature and how large it is, these numbers are going to change.