Write Your Own Compression Tool
This challenge is to build your own command like tool to compress text files. This is a challenge I first did in 1998 when my employer didn’t have much work for me to do between projects so suggested I pick a skill and polish it.
I picked C and decided to implement a few of the data structures and algorithms from my university course again in C (we did Pascal on the course). At the same time I wanted to start factoring in some of the ‘mechanical sympathy’ that I ’d been reading about in Michael Abrash's excellent Graphics Programming Black Book.
Despite the name the book covers a lot more than just graphics programming - it really digs into understanding the architecture of a computer and how that has an impact on the performance of your code. For example when I re-wrote of part the compression tool that this challenge is about to take into account the hard disk buffer size (back before SSDs) the performance on a one megabyte file jumped 100x!
These days the equivalent would be understand the cache hierarchies in a modern CPU and how programming with an awareness of them can radically change the performance of your code. Scott Meyers did a great talk that digs into this. You can find the slides on his website here.
The Challenge - Building A Huffman Encoder/Decoder
In the early 1950s David Huffman developed an algorithm to find the optimal prefix code that can be used for lossless data compression.
Given there is usually an unequal distribution of character occurrences in text this can then be used to compress data by giving the most commonly occurring characters the shortest prefix.
For example if we have the string aaabbc, it would normally take up 6 bytes, but if we assign each character a variable length code, with the most frequently occurring character has the shortest code we might give them the following codes:
a: 1 b: 01 c: 10
and we could reduce the string aaabbc (six bytes) to 111010110 (nine bits). It’s not quite that simple though as we need to ensure that the codes are prefix-free, that is the bit string representing one character is not a prefix of a bit string representing another.
These prefix-free codes are generated by creating a binary tree. Before we get into that let’s consider the steps involved in compression using Huffman Codes:
- Read the text and determine the frequency of each character occurring.
- Build the binary tree from the frequencies.
- Generate the prefix-code table from the tree.
- Encode the text using the code table.
- Encode the tree - we’ll need to include this in the output file so we can decode it.
- Write the encoded tree and text to an output field
In most (Fortran and COBOL are exceptions) programming languages we index arrays from the element zero onwards because arrays are represented as a pointer to the beginning and an offset from that beginning - i.e. if you’re familiar with C then array[index] is equivalent to *(array + index).
As usually, I’ll leave you to setup your IDE / editor of choice and programming language of choice. After that here’s what I’d like you to do to be ready to test your solution.
After that, please download Les Misérables, by Victor Hugo from Project Gutenberg here: https://www.gutenberg.org/files/135/135-0.txt which will be our test data for this challenge.
In this step your goal is to build your tool to accept a filename as input. It should return an error if the file is not valid, otherwise your program should open it, read the text and determine the frequency of each character occurring within the text.
As a debugging aid you might like to optionally log out the table. Here’s some test values to check. There are 333 occurrences of ‘X’ and 223000 of ‘t’.
An alternate to logging the frequency table is to write some unit tests that take in some prepared data and then check that the counting code returns the correct values. Check your chosen programming language’s documentation for it’s unit testing library and if you’ve never tried it before consider giving test driven development a go.
In this step your goal is to use the frequencies that you calculated in step 1 to build the binary tree. There’s a good explanation of how to do this complete with a visualisation here: https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/Huffman.html
The examples used for the visualisation would be a good starting point for unit tests.
In this step your goal is to use the tree to generate the prefix-code table. Once again there’s an explanation and visualisation at: https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/Huffman.html
In this step your goal is to write a header section to the output file (you’ll want a command line option to specify the filename). The header section will include enough information for your program to be decode the compressed file.
One way of doing this is to write out the tree, another option is to write out the character frequency table. Don’t forget you’ll need some way of knowing when the header ends and when the compressed data starts.
This is the moment you’ve been building up to! In this step your goal is to encode the text using the code table and write the compressed content of the source text to the output file file, appending it after the header. Don’t forget translate the prefixes into bit strings and pack them into bytes to achieve the compression.
In this step your goal is to read in the header of the encoded file and rebuild the prefix table ready to decode the text. In essence you’re going to do the reverse of step 4.
In this step you will read in the remainder of the file and decode it using the prefix table before writing the decoded text back to the specified output file - the reverse of step 5.
At this point you should be able to compress the file with your tool, compare the file size of the input and output see that the output is smaller and then decompress the output file to create an identical copy of the input.
If so, congratulations you’ve written a working Huffman encode/decode tool!
Help Others by Sharing Your Solutions!
If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know - ping me a message via Twitter or LinkedIn or just post about it there and tag me.
Get The Challenges By Email
If you would like to recieve the coding challenges by email, you can subscribe to the weekly newsletter on SubStack here: