-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Huffman CODEC to use Canonical Huffman code #36
Comments
The main advantage to Canonical Huffman is that it may provide a way of reducing the size of the codebook (encoding tree, etc.) and thus improve overall compression. It does not affect the size of the actual encoded data. The current Gridfour implementation requires a fixed 10-bits per each unique symbol included in a source data set. So the goal is to use Canonical Huffman to create a codebook that requires less space than that. Typically, the codebook for Canonical Huffman is stored (in some compact form) as a series of code lengths. These give the number of bits needed to represent each each unique symbol in the Huffman code. Canonical Huffman uses these code-lengths to reconstruct the Huffman tree (in some efficient form) that was used to encode the source data. So the goal here is to see if the Canonical Huffman code-lengths can be stored with less space than the existing Gridfour method. To test this idea, I used the Canonical Huffman algorithm to produce a set of integer code lengths for a data set that was organized in bytes. While a byte can take on 256 unique values, this particular data set had only 128 unique symbols. The test stored code-lengths for all possible byte states in an array of 256 integers. For those byte states that were not included in the source data, the code length was given by zero. For this particular data set, the maximum code length was 14 (other data sets might have had larger code lengths). Naturally, storing 256*8 bits to represent integer code-length values would take more space that the 10-bits per unique symbol currently implemented. So, for this test, we used the existing implementation to Huffman compress the 256 code-lengths. Again, a lot of these code lengths were zero. So the overall code-length set compressed quite readily. Findings:
Indications: So it appears that further investigation is worth pursuing. Next Step: I encoded the code-lengths using the existing implementation because it was available and convenient. That constitutes a hybrid approach to Canonical Huffman. The next step is to see if a purely Canonical implementation improves over the results from the test. |
There is evidence that we can use the Canonical Huffman code to store the code table for Huffman. Our current approach requires a fixed 10 bits per symbol. The Cannonic Huffman code scheme enhanced with run-length encoding for non-included symbols has potential to reduce that (some testing yielded an average of fewer than 6 bits per code).
This idea was inspired by discussion with Mark Adler See Stack Overflow discussion "When would Huffman coding outperform Deflate
There is a chance (perhaps a small one) that the Canonical approach has a potential performance improvement on the existing approach because it would eliminate the need to build a Huffman tree on the decoding side of things. We might also be able to exploit the structure of the Canonical form to expedite decoding the body of the compressed data.
The downside to this approach is that it would break compatibility with the older format (unless we want to preserve both).
The text was updated successfully, but these errors were encountered: