Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Huffman CODEC to use Canonical Huffman code #36

Open
gwlucastrig opened this issue Nov 29, 2024 · 1 comment
Open

Refactor Huffman CODEC to use Canonical Huffman code #36

gwlucastrig opened this issue Nov 29, 2024 · 1 comment

Comments

@gwlucastrig
Copy link
Owner

gwlucastrig commented Nov 29, 2024

There is evidence that we can use the Canonical Huffman code to store the code table for Huffman. Our current approach requires a fixed 10 bits per symbol. The Cannonic Huffman code scheme enhanced with run-length encoding for non-included symbols has potential to reduce that (some testing yielded an average of fewer than 6 bits per code).

This idea was inspired by discussion with Mark Adler See Stack Overflow discussion "When would Huffman coding outperform Deflate

There is a chance (perhaps a small one) that the Canonical approach has a potential performance improvement on the existing approach because it would eliminate the need to build a Huffman tree on the decoding side of things. We might also be able to exploit the structure of the Canonical form to expedite decoding the body of the compressed data.

The downside to this approach is that it would break compatibility with the older format (unless we want to preserve both).

@gwlucastrig
Copy link
Owner Author

gwlucastrig commented Dec 4, 2024

The main advantage to Canonical Huffman is that it may provide a way of reducing the size of the codebook (encoding tree, etc.) and thus improve overall compression. It does not affect the size of the actual encoded data. The current Gridfour implementation requires a fixed 10-bits per each unique symbol included in a source data set. So the goal is to use Canonical Huffman to create a codebook that requires less space than that.

Typically, the codebook for Canonical Huffman is stored (in some compact form) as a series of code lengths. These give the number of bits needed to represent each each unique symbol in the Huffman code. Canonical Huffman uses these code-lengths to reconstruct the Huffman tree (in some efficient form) that was used to encode the source data. So the goal here is to see if the Canonical Huffman code-lengths can be stored with less space than the existing Gridfour method.

To test this idea, I used the Canonical Huffman algorithm to produce a set of integer code lengths for a data set that was organized in bytes. While a byte can take on 256 unique values, this particular data set had only 128 unique symbols. The test stored code-lengths for all possible byte states in an array of 256 integers. For those byte states that were not included in the source data, the code length was given by zero. For this particular data set, the maximum code length was 14 (other data sets might have had larger code lengths).

Naturally, storing 256*8 bits to represent integer code-length values would take more space that the 10-bits per unique symbol currently implemented. So, for this test, we used the existing implementation to Huffman compress the 256 code-lengths. Again, a lot of these code lengths were zero. So the overall code-length set compressed quite readily.

Findings:

  • Bits to encode Huffman tree using exising implementation: 1287.
  • Bits to encode set of code-lengths for Canonical Huffman: 784

Indications:
From these values, we get the idea that Canonical Huffman may require only about 60 percent as much storage as the existing implementation. Our current implementation averages about 790.7 bits to encode the Huffman tree for 7774 tiles in the ETOPO1 data set, for a total of near 6 megabytes of storage overhead . So it is possible that Canonical Huffman might reduce that to about 3.5 megabytes of overhead.

So it appears that further investigation is worth pursuing.

Next Step:

I encoded the code-lengths using the existing implementation because it was available and convenient. That constitutes a hybrid approach to Canonical Huffman. The next step is to see if a purely Canonical implementation improves over the results from the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant