-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider encoding str/seq lengths as varints #75
Comments
(note: when I talk about varints, I'm not necessarily talking about the protobuf implementation, just the idea of variable-length-encoded integers in general). I've actually been thinking about this for a while. Not only would varints be useful for the lengths of sequences, they could also come in handy to compress the contents of items that are compressed. Think about this data structure:
With the current bincode scheme, it would be This encoding scheme could also be applied to enum variant tags which are practically never larger than 2^7. All of that said, I'm not sure that it's worth it to build into bincode proper. One of the reasons that bincode is as fast as it is, is that most of the operations are compilable to memcopies. It is my hunch that adding varints in the general case would hurt perf. (varints for sequences probably wouldn't be that bad though). The other inconvenience is that bincode's serialization format has been backwards compatible for a long time. I'm not sure how many people care about that, but it could be an issue. I know that I'm storing "bincode-serialized" font data for a game that I'm working on, which would suddenly break if we were to update. Maybe it would be nice to have a "sister-crate" that exposed the same API as bincode, but provided number compression by default? That way, someone could switch out their |
Yes, I agree with everything you said. I didn't consider backwards compability, and I guess most people wouldn't be happy to re-encode their existing data just to save a few bytes. Frankly, if I were the maintainer of this crate, I wouldn't change the current encoding either. Having a new crate that offers small encodings as its main selling point sounds like a good idea, though I don't know if many people would be interested in such a library. But it sounds tempting nonetheless. Maybe one could even try to come up with a compact and compression-friendly encoding. And with compression-friendly I mean saving data in a way that might help compression algorithms to improve their compression ratio: saving integer arrays delta-encoded, saving arrays of floats grouped by their bytes and maybe even applying Burrows–Wheeler transform on strings... stuff like that. Btw. how did you calculate
where the first byte is the varint-encoded length and the following bytes are the varint-encoded vector elements (1:1 mapping as each value fits into 7 bits)? |
I'll ask around and see if any of the people currently using bincode would like to see a "bincode-small" crate created. If there's interest, I'll make one with this improvement! The "naive" scheme that I came up with was encoding [tag:u8, value: u8*] where the tag always took up a full byte. Not the best encoding by far, but still better than current compression. |
I might also be able to make this a cargo feature instead of a separate crate. |
Forgive my naïve question, but presumably there's a version number stored in the header of the resulting file that could be switched upon? If this new varint scheme was all-around-better, you could still read old and new files and only write new ones. |
@shepmaster There is no version specifier inserted by bincode at the moment. |
I guess this and #45 are worth a major version bump, no need to introduce version numbers. |
This issue is for str/seq length so I filed #157 to apply this to enum discriminants, which I would like to see in 1.0. |
Minor consideration but it should factor into a decision on this: variable length encoding for seq/map length would make it harder to support seq/map with a size that is not known up front. #167 |
How about passing the len size as an option to serialize/deserialize. The default can remain as is without breaking compatibility, and an option for VarInt can be added as well. |
Closing in favor of #319 |
Have you considered storing the lengths of strings and sequences as Varints? The fixed 8-byte encoding of the length adds quite some space overhead when you store many shorter strings.
Just an idea. I could create a pull request if you like it.
The text was updated successfully, but these errors were encountered: