Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider encoding str/seq lengths as varints #75

Closed
slyrz opened this issue Jun 5, 2016 · 11 comments
Closed

Consider encoding str/seq lengths as varints #75

slyrz opened this issue Jun 5, 2016 · 11 comments
Milestone

Comments

@slyrz
Copy link
Contributor

slyrz commented Jun 5, 2016

Have you considered storing the lengths of strings and sequences as Varints? The fixed 8-byte encoding of the length adds quite some space overhead when you store many shorter strings.

Just an idea. I could create a pull request if you like it.

@TyOverby
Copy link
Collaborator

TyOverby commented Jun 6, 2016

(note: when I talk about varints, I'm not necessarily talking about the protobuf implementation, just the idea of variable-length-encoded integers in general).

I've actually been thinking about this for a while. Not only would varints be useful for the lengths of sequences, they could also come in handy to compress the contents of items that are compressed. Think about this data structure:

let v: Vec<u64> = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];

With the current bincode scheme, it would be 704 = 64 (length) + 64 * 10 (content) bytes long. By using even a naive varint, this could be compressed down to 176 = 16 (length) + 16 * 11 (content) bytes. A slightly more intelligent varint implementation that merges the tag and number for encoding numbers that don't use anything past 2^7 bits could compress down to 88 = 8 (length) + 8 * 10 (content) bits, an 8x improvement.

This encoding scheme could also be applied to enum variant tags which are practically never larger than 2^7.

All of that said, I'm not sure that it's worth it to build into bincode proper. One of the reasons that bincode is as fast as it is, is that most of the operations are compilable to memcopies. It is my hunch that adding varints in the general case would hurt perf. (varints for sequences probably wouldn't be that bad though).

The other inconvenience is that bincode's serialization format has been backwards compatible for a long time. I'm not sure how many people care about that, but it could be an issue. I know that I'm storing "bincode-serialized" font data for a game that I'm working on, which would suddenly break if we were to update.

Maybe it would be nice to have a "sister-crate" that exposed the same API as bincode, but provided number compression by default? That way, someone could switch out their extern crate bincode for extern crate bincode_small as bincode and everything just works?

@slyrz
Copy link
Contributor Author

slyrz commented Jun 9, 2016

Yes, I agree with everything you said. I didn't consider backwards compability, and I guess most people wouldn't be happy to re-encode their existing data just to save a few bytes. Frankly, if I were the maintainer of this crate, I wouldn't change the current encoding either.

Having a new crate that offers small encodings as its main selling point sounds like a good idea, though I don't know if many people would be interested in such a library. But it sounds tempting nonetheless. Maybe one could even try to come up with a compact and compression-friendly encoding. And with compression-friendly I mean saving data in a way that might help compression algorithms to improve their compression ratio: saving integer arrays delta-encoded, saving arrays of floats grouped by their bytes and maybe even applying Burrows–Wheeler transform on strings... stuff like that.

Btw. how did you calculate 16 bits in your example? Couldn't you serialize the vector as

[u8; 11] = [11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

where the first byte is the varint-encoded length and the following bytes are the varint-encoded vector elements (1:1 mapping as each value fits into 7 bits)?

@TyOverby
Copy link
Collaborator

TyOverby commented Jun 9, 2016

I'll ask around and see if any of the people currently using bincode would like to see a "bincode-small" crate created. If there's interest, I'll make one with this improvement!

The "naive" scheme that I came up with was encoding [tag:u8, value: u8*] where the tag always took up a full byte. Not the best encoding by far, but still better than current compression.

@TyOverby
Copy link
Collaborator

TyOverby commented Jun 9, 2016

I might also be able to make this a cargo feature instead of a separate crate.

@shepmaster
Copy link

The other inconvenience is that bincode's serialization format has been backwards compatible for a long time.

Forgive my naïve question, but presumably there's a version number stored in the header of the resulting file that could be switched upon? If this new varint scheme was all-around-better, you could still read old and new files and only write new ones.

@TyOverby
Copy link
Collaborator

@shepmaster There is no version specifier inserted by bincode at the moment.

@arthurprs
Copy link

I guess this and #45 are worth a major version bump, no need to introduce version numbers.

@TyOverby TyOverby modified the milestone: unknown Feb 23, 2017
@dtolnay
Copy link
Contributor

dtolnay commented Apr 28, 2017

This issue is for str/seq length so I filed #157 to apply this to enum discriminants, which I would like to see in 1.0.

@dtolnay
Copy link
Contributor

dtolnay commented Apr 28, 2017

Minor consideration but it should factor into a decision on this: variable length encoding for seq/map length would make it harder to support seq/map with a size that is not known up front. #167

@aeyakovenko
Copy link

How about passing the len size as an option to serialize/deserialize. The default can remain as is without breaking compatibility, and an option for VarInt can be added as well.

@ZoeyR
Copy link
Collaborator

ZoeyR commented Apr 16, 2020

Closing in favor of #319

@ZoeyR ZoeyR closed this as completed Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants