Consider encoding str/seq lengths as varints #75

slyrz · 2016-06-05T16:00:17Z

Have you considered storing the lengths of strings and sequences as Varints? The fixed 8-byte encoding of the length adds quite some space overhead when you store many shorter strings.

Just an idea. I could create a pull request if you like it.

TyOverby · 2016-06-06T16:26:08Z

(note: when I talk about varints, I'm not necessarily talking about the protobuf implementation, just the idea of variable-length-encoded integers in general).

I've actually been thinking about this for a while. Not only would varints be useful for the lengths of sequences, they could also come in handy to compress the contents of items that are compressed. Think about this data structure:

let v: Vec<u64> = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];

With the current bincode scheme, it would be 704 = 64 (length) + 64 * 10 (content) bytes long. By using even a naive varint, this could be compressed down to 176 = 16 (length) + 16 * 11 (content) bytes. A slightly more intelligent varint implementation that merges the tag and number for encoding numbers that don't use anything past 2^7 bits could compress down to 88 = 8 (length) + 8 * 10 (content) bits, an 8x improvement.

This encoding scheme could also be applied to enum variant tags which are practically never larger than 2^7.

All of that said, I'm not sure that it's worth it to build into bincode proper. One of the reasons that bincode is as fast as it is, is that most of the operations are compilable to memcopies. It is my hunch that adding varints in the general case would hurt perf. (varints for sequences probably wouldn't be that bad though).

The other inconvenience is that bincode's serialization format has been backwards compatible for a long time. I'm not sure how many people care about that, but it could be an issue. I know that I'm storing "bincode-serialized" font data for a game that I'm working on, which would suddenly break if we were to update.

Maybe it would be nice to have a "sister-crate" that exposed the same API as bincode, but provided number compression by default? That way, someone could switch out their extern crate bincode for extern crate bincode_small as bincode and everything just works?

slyrz · 2016-06-09T20:58:58Z

Yes, I agree with everything you said. I didn't consider backwards compability, and I guess most people wouldn't be happy to re-encode their existing data just to save a few bytes. Frankly, if I were the maintainer of this crate, I wouldn't change the current encoding either.

Having a new crate that offers small encodings as its main selling point sounds like a good idea, though I don't know if many people would be interested in such a library. But it sounds tempting nonetheless. Maybe one could even try to come up with a compact and compression-friendly encoding. And with compression-friendly I mean saving data in a way that might help compression algorithms to improve their compression ratio: saving integer arrays delta-encoded, saving arrays of floats grouped by their bytes and maybe even applying Burrows–Wheeler transform on strings... stuff like that.

Btw. how did you calculate 16 bits in your example? Couldn't you serialize the vector as

[u8; 11] = [11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

where the first byte is the varint-encoded length and the following bytes are the varint-encoded vector elements (1:1 mapping as each value fits into 7 bits)?

TyOverby · 2016-06-09T21:35:21Z

I'll ask around and see if any of the people currently using bincode would like to see a "bincode-small" crate created. If there's interest, I'll make one with this improvement!

The "naive" scheme that I came up with was encoding [tag:u8, value: u8*] where the tag always took up a full byte. Not the best encoding by far, but still better than current compression.

TyOverby · 2016-06-09T21:41:31Z

I might also be able to make this a cargo feature instead of a separate crate.

shepmaster · 2016-07-17T19:00:23Z

The other inconvenience is that bincode's serialization format has been backwards compatible for a long time.

Forgive my naïve question, but presumably there's a version number stored in the header of the resulting file that could be switched upon? If this new varint scheme was all-around-better, you could still read old and new files and only write new ones.

TyOverby · 2016-07-18T16:19:21Z

@shepmaster There is no version specifier inserted by bincode at the moment.

arthurprs · 2016-08-01T18:23:42Z

I guess this and #45 are worth a major version bump, no need to introduce version numbers.

dtolnay · 2017-04-28T17:15:16Z

This issue is for str/seq length so I filed #157 to apply this to enum discriminants, which I would like to see in 1.0.

dtolnay · 2017-04-28T18:42:55Z

Minor consideration but it should factor into a decision on this: variable length encoding for seq/map length would make it harder to support seq/map with a size that is not known up front. #167

aeyakovenko · 2018-07-29T02:18:26Z

How about passing the len size as an option to serialize/deserialize. The default can remain as is without breaking compatibility, and an option for VarInt can be added as well.

ZoeyR · 2020-04-16T04:09:00Z

Closing in favor of #319

TyOverby modified the milestone: unknown Feb 23, 2017

dtolnay mentioned this issue Apr 28, 2017

Support serializing to Vec<u8> with unknown seq/map length #167

Closed

maciejhirsz mentioned this issue Jun 8, 2019

Varint enum tags and lengths behind a feature #271

Closed

tkaitchuck mentioned this issue Jan 14, 2020

Support variable length encoding pravega/bincode2#5

Open

ZoeyR mentioned this issue Mar 14, 2020

Varint enum tags and lengths #306

Merged

ZoeyR mentioned this issue Apr 16, 2020

Varint encoding for discriminants/lengths #319

Closed

ZoeyR closed this as completed Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider encoding str/seq lengths as varints #75

Consider encoding str/seq lengths as varints #75

slyrz commented Jun 5, 2016

TyOverby commented Jun 6, 2016 •

edited

Loading

slyrz commented Jun 9, 2016 •

edited

Loading

TyOverby commented Jun 9, 2016

TyOverby commented Jun 9, 2016

shepmaster commented Jul 17, 2016

TyOverby commented Jul 18, 2016

arthurprs commented Aug 1, 2016

dtolnay commented Apr 28, 2017

dtolnay commented Apr 28, 2017

aeyakovenko commented Jul 29, 2018

ZoeyR commented Apr 16, 2020

Consider encoding str/seq lengths as varints #75

Consider encoding str/seq lengths as varints #75

Comments

slyrz commented Jun 5, 2016

TyOverby commented Jun 6, 2016 • edited Loading

slyrz commented Jun 9, 2016 • edited Loading

TyOverby commented Jun 9, 2016

TyOverby commented Jun 9, 2016

shepmaster commented Jul 17, 2016

TyOverby commented Jul 18, 2016

arthurprs commented Aug 1, 2016

dtolnay commented Apr 28, 2017

dtolnay commented Apr 28, 2017

aeyakovenko commented Jul 29, 2018

ZoeyR commented Apr 16, 2020

TyOverby commented Jun 6, 2016 •

edited

Loading

slyrz commented Jun 9, 2016 •

edited

Loading