Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zlib #396

Open
KOLANICH opened this issue Jan 8, 2021 · 3 comments
Open

Zlib #396

KOLANICH opened this issue Jan 8, 2021 · 3 comments

Comments

@KOLANICH
Copy link
Contributor

KOLANICH commented Jan 8, 2021

meta:
  id: deflate_stream
  title: ZLib-compressed blocks
  application: zlib
  xref:
    justsolve: Zlib
    mime: application/zlib
    rfc: 1951
    wikidata: Q2712
  endian: le
  bit-endian: le

doc: |
  Blocks compressed with zlib - a compression format designed by Mark Adler

doc-ref:
  - https://github.com/madler/zlib
  - https://github.com/golang/go/blob/f90e89e/src/compress/flate/inflate.go#L301
WiP: "https://gist.github.com/generalmimon/0f202457ebe8f1556293d611a949c358"
@generalmimon
Copy link
Member

generalmimon commented Jan 8, 2021

FWIW, I started working on this format. It is commonly called zlib compression, but this is quite unfortunate designation (it leads to false impressions). zlib is mainly a library for Deflate compression and decompression (see https://zlib.net/).

See this good summary on Stack Overflow by Mark Adler (https://stackoverflow.com/a/20765054):

The zlib library supports Deflate compression and decompression, and three kinds of wrapping around the deflate streams. Those are: no wrapping at all ("raw" deflate), zlib wrapping, which is used in the PNG format data blocks, and gzip wrapping, to provide gzip routines for the programmer. The main difference between zlib and gzip wrapping is that the zlib wrapping is more compact, six bytes vs. a minimum of 18 bytes for gzip, and the integrity check, Adler-32, runs faster than the CRC-32 that gzip uses. Raw deflate is used by programs that read and write the .zip format, which is another format that wraps around deflate compressed data.

It's important to understand that the method of data compression, which is called zlib, "raw" deflate and gzip, is one and the same - the only difference is in the wrapping (envelope).

  • "Raw" deflate has no envelope (it is just a stream of self-terminating chunks with compressed data; used in the ZIP format for example),
  • zlib envelope has a typical 2-byte header (usually 78 01, 78 9C and 78 DA, but there are more of them; these 2 bytes have their own internal structure) and an Adler-32 checksum at the end,
  • gzip has a 10-byte header at the beginning and a CRC-32 checksum and the size of uncompressed data at the end.

(Note: gzip and zlib headers might have some optional fields after the mandatory part, but that's out of scope of this basic intro.)

Kaitai Struct supports only the process: zlib decompression out-of-the-box, which requires the zlib header to be present (at the beginning of the compressed data). That's quite unfortunate, because it denies you to decompress raw deflate and gzip. It would be better to have process: deflate decompressing the raw deflate data, and parsing the zlib and gzip header and "footer" fields with a KSY spec.

And to the references that you linked - RFC 1950 really just describes the zlib header and footer (from which you can't read anything useful about the compressed data), the actual compression method is documented in RFC 1951 ("DEFLATE Compressed Data Format Specification version 1.3"). Also, the Wikidata item Q207240 refers to the zlib library, not to the compression format - DEFLATE - data decompression algorithm (Q2712) would be more appropriate.

I can think of a terminological thing to mention, in case someone isn't aware - deflating means "letting air or gas out of a baloon" and DEFLATE means the compression (reducing the size), and inflating is the opposite - it means "filling a balloon with air or gas" and it refers to the decompression. It's a quite funny analogy I think 😃


My WIP .ksy spec for the deflate stream is here: https://gist.github.com/generalmimon/0f202457ebe8f1556293d611a949c358

I consider the RFCs and docs pretty much incomprehensible and not practical (i.e. you are often left to devise your own specific algorithms by yourself) and I don't fancy C code at all (zlib library), but Go language has a pretty good and legible DEFLATE implementation. The parsing of the deflated stream starts here: golang/go > src/compress/flate/inflate.go:301

So most of the deflate_stream.ksy spec is actually based on stepping the Go flate implementation in the VSCode debugger (coming from the Go For Visual Studio Code extension). The debugger is really helpful, because you can see the algorithm step-by-step, reimplement a part of it in the KSY and check if the intermediate values shown in the debugger are the same as from the KSY. FWIW, this is the application code that I was stepping: https://play.golang.org/p/E5dLnJRZ4Im

The beginning is pretty simple, but it becomes ugly fast. You often need to implement various counters and collector variables, the Go implementation uses various mutable byte arrays for example, so this needs to be converted to the immutable paradigm for KSY usage, etc. I'm not sure if the KSY spec can be even finished.

I don't think I'm going to actively work on the spec in the near future, so if anyone feels like that, please let me know.

@KOLANICH
Copy link
Contributor Author

KOLANICH commented Jan 8, 2021

Kaitai Struct supports only the process: zlib decompression out-of-the-box, which requires the zlib header to be present (at the beginning of the compressed data). That's quite unfortunate, because it denies you to decompress raw deflate and gzip. It would be better to have process: deflate decompressing the raw deflate data, and parsing the zlib and gzip header and "footer" fields with a KSY spec.

kaitai_compress has a PR fixing that for python.

And to the references that you linked - RFC 1950 really just describes the zlib header and footer (from which you can't read anything useful about the compressed data), the actual compression method is documented in RFC 1951 ("DEFLATE Compressed Data Format Specification version 1.3"). Also, the Wikidata item Q207240 refers to the zlib library, not to the compression format - DEFLATE - data decompression algorithm (Q2712) would be more appropriate.

Fixed, thanks.

@generalmimon
Copy link
Member

generalmimon commented Jan 8, 2021

kaitai_compress has a PR fixing that for python.

Thanks, I wasn't aware of it. I will look into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants