Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel PNG encoding #530

Open
Shnatsel opened this issue Nov 9, 2024 · 3 comments
Open

Parallel PNG encoding #530

Shnatsel opened this issue Nov 9, 2024 · 3 comments

Comments

@Shnatsel
Copy link
Contributor

Shnatsel commented Nov 9, 2024

Every part of the decoding process for PNGs is inherently sequential, but there is some performance to be gained by parallelizing the encoding process.

Zlib compression

We have two Zlib compressors that can be used: fdeflate and flate2.

#478 implemented this for the fdeflate mode.

https://github.com/sstadick/crabz/ provides a parallelizing wrapper around flate2.

Filtering

Filtering can be parallelized fairly easily, since it operates on each row individually. However, for up and paeth filters each operation needs not one but two rows - the previous row and the current one.

If we had the entire image in memory at the same time this would be trivial, but implementing this in a streaming fashion may require more somewhat more complex logic.

I don't expect huge gains here because filtering is extremely fast already, but it might become relevant once the rest of the process is parallelized.

@fintelia
Copy link
Contributor

fintelia commented Nov 10, 2024

I think it is worth taking a moment to think about precisely what use case we want to address.

The fdeflate parallelism was able to increase encoding speed from a couple hundred million pixels per second, to about one billion pixels per second. It gets the same -- roughly on-par with QOI -- compression ratio as the single thread version. Helpful if you have lots of image data to encode, but only if you don't care much about the compression ratio.

The gzp approach gets a compression almost as good single-threaded, but each core operates on 128 KB chunks of uncompressed image data. On small images there'd be no speedup at all. And even big images are probably going to much slower that single-threaded fdeflate.

Streaming encode is another question. It is probably doable to implement, but would be more effort. Is it worth trying to design for at the same time or only if there's interest later?

@Shnatsel
Copy link
Contributor Author

The use case I had in mind was AI image generators. They typically run on beefy machines and need to encode the generated images losslessly, both to store later and to transmit over a network via a web UI. So both encoding speed and compression ratio matter.

Admittedly I'm not building one of those, but I think it is applicable anything that needs a trade-off between encoding speed and compression ratio. That is, everything that doesn't crank compression up to 9.

Regarding streaming: it only introduces issues for filtering, and I'm not convinced that parallelizing filtering is even profitable. We can implement parallel Gzip compression first, and then evaluate whether parallel filtering helps at all. If we find that it does, we can initially implement it for full-image encoding only, and support streaming later if/when someone asks.

@fintelia
Copy link
Contributor

That's helpful framing. It probably means images in the single digits megapixel range, so enough to keep a modest number of cores fed with 128 KB chunks of raw data. But not so much that's there any concern about keeping one (or even several) copies of the full image in memory.

In this scenario, there might be some control over which image format to use. If so, it may make more sense to focus on the WebP encoder, which already does much better on the encoding speed vs. compression ratio tradeoff.

The reason to parallelize filtering is probably mostly to use memory bandwidth better. Right now, the code filters a row into a temporary buffer and then immediately compresses it. The full filtered image data never exists fully in memory. In #478, the parallelization applies to both steps with each core allocating one row's worth of temporary buffer to filter into and then compress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants