-
-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analysis & Discussion: Jpeg & Resize processing pipelines, improvement opportunities #1064
Comments
Thanks for taking the time writing all of the above, it's very informative. 🤯 Focusing on Jpeg decoding optimization for now I would advocate for being as radical as possible. I propose redefining the entire JpegPostProcessor pipeline as three separate implementations based on an integer pipeline. This includes Dequantization, IDCT, Subsampling, and Colorspace transforms.
I would focus on 1 and 3 as a priority and refactor our current floating point implementation to fit via scaling at the beginning as you describe.
The upgrade path from NET Core 2 to 3 is surprisingly simple for the most part. I recently ported 5 quite complex libraries in a matter of days with very little refactoring required so while 2.1 has LTS support until August 2021 I believe many customers will have moved on by then. The benefit I see from cleanly sliced implementations are the following:
I appreciate that this is a lot of work but together I think we can do it. I'm also thinking V1 not RC as a milestone since all the APIs are internal. |
Now the bitter pill: the total amount of work for replacing the entire pipeline is huge. Think of at least 2-3 man-weeks of full time work, assuming that we know exactly what we are doing (I wouldn't dare to say so about myself). I make these estimations based on my own experiences and by lurking in
Because of the amount of work, I don't think it makes sense to talk about a full scale rewrite within the V1 timeframe. We want to get the library released before 2021. Also: everything being internal doesn't mean that a significant rewrite is an option during a pre-release bugfix cycle. (Regressions, behavioral breaking changes.) There are other problems about aiming full-integer pipeline everywhere:
Expect a regression of an order of magnitude. This is basically the stuff we started with in 2016. And it's a big amount of extra work to do it properly.
This is not possible with integers because of missing intrinsics as fundamental as division. Even if we started ImageSharp in 2019, I would say that FP pipelines are valuable and worth to implement:
Summary/TLDR While doing the optimizations, we can improve the understandability of the code by cleaning it up, adding comments, and introduce simplifications in the pipeline where the perf impacts are limited. Eg: at a certain point, we can consider dropping all the EDIT |
If you truly believe this then I'm with you all the way and we'll do it your way. You are, by far and away, the performance expert here. |
Introduction
Apart from the API simplification, the main intent of #907 was to enable new optimizations: it's possible to eliminate a bunch of unnecessary processing steps from the most common YCbCr Jpeg thumbnail making use-case. As it turned out in #1062, simply changing the pixel type to
Rgba24
is not sufficient, we need to implement the processing pipeline optimizations enabled by the .NET Core 3.0 Hardware Intrinsic API, especially by the shuffle and permutation intrinsics which are allowowing fast conversion between different pixel type representations and component orders (eg.Rgba32
<-->Rgb24
), as well as fast conversion between Planar/SOA and Packed/AOS pixel representations. The latter is important because raw Jpeg data consists of 3 planes representing the YCbCr data, while an ImageSharpImage
is always packed.This analyisis:
Rgb24
slowdown in Add La16 and La32 IPixel formats. #1062Please let me know, if some pieces are still hard to follow. It's worth to check out all URL-s while reading.
TLDR
If you want to hear some good news before reading through the whole thing, jump to the Conclusion part 😄
Why is
Rgb24
post processing slow in our current code?YCbCr
->TPixel
conversions, the generic caseJpegImagePostprocessor
is processing the YCbCr data in two steps:Vector4
RGBA buffers. The two operations are carried out together by the matchingJpegColorConverter
. With the YCbCr colorspace which has only 3 components, this is already a sub-optimal, since the 4th alpha component (Vector4.W
) is redundant.Vector4
packing is done with non-vectorized code.Vector4
buffer to pixel buffer, using the pixel specific implementation.Rgba32
vsRgb24
PostProcessIntoImage<Rgba32>
PostProcessIntoImage<Rgb24>
The difference is that
PixelOperations<Rgba32>.FromVector4()
does not need to do any component shuffling, only expandingbyte
values tofloat
-s, while inPixelOperations<Rgba32>.FromVector4()
, we first convert the float buffers toRgba32
buffers (fast), which is followed by anRgba32
->Rgb24
conversion using the sub-optimal default conversion implementation. This operation:JpegColorConverter
with a method to pack data intoVector3
buffers, we could convertVector3
data intoRgb24
data exactly the same way we do theVector4
->Rgba32
conversion.Definition of Processing Pipelines
Personally, my memory is terrible and I always need to reverse engineer my own code when we want to understand what's happening and make decisions. Lack of comments and confusing terminology is also misleading. To get a good overview, it's really important to step back and abstract away implementation details, by thinking about our algorithms as PIPELINES composed of Data States and Transformations, where
This representation is only good for analyzing data flow for a specific configuration, eg. a well defined input image format + decoder configuration + output pixel type. To visualize the junctions, we need DAG-s 🤓.
Current floating point YCbCr Jpeg Color Processing & Resize pipelines, improvement opportunities
Presumtions:
netcoreapp2.1
(enablesVector.Widen
)Vector<T>
-s are in fact AVX2 registers andVector<T>
intrinsics are JIT-ed to AVX2 instructionsVector4
operations are JIT-ed to SSE2 instructions(I.) Converting raw jpeg spectral data to YCbCr planes
CopyBlocksToColorBuffer
Int16
jpeg components (3 xBuffer2D<Block8x8>
, Y+Cb+Cr)Int16
->Int32
widening andInt32
->float
conversion, both usingVector<T>
, implemented inBlock8x8F.LoadFrom(Block8x8)
float
jpeg components (3 xBuffer2D<Block8x8>
, Y+Cb+Cr)Block8x8F.MultiplyInplace(DequantiazationTable)
float
jpeg components (3 xBuffer2D<Block8x8>
, Y+Cb+Cr)float
jpeg color channels (3 xBuffer2D<Block8x8>
, Y+Cb+Cr)Vector<T>
. Rounding is needed for better libjpeg compatibilityfloat
jpeg color channels normalized to 0-255 (3 xBuffer2D<Block8x8>
, Y+Cb+Cr)Block8x8.CopyTo())
(super misleading name!)float
jpeg color channels normalized to 0-255 (3 xBuffer2D<float>
, Y+Cb+Cr)(II. a) Converting the Y+Cb+Cr planes to an
Rgba32
bufferRgba32
buffer, done byConvertColorsInto
float
jpeg color channels normalized to 0-255 (3 xBuffer2D<float>
, Y+Cb+Cr)Vector4
bufferMemory<Vector4>
Vector4
buffer to anRgba32
buffer. In theRgba32
case case, the input buffer could be handled as homogenousfloat
buffer, where all individualfloat
values should be converted tobyte
-s. The conversion is implemented inBulkConvertNormalizedFloatToByteClampOverflows
, utilizing AVX2 conversion and narrowing operations throughVector<T>
Rgba32
buffer(II. b) Converting the Y+Cb+Cr planes to an
Rgb24
buffer, current sub-optimal pipelineRgb24
buffer, done byConvertColorsInto
float
jpeg color channels normalized to 0-255 (3 xBuffer2D<float>
, Y+Cb+Cr)Vector4
bufferMemory<Vector4>
Vector4
buffer to anRgba32
buffer, utilizingBulkConvertNormalizedFloatToByteClampOverflows
, utilizing AVX2 conversion and narrow operations throughVector<T>
Rgba32
bufferPixelOperations<Rgb24>.FromRgba32()
(sub-optimal, extra transformation!)Rgb24
buffer(II. b++) Converting the Y+Cb+Cr planes to an
Rgba24
buffer, IMPROVEMENT PROPOSALSee #1121
(III. a) Resize
Image<Rgba32>
, current pipelineTODO
(III. b) Resize
Image<Rgb24>
, current pipelineTODO.
Without any change, the current code shall run faster than for
Image<Rgba32>
.(III. b++) Resize
Image<Rgb24>
, IMPROVEMENT PROPOSALTODO
Integer-based SIMD pipelines
Although the Hardware Intrinsic API removes all theoretical boundaries to have 1:1 match with other high performance imaging libraries, for both Jpeg Decoder and Resize by utilizing AVX2 and SSE2 integer algorithms, there is a big practical challange: It's very hard to introduce these improvements in an iterative manner.
It's not possible to exchange the elements of the Jpeg pipeline at arbitrary points, because it would lead to insertion of extra
float
<->Int16/32
conversions. To overcome this, we should start introducing integer transformations and data states at the beginning and/or at the end of the pipeline. This could be done by replacing the transformations and the data states in subsequent PR-s while moving theInt16
->float
conversion towards the bottom (when starting from the beginning), and thefloat
->byte
conversion towards the top (when starting from the end). EG:YCbCr24
->Rgb24
SIMD conversion firstConclusion
If we aim for low hanging fruits, I would start by implementing (II. b++) and (III. b++). After that, we can continue by introducing integer SIMD operations starting at the beginning or at the end of the Jpeg pipeline.
I would also suggest to keep the current floating point pipeline in the codebase as is, to avoid perf regressions for pre-3.0 users. I believe those platforms will be still relevant for many customers for a couple of other years.
The text was updated successfully, but these errors were encountered: