-
-
Notifications
You must be signed in to change notification settings - Fork 889
Fix decoding tiff image with BigEndian + 64 bit / pixel + associated alpha #3032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| return TPixel.FromRgba64(new Rgba64((ushort)(r / a), (ushort)(g / a), (ushort)(b / a), a)); | ||
| float scale = 65535f / a; | ||
| ushort ur = (ushort)Math.Min(r * scale, 65535); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure min is required here. It's certainly safer but if I can avoid it, I'd like to as it adds significant overhead per-pixel.
I didn't write this original code but I'm actually a little concerned about performance. It looks like we can massively optimize in the calling code also. In all the callers we do a per-pixel conditional check against hasAssociatedAlpha instead of separate loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to find a way to improve performance for this, I suppose I can allocate a buffer and re-order all the bytes first with Ssse3.Shuffle or similar and then just use the same code for both BigEndian an LittleEndian. Doing this will remove the need of ColorFromRgba64Premultiplied here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the .Min() in theory it should not be useful, only for kind of corrupted image the result might differ.
So I tried to just split the loop into 2, I'm getting a ~8-9% speed gain with it
I also tried to convert the endianess first with SIMD operations (like with Avx.Shuffle / Sse3.Shuffle) then applying the same code as the one for little endian, but I'm getting worst result. I also did some performance test of little endian case, and the big endian scenario is actually faster than the code handling the little endian case.
| Method | Mean | Error | StdDev | Median | Allocated |
|------------------------------------------ |---------:|--------:|--------:|---------:|----------:|
| Rgba16161616TiffColorSplit | 124.3 us | 2.19 us | 5.33 us | 122.5 us | 1.4 KB |
| Rgba16161616TiffColorOriginal | 136.1 us | 2.42 us | 5.10 us | 133.8 us | 1.4 KB |
| Rgba16161616TiffColorOriginalLittleEndian | 374.0 us | 7.24 us | 7.75 us | 373.8 us | 1.4 KB |
| Rgba16161616TiffColorSimd | 393.0 us | 7.71 us | 9.76 us | 394.4 us | 1.49 KB |
So it seems the performance of the current code is not that bad, I pushed the benchmark here if you are interested https://github.com/Socolin/ImageSharp/commits/benchmark-tiff/
To compare BigEndian vs LittleEndian code I did another benchmark:
| Method | AssociatedAlpha | IsBigEndian | Mean | Error | StdDev | Median | Allocated |
|--------------------------- |---------------- |------------ |---------:|--------:|---------:|---------:|----------:|
| Rgba16161616TiffColorSplit | False | False | 376.6 us | 7.26 us | 10.86 us | 373.6 us | 1.4 KB |
| Rgba16161616TiffColorSplit | False | True | 121.9 us | 2.43 us | 4.96 us | 119.5 us | 1.4 KB |
| Rgba16161616TiffColorSplit | True | False | 727.4 us | 9.43 us | 8.36 us | 727.9 us | 1.49 KB |
| Rgba16161616TiffColorSplit | True | True | 194.0 us | 3.78 us | 4.78 us | 193.1 us | 1.49 KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this might work...
Here:
(uint)component * 65535Uavoids overflow in the multiply.+ (a >> 1)gives a simple integer rounding to nearest instead of truncation.- The cast to
ushortis safe because the numerator is bounded so the division result is in[0, 65535]under the premultiplied invariant.
// Inputs are premultiplied, so 0 <= r,g,b <= a <= 65535.
// (component * 65535) / a therefore always fits in [0, 65535],
// so no explicit clamp is required.
ushort ur = (ushort)(((uint)r * 65535U + (a >> 1)) / a);
ushort ug = (ushort)(((uint)g * 65535U + (a >> 1)) / a);
ushort ub = (ushort)(((uint)b * 65535U + (a >> 1)) / a);
return TPixel.FromRgba64(new Rgba64(ur, ug, ub, a));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saddly this give the same problem as removing the Math.Min
This is a nice trick to avoid the conversion to float, but I'm not sure this is needed, I check the performance to see if this would help, and even if there is we can find a trick for the overflow it's already slower than Math.Min(), and readability is lost.
| Method | AssociatedAlpha | IsBigEndian | Mean | Error | StdDev | Median | Allocated |
|---|---|---|---|---|---|---|---|
| Rgba16161616TiffColorSplit | True | True | 196.8 us | 3.90 us | 5.33 us | 195.5 us | 1.49 KB |
| Rgba16161616TiffColorBitTwidding | True | True | 205.5 us | 3.61 us | 5.29 us | 203.0 us | 1.49 KB |
And a benchmark targeting only the method (1000 executions)
(Edit: I did a typo and the first result were wrong, but I still have similar result)
| Method | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|
| ColorFromRgba64PremultipliedMathMin | 3.020 us | 0.0139 us | 0.0123 us | - |
| ColorFromRgba64PremultipliedBitTwidding | 3.967 us | 0.0185 us | 0.0173 us | - |
With COMPlus_JitDisasm=*ColorFromRgba64Premultiplied* I can see that Math.Min is replaced at runtime with a single instruction: vminss so it seems to be already well optimized by the JIT
vmulss xmm0, xmm0, dword ptr [rbp-0x14] ;; Multiply alpha with r and store in xmm0
vmovups xmm1, xmmword ptr [reloc @RWD16] ;; Load the value 65535 in xmm1
vminss xmm0, xmm1, xmm0 ;; xmm0 = Math.Min(xmm0, xmm1)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's keep the min for safety and just use separate calling loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed the changes for the separate loops, I think this is ready to be merged now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, looks good. Can you please do me a favour and remove the macos-13 matrix entries from build-and-test.yml? The images have been retired by GitHub so the entries will block publishing. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
6174279 to
1a22af3
Compare
…void a condtion for each pixel
JimBobSquarePants
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant. Thanks!


Prerequisites
Description
This fix #3031 the method
ColorFromRgba64Premultipliedseems to do the wrong things, I'm not 100% sure about the new code but it seems to works and kind of make sense. I'm trying to find another implementation as a reference to confirm this is the right things to do (But so far all the C library I checked are so hard to read I was not able to find confirmation...)