Image decoding speed (was Re: Cooperative preemption of image decoding)

297 views
Skip to first unread message

Tim Ansell

unread,
Oct 13, 2015, 9:32:47 PM10/13/15
to Matt Sarett, no...@chromium.org, Graphics-dev, scheduler-dev
Hi Matt,

As I mentioned, I'm no expert on image decoding. 

Noel (no...@chromium.org) is the expert in image decoding speed that I've been talking too (CCed on this email). I'll ask him to take a look at your benchmark and give feedback on it.

A bunch of questions I had by looking at your document;
 * How are you doing the memcpy test?
 * SkJpegCodec.cpp seems to be reasonable complicated, while your profile claims that 90% of the time is spent in libjpeg-turbo I'm a little worried that more is going on than just decoding.
 * I have no idea if you are using the "optimal" settings when calling libjpeg-turbo or not.
 * I have no idea how your libjpeg-turbo compares to the Blink version.

On 14 October 2015 at 06:37, Matt Sarett <msa...@google.com> wrote:
I did some work to run performance benchmarks for jpeg decoding as compared memcpy.  I'm seeing that even for the largest images, the jpeg decodes appear to be about 5x slower on z620 and at least 10x slower on Nexus 6.
 
 Intuitively, this makes sense to me, given that a jpeg decode is a pretty complex, multi-step process (entropy decoding, IDCT, upsampling, color conversion).

From a purely theoretic point of view, a reasonably complicated decompression algorithm can be faster than memcpy. See the reasoning below;

A memcpy has to read and then write every byte of an image. 
This is a purely linear operation which doesn't benefit from cache, so is bound by the raw speed of your memory IO. The total DDR memory IO speed is shared between write and read. This means that the top memcpy speed should be roughly half the memory IO speed.

A decompression algorithm has a smaller input data but should produce the same output data. The access to the input data is non-linear but generally has a pattern of having a very "hot" section and a "cold" section. The "hot" section benefits strongly from cache in the CPU. This means (at least in theory) with the reduced input memory IO there is more IO available for writing the output. Therefore you can actually have a higher speed!

Zeroing the memory could almost be thought of as a very efficient compression algorithm and should be roughly double the speed of a memcpy as it doesn't have to read any data at all. This should give you an absolute *top speed* of the memory IO. Maybe we should add a "memset(0)" test to your benchmark and see how it compares to the memcpy benchmark?

I have no idea if this ever happens in practice because of things like multiple memory controllers, memory banking and other issues.

(BTW As needing memory which is full of zeros is so important, many OSs "cheat" in this area. They often have prepared pages of zero, or do a "generate on read" type system. There is even hardware acceleration for this!).

Just food for thought.

 I'm sharing a doc with my procedure and results in case anyone wants to look closer.
Jpeg Decode vs memcpy Comparison

If anyone has any suggestions on the design of the benchmark or has any performance results that show a different conclusion, I would definitely be interested to hear their thoughts!

Matt

On Tuesday, October 13, 2015 at 2:50:58 AM UTC-4, Stojiljkovic, Aleksandar wrote:
Hello,
>For JPEG (and for WebP) decoding is approximately the same speed as doing a memcpy of the image!
Is some more data available for this - regarding behavior on different platforms and with different images?
What are decoder libraries used.
Especially interesting is to get numbers about decoding tiles and downscaled version of images.

Thanks.
Kind Regards,
Aleksandar 


Tim 'mithro' Ansell 

Tim Ansell

unread,
Oct 13, 2015, 10:06:48 PM10/13/15
to Matt Sarett, no...@chromium.org, Graphics-dev, scheduler-dev
Hi Matt,

I found a bunch of references which should be useful.

The important WebKit bug to look at is https://bugs.webkit.org/show_bug.cgi?id=59670

Removing the extra conversion from RGB to RGBA (which is an effective memcpy) made a 50% reduction in encoding speed.

It does look like Skia has something similar? (I see an SkSwizzler::kRGBX).

Tim 'mithro' Ansell

Matt Sarett

unread,
Oct 14, 2015, 9:43:34 AM10/14/15
to Tim Ansell, no...@chromium.org, Graphics-dev, scheduler-dev
Thanks for your thoughts Tim!  In theory, I agree with most of the points you have made.  I'll do my best to address some of your concerns with the benchmark.

"How are you doing the memcpy test?"

It is a similar set-up to the decoding bench, except the loop in onDraw() makes a single function call:
memcpy(fDstPtr, fSrcPtr, totalBytesInImage)
I've looked at the disassembly to ensure that the work is not being optimized out.  Meaning, I saw that memcpy() was in fact being called in a loop.

"SkJpegCodec.cpp seems to be reasonable complicated, while your profile claims that 90% of the time is spent in libjpeg-turbo I'm a little worried that more is going on than just decoding."

I won't deny that our abstraction may make it a little difficult to identify where we are making calls into libjpeg-turbo.  I am adding a few of the profiling results that I have obtained on z620.  I'll add this to the doc as well.
 26.92%  nanobench  nanobench                   [.] decode_mcu [libjpeg-turbo function for entropy (Huffman) decoding]
 21.29%  nanobench  nanobench                   [.] jsimd_idct_islow_sse2 [libjpeg-turbo function for IDCT, don't be confused by "islow", this is the most optimized version for high-quality decodes, chromium uses this version as well]
 18.91%  nanobench  nanobench                   [.] jsimd_ycc_extbgrx_convert_sse2 [libjpeg-turbo internal color conversion YCC->RGBA]
  7.33%  nanobench  nanobench                   [.] decode_mcu_AC_refine [libjpeg-turbo helper function called by decode_mcu()]
  5.64%  nanobench  nanobench                   [.] jsimd_h2v2_fancy_upsample_sse2 [libjpeg-turbo upsampling step]
  3.85%  nanobench  nanobench                   [.] decompress_onepass [libjpeg-turbo function prepares for IDCT]
  3.56%  nanobench  [kernel.kallsyms]           [k] 0xffffffff8104f45a [?]
  3.51%  nanobench  nanobench                   [.] chromium_jpeg_fill_bit_buffer [libjpeg-turbo accepts input stream, could be affected by SkJpegCodec implementation]
  1.42%  nanobench  libc-2.19.so                [.] memset [?]
  1.35%  nanobench  nanobench                   [.] Sample_RGBx_D8888(void*, unsigned char const*, int, int, int, unsigned int const*) [Irrelevant, called exclusively in benchmark set-up]
  1.17%  nanobench  nanobench                   [.] decode_mcu_AC_first [libjpeg-turbo helper function called by decode_mcu()]
  0.75%  nanobench  nanobench                   [.] SkJpegCodec::onGetPixels(SkImageInfo const&, void*, unsigned long, SkCodec::Options const&, unsigned int*, int*, int*) [SkJpegCodec]


"I have no idea if you are using the "optimal" settings when calling libjpeg-turbo or not."

Note that the sse2 functions from the profile above indicate that we are using the optimized setting for z620.  We have paid similar attention to make sure that we are using the Arm NEON optimizations on Nexus 6.

"I have no idea how your libjpeg-turbo compares to the Blink version."

Blink uses libjpeg-turbo and needs to make all of the same API calls that we make.  I know of two significant differences:
(1) Blink supports "suspending data sources" meaning that they can call into the library with "some" of the encoded data, and, if there is not enough data to fulfill the same request, they can repeat the same call later with more data.
(2) In some cases, Blink will decode to YUV instead of RGBA, which would save the color conversion step (third step) in the profile.  But then YUV will need to be converted to RGBA when they draw.  I'm not sure how expensive this is (or if is free?).

"The important WebKit bug to look at is https://bugs.webkit.org/show_bug.cgi?id=59670
Removing the extra conversion from RGB to RGBA (which is an effective memcpy) made a 50% reduction in encoding speed.
It does look like Skia has something similar? (I see an SkSwizzler::kRGBX)."

You'll notice from the profile that we convert directly to RGBA inside libjpeg-turbo.  We use SkSwizzler for some non-JPEG cases and some exceptional cases (sampling for Android, subset decoding), but not in the decode described in this benchmark.  It is worth noting that we could save some decode time if we requested the output in YUV.

"Maybe we should add a "memset(0)" test to your benchmark and see how it compares to the memcpy benchmark?"

I think this is a good idea - to get a theoretical peak performance for an image decode.

Please continue to follow-up with your thoughts!

Matt

Noel Gordon

unread,
Oct 16, 2015, 11:28:08 AM10/16/15
to Tim Ansell, Matt Sarett, Graphics-dev, scheduler-dev
On 14 October 2015 at 13:06, Tim Ansell <mit...@mithis.com> wrote:
Hi Matt,

I found a bunch of references which should be useful.

The important WebKit bug to look at is https://bugs.webkit.org/show_bug.cgi?id=59670

Removing the extra conversion from RGB to RGBA (which is an effective memcpy) made a 50% reduction in encoding speed.

Back in the day, libjpeg6b and libjpeg-turbo would decode to RGB and the Blink decoders would need to swizzle that decoded pixel data to BGRA (or RGBA on Android) when writing to the decoded image buffer.  libjpeg-turbo was 3x faster than libjpeg6b for a decode.

We changed libjpeg-turbo, adding the JCS_EXTENSIONS for BGRX and RGBX upstream, which moved the swizzle step into libjpeg-turbo, and this allowed us to decode directly into Blink's decoded image buffers, and to ditch the memory copy RGB->BGRA|RGBA step.  libjpeg-turbo was 6x faster than libjpeg6b with that change.

So the advice is don't copy memory in and around a turbo image decode, and use the JCS_EXTENSIONS to swizzle decode directly into your target buffer, to maximize decode performance.

If some other part of our system copies the decoded pixel rows elsewhere, then that copy cost will be significant relative to the image decode cost. Consider the image on the right here:


a progressive JPEG.  Each draw copies the wxh image frame SkBitmap, which might not be marked immutable until at least the first frame is fully decoded, into a SkImage these days. The first image frame of that image draws at least 10 times for me - that's 10 wxh decoded image frame copies by the looks.

I wonder how much that costs, and how it impacts folks in EM where bandwidth is not so fast, or where progressive JPEG is recommended for image delivery.

The advice is not about memcpy() per say: it's about not copying memory at all.

~noel

Matt Sarett

unread,
Oct 16, 2015, 11:52:26 AM10/16/15
to Noel Gordon, Tim Ansell, Graphics-dev, scheduler-dev
"libjpeg-turbo was faster than libjpeg6b for a decode."

+1 for libjpeg-turbo instead of libjpeg

"Ditch the memory copy RGB->BGRA|RGBA step"

+1 for converting YUV directly to RGBA or BGRA using turbo SIMD

"The advice is not about memcpy() per say: it's about not copying memory at all."

+1 for avoiding unnecessary memory copies

Dongseong Hwang

unread,
Oct 21, 2015, 7:58:03 AM10/21/15
to Graphics-dev, no...@chromium.org, mit...@mithis.com, schedu...@chromium.org


On Friday, October 16, 2015 at 6:52:26 PM UTC+3, Matt Sarett wrote:
"libjpeg-turbo was faster than libjpeg6b for a decode."

+1 for libjpeg-turbo instead of libjpeg

"Ditch the memory copy RGB->BGRA|RGBA step"

+1 for converting YUV directly to RGBA or BGRA using turbo SIMD

Few month ago, vangelis mention that "We also have some ongoing work to do the YUV -> RGB colorspace conversion in a shader which will speed up decoding some." for gpu rasterization.
Worth to note that video decoding for some format works like it. Decoding video to YUV planes and then converting it in a glsl shader.
Do someone can summarize how Blink converts YUV to RGBA in various configurations?

Noel Gordon

unread,
Nov 2, 2015, 9:03:24 PM11/2/15
to Dongseong Hwang, Graphics-dev, Tim Ansell, scheduler-dev
On 21 October 2015 at 22:58, Dongseong Hwang <dongseo...@intel.com> wrote:


On Friday, October 16, 2015 at 6:52:26 PM UTC+3, Matt Sarett wrote:
"libjpeg-turbo was faster than libjpeg6b for a decode."

+1 for libjpeg-turbo instead of libjpeg

"Ditch the memory copy RGB->BGRA|RGBA step"

+1 for converting YUV directly to RGBA or BGRA using turbo SIMD

Few month ago, vangelis mention that "We also have some ongoing work to do the YUV -> RGB colorspace conversion in a shader which will speed up decoding some." for gpu rasterization.

Yes, the the Blink JPEG decoder has accelerated decode support when GPU rasterization is enabled - refer to https://crbug.com/413001#c31 onwards
 
Worth to note that video decoding for some format works like it. Decoding video to YUV planes and then converting it in a glsl shader. 
Do someone can summarize how Blink converts YUV to RGBA in various configurations?

Same shader is used for the accelerated video path [1], and when drawing video into an accelerated HTML <canvas> [2], and (I think) libyuv is used for the software video path [3].


~noel
Reply all
Reply to author
Forward
0 new messages