OpenJPEG Performance

rodri...@gmail.com

unread,

Nov 13, 2007, 10:10:02 AM11/13/07

to OpenJPEG

Hi,
We are working on a video streaming application based on jpeg2k, and
we are thinking about using openjpeg for it. The problem is that We
are not getting good performance decompressing, 250ms for a 768x576 3
bands image compared to the 70ms of j2k-codec and the 150 ms of
Jasper.
¿Is there any way to optimize this performance? We are just
decompressing in the same way that in the j2k_to_image example. We
have tried compile time optimizations, but we have not got much better
results.
Thanks
Rodrigo Cilla Ugarte
Carlos III University, Madrid

Dzonatas

unread,

Nov 13, 2007, 10:40:34 AM11/13/07

to open...@googlegroups.com

Hi!

I posted a patch to another group to help speed-up decompression across
multi-processors. It is a very preliminary patch to test Intel's
Threading Building Blocks.

I'll post an update soon to this group-list. It should speed up the
encode process also. I still need to build a test suite before I
consider it not-preliminary.

On a multi-core, it is expected to decode at 75% of the normal time *at
least*. For example, if it normally takes 10 seconds on a single core,
than it should only take 1/16 of that on a 16 core CPU (about .06
seconds or 10FPS). Intel's Core 2 Duo can present 4 main physical
threads, so that means 1/4 of the time (equivalent to quad-core). That
should bring it down to about 60ms (from the 250ms on the 768x576).
These expectations are without SSE vectorization, so there is potential
for even further optimizations with SSE vectorization beyond the
multi-processor scalability.

Again, this is preliminary, but the scalability is embarrassingly
parallelizable and practicable.

Cheers!

--
Power to Change the Void

François-Olivier Devaux

unread,

Nov 13, 2007, 11:09:58 AM11/13/07

to open...@googlegroups.com

Hi,

Great, Dzonatas, can't wait to see that new patch.
By the way, I've tested the DWT patch you worked on with Callum, and on my machine, the DWT time is divided up to 4 times for images compressed with the 9x7 kernel. That's really impressive. I should have committed it by the end of the day.

Regarding Rodrigo's question, I would like to know if you're working with release 1.2 or with the SVN version. You should really try the SVN version (see "Download the current working version from svn" in the download section) which is quite stable and much faster.
With the patch that I'll commit today, the performances should be comparable or better than Jasper, depending on your machine.
We should release version 1.3 in the next weeks.

We are also expecting by the beginning of next year some very important contributions to the library in terms of performances and improvement of the way users can take advantage of JPEG 2000 scalability.

So as you can see, the OpenJPEG project is quite active thanks to several contributors, and you can expect a significant increase in the performances...

The indexing functionality now available at the encoder and decoder will be quite useful for you streaming application. Also, did you check the RTP payload format which is quite interesting for streaming ?
http://www.ietf.org/internet-drafts/draft-ietf-avt-rtp-jpeg2000-18.txt

François

Dzonatas a écrit :

Callum Lerwick

unread,

Nov 13, 2007, 12:38:06 PM11/13/07

to open...@googlegroups.com

On Tue, 2007-11-13 at 07:40 -0800, Dzonatas wrote:
> Hi!
>
> I posted a patch to another group to help speed-up decompression across
> multi-processors. It is a very preliminary patch to test Intel's
> Threading Building Blocks.
>
> I'll post an update soon to this group-list. It should speed up the
> encode process also. I still need to build a test suite before I
> consider it not-preliminary.

I hope you realize that our recent patch drops the DWT time completely
off the charts, at least at non-megaimage sizes. :) Given Amdahl's law
there's relatively little to be gained by just threading the DWT. The
T1/MQC is by far the top CPU user. You might notice in the patch I
submitted I moved the outer T1 loop into tcd_decode_tile(), which allows
the possibility of interleaving the T1 and DWT decode. I'm still
exploring the benefits of this, if any, but you could then very easily
decode each component in parallel.

I pulled out my old dual PII workstation so I could experiment with
OpenMP, but haven't quite got around to it yet. Note that TBB requires
you to convert everything to C++, and OpenMP does not. ;P

And as I mentioned on sldev, given how many zillions of textures Second
Life decodes, just running multiple texture decode threads would be the
cleanest way to use multiple cores in slviewer, rather than having the
OpenJPEG library competing for threads. Given that spawning threads has
a non-trivial fixed cost, threading work really should go from the top
down and not bottom up.

signature.asc

Dzonatas

unread,

Nov 13, 2007, 1:33:57 PM11/13/07

to open...@googlegroups.com

Callum Lerwick wrote:
> I hope you realize that our recent patch drops the DWT time completely
> off the charts, at least at non-megaimage sizes. :) Given Amdahl's law
> there's relatively little to be gained by just threading the DWT. The
> T1/MQC is by far the top CPU user. You might notice in the patch I
> submitted I moved the outer T1 loop into tcd_decode_tile(), which allows
> the possibility of interleaving the T1 and DWT decode. I'm still
> exploring the benefits of this, if any, but you could then very easily
> decode each component in parallel.
>
> I pulled out my old dual PII workstation so I could experiment with
> OpenMP, but haven't quite got around to it yet. Note that TBB requires
> you to convert everything to C++, and OpenMP does not. ;P
>
> And as I mentioned on sldev, given how many zillions of textures Second
> Life decodes, just running multiple texture decode threads would be the
> cleanest way to use multiple cores in slviewer, rather than having the
> OpenJPEG library competing for threads. Given that spawning threads has
> a non-trivial fixed cost, threading work really should go from the top
> down and not bottom up.
>

There is an assumption made with Amdahl's law about what make up the
100% sequential part(s). If the equation includes parts that are out of
scope of the algorithm being made scalable, then the result is not
exclusive to the focus of the scalability and can create false predictions.

OpenMP is not supported by all compilers. C++ also shows better general
optimization then plain C, which was not true before. The use of OpenMP
or C++ is moot due to the fact I'll publish the patch under a different
branch separate from the libopenjpeg, which then creates no requirement
to do wholesale changes to any parallel scheme. It also makes it so that
libopenjpeg can be used were even TBB is not ported.

The top-down to parallelization means that a new synchronization scheme
needs to be implemented. In the instance of Second Life, that would be a
wholesale change throughout Second Life as it highly depends on its
sequential flow to create implicit synchronization.

I do realize the changes I made to the DWT speed it up by a variable
factor between 2x to 20x faster in the previous patch (not the TBB
related patch), but that previous patch relied on SSE vectorization to
achieve such performance. TBB does not rely on SSE to achieve similar
performance. TBB also does not require the autovectorization feature
only available in GCC. The effort you did to change libopenjpeg to
GCC-autovectorization is probably similar to an amount of effort to
change it to C++.

Cheers!

Callum Lerwick

unread,

Nov 14, 2007, 4:50:05 AM11/14/07

to open...@googlegroups.com

On Tue, 2007-11-13 at 07:10 -0800, rodri...@gmail.com wrote:
> Hi,
> We are working on a video streaming application based on jpeg2k, and
> we are thinking about using openjpeg for it. The problem is that We
> are not getting good performance decompressing, 250ms for a 768x576 3
> bands image compared to the 70ms of j2k-codec and the 150 ms of
> Jasper.
> ¿Is there any way to optimize this performance? We are just
> decompressing in the same way that in the j2k_to_image example. We
> have tried compile time optimizations, but we have not got much better
> results.

OpenJPEG is used in "open source" builds of the Second Life viewer, the
official builds use KDU and the performance difference is (was?) very
noticeable. So I've been putting a lot of effort into optimization. As
mentioned, I just submitted a patch originally by Dzonatas and mangled
by me that speeds up the 9-7 DWT decode quite a bit, which is probably
the biggest single performance increase so far. Although Dzonatas only
patched the decoder, doing a similar fp-ization to the encoder should
speed that up nicely as well. :)

I also submitted some patches that decrease memory usage and cut down
the size of tcd_cblk structure, which had some big static buffers in it,
which measurably speeds things up, especially on smaller cache machines.

I should probably do some benchmarks. :) For performance testing I
mostly use the mj2 codec. For the test file I encoded "Speedway.yuv"
with "-I 1".

Decodes are run on an Athlon 64 3000+, a Celeron D 2.1ghz (P4
architecture), a Mobile Celeron 1.3ghz (P3 architecture), and a G3
266mhz. Here's stock svn470, compiled using Fedora 7 gcc 4.1, with -O3:

amd64: Total decoding time: 11.97 seconds (16.7 fps)
p4: Total decoding time: 21.60 seconds ( 9.3 fps)
p3: Total decoding time: 26.79 seconds ( 7.5 fps)
ppc: Total decoding time: 67.10 seconds ( 3.0 fps)

Here's after the fp/vectorization patch, with and without SSE on the
i386 machines:

amd64: Total decoding time: 9.33 seconds (21.4 fps) 22.1%
p4: Total decoding time: 18.08 seconds (11.1 fps) 16.3%
p4sse: Total decoding time: 17.46 seconds (11.5 fps) 19.2%
p3: Total decoding time: 21.16 seconds ( 9.5 fps) 21.0%
p3sse: Total decoding time: 20.74 seconds ( 9.6 fps) 22.6%
ppc: Total decoding time: 44.14 seconds ( 4.5 fps) 34.2%

Big wins across the board. Biggest win percentage-wise is on PPC, I'm
not entirely sure why. :) As you can see, most of the speedup doesn't
actually come from SSE.

And after the late-alloc-early-free and tcd-cblk-nostatic patches:

amd64: Total decoding time: 9.31 seconds (21.5 fps) 0.2%
p4sse: Total decoding time: 17.34 seconds (11.5 fps) 0.7%
p3sse: Total decoding time: 20.23 seconds ( 9.9 fps) 2.5%
ppc: Total decoding time: 43.69 seconds ( 4.6 fps) 1.0%

Percentages are versus the previous patchset. As intended, the P3
Celeron seems to benefit from the reduction in cache footprint. I'm a
little confused as to why the P4 doesn't seem to be benefiting as much.
I suspect that machine may be bottlenecked by the fact that I only have
PC2100 RAM in it. As expected, the amd64 with its large cache isn't
affected much.

Next in the queue I've got some cycle-shaving patches to the T1 and MQC:

amd64: Total decoding time: 8.66 seconds (23.1 fps) 7.0%
p3sse: Total decoding time: 19.91 seconds (10.0 fps) 1.6%

The patches were originally written and tuned for x86-64, thus the big
win there. The patches reduce branching and take advantage of the fact
that up to four 16-bit T1 flags can be manipulated at a time given a
64-bit processor. I was able to translate some of it to MMX intrinsics
on i386. :)

There's also further memory work that I'm still sorting through.

signature.asc

Dzonatas

unread,

Nov 14, 2007, 8:48:29 AM11/14/07

to open...@googlegroups.com

Callum Lerwick wrote:
> OpenJPEG is used in "open source" builds of the Second Life viewer, the
> official builds use KDU and the performance difference is (was?) very
> noticeable. So I've been putting a lot of effort into optimization. As
> mentioned, I just submitted a patch originally by Dzonatas and mangled
> by me that speeds up the 9-7 DWT decode quite a bit, which is probably
> the biggest single performance increase so far. Although Dzonatas only
> patched the decoder, doing a similar fp-ization to the encoder should
> speed that up nicely as well. :)
>

The floating point math should be made optional. That would require a
new API option for a run-time toggle.

The floating point is good for hardware that has it and to display video
quality, but for archival concerns the db needs to remain infinite. The
fixed point math emulated in software is the only way to guarantee
infinite db. Consider how SSE (for floating point) can show no
performance gain under cache pollution conditions, it is wise to keep
both floating point and fixed point methods to allow other scalable
solutions to gain performance. =)

The benchmarks I do are run under a mildly stressed machine. It
typically runs Windows and Linux concurrently on a single core. Each OS
can easily steal up to 50% of the hardware utilization or more if the
other is idle. Any "system" time in my benchmarks is easily exploited
under such conditions, so I run a series of the same test to find the
average. The single core setup allows one to calculate the overhead in a
multi-core operation. Given that most software (even the OS) does not
run efficiently under multi-core architecture, live multi-core samples
won't produce exclusive benchmarks, but the samples are good to make
sure the results are within the (calculated) limits. Memory utilization
in multi-core architectures is not a consistent factor as it is under a
single core, and it is usually the reason why such samples are not
within the limits.

There is a technique to help detect, at run-time, when the results are
not within the limits, so optional decode/encode methods are good to
find the best one for certain run-time conditions.

rodri...@gmail.com

unread,

Nov 15, 2007, 5:23:34 AM11/15/07

to OpenJPEG

> Regarding Rodrigo's question, I would like to know if you're working
> with release 1.2 or with the SVN version. You should really try the SVN
> version (see "Download the current working version from svn" in the
> download section) which is quite stable and much faster.
> With the patch that I'll commit today, the performances should be
> comparable or better than Jasper, depending on your machine.
> We should release version 1.3 in the next weeks.
>

I have tried with both, but I haven't got better result with the SVN
version. I'm going to try again, because maybe I didn't use it
correctly.
And Yes!, we were thinking about using RTP. We didn't know that draft
and it sounds promising, and we are going to think about making and
implementation.
Thank you very much for all your efforts,
Rodrigo

Reply all

Reply to author

Forward