DirectShow wrapper for OpenJPEG

298 views
Skip to first unread message

Peter Wimmer

unread,
Jan 20, 2010, 5:45:09 PM1/20/10
to open...@googlegroups.com
Hello,

I'm new to this group and hope my question has not been discussed yet (at
least I found no search results).

Is there somewhere a DirectShow wrapper for the OpenJPEG library? I would
like to use OpenJPEG for Motion JPEG 2000 video playback in DirectShow-based
media players.

Best regards,
Peter


Peter Wimmer

unread,
Jan 25, 2010, 9:48:07 AM1/25/10
to open...@googlegroups.com
Ok, no replies, seems nobody has developed a DirectShow filter yet ;-)

Then I'll do it myself...

But maybe somebody can answer me the question if OpenJPEG is the best
(=fastest) library to write a MJPEG2000 video decoder? Or are there faster
libraries out there? I'm only interested in Decoding, encoding performance
is not an issue.

Best regards,
Peter

Callum Lerwick

unread,
Jan 27, 2010, 6:35:45 AM1/27/10
to open...@googlegroups.com
On Mon, Jan 25, 2010 at 8:48 AM, Peter Wimmer <pwi...@gmx.at> wrote:
> But maybe somebody can answer me the question if OpenJPEG is the best
> (=fastest) library to write a MJPEG2000 video decoder? Or are there faster
> libraries out there? I'm only interested in Decoding, encoding performance
> is not an issue.

I think OpenJPEG is the fastest open-source j2k codec at this point,
but there's closed source ones that are significantly faster. I've got
a pretty good idea what needs to happen to hopefully catch up, but I
haven't had a chance to sit down and hack for a while...

Peter Wimmer

unread,
Jan 27, 2010, 7:14:24 AM1/27/10
to open...@googlegroups.com
Thank you for the answer.

During my research I've found the commercial Kakadu library that claims to
be pretty fast, but it is far too expensive. I'll rather spend half a year
to improve OpenJPEG than spending ten thousands of Euro.

The only problem is that I'm not an expert for JPEG2000 yet. But I'm a
professional software developer and I have the time to do the optimizations
if somebody could guide me what needs to be done.

Btw, I need the JPEG2000 decoder to play back 3D DCPs with my Stereoscopic
Player (http://www.3dtv.at). Consequently, it is my goal to play 2k JPEG2000
at 48 fps in real-time on a high-end CPU (Intel Core i7).

Best regards,
Peter

-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf

--
You received this message because you are subscribed to the Google Groups
"OpenJPEG" group.
To post to this group, send email to open...@googlegroups.com.
To unsubscribe from this group, send email to
openjpeg+u...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/openjpeg?hl=en.

Callum Lerwick

unread,
Jan 27, 2010, 6:31:42 PM1/27/10
to open...@googlegroups.com
On Wed, Jan 27, 2010 at 6:14 AM, Peter Wimmer <pwi...@gmx.at> wrote:
> Thank you for the answer.
>
> During my research I've found the commercial Kakadu library that claims to
> be pretty fast, but it is far too expensive. I'll rather spend half a year
> to improve OpenJPEG than spending ten thousands of Euro.

Yeah, Kakadu is quite fast. Hopefully we can beat it someday. :)

> The only problem is that I'm not an expert for JPEG2000 yet. But I'm a
> professional software developer and I have the time to do the optimizations
> if somebody could guide me what needs to be done.

Well, profiling shows that the T1 is taking up most of the time by far
currently. Right now, the inner T1 loops call the MQC (the arithmetic
(de)compressor) to decompress one bit at a time. That means a function
call and basically a massive context switch for every loop around.
Neither the T1 or the MQC can effectively keep anything in the
registers, and I think this is what's killing performance. The MQC
alone will probably eat all the registers available on an x86.

I think if the arithmetic decompress was done before doing the T1,
both the MQC and the T1 would then be smaller, tighter, independent
loops that would lend themselves far better to optimization, and go a
long way towards speeding things up.

BTW, restructuring efforts like this really should be directed at
OpenJPEG 2.0, don't want the codebase to fragment too much. Also,
MSVC's optimiser sucks and OpenJPEG is much faster compiled with gcc
or Intel C. :)

Matteo Italia

unread,
Jan 27, 2010, 6:45:00 PM1/27/10
to open...@googlegroups.com
Another thing that may be killing the performances is the int used for
the pixel values: the majority of images use just 8 bit per component
for each pixel, while OpenJPEG always uses ints to store the values; 3
quarters of that memory is usually wasted. This limits the size of the
images that can be loaded, and slows down the decoding since it requires
much more accesses to the memory.
A viable improvement may be to make OpenJPEG allocate arrays of the
smallest type possible for each component, and report the size of each
pixel value; this may make the library users more complicated, but it's
nothing that can't be solved with a little macro.

Peter Wimmer

unread,
Jan 27, 2010, 7:43:41 PM1/27/10
to open...@googlegroups.com
Hi,

Here are the first results from performance testing. I've completed a
working prototype of the DirectShow wrapper today.

It plays the 1920 x 1080 DCP with a frame rate of 1 fps.
CPU: Core 2 Quad 2.6 GHz, only one core used.
Compiler: Microsoft Visual Studio 2008

This means the code must get faster by factor 50. ;-)

Decoding four frames in parallel is pretty easy, which will improve
performance by 4.

I've noticed floating point calculations in the library. Would it be
possible to approximate the floating point calculations with integer
calculations?

My next step is to buy a book on JPEG2000 to understand the code. I don't
see any chance to optimize the code without understanding JPEG2000 in
detail.

Best regards,
Peter

Matteo Italia

unread,
Jan 27, 2010, 8:02:19 PM1/27/10
to open...@googlegroups.com
Optimization tip: did you enable "Omit stack frame pointer"? I noticed
that it's not enabled by default in the example projects, but it can
boost a bit the code, since it frees one register (and registers on x86
are precious stuff). Another thing that may help quite a bit (and is
disabled by default) is the "Global optimization" setting, which also is
disabled (note that, however, the .lib in this way won't be portable).

I'm too interested in a book on JPEG2000, or at least on some well-done
documentation, that I couldn't find anywhere on the net. Maybe some
OpenJPEG developer could give us a hint...

Regards,
Matteo

Peter Wimmer

unread,
Jan 28, 2010, 8:57:57 PM1/28/10
to open...@googlegroups.com
Hello,

A first test version of the multi-threaded JPEG2000 Video Decoder DirectShow
filter can be downloaded from:
http://www.3dtv.at/Temp/J2KDecoder.dll

To install the filter, copy it to the harddisk, open a command window with
admin right and type:
regsvr32.exe J2KDecoder.dll

I've tested it with mxf files from DCPs and the MediaLooks MXF Splitter:
http://www.medialooks.com/products/directshow_filters/mxf_demultiplexer.html

Adding multi-threading improved the performance from 1 fps to 4 fps on the
Core 2 Quad CPU, which is the expected result.

Peter

Callum Lerwick

unread,
Jan 29, 2010, 3:52:51 AM1/29/10
to open...@googlegroups.com
On Wed, Jan 27, 2010 at 5:45 PM, Matteo Italia
<matteoa...@gmail.com> wrote:
> Another thing that may be killing the performances is the int used for the
> pixel values: the majority of images use just 8 bit per component for each
> pixel, while OpenJPEG always uses ints to store the values; 3 quarters of
> that memory is usually wasted. This limits the size of the images that can
> be loaded, and slows down the decoding since it requires much more accesses
> to the memory.

The T1 and DWT are doing fixed point math (or fp) so I don't think
this is possible. I've played with it actually, and I think the j2k
standard itself demands 32-bit accuracy.

It could be taken down to 8-bit later in the pipeline, but that's not
where the time is being spent. It's the T1 dominating execution time.

> A viable improvement may be to make OpenJPEG allocate arrays of the smallest
> type possible for each component, and report the size of each pixel value;
> this may make the library users more complicated, but it's nothing that
> can't be solved with a little macro.

I think eventually OpenJPEG should have an optimized fast-path for the
majority of apps that just want to get RGBA in the end. (Does MJ2K use
YUV or something? That could be done too.) But first, the T1 is what
needs some serious optimization. Everything else is miniscule by
comparison.

Callum Lerwick

unread,
Jan 29, 2010, 4:08:46 AM1/29/10
to open...@googlegroups.com
On Wed, Jan 27, 2010 at 6:43 PM, Peter Wimmer <pwi...@gmx.at> wrote:
> Hi,
>
> Here are the first results from performance testing. I've completed a
> working prototype of the DirectShow wrapper today.
>
> It plays the 1920 x 1080 DCP with a frame rate of 1 fps.
> CPU: Core 2 Quad 2.6 GHz, only one core used.
> Compiler: Microsoft Visual Studio 2008
>
> This means the code must get faster by factor 50.  ;-)
>
> Decoding four frames in parallel is pretty easy, which will improve
> performance by 4.
>
> I've noticed floating point calculations in the library. Would it be
> possible to approximate the floating point calculations with integer
> calculations?

A bunch of work was done to switch the non-reversible DWT to FP in 1.3
:) On typical PC CPUs with decent FPUs and stuff like SSE, FP is
going to be much faster than 32-bit fixed point.

If you can figure out how to do it in 16-bit fixed point, you may have
something, but accuracy is going to suffer.

(I actually did some experimentation with just not bothering to decode
the "lesser" bitplanes in the T1, with significant speedups and
significant quality loss depening on the image. This may be a route to
explore to get that last little bit of performance for realtime
decode, but the T1's still needs proper optimization first...)

Callum Lerwick

unread,
Jan 29, 2010, 5:15:20 AM1/29/10
to open...@googlegroups.com
On Wed, Jan 27, 2010 at 7:02 PM, Matteo Italia
<matteoa...@gmail.com> wrote:
> Optimization tip: did you enable "Omit stack frame pointer"? I noticed that
> it's not enabled by default in the example projects, but it can boost a bit
> the code, since it frees one register (and registers on x86 are precious
> stuff). Another thing that may help quite a bit (and is disabled by default)
> is the "Global optimization" setting, which also is disabled (note that,
> however, the .lib in this way won't be portable).

From the gcc docs:

`-fomit-frame-pointer'
Don't keep the frame pointer in a register for functions that
don't need one. This avoids the instructions to save, set up and
restore frame pointers; it also makes an extra register available
in many functions. *It also makes debugging impossible on some
machines.*

On some machines, such as the VAX, this flag has no effect, because
the standard calling sequence automatically handles the frame
pointer and nothing is saved by pretending it doesn't exist. The
machine-description macro `FRAME_POINTER_REQUIRED' controls
whether a target machine supports this flag. *Note Register
Usage: (gccint)Registers.

Enabled at levels `-O', `-O2', `-O3', `-Os'.

So if you optimize at all, it's already enabled.

> I'm too interested in a book on JPEG2000, or at least on some well-done
> documentation, that I couldn't find anywhere on the net. Maybe some OpenJPEG
> developer could give us a hint...

There's a final draft spec available here:

http://www.jpeg.org/public/fcd15444-1.pdf

But it's not exactly light reading if you're not a PHD in mathematics
and whatnot...

Francois-Olivier Devaux

unread,
Jan 29, 2010, 6:52:57 AM1/29/10
to open...@googlegroups.com
Hi,

Regarding a good book about JPEG 2000, I can recommend this one:

JPEG2000: image compression fundamentals, standards, and practice
By David S. Taubman, Michael W. Marcellin
http://books.google.be/books?id=Nz_xicZ_S68C&lpg=PP1&ots=ygDE4SgIxN&dq=jpeg2000%20book%20taubman&hl=en&pg=PP1#v=onepage&q=jpeg2000%20book%20taubman&f=false
The two authors have actively participated in the design of the standard at the JPEG committee, and the book is very .

There are many good JPEG 2000 overviews, among which this one
http://www.cs.cmu.edu/afs/cs/project/pscico-guyb/realworld/www/paper_ieee_ce_jpeg2000_Nov2000.pdf

Cheers,

François

-----Message d'origine-----
De : open...@googlegroups.com [mailto:open...@googlegroups.com] De la part de Callum Lerwick
Envoyé : vendredi 29 janvier 2010 11:15
À : open...@googlegroups.com
Objet : Re: [OpenJPEG] DirectShow wrapper for OpenJPEG

From the gcc docs:

http://www.jpeg.org/public/fcd15444-1.pdf

--

Matteo Italia

unread,
Jan 29, 2010, 7:39:28 AM1/29/10
to open...@googlegroups.com

> > From the gcc docs:
>
> `-fomit-frame-pointer'
> Don't keep the frame pointer in a register for functions that
> don't need one. This avoids the instructions to save, set up and
> restore frame pointers; it also makes an extra register available
> in many functions. *It also makes debugging impossible on some
> machines.*
>
> On some machines, such as the VAX, this flag has no effect, because
> the standard calling sequence automatically handles the frame
> pointer and nothing is saved by pretending it doesn't exist. The
> machine-description macro `FRAME_POINTER_REQUIRED' controls
> whether a target machine supports this flag. *Note Register
> Usage: (gccint)Registers.
>
> Enabled at levels `-O', `-O2', `-O3', `-Os'.
>
> So if you optimize at all, it's already enabled.
>
My suggestion was about VC++, which do not enable it by default.

> There's a final draft spec available here:
>
> http://www.jpeg.org/public/fcd15444-1.pdf
>
> But it's not exactly light reading if you're not a PHD in mathematics
> and whatnot...
>
Thank you, I'll try to read it.

Regards,
Matteo

Matteo Italia

unread,
Jan 29, 2010, 7:39:52 AM1/29/10
to open...@googlegroups.com
Thank you!

Il 29/01/2010 12:52, Francois-Olivier Devaux ha scritto:
> Hi,
>
> Regarding a good book about JPEG 2000, I can recommend this one:
>
> JPEG2000: image compression fundamentals, standards, and practice
> By David S. Taubman, Michael W. Marcellin
> http://books.google.be/books?id=Nz_xicZ_S68C&lpg=PP1&ots=ygDE4SgIxN&dq=jpeg2000%20book%20taubman&hl=en&pg=PP1#v=onepage&q=jpeg2000%20book%20taubman&f=false
> The two authors have actively participated in the design of the standard at the JPEG committee, and the book is very .
>
> There are many good JPEG 2000 overviews, among which this one
> http://www.cs.cmu.edu/afs/cs/project/pscico-guyb/realworld/www/paper_ieee_ce_jpeg2000_Nov2000.pdf
>
> Cheers,
>

> Fran�ois


>
> -----Message d'origine-----
> De : open...@googlegroups.com [mailto:open...@googlegroups.com] De la part de Callum Lerwick

> Envoy� : vendredi 29 janvier 2010 11:15
> � : open...@googlegroups.com

Peter Wimmer

unread,
Jan 29, 2010, 6:45:35 PM1/29/10
to open...@googlegroups.com
> I think if the arithmetic decompress was done before doing the T1,
both the MQC and the T1 would then be smaller, tighter, independent
loops that would lend themselves far better to optimization, and go a
long way towards speeding things up.

Is this really possible? In the code I see lots of mqc_setcurctx calls
between the mqc_decode. So the decode cannot be done in a separate
independent step.

Callum Lerwick

unread,
Jan 31, 2010, 4:59:32 AM1/31/10
to open...@googlegroups.com

I think that's some code mishmash. Been a while since I looked at it
but the j2k spec makes it sound like it can.

Peter Wimmer

unread,
Jan 31, 2010, 6:26:16 PM1/31/10
to open...@googlegroups.com
>>> I think if the arithmetic decompress was done before doing the T1,
>>> both the MQC and the T1 would then be smaller, tighter, independent
>>> loops that would lend themselves far better to optimization, and go a
>>> long way towards speeding things up.
>>
>> Is this really possible? In the code I see lots of mqc_setcurctx calls
>> between the mqc_decode. So the decode cannot be done in a separate
>> independent step.
>
> I think that's some code mishmash. Been a while since I looked at it
> but the j2k spec makes it sound like it can.

Thanks for your answer. I'm at the beginning to understand the spec yet.
Hope I understand it soon and can fix the mishmash. Any help is appreciated
or guidance...

Peter


Peter Wimmer

unread,
Feb 2, 2010, 9:01:17 PM2/2/10
to open...@googlegroups.com
>>> I think if the arithmetic decompress was done before doing the T1,
>>> both the MQC and the T1 would then be smaller, tighter, independent
>>> loops that would lend themselves far better to optimization, and go a
>>> long way towards speeding things up.
>>
>> Is this really possible? In the code I see lots of mqc_setcurctx calls
>> between the mqc_decode. So the decode cannot be done in a separate
>> independent step.
>
> I think that's some code mishmash. Been a while since I looked at it
> but the j2k spec makes it sound like it can.

The spec also mentions the context labels that can be found in the OpenJPEG
implementation. In the flow chart for all coding passes on a code-block
bit-plane, there is an overview of the context labels. I don't see any
chance to do the arithmetic decoding in a separate step.


Peter Wimmer

unread,
Feb 8, 2010, 9:59:25 AM2/8/10
to open...@googlegroups.com
Hi,

Attached are a few modifications of the OpenJPEG library.

j2k.h: Added comments.

opj_malloc: Added define ALLOC_PERF_OPT that allows the caller of the
library to implement malloc, calloc, realloc and free. Allocating large
memory blocks takes a few milliseconds, thus allocating and releasing memory
for each frame of a MJPEG2000 streams wastes lots of performance. The caller
can implement a more efficient memory manager that reuses memory blocks.

t1.c: Created different code paths for raw, mqc and mqc+vsc decoding. This
removes many "if" from inner to outer loops.

mqc.h, mqc.c: Moves bitstream parsing that was previously done in
mqc_byte_in to mqc_init_dec. All functions called by mqc_decode are now
macros so that all variables uses inside mqc_decode are kept in registers.
To enable the optimization, the define MQC_PERF_OPT must be set. The old
code is still present. The effect of this optimization is smaller than
expected. Further optimizations are necessary. I suggest to keep the MQC
state (the opj_mqc_t structure) entirely in SSE2 registers to avoid any
load/stores from memory to registers at any call to decode.


The optimizations above improve the performance from 4 to 5 fps when
decoding a DCP on a Core 2 Quad.


Best regards,
Peter

j2k.h
opj_malloc.h
t1.c
mqc.h
mqc.c

Bob Friesenhahn

unread,
Feb 8, 2010, 12:42:55 PM2/8/10
to Peter Wimmer, open...@googlegroups.com
On Mon, 8 Feb 2010, Peter Wimmer wrote:
>
> mqc.h, mqc.c: Moves bitstream parsing that was previously done in
> mqc_byte_in to mqc_init_dec. All functions called by mqc_decode are now
> macros so that all variables uses inside mqc_decode are kept in registers.
> To enable the optimization, the define MQC_PERF_OPT must be set. The old
> code is still present. The effect of this optimization is smaller than
> expected. Further optimizations are necessary. I suggest to keep the MQC
> state (the opj_mqc_t structure) entirely in SSE2 registers to avoid any
> load/stores from memory to registers at any call to decode.

For people really interested in achieving SSE vector optimizations
with modern GCC (in portable C code), it is useful to investigate
these options:

-ftree-vectorize -ftree-vectorizer-verbose=1

where the -ftree-vectorizer-verbose argument controls the amount of
compiler commentary about why code could not be vectorized.

Notes about this may be found at
"http://gcc.gnu.org/projects/tree-ssa/vectorization.html".

Due to the existing rigid pre-defined structure of my software, I
found it difficult to increase the amount of vectorization, but the
proper organization of data and algorithm should reap large rewards if
a supported vectorization pattern is matched.

I also like to use -Wdisabled-optimization, since this causes GCC to
warn if something silly is done such as excessive belief in 'inline'
requests.

Other compilers besides GCC are working on similar features so it
seems reasonable that in the long term, vector optimizations targeting
GCC will apply to some other compilers as well.

Bob
--
Bob Friesenhahn
bfri...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

Peter Wimmer

unread,
Feb 9, 2010, 12:37:06 PM2/9/10
to open...@googlegroups.com
Here is a slightly modified dwt.c that compiles with Visual C++ 2008 with
__SSE__ defined.

Who is responsible to check in the code in SVN?

Peter


dwt.c

François-Olivier Devaux

unread,
Feb 9, 2010, 1:08:34 PM2/9/10
to open...@googlegroups.com, <openjpeg@googlegroups.com>
Hi Peter,

Thanks for your valuable inputs. I'll check the patches and push them
to the SVN.

Cheers,

Francois

> --
> You received this message because you are subscribed to the Google
> Groups "OpenJPEG" group.
> To post to this group, send email to open...@googlegroups.com.
> To unsubscribe from this group, send email to openjpeg+u...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/openjpeg?hl=en
> .
>

> <dwt.c>

Callum Lerwick

unread,
Feb 9, 2010, 6:02:13 PM2/9/10
to open...@googlegroups.com
On Mon, Feb 8, 2010 at 8:59 AM, Peter Wimmer <pwi...@gmx.at> wrote:
> t1.c: Created different code paths for raw, mqc and mqc+vsc decoding. This
> removes many "if" from inner to outer loops.

Might want to take at look at the patch I posted a while back:

http://groups.google.com/group/openjpeg/browse_thread/thread/4d27b0eaeba39c8a/fd6a1dee88e6cc6f

It does this and more, and gains about 25% speed and should probably
be merged...

Peter Wimmer

unread,
Feb 10, 2010, 3:57:08 PM2/10/10
to open...@googlegroups.com
Thanks for the hint. I already reinvented some of your optimizations,
however, my code is still missing the loop unrolling. I'll also add it and
then send all modified files.

Btw, in the last few days I improved the performance of the MJPEG2000
decoder from 4 fps to 6.7 fps (DCP decoding on a Core 2 Quad).

Peter

-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf
Of Callum Lerwick
Sent: Wednesday, February 10, 2010 12:02 AM
To: open...@googlegroups.com

http://groups.google.com/group/openjpeg/browse_thread/thread/4d27b0eaeba39c8
a/fd6a1dee88e6cc6f

--

Peter Wimmer

unread,
Feb 11, 2010, 10:44:58 PM2/11/10
to open...@googlegroups.com
Hello,

Here are all my modified files.

Performance optimizations:

* Loop unrolling in T1 and other files
* Splitting RAW, MQC and MQC+VSC in T1 into separate code paths
* Inlining MQC
* Eliminated multiplications in loops
* Added several SSE/SSE2 optimizations
* Added possibility to use custom memory manager (very important for video
decoding to reuse buffers)

The speedup is ~75%

Functional improvements:

* Skipping MQC decoding when resolutions are skipped (TODO: also skip T2)
* Added vs_passes option to skip bitplanes (you can skip a few planes of 12
bit images without visible quality loss)


Francois, I hope you can review these changes and check them in soon. I
should mention that I tested with Visual C++ only and only with DCPs. You
can also contact me offlist if you have any question.


Peter

dwt.c
opj_malloc.h
t1.h
t1.c
tcd.c
j2k.c
j2k.h
mct.c
mqc.h
mqc.c
openjpeg.h
opj_includes.h
openjpeg.c

Michael Wood

unread,
Feb 12, 2010, 2:06:55 AM2/12/10
to open...@googlegroups.com
Hi

On 12 February 2010 05:44, Peter Wimmer <pwi...@gmx.at> wrote:
> Hello,
>
> Here are all my modified files.

I am not an OpenJPEG developer and am not qualified to review these
changes, but if you provide the changes as a separate patch per change
(e.g. one patch for loop unrolling, another patch for splitting into
separate code paths, another for inlining MQC, etc.) then it's easier
to review the changes without looking at everything at once. It's
also easier to merge them if other people are making changes and
sending in patches too. Obviously there don't seem to be many people
doing that at the moment, though.

Of course if Francois says it's fine you are welcome to ignore me :)

> Performance optimizations:
>
> * Loop unrolling in T1 and other files
> * Splitting RAW, MQC and MQC+VSC in T1 into separate code paths
> * Inlining MQC
> * Eliminated multiplications in loops
> * Added several SSE/SSE2 optimizations
> * Added possibility to use custom memory manager (very important for video
> decoding to reuse buffers)
>
> The speedup is ~75%

Excellent :)

> Functional improvements:
>
> * Skipping MQC decoding when resolutions are skipped (TODO: also skip T2)
> * Added vs_passes option to skip bitplanes (you can skip a few planes of 12
> bit images without visible quality loss)

Is this change only for lossy compressed images? Or is it something
that might affect the quality of someone using OpenJPEG for decoding
lossless images too?

> Francois, I hope you can review these changes and check them in soon. I
> should mention that I tested with Visual C++ only and only with DCPs. You
> can also contact me offlist if you have any question.

--
Michael Wood <esio...@gmail.com>

Matteo Italia

unread,
Feb 12, 2010, 5:44:28 AM2/12/10
to open...@googlegroups.com
Hello Peter,

wow, these are great news, a 75% speedup is really sensible, my compliments!

Regards,
Matteo

Peter Wimmer

unread,
Feb 12, 2010, 10:12:00 AM2/12/10
to open...@googlegroups.com
If required, I can create patch files. I prefer Araxis Merge to compare and
merge files, but I just noticed that it can also create UNIX diff files
(which should be what you want).

My optimizations target the lossy mode only, but the MQC changes also affect
the lossless mode.

Once the current changes are checked in, I'll continue to improve
performance. The following optimizations look most promising to me:

* Migrating the DWT from float to integer code.
* Skipping T2 decoding for all blocks that are not required in
reduced-resolution mode.

If you have other ideas how to improve OpenJPEG performance, let me know.

The Kakadu lib is still twice as fast as OpenJPEG, so there is definitely
potential for improvements ;-)

Peter

Bob Friesenhahn

unread,
Feb 12, 2010, 11:35:53 AM2/12/10
to open...@googlegroups.com
On Fri, 12 Feb 2010, Michael Wood wrote:
>
> I am not an OpenJPEG developer and am not qualified to review these
> changes, but if you provide the changes as a separate patch per change
> (e.g. one patch for loop unrolling, another patch for splitting into
> separate code paths, another for inlining MQC, etc.) then it's easier
> to review the changes without looking at everything at once. It's

It seems most likely that the changes are all intertwined and it is
not possible to split them out like that.

Michael Wood

unread,
Feb 12, 2010, 12:58:32 PM2/12/10
to open...@googlegroups.com
On 12 February 2010 18:35, Bob Friesenhahn <bfri...@simple.dallas.tx.us> wrote:
> On Fri, 12 Feb 2010, Michael Wood wrote:
>>
>> I am not an OpenJPEG developer and am not qualified to review these
>> changes, but if you provide the changes as a separate patch per change
>> (e.g. one patch for loop unrolling, another patch for splitting into
>> separate code paths, another for inlining MQC, etc.) then it's easier
>> to review the changes without looking at everything at once.  It's
>
> It seems most likely that the changes are all intertwined and it is not
> possible to split them out like that.

If that's the case, then please ignore my comment.

--
Michael Wood <esio...@gmail.com>

Michael Wood

unread,
Feb 12, 2010, 1:01:13 PM2/12/10
to open...@googlegroups.com
On 12 February 2010 17:12, Peter Wimmer <pwi...@gmx.at> wrote:
> If required, I can create patch files. I prefer Araxis Merge to compare and
> merge files, but I just noticed that it can also create UNIX diff files
> (which should be what you want).

Well as I said it's not really for me, and since Bob Friesenhahn seems
to think it's not possible to split them up into a patch per change
please ignore my comment about that.

> My optimizations target the lossy mode only, but the MQC changes also affect
> the lossless mode.

OK, thanks.

> Once the current changes are checked in, I'll continue to improve
> performance. The following optimizations look most promising to me:

More performance is always welcome :)

> * Migrating the DWT from float to integer code.
> * Skipping T2 decoding for all blocks that are not required in
> reduced-resolution mode.
>
> If you have other ideas how to improve OpenJPEG performance, let me know.
>
> The Kakadu lib is still twice as fast as OpenJPEG, so there is definitely
> potential for improvements ;-)
>
> Peter

--
Michael Wood <esio...@gmail.com>

Peter Wimmer

unread,
Feb 12, 2010, 5:14:03 PM2/12/10
to open...@googlegroups.com
> It seems most likely that the changes are all intertwined and it is not
> possible to split them out like that.

Not all of them are intertwined, many can be applied separately. If patches
are required I will create them. But I only want to do it if they are really
required because it also means work on my side to split all the changes into
different patch files.

It seems that there are not many changes applied to OpenJPEG right now. All
my changes are based on the latest code, no merging is required. The easiest
way is to just check in my files in the v2.0 branch. However, somebody
should verify that they still compile with other compilers other than Visual
Studio and don't break anything.

Peter

cheikhrouh...@planet.tn

unread,
Feb 15, 2010, 3:13:24 AM2/15/10
to open...@googlegroups.com
It is really good to reach 75%. Congratulations!
What I am interesting to is the compression rate of the motionjpeg2000
. I want to khow if this changes outperform the bitrate?
Please reply me !!!

Peter Wimmer

unread,
Feb 15, 2010, 7:17:12 AM2/15/10
to open...@googlegroups.com
The optimization neither improve compression efficiency nor compression
speed. They only improve decompression speed without any influence on the
quality.

Peter

-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf
Of cheikhrouh...@planet.tn
Sent: Monday, February 15, 2010 9:13 AM
To: open...@googlegroups.com
Subject: Re: [OpenJPEG] Optimizations for OpenJPEG

cheikhrouh...@planet.tn

unread,
Feb 15, 2010, 9:51:43 AM2/15/10
to open...@googlegroups.com
Thank you Peter for your prompt reply.

Saoussen

François-Olivier

unread,
Feb 16, 2010, 5:29:41 AM2/16/10
to OpenJPEG
Hi Peter,

You mention below that the files you provided could be checked in the
v2.0 branch. However it seems they are based on v1.3.

Do you plan to adapt your optimizations for v2.0 ? Since I believe all
efforts should be (at least) focused on v2.0, this would be an
interesting contribution. By the way, I agree that providing patches
rather than source files are a more efficient way to share updates.

I'll continue to check your contributions for v1.3 and commit them
asap.

Cheers,

François

Peter Wimmer

unread,
Feb 16, 2010, 11:30:00 AM2/16/10
to open...@googlegroups.com
If my patches are based on 1.3, then I've made a mistake during checkout of
the files. I'll have a look at it and probably apply them on v2.0, too. What
are the main differences between the v1.3 and the v2.0 branch?

Peter


-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf

Of François-Olivier
Sent: Tuesday, February 16, 2010 11:30 AM
To: OpenJPEG
Subject: [OpenJPEG] Re: Optimizations for OpenJPEG

Hi Peter,

You mention below that the files you provided could be checked in the
v2.0 branch. However it seems they are based on v1.3.

Do you plan to adapt your optimizations for v2.0 ? Since I believe all
efforts should be (at least) focused on v2.0, this would be an
interesting contribution. By the way, I agree that providing patches
rather than source files are a more efficient way to share updates.

I'll continue to check your contributions for v1.3 and commit them
asap.

Cheers,

François
p/openjpeg?hl=en.

Peter Wimmer

unread,
Feb 16, 2010, 11:38:01 AM2/16/10
to open...@googlegroups.com
I've checkout http://openjpeg.googlecode.com/svn/trunk/ and expected to get
the latest version. How do I check out the 2.0 branch?

-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf
Of Peter Wimmer
Sent: Tuesday, February 16, 2010 5:30 PM
To: open...@googlegroups.com
Subject: RE: [OpenJPEG] Re: Optimizations for OpenJPEG

If my patches are based on 1.3, then I've made a mistake during checkout of
the files. I'll have a look at it and probably apply them on v2.0, too. What
are the main differences between the v1.3 and the v2.0 branch?

Peter
g?hl=en.

François-Olivier

unread,
Feb 17, 2010, 4:30:05 AM2/17/10
to OpenJPEG
Hi Peter,

The v2 branch is available here:
http://openjpeg.googlecode.com/svn/branches/v2/

You'll find a small description of the main modifications brought by
v2 in this thread
http://groups.google.com/group/openjpeg/browse_thread/thread/c5189f1a959ca813/46a52883c2827b0a?#46a52883c2827b0a

Important patches are currently applied on both v1.3 and v2.0.

I suggest releasing at the end of this month a version 1.3.1 wich will
contain all the bug fixes and improvements from version 1.3. Then,
starting on March the 1st, we'll only focus on v2 (we'll then move the
v2 to the trunk).

I'll have some time until then to apply the latest patches on v1.3.

Cheers,

François

On Feb 16, 5:38 pm, "Peter Wimmer" <pwim...@gmx.at> wrote:
> I've checkouthttp://openjpeg.googlecode.com/svn/trunk/and expected to get
> the latest version.How do I checkout the 2.0 branch?

François-Olivier

unread,
Mar 11, 2010, 3:22:46 PM3/11/10
to OpenJPEG
Hi Peter and Callum,

Have you been able to merge the optimizations (cfr discussion below) ?
If you have the chance to do so, and to pass them through the
OPJ_Validate tool, I'd like to commit them to v1.4 (and not lose the
great work you both did).

François-Olivier

On 10 fév, 21:57, "Peter Wimmer" <pwim...@gmx.at> wrote:
> Thanks for the hint. I already reinvented some of your optimizations,
> however, my code is still missing the loop unrolling. I'll also add it and
> then send all modified files.
>
> Btw, in the last few days I improved the performance of the MJPEG2000
> decoder from 4 fps to 6.7 fps (DCP decoding on a Core 2 Quad).
>
> Peter
>
>
>
> -----Original Message-----
> From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf
>
> OfCallumLerwick
> Sent: Wednesday, February 10, 2010 12:02 AM
> To: open...@googlegroups.com
> Subject: Re: [OpenJPEG] Optimizations for OpenJPEG
>

> On Mon, Feb 8, 2010 at 8:59 AM, Peter Wimmer <pwim...@gmx.at> wrote:
> > t1.c: Created different code paths for raw, mqc and mqc+vsc decoding. This
> > removes many "if" from inner to outer loops.
>
> Might want to take at look at the patch I posted a while back:
>

> http://groups.google.com/group/openjpeg/browse_thread/thread/4d27b0ea...

Peter Wimmer

unread,
Mar 11, 2010, 3:50:17 PM3/11/10
to open...@googlegroups.com
Hi François-Olivier,

I didn't work on OpenJPEG the last few weeks. If nobody else did it, my
optimizations are not merged yet.

I'm not familiar with the workflow to check in modifications. Do you need
patch files? And what is the OPJ_Validate tool? I'm working exclusively with
Microsoft Visual C++ 2008, so it is necessary that somebody makes sure that
my optimizations compile on other platforms before checking them in.

François-Olivier Devaux

unread,
Mar 12, 2010, 3:10:23 AM3/12/10
to open...@googlegroups.com
Hi Peter,

The OPJ_Validate tool also works under Visual C++ 2008. You can obtain it by synchronizing with the latest version of the SVN (see here: http://code.google.com/p/openjpeg/source/checkout
). If you validate the optimizations under VC++2008, I'll do the verifications under linux.

It would be good to synchronize with Callum to see how his patches
http://groups.google.com/group/openjpeg/browse_thread/thread/4d27b0eaeba39c8a/fd6a1dee88e6cc6f
could be merged with yours.


In any case, patch files are the best way to share the modifications as I have other patches to commit while waiting for your optimizations.

François

Peter Wimmer

unread,
Mar 16, 2010, 5:10:47 PM3/16/10
to open...@googlegroups.com
Hi François-Olivier,

Can you let me know if the attached patch file works for you. I've created
it with Araxis Merge. If it is ok, I make patched for the other changes,
too.

This patch adds SSE2 code, loop unrolling and fixes compiler errors with the
existing SSE code with Visual C++ 2008.

Peter

dwt performance patch.txt

Peter Wimmer

unread,
Mar 16, 2010, 5:19:03 PM3/16/10
to open...@googlegroups.com
Comments in j2k.h
j2k comments.txt

Peter Wimmer

unread,
Mar 16, 2010, 5:17:12 PM3/16/10
to open...@googlegroups.com
Guess I got old/new files reversed. Here's a new file.

Peter


-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf

Hi François-Olivier,

Peter

--

dwt performance patch.txt

Peter Wimmer

unread,
Mar 16, 2010, 5:23:57 PM3/16/10
to open...@googlegroups.com
SSE2 optimization for mct.c
mct performance patch.txt

Peter Wimmer

unread,
Mar 16, 2010, 5:31:27 PM3/16/10
to open...@googlegroups.com
MQC inlining. A new buffer is introduced that so that the decoding can be
split into two parts.

The line

mqc->buffer = opj_realloc(mqc->buffer, (2 * len + 1) * sizeof(unsigned
int));

allocates a buffer that is always too large, but I have no idea how to do it
better. It simply assumes that in the worst case, two output bytes are
written per input byte.

mqc inlining part 1.txt
mqc inlining part 2.txt

Peter Wimmer

unread,
Mar 16, 2010, 5:33:58 PM3/16/10
to open...@googlegroups.com
Forced inlining


opj_includes forced inlining.txt

Peter Wimmer

unread,
Mar 16, 2010, 5:38:47 PM3/16/10
to open...@googlegroups.com
Introduces a new #define that allows to use an external memory manager.


opj_malloc external mem manager.txt

Peter Wimmer

unread,
Mar 16, 2010, 5:46:30 PM3/16/10
to open...@googlegroups.com
This is also required for MQC inlining.

The bad thing is that it also requires changes in the project files, mqc.c
is now included with #include "mqc.c" into t1.c instead of compiled
separately and thus must be removed from the make/project files.


mqc inlining part 3.txt

Peter Wimmer

unread,
Mar 16, 2010, 5:58:22 PM3/16/10
to open...@googlegroups.com
T1 optimizations - these are the most complex changes where the code has
been split to have fast procedures for the common cases.

t1.txt

Peter Wimmer

unread,
Mar 26, 2010, 8:53:31 AM3/26/10
to open...@googlegroups.com
The next step I'm planning to speedup OpenJPEG is migrating the floating
point DWT to integer. Has anybody already done it? Are there existing
patches?

Peter

Matteo Italia

unread,
Mar 26, 2010, 9:19:21 AM3/26/10
to open...@googlegroups.com
Hi Peter,

IIRC in the past the inverse operation was done to take advantage of the
SSE/SSE2 extensions.

Matteo

Peter Wimmer

unread,
Apr 7, 2010, 10:00:35 AM4/7/10
to open...@googlegroups.com
> IIRC in the past the inverse operation was done to take advantage of the
SSE/SSE2 extensions.

Strange. SSE2 supports integer operations. It was definitely the wrong
decision to go from integer to float. This means I find old integer versions
in SVN? I'll have a look.

Matteo Italia

unread,
Apr 7, 2010, 5:50:44 PM4/7/10
to open...@googlegroups.com
I don't know if it was a good or bad idea, I'm not really in subtle
performance tuning; still, here's the message where I read that
(http://groups.google.it/group/openjpeg/browse_thread/thread/f363f253209dc418/66cd01b13bc3d515?hl=it&lnk=gst&q=fixed+point+sse#66cd01b13bc3d515),
you can ask the one who sent it some explanations.

Regards,
Matteo

Peter Wimmer

unread,
Apr 7, 2010, 6:11:15 PM4/7/10
to open...@googlegroups.com
Thanks for the info. It needs testing to find out what is actually faster. I
bet on integer, but we'll see. I just studied the SIMD instructions on
x86/x64. Some of the SIMD instructions operating on 4x32 bit integers where
introduced in SSE4. Using only SSE and SSE2, floating point could be faster
indeed. But all recent CPUs support SSE4 now.

Peter

-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf
Of Matteo Italia
Sent: Wednesday, April 07, 2010 11:51 PM
To: open...@googlegroups.com

Matteo Italia

unread,
Apr 8, 2010, 7:54:45 AM4/8/10
to open...@googlegroups.com
By the way, didn't anyone try to compile it for x64? It would benefit
from the new general purpose registers (that indeed are integer
registers) and could be compiled by default with SSE/SSE2 without
breaking compatibility with anything.

Matteo

Peter Wimmer

unread,
Apr 8, 2010, 10:06:36 AM4/8/10
to open...@googlegroups.com
I didn't try x64. My application's GUI is written in Delphi and there exists
no x64 Delphi compiler yet. Once I can port my application to 64 bit, I will
compile OpenJPEG as 64 bit code, too.

Matteo

--

François-Olivier Devaux

unread,
Apr 8, 2010, 11:42:00 AM4/8/10
to open...@googlegroups.com
Hi Peter,

I'm having trouble (segmentation fault under linux) when using the MQC_PERF_OPT flag with the J2K file "p0_02.j2k" available in the OPJ_Validate dataset
http://www.openjpeg.org/OPJ_Validate_OriginalImages.7z

Could you check on your side ?

Thanks,

François

François-Olivier Devaux

unread,
Apr 8, 2010, 11:45:11 AM4/8/10
to open...@googlegroups.com
Peter,

I would prefer an alternative to your suggestion of adding and #include
of mqc.c inside t1.c (see below).

--- C:\Temp\OpenJPEG 1.3\libopenjpeg\t1.c Tue Feb 16 16:32:29 2010 UTC
+++ C:\Daten\Development\C++\LibOpenJPEG\t1.c Thu Feb 11 21:10:12 2010 UTC
@@ -32,6 +32,7 @@

#include "opj_includes.h"
#include "t1_luts.h"
+#include "mqc.c"

Note: In my previous email, the problem comes with the following command
line: j2k_to_image -i original/p0_02.j2k -o temp/p0_02.tif

Peter Wimmer

unread,
Apr 8, 2010, 1:14:26 PM4/8/10
to open...@googlegroups.com
> I would prefer an alternative to your suggestion of adding and #include of mqc.c inside t1.c (see below).

Me too, but I have no better idea. How else can you inline a function across file borders?

--- C:\Temp\OpenJPEG 1.3\libopenjpeg\t1.c Tue Feb 16 16:32:29 2010 UTC
+++ C:\Daten\Development\C++\LibOpenJPEG\t1.c Thu Feb 11 21:10:12 2010 UTC
@@ -32,6 +32,7 @@

#include "opj_includes.h"
#include "t1_luts.h"
+#include "mqc.c"

Note: In my previous email, the problem comes with the following command
line: j2k_to_image -i original/p0_02.j2k -o temp/p0_02.tif

On 03/16/2010 10:46 PM, Peter Wimmer wrote:
> This is also required for MQC inlining.
>
> The bad thing is that it also requires changes in the project files, mqc.c
> is now included with #include "mqc.c" into t1.c instead of compiled
> separately and thus must be removed from the make/project files.
>
>
>

--

Dzonatas

unread,
Apr 8, 2010, 10:54:21 PM4/8/10
to open...@googlegroups.com
FP is faster due to less operations needed. Integer math produces
consistent result, but is a bit slower. Lossy encode can use FP while
Lossless needs Integer math.

Integer math requires extra operations to normalize the decimal in
comparison to FP. That means more registers being used to normalize
Integer math.

Peter Wimmer

unread,
Apr 9, 2010, 7:11:58 AM4/9/10
to open...@googlegroups.com
I could move all mqc decoding functions from the c to the h file and implement it as macros. But this is ugly, too.

-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf Of Peter Wimmer
Sent: Thursday, April 08, 2010 7:14 PM
To: open...@googlegroups.com

Peter Wimmer

unread,
Apr 9, 2010, 12:32:10 PM4/9/10
to open...@googlegroups.com
Just trying to understand the requirements for integer decoding. In the
following code:

static INLINE int fix_mul(int a, int b) {
OPJ_INT64 temp = (OPJ_INT64) a * (OPJ_INT64) b ;
temp += temp & 4096;
return (int) (temp >> 13) ;
}

Is temp += temp & 4096 supposed to round the value? It wastes performance:
temp += 4096 does the same.

Peter


Callum Lerwick

unread,
Apr 15, 2010, 4:12:49 PM4/15/10
to open...@googlegroups.com

I forget what the & is for. But the big thing is it's a 64bit
multiply. That is not a fast thing on a 32-bit machine. AFAIK SSE2
isn't any good for a 64-bit multiply either. Even on a 64-bit machine
float seems to be faster. Modern desktop CPUs have rather good float
throughput.

Your only hope for integer to beat float is to cut it down to 16-bit
fixed point, so you do a 32-bit intermediate multiply. But then your
accuracy is going to suffer, so the float code should be kept. A
16-bit integer fastpath for real-time applications would be welcome.
(And would be good for anyone using embedded CPUs with no FPU. Does
anyone even still do that on higher end embedded applications?)

Peter Wimmer

unread,
Apr 16, 2010, 12:21:50 PM4/16/10
to open...@googlegroups.com
I came to same conclusion that SSE doesn't provide the required instructions
for 32 bit integers. For 16 bit, there would be the SSSE3 _mm_mulhrs_epi16
intrinsic that does the following:

r0 := INT16(((a0 * b0) + 0x4000) >> 15)
r1 := INT16(((a1 * b1) + 0x4000) >> 15)
...
r7 := INT16(((a7 * b7) + 0x4000) >> 15)



Would 16 bit arithmetic be sufficient to decode an image with 8 bit output
precision? As far as I understand the spec, while performing the DWT a
higher range is required. I could certainly implement the 16 bit fast code
path, but I'm still no expert on the JPEG2000 spec yet - still learning ;-)


-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf
Of Callum Lerwick
Sent: Thursday, April 15, 2010 10:13 PM
To: open...@googlegroups.com
Subject: Re: [OpenJPEG] Integer decoding

Peter Wimmer

unread,
Apr 21, 2010, 5:32:12 AM4/21/10
to open...@googlegroups.com
Hello,

I've completed the DirectShow decoder based on OpenJPEG and released it at
http://www.3dtv.at/Products/Jpeg2000Decoder/Index_en.aspx. The installer is
available at http://www.3dtv.at/Downloads/Index_en.aspx.

The decoder uses all the patches I've submitted recently and in addition
distributes the workload on all CPU cores.

Best regards,
Peter



-----Original Message-----
From: open...@googlegroups.com [mailto:open...@googlegroups.com] On Behalf
Of Peter Wimmer
Sent: Wednesday, January 20, 2010 11:45 PM
To: open...@googlegroups.com
Subject: [OpenJPEG] DirectShow wrapper for OpenJPEG

Hello,

I'm new to this group and hope my question has not been discussed yet (at
least I found no search results).

Is there somewhere a DirectShow wrapper for the OpenJPEG library? I would
like to use OpenJPEG for Motion JPEG 2000 video playback in DirectShow-based
media players.

Best regards,
Peter
Reply all
Reply to author
Forward
0 new messages