T1 decoder optimization patches

31 views
Skip to first unread message

Callum Lerwick

unread,
Feb 12, 2009, 5:34:09 PM2/12/09
to open...@googlegroups.com
Hmmm, sidetracked as usual. Anyway it's about time I got these out to
the world in case I get hit by a bus. I have to credit Sean O'Neil for
the idea of unrolling the loops which is where most of the speedup
comes from.

My machine that had Visual Studio 2005 seems to be nice and screwed
and needs a reinstall, so I have not tested these with MSVC yet.
Someone please tell me if these compile, and if they're any faster.

openjpeg-svn528-t1-optimize.patch

The main blob, nearly ready for merging. 25% speedup on Athlon 64. It
does nothing to change the T1 algorithm, it's purely code
optimization. It only really touches the decoder, the encoder is still
slow, performing the same optimizations on the encoder should gain a
similar speedup. It does:

* Unrolling of the 4-step inner loops (by far the most common case)
* Transposition of the data/flag arrays, optimizing cache usage in the
inner loop
* Loop unswitching
* Branch optimization hints for gcc
* Misc cleanup

openjpeg-1.3-t1-mqc-inline.patch

A trick borrowed from Jasper, inlining the MQC gains another 3% on
Athlon 64. Proof of concept, not ready for merging.

openjpeg-1.3-t1-opt5.patch

Really experimental, adding branch hints to the MQC gains %1.6 on
Athlon 64, but is slower on PIII. Don't merge. Although the two DWT
memcpy() patches in there really should be merged.

A total of a 30% speedup on Athlon 64, IIRC even faster on i386. I'm a
little concerned that all this unrolling and inlining might hurt some
embedded platforms, in which case it could be made optional with some
#ifdefs. If anyone sees a slowdown on MIPS/ARM/PPC etc, please tell
me.

openjpeg-svn528-t1-optimize.patch
openjpeg-1.3-t1-mqc-inline.patch
openjpeg-1.3-t1-opt5.patch
Reply all
Reply to author
Forward
0 new messages