I posted a patch to another group to help speed-up decompression across
multi-processors. It is a very preliminary patch to test Intel's
Threading Building Blocks.
I'll post an update soon to this group-list. It should speed up the
encode process also. I still need to build a test suite before I
consider it not-preliminary.
On a multi-core, it is expected to decode at 75% of the normal time *at
least*. For example, if it normally takes 10 seconds on a single core,
than it should only take 1/16 of that on a 16 core CPU (about .06
seconds or 10FPS). Intel's Core 2 Duo can present 4 main physical
threads, so that means 1/4 of the time (equivalent to quad-core). That
should bring it down to about 60ms (from the 250ms on the 768x576).
These expectations are without SSE vectorization, so there is potential
for even further optimizations with SSE vectorization beyond the
multi-processor scalability.
Again, this is preliminary, but the scalability is embarrassingly
parallelizable and practicable.
Cheers!
--
Power to Change the Void
I hope you realize that our recent patch drops the DWT time completely
off the charts, at least at non-megaimage sizes. :) Given Amdahl's law
there's relatively little to be gained by just threading the DWT. The
T1/MQC is by far the top CPU user. You might notice in the patch I
submitted I moved the outer T1 loop into tcd_decode_tile(), which allows
the possibility of interleaving the T1 and DWT decode. I'm still
exploring the benefits of this, if any, but you could then very easily
decode each component in parallel.
I pulled out my old dual PII workstation so I could experiment with
OpenMP, but haven't quite got around to it yet. Note that TBB requires
you to convert everything to C++, and OpenMP does not. ;P
And as I mentioned on sldev, given how many zillions of textures Second
Life decodes, just running multiple texture decode threads would be the
cleanest way to use multiple cores in slviewer, rather than having the
OpenJPEG library competing for threads. Given that spawning threads has
a non-trivial fixed cost, threading work really should go from the top
down and not bottom up.
There is an assumption made with Amdahl's law about what make up the
100% sequential part(s). If the equation includes parts that are out of
scope of the algorithm being made scalable, then the result is not
exclusive to the focus of the scalability and can create false predictions.
OpenMP is not supported by all compilers. C++ also shows better general
optimization then plain C, which was not true before. The use of OpenMP
or C++ is moot due to the fact I'll publish the patch under a different
branch separate from the libopenjpeg, which then creates no requirement
to do wholesale changes to any parallel scheme. It also makes it so that
libopenjpeg can be used were even TBB is not ported.
The top-down to parallelization means that a new synchronization scheme
needs to be implemented. In the instance of Second Life, that would be a
wholesale change throughout Second Life as it highly depends on its
sequential flow to create implicit synchronization.
I do realize the changes I made to the DWT speed it up by a variable
factor between 2x to 20x faster in the previous patch (not the TBB
related patch), but that previous patch relied on SSE vectorization to
achieve such performance. TBB does not rely on SSE to achieve similar
performance. TBB also does not require the autovectorization feature
only available in GCC. The effort you did to change libopenjpeg to
GCC-autovectorization is probably similar to an amount of effort to
change it to C++.
Cheers!
OpenJPEG is used in "open source" builds of the Second Life viewer, the
official builds use KDU and the performance difference is (was?) very
noticeable. So I've been putting a lot of effort into optimization. As
mentioned, I just submitted a patch originally by Dzonatas and mangled
by me that speeds up the 9-7 DWT decode quite a bit, which is probably
the biggest single performance increase so far. Although Dzonatas only
patched the decoder, doing a similar fp-ization to the encoder should
speed that up nicely as well. :)
I also submitted some patches that decrease memory usage and cut down
the size of tcd_cblk structure, which had some big static buffers in it,
which measurably speeds things up, especially on smaller cache machines.
I should probably do some benchmarks. :) For performance testing I
mostly use the mj2 codec. For the test file I encoded "Speedway.yuv"
with "-I 1".
Decodes are run on an Athlon 64 3000+, a Celeron D 2.1ghz (P4
architecture), a Mobile Celeron 1.3ghz (P3 architecture), and a G3
266mhz. Here's stock svn470, compiled using Fedora 7 gcc 4.1, with -O3:
amd64: Total decoding time: 11.97 seconds (16.7 fps)
p4: Total decoding time: 21.60 seconds ( 9.3 fps)
p3: Total decoding time: 26.79 seconds ( 7.5 fps)
ppc: Total decoding time: 67.10 seconds ( 3.0 fps)
Here's after the fp/vectorization patch, with and without SSE on the
i386 machines:
amd64: Total decoding time: 9.33 seconds (21.4 fps) 22.1%
p4: Total decoding time: 18.08 seconds (11.1 fps) 16.3%
p4sse: Total decoding time: 17.46 seconds (11.5 fps) 19.2%
p3: Total decoding time: 21.16 seconds ( 9.5 fps) 21.0%
p3sse: Total decoding time: 20.74 seconds ( 9.6 fps) 22.6%
ppc: Total decoding time: 44.14 seconds ( 4.5 fps) 34.2%
Big wins across the board. Biggest win percentage-wise is on PPC, I'm
not entirely sure why. :) As you can see, most of the speedup doesn't
actually come from SSE.
And after the late-alloc-early-free and tcd-cblk-nostatic patches:
amd64: Total decoding time: 9.31 seconds (21.5 fps) 0.2%
p4sse: Total decoding time: 17.34 seconds (11.5 fps) 0.7%
p3sse: Total decoding time: 20.23 seconds ( 9.9 fps) 2.5%
ppc: Total decoding time: 43.69 seconds ( 4.6 fps) 1.0%
Percentages are versus the previous patchset. As intended, the P3
Celeron seems to benefit from the reduction in cache footprint. I'm a
little confused as to why the P4 doesn't seem to be benefiting as much.
I suspect that machine may be bottlenecked by the fact that I only have
PC2100 RAM in it. As expected, the amd64 with its large cache isn't
affected much.
Next in the queue I've got some cycle-shaving patches to the T1 and MQC:
amd64: Total decoding time: 8.66 seconds (23.1 fps) 7.0%
p3sse: Total decoding time: 19.91 seconds (10.0 fps) 1.6%
The patches were originally written and tuned for x86-64, thus the big
win there. The patches reduce branching and take advantage of the fact
that up to four 16-bit T1 flags can be manipulated at a time given a
64-bit processor. I was able to translate some of it to MMX intrinsics
on i386. :)
There's also further memory work that I'm still sorting through.
The floating point math should be made optional. That would require a
new API option for a run-time toggle.
The floating point is good for hardware that has it and to display video
quality, but for archival concerns the db needs to remain infinite. The
fixed point math emulated in software is the only way to guarantee
infinite db. Consider how SSE (for floating point) can show no
performance gain under cache pollution conditions, it is wise to keep
both floating point and fixed point methods to allow other scalable
solutions to gain performance. =)
The benchmarks I do are run under a mildly stressed machine. It
typically runs Windows and Linux concurrently on a single core. Each OS
can easily steal up to 50% of the hardware utilization or more if the
other is idle. Any "system" time in my benchmarks is easily exploited
under such conditions, so I run a series of the same test to find the
average. The single core setup allows one to calculate the overhead in a
multi-core operation. Given that most software (even the OS) does not
run efficiently under multi-core architecture, live multi-core samples
won't produce exclusive benchmarks, but the samples are good to make
sure the results are within the (calculated) limits. Memory utilization
in multi-core architectures is not a consistent factor as it is under a
single core, and it is usually the reason why such samples are not
within the limits.
There is a technique to help detect, at run-time, when the results are
not within the limits, so optional decode/encode methods are good to
find the best one for certain run-time conditions.