Performance

Curtis Rueden

unread,

Dec 23, 2009, 7:25:54 PM12/23/09

to ImageJX discussion group

Hi everyone,

One important factor to consider when scrutinizing the ImageJ codebase is performance. Historically, Java performance has lagged behind C++ and other natively compiled languages, although Java performance has greatly improved this decade and is now generally within a factor of two of C (see Java section of the LOCI FAQ: http://loci.wisc.edu/faq). In short, Java is easier and more cross-platform than lower-level languages such as C++, and more performant than higher-level scripting languages such as Python, making it a good balance for projects like ImageJ.

The standard approach for improving performance is to identify and target bottlenecks, rather than "prematurely optimize" code. One common technique for doing so, assuming the bottleneck algorithm cannot simply be optimized further, is to delegate that section of code to a more performant framework, such as a natively compiled version (e.g., platform-specific compiled "performance packs"). However, as several people already mentioned, it has recently proven effective and popular to use graphics cards to assist with code execution, as well as parallel processing to improve performance. The OpenCL standard is an effective new way to utilize both techniques in a cross-platform manner, making it a promising candidate for use with ImageJ.

A new ImageJDev team member at LOCI, Rick Lentz, has experience with CUDA and GPU processing, and we plan to explore the best way to leverage OpenCL later next year, once we have pinned down the ImageJDev project's development path over the coming months. It is exciting to think that for some algorithms we can expect a many-fold increase in time performance.

-Curtis

[was: API vs. Scripting?]
On Wed, Oct 7, 2009 at 10:58 AM, Wilhelm Burger <wil...@ieee.org> wrote:

3) I am reading about support for more (unsigned) primitive data
types, use of generics etc. I consider myself a Java person, but this
is where Java is particularly bad at, not to talk about some notorious
performance issues. How important is Java in this context?

[was: API vs. Scripting?]

On Tue, Dec 15, 2009 at 5:33 PM, Dimiter Prodanov <dimi...@gmail.com> wrote:

I think that we should anticipate also another possibility. Notably,
the use of ImageJ with specific hardware, for example writing code to
employ the parallel core of the GPUs. This can in itself require mass
rewriting of the code.

[was: API vs. Scripting?]

On Wed, Dec 16, 2009 at 4:29 AM, Wilhelm Burger <wil...@ieee.org> wrote:

* Performance IS an issue and for image processing and some of us
would prefer to rather get 10x the current power of our current
computers than only 25%. To obtain this level of performance, some
central parts (eg., filters and other regular but time-consuming
tasks) must be recoded in one or the other way, possibly using GPU
functionality. In an earlier post I called these modules "performance
packs", which will be plattform-dependent (as as the JVM) but should
be installed transparently to the API user.

[was: Outsider's opinion]

On Fri, Oct 23, 2009 at 4:38 PM, Johan Henriksson <he.j...@gmail.com> wrote:

> It would be really cool if there were an option to use -if available-

> OpenCL<http://en.wikipedia.org/wiki/OpenCL>for
> computations/transformations... (
> JOCL <http://www.jocl.org/>)
>
you can also use the OpenCL bindings I have written for Endrov.
http://sourceforge.net/projects/jopencl/

unlike JOCL, it's open source

Johan Henriksson

unread,

Dec 24, 2009, 6:10:11 AM12/24/09

to ima...@googlegroups.com

On Thu, Dec 24, 2009 at 1:25 AM, Curtis Rueden <ctrued...@gmail.com> wrote:

Hi everyone,

One important factor to consider when scrutinizing the ImageJ codebase is performance. Historically, Java performance has lagged behind C++ and other natively compiled languages, although Java performance has greatly improved this decade and is now generally within a factor of two of C (see Java section of the LOCI FAQ: http://loci.wisc.edu/faq). In short, Java is easier and more cross-platform than lower-level languages such as C++, and more performant than higher-level scripting languages such as Python, making it a good balance for projects like ImageJ.

I concur. the only real unsolvable annoyance with java vs C++ is that java sometimes cannot optimize away array out of bounds-checks. these can kill performance (branch prediction reduces this problem a bit). I would like to have an @annotation to tell the compiler to simply not insert them and let me guarantee that they won't be needed. it kills some safety properties of java but so does JNI; these can at least be ignored by the compiler if wanted, and inserted once the code is tested and works. I have mentioned it to some openjdk people but it would help if someone else also gave them a push.

The standard approach for improving performance is to identify and target bottlenecks, rather than "prematurely optimize" code. One common technique for doing so, assuming the bottleneck algorithm cannot simply be optimized further, is to delegate that section of code to a more performant framework, such as a natively compiled version (e.g., platform-specific compiled "performance packs"). However, as several people already mentioned, it has recently proven effective and popular to use graphics cards to assist with code execution, as well as parallel processing to improve performance. The OpenCL standard is an effective new way to utilize both techniques in a cross-platform manner, making it a promising candidate for use with ImageJ.

we're working with opencl now and have realized that it already affects our design of our EvPixels class, and even EvStack. things to think of:
* transporting image data to and from the GPU is expensive. sometimes it is faster to run an algorithm on the CPU than on the card
* knowing when to move to the GPU is tricky. we use endrov Flows for most algorithms so in theory, we could analyze the entire pipeline.
e.g. -> A -> B -> C -> D ->
assume that A and D can only be executed on the CPU. B and C can also be executed on the GPU. when should the pipe go for the GPU? simple-minded, you can do it whenever possible. but the memory transfer might not be worth for B or C alone if they are a trivial processes; OTOH it might be worth it with both BC together. this shows that there are potential gains but I don't think IJ could easily make good use of it the way it looks today, without user intervention.
* other pixel formats might be preferred e.g. float16
* data might not even fit, CPU fallbacks needed

we have sort of given up coding a generics system that can be executed on the GPU as well as CPU; they are way too different. we are merely adding optional entry points in our code but all plugins must be able to run on the CPU.

another note on opencl; it is buggy!! and the ATI implementation is way incomplete (no 3d texture support). so yes, you will save some headaches if you postpone this a bit.

/Johan

--
You received this message because you are subscribed to the Google Groups "ImageJX" group.
To post to this group, send email to ima...@googlegroups.com.
To unsubscribe from this group, send email to imagejx+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/imagejx?hl=en.

--
-----------------------------------------------------------
Johan Henriksson
PhD student, Karolinska Institutet
http://mahogny.areta.org http://www.endrov.net

Gábor Bakos

unread,

Dec 24, 2009, 7:46:32 AM12/24/09

to ima...@googlegroups.com

Hi Johan,

2009/12/24 Johan Henriksson <mah...@areta.org>

On Thu, Dec 24, 2009 at 1:25 AM, Curtis Rueden <ctrued...@gmail.com> wrote:

Hi everyone,

One important factor to consider when scrutinizing the ImageJ codebase is performance. Historically, Java performance has lagged behind C++ and other natively compiled languages, although Java performance has greatly improved this decade and is now generally within a factor of two of C (see Java section of the LOCI FAQ: http://loci.wisc.edu/faq). In short, Java is easier and more cross-platform than lower-level languages such as C++, and more performant than higher-level scripting languages such as Python, making it a good balance for projects like ImageJ.

I concur. the only real unsolvable annoyance with java vs C++ is that java sometimes cannot optimize away array out of bounds-checks. these can kill performance (branch prediction reduces this problem a bit). I would like to have an @annotation to tell the compiler to simply not insert them and let me guarantee that they won't be needed. it kills some safety properties of java but so does JNI; these can at least be ignored by the compiler if wanted, and inserted once the code is tested and works. I have mentioned it to some openjdk people but it would help if someone else also gave them a push.

Do you have a reference/test where the performance problem was really caused by the array checks within a recent JVM implementation? (Have you seen this blog entry?)
As I know the Sun JDK does not emit byte codes to check index bounds, the JVM executes the checks for the array access byte codes. I might be wrong, but some of the JVMs do omit the checks for array access if they can prove that wrong index will not happen. As a presentation please see this session:
$ cat Test.java
public class Test {
public void x(int[] y) {
    int j = y[1];
}
}
$ javac -g:none Test.java

$ javap -verbose Test
public class Test extends java.lang.Object{
    public Test();
    public void x(int[]);
}

C:\Users\root\tmp>C:\Java\jdk1.6.0_16_x86\bin\javap.exe -verbose Test
public class Test extends java.lang.Object
minor version: 0
major version: 50
Constant pool:
const #1 = Method       #3.#9; // java/lang/Object."<init>":()V
const #2 = class        #10;    // Test
const #3 = class        #11;    // java/lang/Object
const #4 = Asciz        <init>;
const #5 = Asciz        ()V;
const #6 = Asciz        Code;
const #7 = Asciz        x;
const #8 = Asciz        ([I)V;
const #9 = NameAndType #4:#5;// "<init>":()V
const #10 = Asciz       Test;
const #11 = Asciz       java/lang/Object;

{
public Test();
Code:
   Stack=1, Locals=1, Args_size=1
   0:   aload_0
   1:   invokespecial   #1; //Method java/lang/Object."<init>":()V
   4:   return

public void x(int[]);
Code:
   Stack=2, Locals=3, Args_size=2
   0:   aload_1
   1:   iconst_1
   2:   iaload
   3:   istore_2
   4:   return

}
Thanks, --g

--
-----------------------------------------------------------
Johan Henriksson
PhD student, Karolinska Institutet
http://mahogny.areta.org http://www.endrov.net

--
Basic research is what I'm doing when I don't know what I'm doing. ~~~ Wernher von Braun

Wilhelm Burger

unread,

Dec 24, 2009, 8:15:02 AM12/24/09

to ImageJX, abor...@gmail.com, wil...@ieee.org

Gabor,

I would assume that array bounds checking code is generated by the JIT
compiler. I made some experiments in the past with Sun's (and BEA's)
JVMs and noticed that Sun's server version was a lot smarter with this
than the (usual) client version. For example, given something like

int [] A ...
for (int i=0; i<A.length; i++) {
// do something with A[i] ...
}

access to A[i] is guaranteed to be within the boundaries of A, and the
server JVM seems to run the loop without boundary checking code. It is
easy to inhibit this mechanism though, for example, if you change the
above to

int [] A ...
int N = A.length;
for (int i=0; i<N; i++) {
// do something with A[i] ...
}

Nevertheless, the performance differences were striking in some
situations. I reported this to Wayne but did not follow up on the
subject. Would be interesting to try with the current JVMs.

--Wilhelm

On Dec 24, 1:46 pm, Gábor Bakos <aborga...@gmail.com> wrote:
> Hi Johan,

> 2009/12/24 Johan Henriksson <maho...@areta.org>

>
>
>
>
>
>
>
> > On Thu, Dec 24, 2009 at 1:25 AM, Curtis Rueden <ctrueden.w...@gmail.com>wrote:
>
> >> Hi everyone,
>
> >> One important factor to consider when scrutinizing the ImageJ codebase is
> >> performance. Historically, Java performance has lagged behind C++ and other
> >> natively compiled languages, although Java performance has greatly improved
> >> this decade and is now generally within a factor of two of C (see Java
> >> section of the LOCI FAQ:http://loci.wisc.edu/faq). In short, Java is
> >> easier and more cross-platform than lower-level languages such as C++, and
> >> more performant than higher-level scripting languages such as Python, making
> >> it a good balance for projects like ImageJ.
>
> > I concur. the only real unsolvable annoyance with java vs C++ is that java
> > sometimes cannot optimize away array out of bounds-checks. these can kill
> > performance (branch prediction reduces this problem a bit). I would like to
> > have an @annotation to tell the compiler to simply not insert them and let
> > me guarantee that they won't be needed. it kills some safety properties of
> > java but so does JNI; these can at least be ignored by the compiler if
> > wanted, and inserted once the code is tested and works. I have mentioned it
> > to some openjdk people but it would help if someone else also gave them a
> > push.
>
> Do you have a reference/test where the performance problem was really caused
> by the array checks within a recent JVM implementation? (Have you seen this

> blog entry<http://blogs.azulsystems.com/cliff/2009/09/java-vs-c-performance-agai...>

> ?)
> As I know the Sun JDK does not emit byte codes to check index bounds, the
> JVM executes the checks for the array access byte

> code<http://en.wikipedia.org/wiki/Java_bytecode_instruction_listings>s.

> Wernher von Braun <http://quotes4all.net/quote_993.html>- Hide quoted text -
>
> - Show quoted text -

Johan Henriksson

unread,

Dec 24, 2009, 10:06:04 AM12/24/09

to ima...@googlegroups.com

I have read the research articles on this and how it's detected (don't have references available, sorry). it's clear that if you want to optimize your code you have to know which JVM you are using. now I'm leaning toward only considering openjdk since I take it will soon be *the* JVM. it is correct that the JVM tries to prove when boundary checks are needed. sometimes it cannot figure this out at compile time eg

int bla(int[] a, int b, int c) = sum from a[b] to a[c]

in this case it can be clever enough to write two versions of the code, one with the ifs (always working) and one without (fast, but only works if b and c are within bounds of a). problem is, these optimizations are for common cases only. if you have complicated access patterns then it can't be optimized. but as a human, you can do way more sophisticated reasoning than the compiler ever will be able to, that's why some additional safety-disabling features would be nice.

there are many other compiler personality issues to consider e.g. when inlining is done. until openjdk 1.5 there seemed to be no inlining between classes, then someone (schindelin?) observed a case. confusing. no inter-class inlining is *seriously* limiting. this affects how nice and modular code we can write without killing performance so it would be nice to understand how the compiler behaves here. any getPixel(x,y) is a perfect example.

if you forgot to buy a gift, here is a christmas gift for everyone: figure out how [][] behaves with regard to array checking. due to java allowing non-uniform lengths I think this has a serious penalty vs using [] as a 2d array. C# supposedly has some true 2d-array but in some article I saw it still had inferior performance to a [].

/Johan

--

You received this message because you are subscribed to the Google Groups "ImageJX" group.
To post to this group, send email to ima...@googlegroups.com.
To unsubscribe from this group, send email to imagejx+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/imagejx?hl=en.

Rick Lentz

unread,

Jan 5, 2010, 11:18:49 PM1/5/10

to ima...@googlegroups.com

Greetings Johan and others,

I am very excited to be part of the community and agree with the challenges presented regarding developing with OpenCL. In regard to chaining processing pipelines together, keeping the result of operation A in global memory and passing it as an argument to operation B overcomes the latencies involved with host-to-device and device-to-host transfers. The cost to this mitigation tactic is the increased overhead of effectively managing the process pipelines (which is not trivial). To illustrate the latencies Johan discussed, I have provided a report generated by NVidia's freely available GPU Computing example called 'OpenCL Bandwidth Test'. (Note: This tool does not capture or make apparent an additional fact - that the GPU device is being shared by multiple contexts to include the host operating system's windowing process).

Additionally Johan mentioned were that there are errors are in the implementations of the initial OpenCL specification provided by The Kronos Group. Having solid unit test coverage will be an effective way to help identify and mitigate these types of risks on different platforms. Another challenge (more to the scientific community than with image processing is double precision support). Most GPUs in circulation today do not have hardware based double precision support. Therefore the OpenCL API 1.0 has considered double types to be 'optional'.

Despite these challenges, it is my opinion that the implementations of OpenCL 1.0 have matured enough to open up considerable hardware resources to the development team. In addition, there are language bindings that allow direct use of OpenCL from high level languages. In summary, the current benefits of OpenCL as a suitable complement to Java for improving performance of cross-platform software in a platform-independent way outweigh the current challenges.

Sincerely,

Rick W. Lentz
5 Jan, 2010

Results from ./oclBandwidthTest suggests that:
   1) transfer size can determine IO efficiency with OpenCL / GPU development
   2) use of pinned/pagable and direct/mapped memory impact IO transfer performance

./oclBandwidthTest --access=direct --devices=all --mode=shmoo --memory=pinned --size=100000000
./oclBandwidthTest Starting...

--access=direct
--devices=all
--mode=shmoo
--memory=pinned
--size=100000000

WARNING: NVIDIA OpenCL platform not found - defaulting to first platform!

Running on...

Device Radeon HD 4670
Shmoo Mode

.................................................................................
Host to Device Bandwidth, 0 Device(s), Pinned memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   1024                35.5
   2048                66.7
   3072                111.4
   4096                171.7
   5120                216.4
   6144                216.5
   7168                245.4
   8192                62.9
   9216                373.9
   10240            432.3
   11264            473.2
   12288            110.0
   13312            487.2
   14336            555.8
   15360            593.1
   16384            631.6
   17408            671.8
   18432            700.6
   19456            693.9
   20480            732.3
   22528            767.3
   24576            283.4
   26624            907.1
   28672            993.9
   30720            1031.6
   32768            983.3
   34816            1122.1
   36864            1095.8
   38912            1154.6
   40960            370.8
   43008            1280.2
   45056            1362.4
   47104            1358.0
   49152            1412.7
   51200            1493.3
   61440            1639.4
   71680            1661.7
   81920            921.2
   92160            1970.2
   102400            2087.6
   204800            2509.5
   307200            2840.8
   409600            3050.6
   512000            2849.5
   614400            3286.6
   716800            3420.7
   819200            3438.4
   921600            3259.3
   1024000            3585.4
   1126400            2619.7
   2174976            3861.2
   3223552            3629.7
   4272128            4022.8
   5320704            4078.0
   6369280            4089.3
   7417856            4101.7
   8466432            4136.5
   9515008            4150.9
   10563584            4153.3
   11612160            4151.7
   12660736            4174.4
   13709312            4181.8
   14757888            4171.7
   15806464            4185.1
   16855040            4164.8
   18952192            4187.9
   21049344            4204.2
   23146496            1292.2
   25243648            1369.0
   27340800            1365.7
   29437952            1369.3
   31535104            1366.2
   33632256            1369.5
   37826560            1368.8
   42020864            1370.2
   46215168            1369.2
   50409472            1321.1
   54603776            1371.0
   58798080            1370.9
   62992384            1371.7
   67186688            1368.2

.................................................................................
Device to Host Bandwidth, 0 Device(s), Pinned memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   1024                13.5
   2048                43.4
   3072                70.3
   4096                102.2
   5120                140.4
   6144                173.2
   7168                184.1
   8192                191.3
   9216                208.3
   10240            170.3
   11264            297.1
   12288            321.3
   13312            329.0
   14336            382.5
   15360            310.3
   16384            379.3
   17408            394.4
   18432            429.9
   19456            455.9
   20480            474.3
   22528            539.3
   24576            464.2
   26624            220.0
   28672            631.3
   30720            661.2
   32768            695.4
   34816            711.3
   36864            763.0
   38912            764.5
   40960            797.4
   43008            830.4
   45056            846.0
   47104            863.7
   49152            903.0
   51200            900.2
   61440            1032.1
   71680            1075.3
   81920            1131.9
   92160            1202.0
   102400            1261.9
   204800            1449.3
   307200            1592.6
   409600            1730.0
   512000            1798.5
   614400            1822.1
   716800            1859.4
   819200            1889.0
   921600            1905.3
   1024000            1927.7
   1126400            1940.3
   2174976            1956.9
   3223552            1938.7
   4272128            1962.1
   5320704            1983.9
   6369280            1980.4
   7417856            1945.9
   8466432            1988.5
   9515008            1988.3
   10563584            2014.6
   11612160            1993.4
   12660736            1991.7
   13709312            1995.8
   14757888            2002.2
   15806464            1998.5
   16855040            1959.5
   18952192            1975.5
   21049344            1969.4
   23146496            1980.4
   25243648            1992.8
   27340800            1972.8
   29437952            1993.5
   31535104            2000.9
   33632256            1995.0
   37826560            1995.6
   42020864            1999.0
   46215168            2000.4
   50409472            2017.8
   54603776            1998.4
   58798080            2003.2
   62992384            2015.5
   67186688            1965.2

.................................................................................
Device to Device Bandwidth, 0 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   1024                2.5
   2048                15.6
   3072                18.0
   4096                12.5
   5120                21.6
   6144                34.4
   7168                63.2
   8192                71.7
   9216                80.4
   10240            90.4
   11264            97.9
   12288            107.8
   13312            115.6
   14336            125.4
   15360            134.0
   16384            143.6
   17408            151.5
   18432            159.7
   19456            170.7
   20480            180.0
   22528            197.3
   24576            214.1
   26624            233.4
   28672            266.1
   30720            215.7
   32768            189.4
   34816            305.6
   36864            319.0
   38912            343.4
   40960            382.1
   43008            372.7
   45056            393.9
   47104            404.9
   49152            428.8
   51200            443.5
   61440            534.2
   71680            564.8
   81920            718.7
   92160            793.4
   102400            889.2
   204800            1146.9
   307200            1482.3
   409600            1577.3
   512000            1592.1
   614400            1631.6
   716800            1415.4
   819200            1639.6
   921600            1682.9
   1024000            1887.7
   1126400            1873.4
   2174976            1933.1
   3223552            1972.1
   4272128            1933.0
   5320704            1798.1
   6369280            1817.8
   7417856            2014.1
   8466432            2002.6
   9515008            2015.1
   10563584            2023.3
   11612160            1962.5
   12660736            1948.4
   13709312            2026.2
   14757888            2045.4
   15806464            2050.5
   16855040            2038.6
   18952192            2050.3
   21049344            2047.2
   23146496            2045.2
   25243648            1986.6
   27340800            2053.4
   29437952            1948.8
   31535104            2058.9
   33632256            2056.2
   37826560            2017.2
   42020864            2030.3
   46215168            2022.9
   50409472            1941.3
   54603776            2017.9
   58798080            2000.0
   62992384            2037.0
   67186688            8096.3

TEST PASSED

Press <Enter> to Quit...
-----------------------------------------------------------

Reply all

Reply to author

Forward