GPU Execution Slow: Algorithm Architecture Questions

Jared Nelsen

unread,

Aug 15, 2016, 9:47:13 PM8/15/16

to aparapi-discuss

Hi there,

I've managed to get openCL compiled and responsive and debugged my Kernel code such that no errors are thrown. This is great progress! However, I am very disappointed in the performance this far. I am not certain how to go about diagnosing what is going on. I have several specific aparapi questions and a few architecture questions.

1. Is allocating the resources necessary for a kernel expensive time-wise? In other words, how expensive is allocating / disposing of a Kernel?

2. Is there something wrong with allocating many kernels in one sweep and then executing / disposing of them serially? Is this a common use case?

3. All of the test programs included with aparapi function without issue. But my program is extremely slow and I am concerned that my configuration is incorrect. How would you go about diagnosing what is wrong?
-The CPU implementation of my algorithm takes fractions of a second but the GPU implementation takes upwards of 60 seconds

Architecture questions:

I am using aparapi kernels to multiply matrices together.

I have a set of matrix multiplications M. After each M completes I need to feed the result into the next M:

[M1, M2, M..., MN]

I am processing this set of M matrix multiplications recursively. M1 executes, the result is fed into M2, and M1 is removed from the list:

[M1, M2, M..., MN] -. [M2, M..., MN] -> [M..., MN] -> [MN] -> [] : MN's result is returned

I am implementing these matrix multiplications into Kernels to be executed on the GPU:

[K1, K2, K..., KN] -. [K2, K..., KN] -> [K..., KN] -> [KN] -> [] : KN's result is returned

I am simply allocating N Kernels to begin with, loading the first with the input and recursively applying this formula. We can call this formula F.

In my governing algorithm I need to tile many, many formula F's and execute each F before I advance through the next iteration of the governing algorithm:

[F[K1, K2, K..., KN]1, F[K1, K2, K..., KN]2, F[K1, K2, K..., KN]..., F[K1, K2, K..., KN]X], where X can grow potentially into the millions.

At this point you may see why I am asking about Kernel allocation cost. I need to do it many many times so Id like it to be cheap.

My final question is this:

This foray into GPU computation was started in hopes that I could shrink the cost of these matrix multiplications tremendously. In theory this is a good bet; this problem is perfect for the GPU. But in light of the fact that I am tiling this operation at scale, concurrently on many threads is this sane in practice as well as theory? Is aparapi / GPU task delegation a good fit for this problem in your opinion?

I am aware that my paranoia towards a global fatal flaw may be getting the best of me. I am reasonably certain that this is a configuration problem and that is primarily what I would like your input on. I just thought I'd run the background by you to hear what you have to say.

Thanks very much for the learning opportunity,
Jared Nelsen

Gary Frost

unread,

Aug 19, 2016, 4:30:09 PM8/19/16

to aparapi-discuss

You should try to use the same kernel instance. Otherwise each call of execute() will incur the cost of generating and compiling OpenCL.

I know many examples show this form

(new Kernel(){

public void run(){

}

}).execute(n);

But we do this to demo the API in as few lines as possible.

Much better to create your own class inheriting from Kernel and overridding run().

Then create one instance and move data to the instance then call execute, then move data again and call execute.

This is the most efficient form

Gary

--
You received this message because you are subscribed to the Google Groups "aparapi-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aparapi-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Barney Pitt

unread,

Aug 26, 2016, 4:41:03 PM8/26/16

to aparapi...@googlegroups.com

Hi jnelson,

Actually, it's no longer so bad as Gary suggests, if you are using the current build from GitHub.

Due to multiple contributions (some of them mine;) most of the old overheads of Kernel instantiation are now no longer a factor. In the latest repository version, all of the reflection, bytecode analysis, CL generation and CL compilation and so on are cached (even the compiled binaries are cached on the JNI side).

It is still the case that for maximum optimality you should reuse Kernel instances, but you will probably find that in the latest development build it doesn't make a great deal of difference.

Barney

To unsubscribe from this group and stop receiving emails from it, send an email to aparapi-discu...@googlegroups.com.

Jared Nelsen

unread,

Aug 26, 2016, 4:53:24 PM8/26/16

to aparapi-discuss

Hi Barney,

Thanks for the update! I think these performance improvements will help overall. I will soon finish an implementation of a single Kernel type system. I think it actually helped to make my project cleaner and more understandable. Once I am done implementing it and I work out the bugs I may be able to do some performance comparisons between my original serial Kernel creation / execution implementation and my current single giant Kernel implementation.

Thanks for the updates and I will be sure to report back!

Jared Nelsen

To unsubscribe from this group and stop receiving emails from it, send an email to aparapi-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Barney Pitt

unread,

Aug 26, 2016, 5:17:07 PM8/26/16

to aparapi...@googlegroups.com

In my (principal) application it is very difficult to reuse Kernels, multiple threads might need access to a given Kernel type simultaneously (prompting me to contribute some of the changes referred to). In a client-server scenario such as mine things can get quite difficult quite quickly. I think that if it is practical to create one Kernel and reuse it, you should do so, but would be very interested to see the results of any comparisons you might perform.

Barney

Gary Frost

unread,

Aug 27, 2016, 5:36:52 PM8/27/16

to aparapi-discuss

Thanks Barney, Sorry. I was recalling the bad old days. I recall that you fixed this.

Gary

Barney Pitt

unread,

Aug 30, 2016, 10:11:09 AM8/30/16

to aparapi...@googlegroups.com

Thanks, but wasn't just me! I applied some C++ code from Jaeju Kim and I think it was Ryan LaMothe who did some of the earliest caching stuff.

To unsubscribe from this group and stop receiving emails from it, send an email to aparapi-discu...@googlegroups.com.

Jared Nelsen

unread,

Aug 30, 2016, 10:18:51 PM8/30/16

to aparapi-discuss

Hi all,

I have finished the implementation of my singular Kernel and have started to run some simple tests but am having trouble debugging the execution results. I am able to load data into the Kernel but my debugger does not let me see the actual execution of the Kernel code which makes perfect sense because the GPU memory is not shared / local. I can see the actual results of the computation but it is difficult to make assumptions about what is going on with the math. The results of the multiplications are not what I expected.

What are some tools and methods I might use to debug what is going on here? Is there a way to read the current state of the memory in the GPU as the computation is happening or would that even be useful? I am using IntelliJ Idea if that is of any use. But I would be happy exploring external tools / examining generated openCL.

Thanks for anything you can pass my way.

Jared

Barney Pitt

unread,

Aug 31, 2016, 6:57:30 PM8/31/16

to aparapi...@googlegroups.com

Jared,

It is possible to instruct Aparapi to use a Java "Device" which will run your Kernel as Plain Old Java and in a single thread. Call KernelManager.setKernelManager(KernelManagers.SEQUENTIAL_ONLY) at the start of your main().

This way you can debug your Kernel very easily. You can even include logging, just place any logging calls inside a method annotated @Kernel.NoCL so that the logging is disabled when running in OpenCL.

Barney

Barney Pitt

unread,

Sep 5, 2016, 6:43:37 PM9/5/16

to aparapi...@googlegroups.com

I guess this prompts the question, "when will the current dev build be put live"?

I have no idea who would be responsible for such an action, now that Gary no longer works for AMD.

Barney

On 27/08/2016 22:36, Gary Frost wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to aparapi-discu...@googlegroups.com.

Ryan LaMothe

unread,

Sep 5, 2016, 6:59:52 PM9/5/16

to aparapi...@googlegroups.com

The primary challenge is finding a Jenkins builder (or equivalent) with all the necessary operating systems and libraries available. We used to have this and were offering it for use to the Aparapi community, but haven't revisited it in over a year, unfortunately :(

Maybe I can try to get that all working again sometime soon...

Sent from my iPhone --- Please excuse any typos or autocorrect mistakes

Jared Nelsen

unread,

Sep 5, 2016, 7:37:40 PM9/5/16

to aparapi-discuss

Barney,

Yes I just ran into this when I saw that KernelManager.java was not present in the binary releases. I saw it on Github though. Is it possible for me to build from this source perhaps?

The issue I am trying to debug is that any calculations through my neural network matrix model end up always being NaN and I am not sure why. I am able to load arrays into the Kernel but any mathematical operation on them always ends up producing only NaNs. Does this maybe indicate something other than my math just being incorrect? Ring any bells?

My next steps are to isolate the Kernel matrix calculations in its own project and see if it operates correctly. If it does then I know that my math is likely incorrect in my project or that I am doing something wrong with the Kernel. Otherwise it must be an issue with my configuration. However, the samples work as expected.

I will report back when I am able to work on it.

Again, thanks for the help and consideration! It speaks very well of the Aparapi team!

Jared

Ryan LaMothe

unread,

Sep 5, 2016, 7:53:21 PM9/5/16

to aparapi...@googlegroups.com

Building Aparapi on Linux or OS X is simple, Windows not as much. That's usually the hang-up, many people are on Windows.

Sent from my iPhone --- Please excuse any typos or autocorrect mistakes

--

Jared Nelsen

unread,

Sep 7, 2016, 11:16:09 PM9/7/16

to aparapi-discuss

Compatriots,

I've spent some time attempting to build Aparapi on Linux but have run into persistent issues. I've been able to edit the build.xml designation for my machine's openCL distribution such that the build begins but it does not complete as g++ is now screaming at me for some odd reasons. I am guessing that it too can't find my openCL implementation. I've reached beyond my comfort level in Linux here.

As a last ditch effort I again would ask that you review my code to attempt to spot any obvious mistakes I might been making. If you would take a little time to do so I would be grateful.

http://paste.ofcode.org/vrNRyFmEvHFKgLaqV7ZYi6

I've put a lot of eggs into one basket with Aparapi and parallelization with no success over at least two months. If I am not able to get this going soon I am going to have to look for another solution. My specialty and focus of this project is the implementation of advanced AI algorithms and in this case it is for a specific (but ultra double top secret) purpose but I cant continue until I have a viable means to scale them. GPU computation seemed to be the best as I can boil down the innermost core of what I am attempting to do into many serial matrix / tensor multiplications. Parallel computation is a completely new frontier for me and it seems as if it has a very steep learning curve. And to boot I am still shaky on the conceptual linear algebra at play here.

Again, I sincerely thank you for your time and consideration. And if you can't help me further or if these topics are out of the intended range of this forum then I understand. Any tips about where to go from here would also be appreciated.

Thanks,

Jared Nelsen

Barney Pitt

unread,

Sep 8, 2016, 6:34:03 PM9/8/16

to aparapi...@googlegroups.com

Hi

We *really* need automated builds back working, not to mention better build scripts. It's definitely something which needs addressing, took me several hours to figure out how to originally build on Windows (I've not tried Linux) and to get it working required a lot of tampering with scripts (not just property files).

However, it should be possible to debug into a pure-java implementation with the binary release you have, it just uses a different (per-Kernel) mechanism. If I recall correctly, the code is

kernel.setExecutionMode(EXECUTION_MODE.JTP);

That will cause the code to be invoked on a Java ThreadPoolExecutor.

EXECUTION_MODE.SEQ would run it in a single thread, obviously easier for debugging, but I seem to recall it is broken in the current non-development release. Give EXECUTION_MODE.SEQ a try, if it fails then you'll have to use EXECUTION_MODE.JTP.

Barney.

Ryan LaMothe

unread,

Sep 8, 2016, 7:25:28 PM9/8/16

to aparapi...@googlegroups.com

Cross-platform Java builds are simple, cross-platform native builds are not. In reality, native builds are a nightmare, as each environment is a special snowflake...which is the entire hang-up with the build scripts.

Sent from my iPhone --- Please excuse any typos or autocorrect mistakes

Jared Nelsen

unread,

Sep 8, 2016, 9:53:13 PM9/8/16

to aparapi-discuss

Barney,

EXECUTION_MODE.SEQ worked! That's confirmation that it isn't broking in the current non-development release. I am now able to step through and debug my algorithm. Long story short is that I may be using a parallel matrix multiplication algorithm in a context that is incorrect for my program. I will need to step back and think about this some more.

Thanks team Aparapi for following through!

I will report back whenever I advance through this but that may not be until I am a very old man!

Reply all

Reply to author

Forward