Optimizing an SPMV kernel using Aparapi

Niaz Khan

unread,

Oct 27, 2015, 8:04:21 PM10/27/15

to aparapi-discuss, niaz...@yahoo.com

Dear All,

I am trying an SPMV kernel using Aparapi. I have gone through most of the online resources on Aparapi. Also I have gone through this link where Gary did pointed out few improvements in the dense matrix multiplication kernel.

https://vasanthexperiments.wordpress.com/2011/11/20/aparapi-java-matrix-multiplication-example/

Since I am using SPMV the kernel is slightly different from the dense kernel which is

    kernel.execute(rows);

    public void run() {
        int i = getGlobalId();
        mult[i] = 0;
        for (int j = rowPtr[i]; j < rowPtr[i + 1]; j++){
            mult[i] += (data[j] * x[colIndices[j]]);
        }
    }

Here each row is pretty much done by one thread.

The results for Sparse matrix with nnz 48M on my laptop with Nvidia GTX 485M for 10 runs using Windows 7 64 bits are:

JCUDA kernel (CUSPARSE) takes on average 27ms

Aparapi kernel takes on average 700ms

I have few questions related to this and I would appreciate to have some suggestions on it.

(1) Is this current kernel can be further optimized from Aparapi, somehow the way the dense kernel is optimized. It might not be applicabale in the sparse kernel but not sure if this could be further improved?

(2) I was trying to optimize the Aparapi SPMV kernel, where a single row is assigned to a warp. I have read in the online Quick Reference guide for Aparapi, that information like warp or work-item or thread-id etc can't be obtained before kernel's launch.Would be very handy to know why there is such a restriction that one can't assign more threads before the kernel's launch e.g. in the case of SPMV assigning a warp to a row , assigning number of blocks and threads per block etc. If this sort of functionality is already available in Aparapi, it would be very handy to have an idea about how to use that.

(3) Have you tried some other kernels for benchmarking Aparapi, I am just wondering that may be the SPMV kernel might not be well suited for Aparapi and in case of other kernels the performance gap of Aparapi with JCUDA etc might not be that much?.

I would really appreciate your suggestions and I am looking forward to hear from you.

Regards
Niaz

Michael Zucchi

unread,

Oct 28, 2015, 2:39:08 AM10/28/15

to aparapi...@googlegroups.com

Hi Niaz,

On 28/10/15 10:34, Niaz Khan wrote:

> Since I am using SPMV the kernel is slightly different from the dense
> kernel which is
>
> kernel.execute(rows);
>
> public void run() {
> int i = getGlobalId();
> mult[i] = 0;
> for (int j = rowPtr[i]; j < rowPtr[i + 1]; j++){
> mult[i] += (data[j] * x[colIndices[j]]);
> }
> }
>

As a starting point I would change it so that each workgroup calculates
a single result.

Do this by parallelising the for loop across all work items. This will
make the data[j] and colIndices[j] accesses coalescable which improves
efficiency significantly - this would be the main bottleneck here. The
worst-case scenario of the above loop is where each row has N>64 active
items (N depending on hardware), in which case that innocuous looking
data[j] will actually transform into 64 separate global memory fetches
*per loop iteration*. colIndices[j] will do the same.

Reading x[X] will just have to remain scatter-gather but the given
re-arrangement might improve locality of reference and chances for
coalescing.

Rather than 'i' being the globalid it should be the groupid, and the
loop now does 64 items per iteration, one per work-item. Using this
technique the kernel can process any number of items locally but still
with got parallel efficiency (this solves the problem of wanting to set
a per-kernel local size, which is not possible nor necessary).

A parallel sum is used to finish off the calculation.

Use something like Range.create(nrows*64, 64) to setup the work range to
force 64-work items, for gpu-specific optimisation. The reason to use a
hard number rather than let opencl choose is so you can allocate sum
properly and it just makes the code easier to write - it's hard enough
already. It often also produces faster code due to constant optimisations.

@Local float sums[] = new float[64];
run() {
int i = getGroupId();
int k = getLocalId();
float sum = 0;
int j0 = rowPtr[i];
int j1 = rowPtr[i+1];

for (int j=k;j < (j1-j0));j+=64) {
sum += data[j+j0] * x[colIndices[j+j0]];
}
sums[k] = sum;
localBarrier();

... insert code for parallel sum here ...

// in "this case" work item 0 has the result in sum (see next)
if (k == 0) {
mult[i] = sum;
}

}

Note that this will produce a different numerical result to your
original kernel as + is not generally associative for ieee floats due to
rounding.

Parallel sum has a simple optimisation:

// add 64 dummy spots, worksize is still 64
@Local float sums[] = new float[64+64];

...
// setup
sums[k+64] = 0;
localBarrier();

// sum code
for (int j=32;j>0;j>>=1) {
sum += sums[j+k];
localBarrier();
sums[j] = sum;
localBarrier()l
}
// result already in 'sum' in thread (k==0), or in sums[0] for all threads.
// it's better to run any less-than-worksize work on the lower numbers
...

By having 64 extra zeros you can avoid any range test and branch logic
in both the setup and inner loop.

(if you know the hardware specifically you can also remove all
localBarrier() calls in certain portability-breaking circumstances.
Unlikely to be worth anything here though.).

> Here each row is pretty much done by one thread.
>
> The results for Sparse matrix with nnz 48M on my laptop with Nvidia
> GTX 485M for 10 runs using Windows 7 64 bits are:
>
> JCUDA kernel (CUSPARSE) takes on average 27ms
>
> Aparapi kernel takes on average 700ms
>
> I have few questions related to this and I would appreciate to have
> some suggestions on it.
>
> (1) Is this current kernel can be further optimized from Aparapi,
> somehow the way the dense kernel is optimized. It might not be
> applicabale in the sparse kernel but not sure if this could be further
> improved?
>

Well, as above I guess. That was just off the top of my head so might
be buggy but the general idea is there. It might even be slower but i
don't think so.

> (2) I was trying to optimize the Aparapi SPMV kernel, where a single
> row is assigned to a warp. I have read in the online Quick Reference
> guide for Aparapi, that information like warp or work-item or
> thread-id etc can't be obtained before kernel's launch.Would be very
> handy to know why there is such a restriction that one can't assign
> more threads before the kernel's launch e.g. in the case of SPMV
> assigning a warp to a row , assigning number of blocks and threads per
> block etc. If this sort of functionality is already available in
> Aparapi, it would be very handy to have an idea about how to use that.
>

None of these things 'exist' outside of a running kernel, or are
gouverned by the hardware capabilities, opencl processing model, and/or
kernel execution range. e.g. all work groups are the same size and the
global work size is some integer multiple thereof.

re cuda nomenclature, I think "thread-id" is local_id/localId in
opencl/aparapi, work-item is global_id/global, and block-id is
group_id/groupId. These are the things I think you're wanting here.

FWIW I think 'thread' is a poor name choice for the work item
abstraction and local-id makes it clearer that it really isn't a
thread. It's not even a thread on the hardware, but a SIMD lane, and
besides that these machines have other things called threads on them
already. It also gets people expecting things like mutexes to be a good
idea, and they really aren't (on a gpu).

> (3) Have you tried some other kernels for benchmarking Aparapi, I am
> just wondering that may be the SPMV kernel might not be well suited
> for Aparapi and in case of other kernels the performance gap of
> Aparapi with JCUDA etc might not be that much?.
>

Your kernel is just not optimised for a GPU topography but this isn't
too hard to address. aparapi is a fairly thin conceptual layer on
opencl so any opencl optimisations should apply, and these are all
general tight-concurrency techniques applicable to other similar languages.

This has the side-effect that the gpu-optimised version is probably
cpu-unoptimised, but this isn't something you can avoid as opencl
explicitly exposes such details on purpose.

Another side-effect is that the code is probably slower for small
matrices, so the size matters too. And if the problem is very large
(live col length is >> 64) then you can do some other optimisations like
unrolling the inner loop some steps that can be worth the effort. With
opencl you can just dynamically create the code to take the problem size
and hardware capabilities into account at runtime.

Cheers,
!Z

Niaz Khan

unread,

Oct 29, 2015, 11:36:22 AM10/29/15

to aparapi-discuss

Hi Michael,

Thanks a lot for your very informative reply, I would try to apply the idea and would let you know the difference with the initial version.

(1) I would be trying the SPMV kernel on AMD device, Nvidia device and also on CPUs. Regarding using the range like Range.create(nrows*64, 64), you have already mentioned that N will be depend on the hardware. In case of AMD device N would be 64 and in case of nvidia device it would be 32, for a CPU it might be 8 or 16 depending on the hardware. Is there a way before setting up the kernel range & launch to specify that in a portable way rather than hard coded way where I have to give vale to N manually. I mean some sort of device query or check to have N depending on the device.

(2) I had a look at the Aparapi's generated OpenCL for the initial posted spmv kernel. Would be very handy to have an idea how aparapi generates the OpenCL. I would be trying some other kernels as well so this would be very handy in mapping them to aparapi.

Regards
Niaz

Michael Zucchi

unread,

Oct 29, 2015, 8:15:57 PM10/29/15

to aparapi...@googlegroups.com

On 30/10/15 02:06, Niaz Khan wrote:
> Hi Michael,
>
>
> Thanks a lot for your very informative reply, I would try to apply the
> idea and would let you know the difference with the initial version.
>
> (1) I would be trying the SPMV kernel on AMD device, Nvidia device and
> also on CPUs. Regarding using the range like Range.create(nrows*64,
> 64), you have already mentioned that N will be depend on the hardware.
> In case of AMD device N would be 64 and in case of nvidia device it
> would be 32, for a CPU it might be 8 or 16 depending on the hardware.
> Is there a way before setting up the kernel range & launch to specify
> that in a portable way rather than hard coded way where I have to give
> vale to N manually. I mean some sort of device query or check to have
> N depending on the device.
>

Yes you can, but no you don't want to.

Portability and optimisation are usually opposites.

Everything can be generic and automatic sizing (based on in-kernel
queries) but you will just end up with fatter, slower, harder-to-write,
harder-to-test code as it will need to do more workgroup queries and
address calculations which is just non-functional busy-work. If you're
using opencl you can move these calculations to a kernel generator but
aparapi precludes that by design.

I think you can only query the device worksize from a compiled kernel
which aparapi doesn't expose. It's of limited use (LDS sizing) unless
you're dynamically generating kernels at runtime.

You only need to worry about the local work size when you use LDS, and
using LDS is only very much use on a GPU at the moment. So you've
already assumed a particular hardware target. Just write for the N=64
case, you're wasting your time also doing N=32, and unless you have very
specific circumstances this will be near optimal on nvidia hardware
too. And if you also need performance-optimised support for non-gpu
devices then you're much better off just writing two separate kernels in
almost every case.

CPU is a bit trickier as it depends on whether the compiler uses SIMD
for work items or not. You're probably better off just not using LDS or
barriers and assuming N=1 and let the compiler choose, particularly for
aparapi as you're limited to scalar ops. Sometimes plain java is faster
too.

If you're not using local memory (many algorithms will not benefit or be
slower with it) then the code can usually be common and you can let the
device choose the local size.

Or you can ignore my recommendations and learn the hard way as i had to! ;-)

>
> (2) I had a look at the Aparapi's generated OpenCL for the initial
> posted spmv kernel. Would be very handy to have an idea how aparapi
> generates the OpenCL. I would be trying some other kernels as well so
> this would be very handy in mapping them to aparapi.
>

The source-code is all there. It "just" compiles the bytecode to opencl
in a fairly straightforward manner.

Niaz Khan

unread,

Oct 31, 2015, 12:53:45 PM10/31/15

to aparapi-discuss

Hi Michael,

I have been trying on your idea for the spmv implementation which is:

@Local double sums[] = new double[32];

public void run() {

int i = getGroupId();

int k = getLocalId();

float sum = 0;

int j0 = rowPtr[i];

int j1 = rowPtr[i+1];

for (int j=k;j < (j1-j0);j+=32) {

sum += data[j+j0] * x[colIndices[j+j0]];

}

sums[k] = sum;

localBarrier();

//Parallel sum

for (int j=32;j>0;j>>=1) {

sum += sums[j+k];

localBarrier();

sums[j] = sum;

localBarrier();

}

if (k == 0) {

mult[i] = sum;

}

I am using double instead of float and also for N I am using 32 instead of 64 as the device is an Nvidia device (GTX 485m).

The kernel range has been set up like this:

Range range = Range.create(rows*32, 32);

kernel.execute(range);

I have tried it on a matrix with rows 1489752 cols 1489752 nnz 10319760 on windows 7 64 bits but I am getting this error:

Execution mode=GPU

#

# A fatal error has been detected by the Java Runtime Environment:

#

# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00000000775e0e91, pid=10996, tid=6860

#

# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)

# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode windows-amd64 )

# Problematic frame:

# C [ntdll.dll+0x50e91]

#

# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows

#

# An error report file with more information is saved as:

# C:\Users\NIAZ\Desktop\workspace\ \hs_err_pid10996.log

#

# If you would like to submit a bug report, please visit:

# http://bugreport.java.com/bugreport/crash.jsp

# The crash happened outside the Java Virtual Machine in native code.

# See problematic frame for where to report the bug.

(1) Not sure why the access violation error has been thrown, I think the kernel code seems fine?

(2) I am also trying to debug it but not sure if it would be possible straight to debug inside the kernel from within Java?

Regards

Niaz

Ryan LaMothe

unread,

Oct 31, 2015, 12:56:32 PM10/31/15

to aparapi...@googlegroups.com

Does it throw this error for all possible valid or only the ones you specified below?

Sent from my iPhone --- Please excuse any typos or autocorrect mistakes

--
You received this message because you are subscribed to the Google Groups "aparapi-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aparapi-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Niaz Khan

unread,

Oct 31, 2015, 1:28:16 PM10/31/15

to aparapi-discuss

I have tried with N =64 too on the same matrix size and it throws the same error. I am going to test it on a different matrix size

Thanks

Michael Zucchi

unread,

Oct 31, 2015, 7:53:19 PM10/31/15

to aparapi...@googlegroups.com

Debug the kernel using the java driver I guess. Like I said i just wrote that on the fly based on a tiny snippet of code so i might've messed up some ranges or misinterpreted the storage format.

Niaz Khan

unread,

Nov 1, 2015, 11:12:37 AM11/1/15

to aparapi-discuss

Michael,

I am using the CSR as a storage format so the idea should work with that. I am using eclipse, I guess in order to debug the kernel using the java driver I have to do it from command line or would it be possible from eclipse. I tried to attach the source in the eclipse debugger but it still can't look through the source when I try to step into the kernel.

Normally how do debug an Aparapi kernel if you have to debug?

Cheers

Niaz

Ryan LaMothe

unread,

Nov 1, 2015, 11:29:02 AM11/1/15

to aparapi...@googlegroups.com

Michael is meaning to run your application in JTP mode (disable GPU execution) and debug normally. Maybe this is a hidden NPE or OOB on an array, and the GPU execution is swallowing the error.

Sent from my iPhone --- Please excuse any typos or autocorrect mistakes

--

Niaz Khan

unread,

Nov 3, 2015, 12:01:33 PM11/3/15

to aparapi-discuss

Dear Michael and Ryan,

I have been able to run the kernel in JTP mode. After switching to GPU mode, it still throws the same error:

Execution mode=GPU

#

# A fatal error has been detected by the Java Runtime Environment:

#

# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00000000775e0e91, pid=10996, tid=6860

#

# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)

# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode windows-amd64 )

# Problematic frame:

# C [ntdll.dll+0x50e91]

#

# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows

#

# An error report file with more information is saved as:

# C:\Users\NIAZ\Desktop\workspace\ \hs_err_pid10996.log

#

# If you would like to submit a bug report, please visit:

# http://bugreport.java.com/bugreport/crash.jsp

# The crash happened outside the Java Virtual Machine in native code.

# See problematic frame for where to report the bug.

Not sure why is still it thrown after running fine in the JTP mode. Would you point me to some documentation or resources for how Aparapi works in JTP, CPU and GPU modes, just want to get an insight and understand how Aparapi runs on CPUs and GPUs.

Cheers
Niaz

Gary Frost

unread,

Nov 3, 2015, 12:34:32 PM11/3/15

to aparapi-discuss

SEQ - Executes your kernel sequentially from Java

JTP - Just divides your global size (range) into 'N' threads, each of which executes (global size)/N kernels. N is dependent on the # of cores you have.

CPU - Runs your kernel using OpenCL using your machines CPU driver (you will need Intel or AMD OpenCL runtime for this - NVidia does not support this)

GPU - Runs your kernel using OpenCL using your GPU. You will need a suitable GPU card for this mode.

You might try turning on JNI tracing (it is verbose) to isolate issues.

Add this to your java command line.

-Dcom.amd.aparapi.enableVerboseJNI=true

Gary

Niaz Khan

unread,

Nov 3, 2015, 1:15:20 PM11/3/15

to aparapi-discuss

Hi Gary,

Thanks for the reply.

I am using an Nvidia GTX 480M GPU which should be suitable for the GPU mode. After adding -Dcom.amd.aparapi.enableVerboseJNI=true

The outcome is:

Execution mode=GPU

in setArgs arg 0 partialSums type 00000e90

in setArgs arg 0 partialSums is local

in setArgs arg 1 rowPtr type 00001288

in setArgs arg 1 rowPtr is *not* local

in setArgs arg 2 data type 00001290

in setArgs arg 2 data is *not* local

in setArgs arg 3 x type 00001290

in setArgs arg 3 x is *not* local

in setArgs arg 4 colIndices type 00001288

in setArgs arg 4 colIndices is *not* local

in setArgs arg 5 mult type 00001690

in setArgs arg 5 mult is *not* local

got type for partialSums: 00000e90

testing for Resync javaArray partialSums: old=0000000000000000, new=000000004AE142E8

Resync javaArray for partialSums: 000000004AE142E8 0000000000000000

NewWeakGlobalRef for partialSums, set to 000000004AF021A0

updateNonPrimitiveReferences, args[0].lengthInBytes=1024

got type for rowPtr: 00001288

testing for Resync javaArray rowPtr: old=0000000000000000, new=000000004AE142F0

Resync javaArray for rowPtr: 000000004AE142F0 0000000000000000

NewWeakGlobalRef for rowPtr, set to 000000004AF021A8

updateNonPrimitiveReferences, args[1].lengthInBytes=124

got type for data: 00001290

testing for Resync javaArray data: old=0000000000000000, new=000000004AE142F8

Resync javaArray for data: 000000004AE142F8 0000000000000000

NewWeakGlobalRef for data, set to 000000004AF021B0

updateNonPrimitiveReferences, args[2].lengthInBytes=1440

got type for x: 00001290

testing for Resync javaArray x: old=0000000000000000, new=000000004AE14300

Resync javaArray for x: 000000004AE14300 0000000000000000

NewWeakGlobalRef for x, set to 000000004AF021B8

updateNonPrimitiveReferences, args[3].lengthInBytes=240

got type for colIndices: 00001288

testing for Resync javaArray colIndices: old=0000000000000000, new=000000004AE14308

Resync javaArray for colIndices: 000000004AE14308 0000000000000000

NewWeakGlobalRef for colIndices, set to 000000004AF021C0

updateNonPrimitiveReferences, args[4].lengthInBytes=720

got type for mult: 00001690

testing for Resync javaArray mult: old=0000000000000000, new=000000004AE14310

Resync javaArray for mult: 000000004AE14310 0000000000000000

NewWeakGlobalRef for mult, set to 000000004AF021C8

updateNonPrimitiveReferences, args[5].lengthInBytes=240

back from updateNonPrimitiveReferences

got type for arg 0, partialSums, type=00000e90

ISLOCAL, clSetKernelArg(jniContext->kernel, 0, 1024, NULL);

got type for arg 1, rowPtr, type=00001288

runKernel: arrayOrBuf ref 000000004AF021A8, oldAddr=0000000000000000, newAddr=0000002BEB3D3D98, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB3D3D98, contents: 00 00 00 00 04 00 00 00

rowPtr 1 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=0000007c bytes, address=0000002BEB3D3D98, &status)

writing buffer argIndex=1 argPos=1 rowPtr

got type for arg 2, data, type=00001290

runKernel: arrayOrBuf ref 000000004AF021B0, oldAddr=0000000000000000, newAddr=0000002BEB3D33C0, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB3D33C0, contents: e0 7f d0 1f cf a0 8d c0

data 2 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000005a0 bytes, address=0000002BEB3D33C0, &status)

writing buffer argIndex=2 argPos=2 data

got type for arg 3, x, type=00001290

runKernel: arrayOrBuf ref 000000004AF021B8, oldAddr=0000000000000000, newAddr=0000002BEB26BD30, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB26BD30, contents: 00 00 00 00 00 00 08 40

x 3 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000000f0 bytes, address=0000002BEB26BD30, &status)

writing buffer argIndex=3 argPos=3 x

got type for arg 4, colIndices, type=00001288

runKernel: arrayOrBuf ref 000000004AF021C0, oldAddr=0000000000000000, newAddr=0000002BEB3D3A70, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB3D3A70, contents: 00 00 00 00 01 00 00 00

colIndices 4 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000002d0 bytes, address=0000002BEB3D3A70, &status)

writing buffer argIndex=4 argPos=4 colIndices

got type for arg 5, mult, type=00001690

runKernel: arrayOrBuf ref 000000004AF021C8, oldAddr=0000000000000000, newAddr=0000002BEB26C830, ref.mem=0000000#

# A fatal error has been detected by the Java Runtime Environment:

#

# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x0000000077a10e91, pid=9628, tid=3776

#

# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)

# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode windows-amd64 )

# Problematic frame:

# C [ntdll.dll+0x50e91]

#

# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows

#

# An error report file with more information is saved as:

# C:\Users\NIAZ\Desktop\workspace\AparapiTest\hs_err_pid9628.log

#

# If you would like to submit a bug report, please visit:

# http://bugreport.java.com/bugreport/crash.jsp

# The crash happened outside the Java Virtual Machine in native code.

# See problematic frame for where to report the bug.

#

000000000 isCopy=false

at memory addr 0000002BEB26C830, contents: 00 00 00 00 00 00 00 00

mult 5 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE, size=000000f0 bytes, address=0000002BEB26C830, &status)

writing buffer argIndex=5 argPos=5 mult

Its showing that the arrays double[] mult double[] data double[] x int[] rowPt; int[] colIndices are not local!!! not sure if this is causing the crash.

These arrays are passed by reference from the class which has the main method to the class which has the kernel. Do these arrays needs to be @local, I am using references to avoid extra copying.

Cheers

Niaz

Gary Frost

unread,

Nov 3, 2015, 1:47:15 PM11/3/15

to aparapi-discuss

Can you post your code?

What size are the local arrays?

Gary

--

Niaz Khan

unread,

Nov 3, 2015, 2:26:33 PM11/3/15

to aparapi-discuss

Hi Gary,

In the code I am trying to read the CSR from matrix market format. The sizes are based on the rows and cols and nnz of the matrix. In the code snippet I am using a matrix with rows = 30, cols = 30 and nnz =180.

public class AparapiTest{

public static void main(String args[]) {

double x[];

double mult[];

double data[];

int colIndices[];

int rowPtr[];

int rows=30, cols=30, size=rows*cols;

int totalNonZeros = 180;

x = new double [cols];

Arrays.fill(x, 3.0);

SparseRCDoubleMatrix2D A = null;

String mmFile = "C:\\Users\\NIAZ\\Desktop\\workspace\\Spmv\\pores_1.mtx";

//CSR variables

data = new double [totalNonZeros];

colIndices = new int [totalNonZeros];

rowPtr = new int [rows+1];

mult = new double [rows];

A = SpmvUtils.getMatrix(mmFile);

data = A.getValues();

colIndices = A.getColumnIndexes();

rowPtr = A.getRowPointers();

AparapiKernel kernel = new AparapiKernel(mult, data, x, rowPtr, colIndices);

try{

kernel.setExecutionMode(Kernel.EXECUTION_MODE.GPU);

System.out.println("Execution mode=" + kernel.getExecutionMode());

Range range = Range.create(rows*32, 32);

kernel.execute(range);

}catch(NullPointerException ne){

ne.printStackTrace();

}

kernel.dispose();

}

class AparapiKernel extends Kernel {

double[] mult;

double[] data;

double[] x;

int[] rowPtr;

int[] colIndices;

public AparapiKernel(double[] mult, double[] data, double[] x, int[] rowPtr, int[] colIndices) {

this.mult = mult;

this.data = data;

this.rowPtr = rowPtr;

this.x = x;

this.colIndices = colIndices;

}

@Local double partialSums[] = new double[128];

public void run() {

int dim = 180;

int VECTOR_SIZE = 32;

int tid = getLocalId();

int id = tid & (VECTOR_SIZE-1);

// One row per warp

int threadsPerBlock = getLocalSize() / VECTOR_SIZE;

int row = (getGroupId() * threadsPerBlock) + (tid / VECTOR_SIZE);

partialSums[tid] = 0;

if (row < dim)

{

int vecStart = rowPtr[row];

int vecEnd = rowPtr[row+1];

double sum = 0;

for (int j = vecStart + id; j < vecEnd; j += VECTOR_SIZE) {

sum += (data[j] * x[colIndices[j]]);

}

partialSums[tid] = sum;

localBarrier();

// Reduce partial sums

if (id < 16) partialSums[tid] += partialSums[tid +16];

localBarrier();

if (id < 8) partialSums[tid] += partialSums[tid + 8];

localBarrier();

if (id < 4) partialSums[tid] += partialSums[tid + 4];

localBarrier();

if (id < 2) partialSums[tid] += partialSums[tid + 2];

localBarrier();

if (id < 1) partialSums[tid] += partialSums[tid + 1];

localBarrier();

// Write result

if (id == 0)

{

mult[row] = partialSums[tid];

}

Gary Frost

unread,

Nov 3, 2015, 3:58:22 PM11/3/15

to aparapi-discuss

Thanks.

So partialSums is the only local data.

I don't think anything else should be local.

In the log above you can see that Aparapi has requested a 1024 byte buffer for partialSums

got type for arg 0, partialSums, type=00000e90

ISLOCAL, clSetKernelArg(jniContext->kernel, 0, 1024, NULL);

This correlates do your 'new double[128]'

Your execution grid implies a local/group size of 32. So theoretically tid will be 0..31

I wanted to check if your code was walking off the end (or beginning) of this local buffer. It does not appear to do so.

Presumeably VECTOR_SIZE matches getLocalSize() and is 32

You might need to add localBarrier() after partialSums[tid]=0;

Do you expect 'row>=dim'? in any lane/work-item?

If so you have an issue with barriers.

All kernels must pass through the same number of barriers.

So consider adding an 'else' to your 'if (row<dim)' which contains 6 calls to localBarrier()

Gary

--

Michael Zucchi

unread,

Nov 3, 2015, 4:12:58 PM11/3/15

to aparapi...@googlegroups.com

My understanding is that the barriers must be the same barriers, not just the same number of barriers. If you mentally break apart multi-work-item work into serial code you see that this must be so (each barrier boundary becomes an encompassing loop point with an implicit pair of barriers bracketing the whole kernel).

As originally suggested, all range checking in that routine outside of the for loop is entirely superfluous .

Gary Frost

unread,

Nov 3, 2015, 4:21:33 PM11/3/15

to aparapi-discuss

Hey Micheal

I think we are saying the same thing. In this case we only have localBarriers. So they are indeed the same barrier type.

We do need to ensure that each work item passes over the same # of local barriers no matter which execution path it takes.

If of course 'row<dim' is always true then this is not the cause of the issue.

It is my understanding that efficient OpenCL CPU implementations have to jump through hoops to make these barriers work...

Gary

Gary Frost

unread,

Nov 3, 2015, 5:11:29 PM11/3/15

to aparapi-discuss

Ah.

Micheal is 100% correct. My workaround for your 'else' won't work. You need to ensure that all workitems pass through the same code. So you need to avoid the conditional in your code.

Here is the OpenCL spec.

All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.

If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier.

If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

The barrier function also queues a memory fence (reads and writes) to ensure correct ordering of memory operations to local or global memory.

The flags argument specifies the memory address space and can be set to a combination of the following literal values.

CLK_LOCAL_MEM_FENCE - The barrier function will either flush any variables stored in local memory or queue a memory fence to ensure correct ordering of memory operations to local memory.

CLK_GLOBAL_MEM_FENCE - The barrier function will queue a memory fence to ensure correct ordering of memory operations to global memory. This can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data.

Niaz Khan

unread,

Nov 3, 2015, 8:45:44 PM11/3/15

to aparapi-discuss

Michael and Gary,

I have been woring on the idea of all workitems passing through the same code and avoiding the conditional.

The kernel is exactly the same as Michael has suggested.

@Local double sums[] = new double[64];

public void run() {

int i = getGroupId();

int k = getLocalId();

double sum = 0;

int j0 = rowPtr[i];

int j1 = rowPtr[i+1];

for (int j=k;j < (j1-j0);j+=32) {

sum += data[j+j0] * x[colIndices[j+j0]];

}

sums[k] = sum;

localBarrier();

// ... insert code for parallel sum here ...

// sum code

for (int j=32;j>0;j>>=1) {

sum += sums[j+k];

localBarrier();

sums[j] = sum;

localBarrier();

}

if (k == 0) {

mult[i] = sum;

}

The results are both incorrect in JTP and GPU mode. Here is the JNI verbose for GPU mode

in setArgs arg 0 rowPtr type 00001288

in setArgs arg 0 rowPtr is *not* local

in setArgs arg 1 data type 00001290

in setArgs arg 1 data is *not* local

in setArgs arg 2 x type 00001290

in setArgs arg 2 x is *not* local

in setArgs arg 3 colIndices type 00001288

in setArgs arg 3 colIndices is *not* local

in setArgs arg 4 sums type 00000e90

in setArgs arg 4 sums is local

in setArgs arg 5 mult type 00001690

in setArgs arg 5 mult is *not* local

got type for rowPtr: 00001288

testing for Resync javaArray rowPtr: old=0000000000000000, new=000000004AC5DF88

Resync javaArray for rowPtr: 000000004AC5DF88 0000000000000000

NewWeakGlobalRef for rowPtr, set to 000000004AD49730

updateNonPrimitiveReferences, args[0].lengthInBytes=124

got type for data: 00001290

testing for Resync javaArray data: old=0000000000000000, new=000000004AC5DF90

Resync javaArray for data: 000000004AC5DF90 0000000000000000

NewWeakGlobalRef for data, set to 000000004AD49738

updateNonPrimitiveReferences, args[1].lengthInBytes=1440

got type for x: 00001290

testing for Resync javaArray x: old=0000000000000000, new=000000004AC5DF98

Resync javaArray for x: 000000004AC5DF98 0000000000000000

NewWeakGlobalRef for x, set to 000000004AD49740

updateNonPrimitiveReferences, args[2].lengthInBytes=240

got type for colIndices: 00001288

testing for Resync javaArray colIndices: old=0000000000000000, new=000000004AC5DFA0

Resync javaArray for colIndices: 000000004AC5DFA0 0000000000000000

NewWeakGlobalRef for colIndices, set to 000000004AD49748

updateNonPrimitiveReferences, args[3].lengthInBytes=720

got type for sums: 00000e90

testing for Resync javaArray sums: old=0000000000000000, new=000000004AC5DFA8

Resync javaArray for sums: 000000004AC5DFA8 0000000000000000

NewWeakGlobalRef for sums, set to 000000004AD49750

updateNonPrimitiveReferences, args[4].lengthInBytes=512

got type for mult: 00001690

testing for Resync javaArray mult: old=0000000000000000, new=000000004AC5DFB0

Resync javaArray for mult: 000000004AC5DFB0 0000000000000000

NewWeakGlobalRef for mult, set to 000000004AD49758

updateNonPrimitiveReferences, args[5].lengthInBytes=240

back from updateNonPrimitiveReferences

got type for arg 0, rowPtr, type=00001288

runKernel: arrayOrBuf ref 000000004AD49730, oldAddr=0000000000000000, newAddr=0000002BEB3D3CF8, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB3D3CF8, contents: 00 00 00 00 04 00 00 00

rowPtr 0 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=0000007c bytes, address=0000002BEB3D3CF8, &status)

writing buffer argIndex=0 argPos=0 rowPtr

got type for arg 1, data, type=00001290

runKernel: arrayOrBuf ref 000000004AD49738, oldAddr=0000000000000000, newAddr=0000002BEB3D3320, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB3D3320, contents: e0 7f d0 1f cf a0 8d c0

data 1 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000005a0 bytes, address=0000002BEB3D3320, &status)

writing buffer argIndex=1 argPos=1 data

got type for arg 2, x, type=00001290

runKernel: arrayOrBuf ref 000000004AD49740, oldAddr=0000000000000000, newAddr=0000002BEB26BC90, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB26BC90, contents: 00 00 00 00 00 00 08 40

x 2 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000000f0 bytes, address=0000002BEB26BC90, &status)

writing buffer argIndex=2 argPos=2 x

got type for arg 3, colIndices, type=00001288

runKernel: arrayOrBuf ref 000000004AD49748, oldAddr=0000000000000000, newAddr=0000002BEB3D39D0, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB3D39D0, contents: 00 00 00 00 01 00 00 00

colIndices 3 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000002d0 bytes, address=0000002BEB3D39D0, &status)

writing buffer argIndex=3 argPos=3 colIndices

got type for arg 4, sums, type=00000e90

ISLOCAL, clSetKernelArg(jniContext->kernel, 4, 512, NULL);

got type for arg 5, mult, type=00001690

runKernel: arrayOrBuf ref 000000004AD49758, oldAddr=0000000000000000, newAddr=0000002BEB26C790, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB26C#

# A fatal error has been detected by the Java Runtime Environment:

#

# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x0000000077a10e91, pid=7588, tid=13308

#

# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)

# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode windows-amd64 )

# Problematic frame:

# C [ntdll.dll+0x50e91]

#

# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows

#

# An error report file with more information is saved as:

# C:\Users\NIAZ\Desktop\workspace\AparapiTest\hs_err_pid7588.log

#

# If you would like to submit a bug report, please visit:

# http://bugreport.java.com/bugreport/crash.jsp

# The crash happened outside the Java Virtual Machine in native code.

# See problematic frame for where to report the bug.

#

790, contents: 00 00 00 00 00 00 00 00

mult 5 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE, size=000000f0 bytes, address=0000002BEB26C790, &status)

writing buffer argIndex=5 argPos=5 mult

Not sure why the error is thrown and the results are incorrect if this the barriers have been used properly this time.in the 2nd for loop. Also I think the sums are computed correctly in the first for loop?

Cheers

Niaz

Michael Zucchi

unread,

Nov 4, 2015, 6:36:17 PM11/4/15

to aparapi...@googlegroups.com

On 04/11/15 07:51, Gary Frost wrote:
>
> It is my understanding that efficient OpenCL CPU implementations have
> to jump through hoops to make these barriers work...

You followed up on the other details but on this specific point, this is
how i see an opencl implementation working.

So say you have:

lws = 64
lx = get_local_id(0);

local float foo[64];

foo[lx] = read_data();
local_barrier();

if (!(lx & 1))
foo[lx] += foo[lx+1];
local_barrier();
// more stuff follows

On a gpu this will basically compile as shown, just run (effectively)
64-wide.

On a cpu this can be solved a couple of ways whilst still being within
spec, such as:
1. Just use os threads, cyclicbarriers, and 'do the same'.
2. run the whole work-group as a single function on a single thread, or
3. the same as 2 but do some simd stuff as well.

1 is going to be pretty slow and also impact the benefit of multiple
concurrent workgroups. 2 is a pretty simple (conceptually at least) and
decent solution and seems to be used at least by the amd driver. 3. adds
a lot of compiler complexity and would probably only be a significant
benefit in limited circumstances, although the distinction between
gpu/cpu is lessening.

If all work items are in a single function then the local 'threads' is
just a loop iteration. If no barriers are used you just bracket the
whole function with a single loop which steps over the work size. But
if barriers are used you break the whole function into multiple loops -
one for each barrier bracket.

lws = 64

float foo[64];

for (int lx=0;lx<get_local_size(0);lx++) {
foo[lx] = read_data();
} // for loop is 'bound' for local_barrier();

for (int lx=0;lx<get_local_size(0);lx++) {
if (!(lx & 1))
foo[lx] += foo[lx+1];
} // for loop is 'bound' for local_barrier();

Whilst this isn't optimal single-thread code it isn't really that bad
considering the code was specifically optimised for such a radically
different architecture. Whereas going the other way would be abysmal.

I find using this as an abstraction of the processing model simplifies
understanding what will and wont work correctly, or what will be
inefficient. Because you can just think serially rather than having to
worry about thinking concurrently.

I presume JTP just uses threads and cyclicbarriers? Whilst the above is
conceptually simple, i've missed out all the mess of tracking auto
variables which complicates a compiler somewhat since they all need to
be expanded to the size of the workgroup.

(it is also probably not worth the effort: it's easier just to write a
cpu-optimised alternative of every kernel - which will also run faster).

!Z

Michael Zucchi

unread,

Nov 5, 2015, 7:50:17 PM11/5/15

to aparapi...@googlegroups.com

Bugs I guess.

On 04/11/15 12:15, Niaz Khan wrote:

Michael and Gary,

I have been woring on the idea of all workitems passing through the same code and avoiding the conditional.

The kernel is exactly the same as Michael has suggested.

@Local double sums[] = new double[64];

public void run() {

int i = getGroupId();

int k = getLocalId();

double sum = 0;

int j0 = rowPtr[i];

int j1 = rowPtr[i+1];

for (int j=k;j < (j1-j0);j+=32) {

sum += data[j+j0] * x[colIndices[j+j0]];

}

sums[k] = sum;

This also needs a "sums[k+32] = 0" to initialise the dummy high half of the array.

localBarrier();

// ... insert code for parallel sum here ...

// sum code

for (int j=32;j>0;j>>=1) {

This loop was for lws=64,1,1 so it should be 16 if lws=32,1,1

sum += sums[j+k];

localBarrier();

sums[j] = sum;

localBarrier();

}

It should also be "sums[k] = sum". If i typo'd that i apologise but it was just an example anyway and i usually just find where i've done it before and paste it in.

Niaz Khan

unread,

Nov 5, 2015, 8:28:24 PM11/5/15

to aparapi-discuss

Thanks Michael,

The code works in JTP mode using the changes you have suggested. For some reason it is still throwing the same EXCEPTION_ACCESS_VIOLATION error. The only thing is I am using double instead of float, do I have to somehow enable fp64 as its done in OpenCL or aparapi does care of that?

On Friday, November 6, 2015 at 12:50:17 AM UTC, NotZed wrote:

Bugs I guess.

On 04/11/15 12:15, Niaz Khan wrote:

Michael and Gary,

I have been woring on the idea of all workitems passing through the same code and avoiding the conditional.

The kernel is exactly the same as Michael has suggested.

@Local double sums[] = new double[64];

public void run() {

int i = getGroupId();

int k = getLocalId();

double sum = 0;

int j0 = rowPtr[i];

int j1 = rowPtr[i+1];

for (int j=k;j < (j1-j0);j+=32) {

sum += data[j+j0] * x[colIndices[j+j0]];

}

sums[k] = sum;

This also needs a "sums[k+32] = 0" to initialise the dummy high half of the array.

For some reason in the JTP mode when I try to initialise the dummy high half of the array in this way it produces incorrect results.

while sums[k] = sum; works

Gary Frost

unread,

Nov 5, 2015, 8:30:48 PM11/5/15

to aparapi-discuss

I presume your GPU and OpenCL version can process doubles.

Is this an OpenCL 1.1 or 1.2 runtime/device.

If you can build cltest (inside com.amd.aparapi.jni) using 'ant cltest' and run it, it will tell you whether your card/runtime supports doubles.

Gary

Niaz Khan

unread,

Nov 5, 2015, 8:47:18 PM11/5/15

to aparapi-discuss

Gary,

That's right, I think it can process doubles

The device is Nvidia GTX 485M and the OpenCL runtime is 1.1

I had a quickly tested an OpenCL code where there are doubles in the kernel, for that I have to use
#pragma OPENCL EXTENSION cl_khr_fp64 : enable

So I was thinking whether I have to use that sort of enabling in Aparapi too or Aparapi does care of that internally?

Niaz

Gary Frost

unread,

Nov 5, 2015, 9:01:19 PM11/5/15

to aparapi-discuss

Aparapi examines the capabilities and *should* detect this one to add the pragma.

You can dump the generated OpenCL using this on the java launch line.

-Dcom.amd.aparapi.enableShowGeneratedOpenCL=true

Hopefully we are detecting the capability and generating the pragma.

Gary

--

Niaz Khan

unread,

Nov 5, 2015, 9:11:35 PM11/5/15

to aparapi-discuss

Thats exactly right, in the generated OpenCL the pragma is there.

Just wondering what else can go wrong that the error is still thrown in the GPU mode? I will also quickly check the generated OpenCL but I am pretty sure that will work

Niaz

Michael Zucchi

unread,

Nov 6, 2015, 4:29:49 AM11/6/15

to aparapi...@googlegroups.com

Can you just try testing with floats to see if that goes away? At least it might isolate the problem outside of the code and execution process.

It's probably not related to the kernel as it looks like a host-side crash, but perhaps attach the generated OpenCL C?

Could it be an alignment issue from the critical-access array? I think that use of USE_HOST_PTR is valid but it's not something i've used and not implementations are equal.

I haven't ever used it but perhaps you could try adding this to the java command line.

"-XX:ObjectAlignmentInBytes=32"

(or some other power of 2, >8)

I presume it affects arrays in the obvious way, but it might not.

Niaz Khan

unread,

Nov 6, 2015, 8:04:36 AM11/6/15

to aparapi-discuss

Michael,

I have tried with floats and its still crashing.

Here is the generated OpenCL and JNI verbose.

typedef struct This_s{

__global int *rowPtr;

__global float *data;

__global float *x;

__global int *colIndices;

__global float *sums;

__global float *mult;

int passid;

}This;

int get_pass_id(This *this){

return this->passid;

}

__kernel void run(

__global int *rowPtr,

__global float *data,

__global float *x,

__global int *colIndices,

__global float *sums,

__global float *mult,

int passid

){

This thisStruct;

This* this=&thisStruct;

this->rowPtr = rowPtr;

this->data = data;

this->x = x;

this->colIndices = colIndices;

this->sums = sums;

this->mult = mult;

this->passid = passid;

{

int i = get_group_id(0);

int k = get_local_id(0);

float sum = 0.0f;

int j0 = this->rowPtr[i];

int j1 = this->rowPtr[(i + 1)];

for (int j = k; j<(j1 - j0); j+=32){

sum = sum + (this->data[(j + j0)] * this->x[this->colIndices[(j + j0)]]);

}

this->sums[k] = sum;

for (int j = 16; j>0; j = j >> 1){

sum = sum + this->sums[(j + k)];

barrier(CLK_LOCAL_MEM_FENCE);

this->sums[k] = sum;

barrier(CLK_LOCAL_MEM_FENCE);

}

if (k==0){

this->mult[i] = sum;

}

return;

}

in setArgs arg 0 rowPtr type 00001288

in setArgs arg 0 rowPtr is *not* local

in setArgs arg 1 data type 00001284

in setArgs arg 1 data is *not* local

in setArgs arg 2 x type 00001284

in setArgs arg 2 x is *not* local

in setArgs arg 3 colIndices type 00001288

in setArgs arg 3 colIndices is *not* local

in setArgs arg 4 sums type 00000e84

in setArgs arg 4 sums is local

in setArgs arg 5 mult type 00001684

in setArgs arg 5 mult is *not* local

got type for rowPtr: 00001288

testing for Resync javaArray rowPtr: old=0000000000000000, new=000000004ACE36C8

Resync javaArray for rowPtr: 000000004ACE36C8 0000000000000000

NewWeakGlobalRef for rowPtr, set to 000000004ADCF130

updateNonPrimitiveReferences, args[0].lengthInBytes=124

got type for data: 00001284

testing for Resync javaArray data: old=0000000000000000, new=000000004ACE36D0

Resync javaArray for data: 000000004ACE36D0 0000000000000000

NewWeakGlobalRef for data, set to 000000004ADCF138

updateNonPrimitiveReferences, args[1].lengthInBytes=720

got type for x: 00001284

testing for Resync javaArray x: old=0000000000000000, new=000000004ACE36D8

Resync javaArray for x: 000000004ACE36D8 0000000000000000

NewWeakGlobalRef for x, set to 000000004ADCF140

updateNonPrimitiveReferences, args[2].lengthInBytes=120

got type for colIndices: 00001288

testing for Resync javaArray colIndices: old=0000000000000000, new=000000004ACE36E0

Resync javaArray for colIndices: 000000004ACE36E0 0000000000000000

NewWeakGlobalRef for colIndices, set to 000000004ADCF148

updateNonPrimitiveReferences, args[3].lengthInBytes=720

got type for sums: 00000e84

testing for Resync javaArray sums: old=0000000000000000, new=000000004ACE36E8

Resync javaArray for sums: 000000004ACE36E8 0000000000000000

NewWeakGlobalRef for sums, set to 000000004ADCF150

updateNonPrimitiveReferences, args[4].lengthInBytes=256

got type for mult: 00001684

testing for Resync javaArray mult: old=0000000000000000, new=000000004ACE36F0

Resync javaArray for mult: 000000004ACE36F0 0000000000000000

NewWeakGlobalRef for mult, set to 000000004ADCF158

updateNonPrimitiveReferences, args[5].lengthInBytes=120

back from updateNonPrimitiveReferences

got type for arg 0, rowPtr, type=00001288

runKernel: arrayOrBuf ref 000000004ADCF130, oldAddr=0000000000000000, newAddr=0000002BEB7F0B88, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB7F0B88, contents: 00 00 00 00 04 00 00 00

rowPtr 0 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=0000007c bytes, address=0000002BEB7F0B88, &status)

writing buffer argIndex=0 argPos=0 rowPtr

got type for arg 1, data, type=00001284

runKernel: arrayOrBuf ref 000000004ADCF138, oldAddr=0000000000000000, newAddr=0000002BEB7F0480, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB7F0480, contents: 79 06 6d c4 63 6b b6 46

data 1 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000002d0 bytes, address=0000002BEB7F0480, &status)

writing buffer argIndex=1 argPos=1 data

got type for arg 2, x, type=00001284

runKernel: arrayOrBuf ref 000000004ADCF140, oldAddr=0000000000000000, newAddr=0000002BEB66BD28, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB66BD28, contents: 00 00 40 40 00 00 40 40

x 2 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=00000078 bytes, address=0000002BEB66BD28, &status)

writing buffer argIndex=2 argPos=2 x

got type for arg 3, colIndices, type=00001288

runKernel: arrayOrBuf ref 000000004ADCF148, oldAddr=0000000000000000, newAddr=0000002BEB7F0860, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB7F0860, contents: 00 00 00 00 01 00 00 00

colIndices 3 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000002d0 bytes, address=0000002BEB7F0860, &status)

writing buffer argIndex=3 argPos=3 colIndices

got type for arg 4, sums, type=00000e84

ISLOCAL, clSetKernelArg(jniContext->kernel, 4, 256, NULL);

got type for arg 5, mult, type=00001684

runKernel: arrayOrBuf ref 000000004ADCF158, oldAddr=0000000000000000, newAddr=0000002BEB66C4E0, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB66C4#

# A fatal error has been detected by the Java Runtime Environment:

#

# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00000000773f0e91, pid=8724, tid=8632

E0, contents: 00 00 00 00 00 00 00 00

mult 5 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE, size=00000078 bytes, address=0000002BEB66C4E0, &status)

writing buffer argIndex=5 argPos=5 mult

The output of using double is:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

typedef struct This_s{

__global int *rowPtr;

__global double *data;

__global double *x;

__global int *colIndices;

__global double *sums;

__global double *mult;

int passid;

}This;

int get_pass_id(This *this){

return this->passid;

}

__kernel void run(

__global int *rowPtr,

__global double *data,

__global double *x,

__global int *colIndices,

__global double *sums,

__global double *mult,

int passid

){

This thisStruct;

This* this=&thisStruct;

this->rowPtr = rowPtr;

this->data = data;

this->x = x;

this->colIndices = colIndices;

this->sums = sums;

this->mult = mult;

this->passid = passid;

{

int i = get_group_id(0);

int k = get_local_id(0);

double sum = 0.0;

int j0 = this->rowPtr[i];

int j1 = this->rowPtr[(i + 1)];

for (int j = k; j<(j1 - j0); j+=32){

sum = sum + (this->data[(j + j0)] * this->x[this->colIndices[(j + j0)]]);

}

this->sums[k] = sum;

for (int j = 16; j>0; j = j >> 1){

sum = sum + this->sums[(j + k)];

barrier(CLK_LOCAL_MEM_FENCE);

this->sums[k] = sum;

barrier(CLK_LOCAL_MEM_FENCE);

}

if (k==0){

this->mult[i] = sum;

}

return;

}

in setArgs arg 0 rowPtr type 00001288

in setArgs arg 0 rowPtr is *not* local

in setArgs arg 1 data type 00001290

in setArgs arg 1 data is *not* local

in setArgs arg 2 x type 00001290

in setArgs arg 2 x is *not* local

in setArgs arg 3 colIndices type 00001288

in setArgs arg 3 colIndices is *not* local

in setArgs arg 4 sums type 00000e90

in setArgs arg 4 sums is local

in setArgs arg 5 mult type 00001690

in setArgs arg 5 mult is *not* local

got type for rowPtr: 00001288

testing for Resync javaArray rowPtr: old=0000000000000000, new=000000004ACE1378

Resync javaArray for rowPtr: 000000004ACE1378 0000000000000000

NewWeakGlobalRef for rowPtr, set to 000000004AE142A0

updateNonPrimitiveReferences, args[0].lengthInBytes=124

got type for data: 00001290

testing for Resync javaArray data: old=0000000000000000, new=000000004ACE1380

Resync javaArray for data: 000000004ACE1380 0000000000000000

NewWeakGlobalRef for data, set to 000000004AE142A8

updateNonPrimitiveReferences, args[1].lengthInBytes=1440

got type for x: 00001290

testing for Resync javaArray x: old=0000000000000000, new=000000004ACE1388

Resync javaArray for x: 000000004ACE1388 0000000000000000

NewWeakGlobalRef for x, set to 000000004AE142B0

updateNonPrimitiveReferences, args[2].lengthInBytes=240

got type for colIndices: 00001288

testing for Resync javaArray colIndices: old=0000000000000000, new=000000004ACE1390

Resync javaArray for colIndices: 000000004ACE1390 0000000000000000

NewWeakGlobalRef for colIndices, set to 000000004AE142B8

updateNonPrimitiveReferences, args[3].lengthInBytes=720

got type for sums: 00000e90

testing for Resync javaArray sums: old=0000000000000000, new=000000004ACE1398

Resync javaArray for sums: 000000004ACE1398 0000000000000000

NewWeakGlobalRef for sums, set to 000000004AE142C0

updateNonPrimitiveReferences, args[4].lengthInBytes=512

got type for mult: 00001690

testing for Resync javaArray mult: old=0000000000000000, new=000000004ACE13A0

Resync javaArray for mult: 000000004ACE13A0 0000000000000000

NewWeakGlobalRef for mult, set to 000000004AE142C8

updateNonPrimitiveReferences, args[5].lengthInBytes=240

back from updateNonPrimitiveReferences

got type for arg 0, rowPtr, type=00001288

runKernel: arrayOrBuf ref 000000004AE142A0, oldAddr=0000000000000000, newAddr=0000002BEB7D4090, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB7D4090, contents: 00 00 00 00 04 00 00 00

rowPtr 0 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=0000007c bytes, address=0000002BEB7D4090, &status)

writing buffer argIndex=0 argPos=0 rowPtr

got type for arg 1, data, type=00001290

runKernel: arrayOrBuf ref 000000004AE142A8, oldAddr=0000000000000000, newAddr=0000002BEB7D36B8, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB7D36B8, contents: e0 7f d0 1f cf a0 8d c0

data 1 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000005a0 bytes, address=0000002BEB7D36B8, &status)

writing buffer argIndex=1 argPos=1 data

got type for arg 2, x, type=00001290

runKernel: arrayOrBuf ref 000000004AE142B0, oldAddr=0000000000000000, newAddr=0000002BEB66BC90, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB66BC90, contents: 00 00 00 00 00 00 08 40

x 2 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000000f0 bytes, address=0000002BEB66BC90, &status)

writing buffer argIndex=2 argPos=2 x

got type for arg 3, colIndices, type=00001288

runKernel: arrayOrBuf ref 000000004AE142B8, oldAddr=0000000000000000, newAddr=0000002BEB7D3D68, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB7D3D68, contents: 00 00 00 00 01 00 00 00

colIndices 3 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=000002d0 bytes, address=0000002BEB7D3D68, &status)

writing buffer argIndex=3 argPos=3 colIndices

got type for arg 4, sums, type=00000e90

ISLOCAL, clSetKernelArg(jniContext->kernel, 4, 512, NULL);

got type for arg 5, mult, type=00001690

runKernel: arrayOrBuf ref 000000004AE142C8, oldAddr=0000000000000000, newAddr=0000002BEB66C790, ref.mem=0000000000000000 isCopy=false

at memory addr 0000002BEB66C#

# A fatal error has been detected by the Java Runtime Environment:

#

# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00000000773f0e91, pid=8728, tid=5380

#

# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)

# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode windows-amd64 )

# Problematic frame:

# C [ntdll.dll+0x50e91]

#

# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows

#

# An error report file with more information is saved as:

# C:\Users\NIAZ\Desktop\workspace\AparapiTest\hs_err_pid8728.log

#

# If you would like to submit a bug report, please visit:

# http://bugreport.java.com/bugreport/crash.jsp

# The crash happened outside the Java Virtual Machine in native code.

# See problematic frame for where to report the bug.

#

790, contents: 00 00 00 00 00 00 00 00

mult 5 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE, size=000000f0 bytes, address=0000002BEB66C790, &status)

writing buffer argIndex=5 argPos=5 mult

I have also tried using "-XX:ObjectAlignmentInBytes=32" and other power of 2 like 8, 16, 64 but I am still getting the crash.

Just wondering if you have tried the code and its working for you and only I am getting the crash?

Thanks

Niaz

Michael Zucchi

unread,

Nov 6, 2015, 9:50:47 AM11/6/15

to aparapi...@googlegroups.com

On 06/11/15 23:34, Niaz Khan wrote:

Michael,

I have tried with floats and its still crashing.

Here is the generated OpenCL and JNI verbose.

typedef struct This_s{

__global int *rowPtr;

__global float *data;

__global float *x;

__global int *colIndices;

__global float *sums;

__global float *mult;

int passid;

}This;

int get_pass_id(This *this){

return this->passid;

}

__kernel void run(

__global int *rowPtr,

__global float *data,

__global float *x,

__global int *colIndices,

__global float *sums,

Oh, this is definitely wrong, sums should be 'local float *sums' in both places.

Aparapi knows it's local as it sets this:

got type for arg 4, sums, type=00000e84

ISLOCAL, clSetKernelArg(jniContext->kernel, 4, 256, NULL);

I also think calling clSetKernelArg() like that should fail, but maybe the jni code isn't checking the return value, or the driver is non-conformant. If setting the arg is failing then enqueueing the kernel should fail. Not that it matters I guess.

It must be a pretty simple (aparapi) compiler bug outputing the wrong memory type.

__global float *mult,

int passid

){

This thisStruct;

This* this=&thisStruct;

this->rowPtr = rowPtr;

this->data = data;

this->x = x;

this->colIndices = colIndices;

this->sums = sums;

this->mult = mult;

this->passid = passid;

{

int i = get_group_id(0);

int k = get_local_id(0);

float sum = 0.0f;

int j0 = this->rowPtr[i];

int j1 = this->rowPtr[(i + 1)];

for (int j = k; j<(j1 - j0); j+=32){

sum = sum + (this->data[(j + j0)] * this->x[this->colIndices[(j + j0)]]);

}

this->sums[k] = sum;

<- "sums[k+32]=0;" (it really is important with the summation loop below, for portable code)

<- there's a necessary barrier missing here for portable code.

for (int j = 16; j>0; j = j >> 1){

sum = sum + this->sums[(j + k)];

barrier(CLK_LOCAL_MEM_FENCE);

this->sums[k] = sum;

barrier(CLK_LOCAL_MEM_FENCE);

}

if (k==0){

this->mult[i] = sum;

}

return;

}

}

Wow pretty nice work on getting the compiler to output fair looking code though.

Just wondering if you have tried the code and its working for you and only I am getting the crash?

Sorry I don't have all the stuff setup properly atm on this box.

Niaz Khan

unread,

Nov 6, 2015, 10:53:14 AM11/6/15

to aparapi-discuss

Is there any other way to set up the kernel from with in Aparapi?

__global float *mult,

int passid

){

This thisStruct;

This* this=&thisStruct;

this->rowPtr = rowPtr;

this->data = data;

this->x = x;

this->colIndices = colIndices;

this->sums = sums;

this->mult = mult;

this->passid = passid;

{

int i = get_group_id(0);

int k = get_local_id(0);

float sum = 0.0f;

int j0 = this->rowPtr[i];

int j1 = this->rowPtr[(i + 1)];

for (int j = k; j<(j1 - j0); j+=32){

sum = sum + (this->data[(j + j0)] * this->x[this->colIndices[(j + j0)]]);

}

this->sums[k] = sum;

<- "sums[k+32]=0;" (it really is important with the summation loop below, for portable code)

<- there's a necessary barrier missing here for portable code.

Yea exactly the important barrier was missing, even after adding the barrier and using k+32, the results in the JTP mode are incorrect which is strange.

for (int j = 16; j>0; j = j >> 1){

sum = sum + this->sums[(j + k)];

barrier(CLK_LOCAL_MEM_FENCE);

this->sums[k] = sum;

barrier(CLK_LOCAL_MEM_FENCE);

}

if (k==0){

this->mult[i] = sum;

}

return;

}

}

Wow pretty nice work on getting the compiler to output fair looking code though.

Yea the compiler output is awesome

Michael Zucchi

unread,

Nov 7, 2015, 1:35:11 AM11/7/15

to aparapi...@googlegroups.com

Hi,

I didn't have anything better to do so i had a quick look at this. Well
I could've gone to the beach but it's a bit late, the gulf waters are
still a bit cold, and tomorrow is going to be warmer.

I formed a skeleton kernel from your last paste, and I have no idea if
it produces the correct result but the opencl output is correctly typed.

__kernel void run(
__global int *rowPtr,

__global double *data,
__global double *x,
__global int *colIndices,

__local double *sums,

__global double *mult,
int passid
)

I used the latest version in git
(https://github.com/aparapi/aparapi.git), and from your original message
it seems you never specified which one you're using. I've attached what
I used, the output from 'make run', and the makefile I wrote to
build/run it since it's trivial even if it's no use to you.

I'm using java 8 to compile/run it.

You should always include the actual code you're using - don't edit it
or cut out a bit of it. You might miss something important or just make
a mistake. Use attachments if your mail program has them - most mail
editors are junk and might make the pasted code uncompilable or just
unreadable (people are less likely to help if it looks like a mess).
For a library like aparapi you must include the version you're using or
where you got it, and if you're not using the absolutely latest version
you need to test that first.

It's also somewhat unreasonable to expect some stranger on the internet
to turn your partial fragment into a fully running application and run
it against completely unknown parameters.

I'm not trying to be rude to you here, just point out some basic tips
for getting the most out of mailing lists of free software projects.

Of course, if you are already using the latest version then the only
option left is to post a standalone example.

!Z

Makefile

Test.java

out.txt

Ryan LaMothe

unread,

Nov 7, 2015, 1:24:17 PM11/7/15

to aparapi...@googlegroups.com

I'll go even further and simply say this: post your code to Github. It's free and allows everyone to help you out a whole lot easier.

Sent from my iPhone --- Please excuse any typos or autocorrect mistakes

> --
> You received this message because you are subscribed to the Google Groups "aparapi-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to aparapi-discu...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

> <Makefile>
> <Test.java>
> <out.txt>

Niaz Khan

unread,

Nov 7, 2015, 2:36:36 PM11/7/15

to aparapi-discuss

Hi Michael,

I haven't tried it on the latest version in git.

Sorry I haven't mentioned which version I am using at the start. The version I am using is Aparapi-1.0.0:

https://github.com/aparapi/aparapi/releases

I am trying to build the latest version in git to see if the problem is because I am using an old version which might be the case as in your case it has output the right generated OpenCL.

During the build I am getting an error:

fatal error LNK1181: cannot open input file 'OpenCL.lib'

In the build.xml, I have specified the OpenCL which comes with CUDA SDK as I am using an Nvidia device.

<and>

<not>

</not>

</and>

</condition>

<and>

<not>

</not>

</and>

</condition>

Do I have to specify the OpenCL location somewhere else as well for the build? or do I need to have an OpenCL SDK installed for the build?

I am using Windows 7 64 bits with Nvidia GTX 485M.

Cheers

Niaz

Niaz Khan

unread,

Nov 7, 2015, 2:39:42 PM11/7/15

to aparapi-discuss

Hi Ryan,

That's a nice idea, I will do that once I try it with the latest build as Michael suggested to see if it was because I was using an old build which is resulting in that crash in GPU mode.

Cheers

Niaz

Gary Frost

unread,

Nov 7, 2015, 2:40:21 PM11/7/15

to aparapi-discuss

Seems to work for me

I pulled the latest code from https://github.com/aparapi/aparapi

built it ( needed to make one change to build on Java 7 - I let Ryan know separately)

Unzipped your code

Created a subdir called aparapi and copied the built aparapi.jar and libaparapi_x86_64.so into it

Added an ant build.xml so we can run

built and run using

$ cd AparapiSpmv/

$ ant

Buildfile: /home/gfrost/AparapiSpmv/build.xml

clean:

build:

[mkdir] Created dir: /home/gfrost/AparapiSpmv/classes

[javac] Compiling 1 source file to /home/gfrost/AparapiSpmv/classes

[javac] Note: /home/gfrost/AparapiSpmv/src/AparapiTest.java uses or overrides a deprecated API.

[javac] Note: Recompile with -Xlint:deprecation for details.

[jar] Building jar: /home/gfrost/AparapiSpmv/spmv.jar

BUILD SUCCESSFUL

Total time: 1 second

I added an ant target to run it

$ ant run-gpu

Buildfile: /home/gfrost/AparapiSpmv/build.xml

run-gpu:

[java] Execution mode=GPU

[java] #pragma OPENCL EXTENSION cl_khr_fp64 : enable

[java]

[java] typedef struct This_s{

[java] __global int *rowPtr;

[java] __global double *data;

[java] __global double *x;

[java] __global int *colIndices;

[java] __local double *sums;

[java] __global double *mult;

[java] int passid;

[java] }This;

[java] int get_pass_id(This *this){

[java] return this->passid;

[java] }

[java] __kernel void run(

[java] __global int *rowPtr,

[java] __global double *data,

[java] __global double *x,

[java] __global int *colIndices,

[java] __local double *sums,

[java] __global double *mult,

[java] int passid

[java] ){

[java] This thisStruct;

[java] This* this=&thisStruct;

[java] this->rowPtr = rowPtr;

[java] this->data = data;

[java] this->x = x;

[java] this->colIndices = colIndices;

[java] this->sums = sums;

[java] this->mult = mult;

[java] this->passid = passid;

[java] {

[java] int i = get_group_id(0);

[java] int k = get_local_id(0);

[java] double sum = 0.0;

[java] int j0 = this->rowPtr[i];

[java] int j1 = this->rowPtr[(i + 1)];

[java] for (int j = k; j<(j1 - j0); j+=32){

[java] sum = sum + (this->data[(j + j0)] * this->x[this->colIndices[(j + j0)]]);

[java] }

[java] this->sums[k] = sum;

[java] barrier(CLK_LOCAL_MEM_FENCE);

[java] for (int j = 16; j>0; j = j >> 1){

[java] sum = sum + this->sums[(j + k)];

[java] barrier(CLK_LOCAL_MEM_FENCE);

[java] this->sums[k] = sum;

[java] barrier(CLK_LOCAL_MEM_FENCE);

[java] }

[java] if (k==0){

[java] this->mult[i] = sum;

[java] }

[java] return;

[java] }

[java]

[java] The SPMV is correct

BUILD SUCCESSFUL

Total time: 1 second

$

Seemed to work ok.

My ant build.xml file is attached

--

build.xml

Niaz Khan

unread,

Nov 7, 2015, 3:17:02 PM11/7/15

to aparapi-discuss

Hi Gary,

It seems the issue was that I was using an old build of Aparapi as the exact code works when you tried on the latest build.

I am trying now on the latest build now but I am getting a linking error to OpenCL during the build. As you have also got an Nvidia device, where I should specify the location of the OpenCL? Is it only in the build.xml (com.amd.aparapi.jni) or somewhere else.

I have already posted how I am specifying that, I think that would be incorrect?

Thanks

Niaz

Gary Frost

unread,

Nov 7, 2015, 3:36:48 PM11/7/15

to aparapi-discuss

Here (atatched) is a patch containing the changes I made to allow ant to find my nvidia GTX480 libs/headers.

I also had to make a change to Device.java (Long.hashCode(long) missing in Java 7)

This assumes your cuda/opencl install is in /usr/local/cuda-7-0 which is where mine was.

Gary

--

gfrost.patch

Gary Frost

unread,

Nov 7, 2015, 3:40:03 PM11/7/15

to aparapi-discuss

Here (atatched) is a patch containing the changes I made to allow ant to find my nvidia GTX480 libs/headers.

I also had to make a change to Device.java (Long.hashCode(long) missing in Java 7)

This assumes your cuda/opencl install is in /usr/local/cuda-7-0 which is where mine was.

Gary

--

Niaz Khan

unread,

Nov 7, 2015, 4:46:55 PM11/7/15

to aparapi-discuss

Hi Gary,

I have applied the patch. The only difference from your set up and mine is that I am using Windows rather than Linux.

Attach is the build.xml which I have modified to add a property for "win64.nvidia.app.sdk.exists".

Also I have given the path for CUDA SDK, not sure why during the build its throwing linking error to OpenCL?

Thanks

Niaz

OpenCL-link-error.PNG

build.xml

Niaz Khan

unread,

Nov 8, 2015, 2:40:29 PM11/8/15

to aparapi-discuss

I still haven't figured out why the build is failed. I am using an msvc compiler and with visual studio I can run application using Nvidia OpenCL where there I explicitly specify the linking option in visual studio.

In the build.xml I have used the cuda path like this:

<and>

<not>

</not>

</and>

</condition>

Do I have to specify, nvidia.app.sdk.dir somewhere else for linking?

I have tried to check online for the dll and jar for the latest build of aparapi but haven't found one for windows with nvidia opencl

cheers

Niaz

Gary Frost

unread,

Nov 8, 2015, 3:23:01 PM11/8/15

to aparapi-discuss

Sorry I don't have a windows machine.

Can you confirm that the dir "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.5" contains the libraries and headers for opencl?

Also confirm that you are using a 64 bit OS, a 64 bit JVM/JDK

Windows was the hardest build to get right for Aparapi.

Sorry I can't be of more help

--

Niaz Khan

unread,

Nov 8, 2015, 3:35:39 PM11/8/15

to aparapi-discuss

Hi Gary,

Thanks for the reply.

I do confirm that in the dir "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.5", I have both the libraries and headers for opencl (Win32, x64). In the visual studio I can run OpenCL applications where I have included the headers from that location and also I have given the OpenCL.lib as an input to the linker.

I am using a 64bit OS with 64 bit JVM/JDK.

Yea the build is pain to set up in Windows compared to linux. I am not sure if the my posted build.xml is correct where I have added a property for the nvidia.app.sdk, also I am using msvc where for the compiler I have given the location, for input to the linker I am not sure where to specify that in the build.xml?

Cheers

Niaz

Niaz Khan

unread,

Nov 8, 2015, 3:46:11 PM11/8/15

to aparapi-discuss

On Sunday, November 8, 2015 at 8:23:01 PM UTC, gfrost wrote:

Sorry I don't have a windows machine.

Can you confirm that the dir "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.5" contains the libraries and headers for opencl?

Also confirm that you are using a 64 bit OS, a 64 bit JVM/JDK

Windows was the hardest build to get right for Aparapi.

Sorry I can't be of more help

I have an OpenCL 1.1 on my device, not sure if the latest aparapi build would be ok with that or would it require OpenCL 1.2?

Niaz Khan

unread,

Nov 9, 2015, 11:57:06 AM11/9/15

to aparapi-discuss

Has anyone got the 64 bit dll and aparapi jar for the latest build. I still have issues with building the latest version for 64 bit windows with Nvidia OpenCL.

Cheers,
Niaz

Gary Frost

unread,

Nov 9, 2015, 2:55:09 PM11/9/15

to aparapi-discuss

Sorry I don't have a windows machine.

Look in com.amd.aparapi.jni build.xml around line 681.

Add 'echo' lines before

To show you where the build is looking for opencl lib

For example....

<echo message=

    "/libpath:${msvc.dir}\vc\lib\${optional.amd64.subdir}" />
<echo message=
   "/libpath:${msvc.sdk.dir}\lib\${optional.x64.subdir}" />
<echo message=
   "/libpath:${app.sdk.dir}\lib\${x86_or_x86_64}" />
<echo message=
   "/libpath:${app.sdk.dir}\${optional.app.sdk.lib.subdir}" />

Look in each of the dirs echoed to stdout when you do a build and see if build has the correct dirs.

If all else fails add a new line like

Around line 720

Gary

--

Niaz Khan

unread,

Nov 9, 2015, 4:02:56 PM11/9/15

to aparapi-discuss

Gary, Michael and Ryan,

Thanks a lot.

The ouput of the echo is:

[echo] /libpath:G:/Applications/MS Visual Studio Ultimate 2010\vc\lib\amd64\

[echo] /libpath:C:/Program Files (x86)/Microsoft SDKs/Windows/v7.0A\lib\x64\

[echo] /libpath:C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.5\lib\x86_64

[echo] /libpath:C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.5\lib64\

Where the OpenCL.lib dir is C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.5/lib/x64, so I have adjusted that and the build is successful.

Also I have tested the kernel, the generated OpenCL output is correct too and also the results are correct and I am not getting any crash in GPU mode.

Cheers

Niaz

Gary Frost

unread,

Nov 9, 2015, 4:05:00 PM11/9/15

to aparapi-discuss

Kudos for sticking with it !

Gary

--

Michael Zucchi

unread,

Nov 9, 2015, 5:34:38 PM11/9/15

to aparapi...@googlegroups.com

So ... after all that does it run any better?

Niaz Khan

unread,

Nov 9, 2015, 7:17:01 PM11/9/15

to aparapi-discuss

Hi Michael,

Here are the results:

rows 1489752 cols 1489752 nz 10319760

Scalar kernel:

818 ms

Vector kernel: (Range.create(rows*32, 32) with no dummy zeros added to sums

@local sums[64]

669 ms

Vector kernel: (Range.create(rows*32, 32) with no dummy zeros added to sums

@local sums[128]

635ms

For some reason when I add dummy zeros to the sums the results are incorrect in the verification, as you have pointed out that these can improve the results further?

For my Nvidia GTX 485M, I have to keep try different ranges to see which one is an optimal one for the device.

Cheers

Niaz

Gary Frost

unread,

Nov 9, 2015, 7:19:54 PM11/9/15

to aparapi-discuss

If the sums array is 'local' then it's contents from the 'java' side are of no consequence. Basically we just use the 'length' of this to tell OpenCL to create a buffer of the same size.

However, one should not assume that it's contents are 'zero'.

Gary

--

Michael Zucchi

unread,

Nov 9, 2015, 7:29:20 PM11/9/15

to aparapi...@googlegroups.com

This is the same problem which was 700ms previously?

Oh how disappointing if so.

Niaz Khan

unread,

Nov 9, 2015, 8:27:31 PM11/9/15

to aparapi-discuss

Yea, its the same problem size, not a huge difference though from the initial one.

I am using the range as (rows * 32, 32) which might not be the optimal choice in this case?

Michael Zucchi

unread,

Nov 9, 2015, 9:08:42 PM11/9/15

to aparapi...@googlegroups.com

On 10/11/15 11:57, Niaz Khan wrote:

Yea, its the same problem size, not a huge difference though from the initial one.

I am using the range as (rows * 32, 32) which might not be the optimal choice in this case?

Even if it's not optimal it should be good enough to get an idea.

Niaz Khan

unread,

Nov 13, 2015, 3:46:19 PM11/13/15

to aparapi-discuss

Using a range as (rows * 32, 32)

I am specifying a group size of 32 as I am using an Nvidia device.

How about the number of work items per work group?

Does aparapi take care of that automatically or we can also specify that to have number of work items per work group?

cheers

Niaz

Niaz Khan

unread,

Nov 13, 2015, 4:04:00 PM11/13/15

to aparapi-discuss

I am also trying to get profile info using

List<ProfileInfo> profileInfo = kernel.getProfileInfo();

Its returning nulll, I have used it after executing the kernel and also I have used -Dcom.amd.aparapi.enableProfiling=true before running the application.

Any why it might be returning null, do I have to specify some other options?

Cheers

Niaz

Gary Frost

unread,

Nov 13, 2015, 4:20:29 PM11/13/15

to aparapi-discuss

It is possible that the Nvidia runtime does not honor

CL_QUEUE_PROFILING_ENABLE

When creating the command queue. I will check when I get home if I can get profiling info from my GTX 480.

If all else fails

http://uob-hpc.github.io/2015/05/27/nvvp-import-opencl/

Looks like you may have other options.

Gary

--

Niaz Khan

unread,

Nov 13, 2015, 4:55:56 PM11/13/15

to aparapi-discuss

Gary,

Thanks for the link that is quite useful.

Any idea about the range set up for the spmv kernel using Nvidia device (GTX 485M)? Not sure the way I have used the range how many work items are assigned to each work group

Cheers

Niaz

Gary Frost

unread,

Nov 13, 2015, 5:03:42 PM11/13/15

to aparapi-discuss

In your earlier code you had

new Bob().execute(Range.create(128, 64));

Which executes over the range 0..127 (inclusive)

Splitting the work into 64 groups

So you should get 2 groups of 64 running

Your GPU will have a maximum # of groups it can run (My GTX has 1024 I think, yours may well have the same)

You want the group to be as big as possible! (but no bigger ;) )

So the above would be better as

new Bob().execute(Range.create(128, 128));

This way you run one group of 128.

BTW 128 is a very very small dispatch size. Ideally your range should be 10's or 100's of thousands.

Group size rules.

1) groupSize <= min(range, maxDeviceGroupSize)

2) range%groupsSize must be 0

Gary

--

Niaz Khan

unread,

Nov 13, 2015, 5:30:51 PM11/13/15

to aparapi-discuss

Gary,

Thats very handy, thanks a lot.

My GTX 485M also has a maximum # of groups 1024.

So for rows = 1489752

Range.create(rows * 32, 32) (This way it satisfy rule 1 and 2 for the group size)

There will be 1489752 groups of 32 running?

Just trying to figure out the optimal range for my GTX 485M where rows = 1489752

Cheers

Niaz

Gary Frost

unread,

Nov 13, 2015, 5:34:48 PM11/13/15

to aparapi-discuss

I am confused.

What is the global size you need to run ? Over what 'range of data' do you wish to execute?

If you data is a multiple of 1024 why not run in groups of 1024?

(Generally) Running in groups of 32 is slower than groups of 1024?

Gary

--

Niaz Khan

unread,

Nov 13, 2015, 5:56:03 PM11/13/15

to aparapi-discuss

The 'range of data' is 1489752 which is the # of rows in a sparse matrix

Prior to the discussion with Michael and you, I was using it like this in the scalar kernel:

    kernel.execute(rows);

    public void run() {
        int i = getGlobalId();
        mult[i] = 0;
        for (int j = rowPtr[i]; j < rowPtr[i + 1]; j++){
            mult[i] += (data[j] * x[colIndices[j]]);
        }
    }

Now for vector kernel where we assign more work groups and use parallel sums in the kernel. I am trying to use the range like

Range.create(rows * 32, 32)

In this case rows are not multiple of 1024.

For the spmv, rows is the range which is used to go through the CSR arrays (data, colIndices, rowPtr) to perform the multiplication.

Niaz

Michael Zucchi

unread,

Nov 13, 2015, 7:41:50 PM11/13/15

to aparapi...@googlegroups.com

On 14/11/15 09:04, Gary Frost wrote:
> I am confused.
>
> What is the global size you need to run ? Over what 'range of data' do
> you wish to execute?
>
> If you data is a multiple of 1024 why not run in groups of 1024?
>
> (Generally) Running in groups of 32 is slower than groups of 1024?
>

It's because the kernel is tuned for using local memory for the
accumulation, and once you use LDS that typically becomes the
concurrency limiting factor per CU and you "typically" get better
performance from that compared to wider workgroups.

For AMD, for routines using local memory i almost always just use 64
now, it may not be always optimal but it's always near-optimal and if
budget/time allows it can be tweaked.

I assumed that was reasonable for nvidia too and more or less fits this
problem, but if the matrix is very sparse and each row only contains
(n<<=32) live columns then this solution isn't quite right as most of
the alu's will be idled.

For example is n is closer to 8, you could do 4 per 32-item workgroup
instead and implement some of the concurrency manually by addressing
calculations and some other foo.

Or if you have a lot of empty rows, you use a slightly different
approach and rather than supply the size to the launch parameters you
fix the launch size and handle the range in a loop (known as 'persistent
kernel') which is a form of static scheduling. And if things are really
dynamic (in such a way that some kernels end up doing many long rows,
others do nothing) use atomic counters for dynamic scheduling, ... or
determine scheduing beforehand.

!Z

Niaz Khan

unread,

Nov 16, 2015, 6:56:08 AM11/16/15

to aparapi-discuss

Hi Gary,

Have you had a chance to check for

CL_QUEUE_PROFILING_ENABLE

on your GTX 480.

Thanks
Niaz

Niaz Khan

unread,

Nov 19, 2015, 1:50:16 PM11/19/15

to aparapi-discuss

Does aparapi allows to share a GPU from multiple Java threads on a machine, this might have issues with scheduling tasks on the GPU, I am not sure if aparapi can handle this and if it does how it is done? I know OpenCL uses command queues to handle that.

Thanks

Niaz

Gary Frost

unread,

Nov 19, 2015, 2:33:47 PM11/19/15

to aparapi-discuss

Aparapi uses OpenCLs queues, so you should be ok.

--

Gary Frost

unread,

Nov 19, 2015, 2:50:12 PM11/19/15

to aparapi-discuss

Also I checked profiling on my NVidia 970

With

-Dcom.amd.aparapi.enableProfiling=true

I can get profiling info from kernels using

kernel.execute(range);

final List<ProfileInfo> profileInfo = kernel.getProfileInfo();

if ((profileInfo != null) && (profileInfo.size() > 0)) {

for (final ProfileInfo p : profileInfo) {

System.out.print(" " + p.getType() + " " + p.getLabel() + " " + (p.getStart() / 1000) + " .. "

+ (p.getEnd() / 1000) + " " + ((p.getEnd() - p.getStart()) / 1000) + "us");

}

System.out.println();

}

Niaz Khan

unread,

Nov 19, 2015, 5:38:52 PM11/19/15

to aparapi-discuss

Hi Gary,

Strangely this does not show the profile information on my GTX 485M even it has an OpenCL 1.1, I am sure you GTX 970 will have a 1.1 too?

Thanks

Niaz

Ryan LaMothe

unread,

Nov 19, 2015, 6:18:35 PM11/19/15

to aparapi...@googlegroups.com

Windows vs Linux?

Sent from my iPhone --- Please excuse any typos or autocorrect mistakes

--

Niaz Khan

unread,

Nov 19, 2015, 7:11:06 PM11/19/15

to aparapi...@googlegroups.com

Hi Ryan,

I am using Windows and I think Gary is using Linux, both GTX 970 and 485M uses OpenCL 1.1, so I guess it won't be an issue with the driver.

Niaz

You received this message because you are subscribed to a topic in the Google Groups "aparapi-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/aparapi-discuss/roF9t7JZzvg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to aparapi-discu...@googlegroups.com.

Gary Frost

unread,

Nov 19, 2015, 10:47:04 PM11/19/15

to aparapi-discuss

In the end I just added the above property in mandel.sh in the javaone demo dir.

It just dumped profile info.

Gary

--

Niaz Khan

unread,

Dec 15, 2015, 8:50:59 AM12/15/15

to aparapi-discuss

Hi Gary,

Since I can't get the profile info on my windows machine with GTX 485, I am using now another machine using linux and GTX 980.

From the experience of building aparapi on my windows machine I have tried to set up the paths for Nvidia OpenCL in the build xml but I am still getting the following errors.

[nkhan@hvan03 aparapi]$ ant clean build dist
Buildfile: /home/nkhan/aparapi/build.xml

clean:

clean:

clean:

clean:

clean:

clean:

clean:

clean:

clean:

clean:
     [echo] OS Name:        Linux
     [echo] OS Version:     3.10.0-123.el7.x86_64
     [echo] OS Arch:        amd64
     [echo] Java Version:       1.7.0_79

clean:

clean:

clean:

clean:

clean:
     [echo] This project REQUIRES NBODY to be built first!!!

clean:
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64

clean:
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64

clean:
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64
     [echo] Java Ver:   1.7.0_79
     [echo]
     [echo]       Note: Since AMD APP SDK 2.9 you can have multiple versions installed/co-exist on same machine,
     [echo]       so AMD introduced a completely new naming convention for their installation directories
     [echo]       For example C:/Program Files/AMD APP SDK/2.9-1
     [echo]

init:
     [echo] amd.app.sdk.dir /usr/local/cuda-7.0/targets/x86_64-linux/include/
     [echo] nvidia.app.sdk.dir /usr/local/cuda-7.0
     [echo] intel.app.sdk.dir ${intel.app.sdk.dir}
     [echo] freebsd.app.sdk.dir ${freebsd.app.sdk.dir}
     [echo] vendor.name amd
     [echo] app.sdk.dir /usr/local/cuda-7.0/targets/x86_64-linux/include/
     [echo] msvc.dir ${msvc.dir}
     [echo] msvc.sdk.dir ${msvc.sdk.dir}

check:

clean:
   [delete] Deleting directory /home/nkhan/aparapi/com.amd.aparapi.jni/include
   [delete] Deleting directory /home/nkhan/aparapi/com.amd.aparapi.jni/dist
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64

clean:
   [delete] Deleting directory /home/nkhan/aparapi/com.amd.aparapi/classes
   [delete] Deleting directory /home/nkhan/aparapi/com.amd.aparapi/dist

clean:

clean:

clean:

clean:

clean:

clean:

clean:

clean:

clean:

clean:
     [echo] OS Name:        Linux
     [echo] OS Version:     3.10.0-123.el7.x86_64
     [echo] OS Arch:        amd64
     [echo] Java Version:       1.7.0_79

clean:

clean:

clean:

clean:

clean:
     [echo] This project REQUIRES NBODY to be built first!!!

clean:
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64

clean:
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64

clean:
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64
     [echo] Java Ver:   1.7.0_79
     [echo]
     [echo]       Note: Since AMD APP SDK 2.9 you can have multiple versions installed/co-exist on same machine,
     [echo]       so AMD introduced a completely new naming convention for their installation directories
     [echo]       For example C:/Program Files/AMD APP SDK/2.9-1
     [echo]

init:
     [echo] amd.app.sdk.dir /usr/local/cuda-7.0/targets/x86_64-linux/include/
     [echo] nvidia.app.sdk.dir /usr/local/cuda-7.0
     [echo] intel.app.sdk.dir ${intel.app.sdk.dir}
     [echo] freebsd.app.sdk.dir ${freebsd.app.sdk.dir}
     [echo] vendor.name amd
     [echo] app.sdk.dir /usr/local/cuda-7.0/targets/x86_64-linux/include/
     [echo] msvc.dir ${msvc.dir}
     [echo] msvc.sdk.dir ${msvc.sdk.dir}

check:

clean:
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64

clean:

build:
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64

clean:

build:
    [mkdir] Created dir: /home/nkhan/aparapi/com.amd.aparapi/classes
    [mkdir] Created dir: /home/nkhan/aparapi/com.amd.aparapi/dist
    [javac] Compiling 69 source files to /home/nkhan/aparapi/com.amd.aparapi/classes
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/ClassModel.java:2238: warning: [rawtypes] found raw type: LocalVariableTableEntry
    [javac]       private LocalVariableTableEntry localVariableTableEntry = null;
    [javac]               ^
    [javac]   missing type arguments for generic class LocalVariableTableEntry<T>
    [javac]   where T is a type-variable:
    [javac]     T extends LocalVariableInfo declared in interface LocalVariableTableEntry
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/ClassModel.java:2365: warning: [rawtypes] found raw type: LocalVariableTableEntry
    [javac]       public LocalVariableTableEntry getLocalVariableTableEntry() {
    [javac]              ^
    [javac]   missing type arguments for generic class LocalVariableTableEntry<T>
    [javac]   where T is a type-variable:
    [javac]     T extends LocalVariableInfo declared in interface LocalVariableTableEntry
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/ClassModel.java:2535: warning: [rawtypes] found raw type: LocalVariableTableEntry
    [javac]       public LocalVariableTableEntry getLocalVariableTableEntry() {
    [javac]              ^
    [javac]   missing type arguments for generic class LocalVariableTableEntry<T>
    [javac]   where T is a type-variable:
    [javac]     T extends LocalVariableInfo declared in interface LocalVariableTableEntry
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/ClassModel.java:2539: warning: [rawtypes] found raw type: LocalVariableTableEntry
    [javac]       void setLocalVariableTableEntry(LocalVariableTableEntry _localVariableTableEntry) {
    [javac]                                       ^
    [javac]   missing type arguments for generic class LocalVariableTableEntry<T>
    [javac]   where T is a type-variable:
    [javac]     T extends LocalVariableInfo declared in interface LocalVariableTableEntry
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/ClassModel.java:1066: warning: [fallthrough] possible fall-through into case
    [javac]                            case inArray:
    [javac]                            ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/ClassModel.java:1087: warning: [fallthrough] possible fall-through into case
    [javac]                            case inArray:
    [javac]                            ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/ClassModel.java:1964: warning: [cast] redundant cast to ClassModel.AttributePool.RealLocalVariableTableEntry.RealLocalVariableInfo
    [javac]             final RealLocalVariableInfo localVariableInfo = (RealLocalVariableInfo) getVariable(_pc, _index);
    [javac]                                                             ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/instruction/InstructionSet.java:1302: warning: [rawtypes] found raw type: LocalVariableTableEntry
    [javac]          final LocalVariableTableEntry localVariableTableEntry = method.getLocalVariableTableEntry();
    [javac]                ^
    [javac]   missing type arguments for generic class LocalVariableTableEntry<T>
    [javac]   where T is a type-variable:
    [javac]     T extends LocalVariableInfo declared in interface LocalVariableTableEntry
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/instruction/InstructionSet.java:3737: warning: [rawtypes] found raw type: Constant
    [javac]    @SuppressWarnings("unchecked") public interface ConstantPoolEntryConstant extends Constant{
    [javac]                                                                                      ^
    [javac]   missing type arguments for generic class Constant<T>
    [javac]   where T is a type-variable:
    [javac]     T extends Object declared in interface Constant
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/MethodModel.java:1629: warning: [unchecked] unchecked conversion
    [javac]          LocalVariableTableEntry<LocalVariableInfo> localVariableTableEntry = method.getLocalVariableTableEntry();
    [javac]                                                                                                                ^
    [javac]   required: LocalVariableTableEntry<LocalVariableInfo>
    [javac]   found:    LocalVariableTableEntry
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/MethodModel.java:1680: warning: [unchecked] unchecked conversion
    [javac]       return (method.getLocalVariableTableEntry());
    [javac]                                                ^
    [javac]   required: LocalVariableTableEntry<LocalVariableInfo>
    [javac]   found:    LocalVariableTableEntry
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/model/MethodModel.java:1680: warning: [unchecked] unchecked conversion
    [javac]       return (method.getLocalVariableTableEntry());
    [javac]              ^
    [javac]   required: LocalVariableTableEntry<LocalVariableInfo>
    [javac]   found:    LocalVariableTableEntry
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/Kernel.java:399: warning: [dep-ann] deprecated item is not annotated with @Deprecated
    [javac]       static LinkedHashSet<EXECUTION_MODE> getDefaultExecutionModes() {
    [javac]                                            ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/Kernel.java:2808: warning: [dep-ann] deprecated item is not annotated with @Deprecated
    [javac]    private final LinkedHashSet<EXECUTION_MODE> executionModes = (Config.executionMode != null) ? EXECUTION_MODE.getDefaultExecutionModes() : new LinkedHashSet<>(Collections.singleton(EXECUTION_MODE.AUTO));
    [javac]                                                ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/Kernel.java:2813: warning: [dep-ann] deprecated item is not annotated with @Deprecated
    [javac]    private Iterator<EXECUTION_MODE> currentMode = executionModes.iterator();
    [javac]                                     ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/Kernel.java:2818: warning: [dep-ann] deprecated item is not annotated with @Deprecated
    [javac]    private EXECUTION_MODE executionMode = currentMode.next();
    [javac]                           ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/Kernel.java:2827: warning: [dep-ann] deprecated item is not annotated with @Deprecated
    [javac]    public void addExecutionModes(EXECUTION_MODE... platforms) {
    [javac]                ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/Kernel.java:2837: warning: [dep-ann] deprecated item is not annotated with @Deprecated
    [javac]    public boolean hasNextExecutionMode() {
    [javac]                   ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/Kernel.java:2845: warning: [dep-ann] deprecated item is not annotated with @Deprecated
    [javac]    public void tryNextExecutionMode() {
    [javac]                ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/instruction/ExpressionList.java:785: warning: [unchecked] unchecked conversion
    [javac]                   .getLocalVariableTableEntry();
    [javac]                                              ^
    [javac]   required: LocalVariableTableEntry<LocalVariableInfo>
    [javac]   found:    LocalVariableTableEntry
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/device/OpenCLDevice.java:392: warning: [rawtypes] found raw type: Class
    [javac]       final T instance = (T) Proxy.newProxyInstance(OpenCLDevice.class.getClassLoader(), new Class[] {
    [javac]                                                                                              ^
    [javac]   missing type arguments for generic class Class<T>
    [javac]   where T is a type-variable:
    [javac]     T extends Object declared in class Class
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/device/OpenCLDevice.java:392: warning: [unchecked] unchecked cast
    [javac]       final T instance = (T) Proxy.newProxyInstance(OpenCLDevice.class.getClassLoader(), new Class[] {
    [javac]                                                    ^
    [javac]   required: T
    [javac]   found:    Object
    [javac]   where T is a type-variable:
    [javac]     T extends OpenCL<T> declared in method <T>bind(Class<T>,String)
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/kernel/KernelPreferences.java:33: warning: [rawtypes] found raw type: ArrayList
    [javac]          copy = new ArrayList(preferredDevices);
    [javac]                     ^
    [javac]   missing type arguments for generic class ArrayList<E>
    [javac]   where E is a type-variable:
    [javac]     E extends Object declared in class ArrayList
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/kernel/KernelPreferences.java:33: warning: [unchecked] unchecked call to ArrayList(Collection<? extends E>) as a member of the raw type ArrayList
    [javac]          copy = new ArrayList(preferredDevices);
    [javac]                 ^
    [javac]   where E is a type-variable:
    [javac]     E extends Object declared in class ArrayList
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/kernel/KernelPreferences.java:33: warning: [unchecked] unchecked conversion
    [javac]          copy = new ArrayList(preferredDevices);
    [javac]                 ^
    [javac]   required: ArrayList<Device>
    [javac]   found:    ArrayList
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/opencl/OpenCLPlatform.java:59: warning: [unchecked] unchecked conversion
    [javac]             return (Collections.EMPTY_LIST);
    [javac]                                ^
    [javac]   required: List<OpenCLPlatform>
    [javac]   found:    List
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/opencl/OpenCLPlatform.java:59: warning: [unchecked] unchecked conversion
    [javac]             return (Collections.EMPTY_LIST);
    [javac]                    ^
    [javac]   required: List<OpenCLPlatform>
    [javac]   found:    List
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/kernel/KernelManager.java:95: warning: [static] static method should be qualified by type name, KernelDeviceProfile, instead of by an expression
    [javac]                   builder.append(deviceProfile.getTableHeader()).append("\n");
    [javac]                                               ^
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/kernel/KernelManager.java:157: warning: [unchecked] unchecked cast
    [javac]          T shared = (T) sharedInstances.get(kernelClass);
    [javac]                                            ^
    [javac]   required: T
    [javac]   found:    Kernel
    [javac]   where T is a type-variable:
    [javac]     T extends Kernel declared in method <T>getSharedKernelInstance(Class<T>)
    [javac] /home/nkhan/aparapi/com.amd.aparapi/src/java/com/amd/aparapi/internal/writer/BlockWriter.java:271: warning: [cast] redundant cast to CompositeInstruction
    [javac]          writeConditional(((CompositeInstruction) instruction).getBranchSet(), true);
    [javac]                            ^
    [javac] 30 warnings
    [javac] Creating empty /home/nkhan/aparapi/com.amd.aparapi/classes/com/amd/aparapi/internal/tool/package-info.class
    [javac] Creating empty /home/nkhan/aparapi/com.amd.aparapi/classes/com/amd/aparapi/opencl/package-info.class
    [javac] Creating empty /home/nkhan/aparapi/com.amd.aparapi/classes/com/amd/aparapi/package-info.class
    [javac] Creating empty /home/nkhan/aparapi/com.amd.aparapi/classes/com/amd/aparapi/exception/package-info.class
    [javac] Creating empty /home/nkhan/aparapi/com.amd.aparapi/classes/com/amd/aparapi/annotation/package-info.class
    [javac] Creating empty /home/nkhan/aparapi/com.amd.aparapi/classes/com/amd/aparapi/device/package-info.class
      [jar] Building jar: /home/nkhan/aparapi/com.amd.aparapi/dist/aparapi.jar
     [echo] OS Name:    Linux
     [echo] OS Version: 3.10.0-123.el7.x86_64
     [echo] OS Arch:    amd64
     [echo] Java Ver:   1.7.0_79
     [echo]
     [echo]       Note: Since AMD APP SDK 2.9 you can have multiple versions installed/co-exist on same machine,
     [echo]       so AMD introduced a completely new naming convention for their installation directories
     [echo]       For example C:/Program Files/AMD APP SDK/2.9-1
     [echo]

init:
     [echo] amd.app.sdk.dir /usr/local/cuda-7.0/targets/x86_64-linux/include/
     [echo] nvidia.app.sdk.dir /usr/local/cuda-7.0
     [echo] intel.app.sdk.dir ${intel.app.sdk.dir}
     [echo] freebsd.app.sdk.dir ${freebsd.app.sdk.dir}
     [echo] vendor.name amd
     [echo] app.sdk.dir /usr/local/cuda-7.0/targets/x86_64-linux/include/
     [echo] msvc.dir ${msvc.dir}
     [echo] msvc.sdk.dir ${msvc.sdk.dir}

check:

clean:

javah:
    [mkdir] Created dir: /home/nkhan/aparapi/com.amd.aparapi.jni/include

msvc:

gcc:
    [mkdir] Created dir: /home/nkhan/aparapi/com.amd.aparapi.jni/dist
     [echo] linuxcc amd64
     [exec] In file included from src/cpp/runKernel/Aparapi.h:43:0,
     [exec]                  from src/cpp/runKernel/Aparapi.cpp:44:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/runKernel/ArrayBuffer.h:41:0,
     [exec]                  from src/cpp/runKernel/ArrayBuffer.cpp:39:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/runKernel/AparapiBuffer.h:41:0,
     [exec]                  from src/cpp/runKernel/AparapiBuffer.cpp:39:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/runKernel/Config.h:40:0,
     [exec]                  from src/cpp/runKernel/Config.cpp:39:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/runKernel/JNIContext.h:5:0,
     [exec]                  from src/cpp/runKernel/JNIContext.cpp:1:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/runKernel/KernelArg.h:5:0,
     [exec]                  from src/cpp/runKernel/KernelArg.cpp:1:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/runKernel/ProfileInfo.h:40:0,
     [exec]                  from src/cpp/runKernel/ProfileInfo.cpp:39:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/runKernel/Range.h:4:0,
     [exec]                  from src/cpp/runKernel/Range.cpp:2:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/invoke/OpenCLJNI.h:44:0,
     [exec]                  from src/cpp/invoke/OpenCLJNI.cpp:42:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/CLHelper.h:42:0,
     [exec]                  from src/cpp/CLException.h:7,
     [exec]                  from src/cpp/JNIHelper.h:44,
     [exec]                  from src/cpp/invoke/OpenCLArgDescriptor.h:4,
     [exec]                  from src/cpp/invoke/OpenCLArgDescriptor.cpp:1:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/CLHelper.h:42:0,
     [exec]                  from src/cpp/CLException.h:7,
     [exec]                  from src/cpp/JNIHelper.h:44,
     [exec]                  from src/cpp/invoke/OpenCLMem.h:4,
     [exec]                  from src/cpp/invoke/OpenCLMem.cpp:1:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/CLHelper.h:42:0,
     [exec]                  from src/cpp/CLHelper.cpp:40:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.
     [exec] In file included from src/cpp/JNIHelper.cpp:39:0:
     [exec] src/cpp/Common.h:57:19: fatal error: CL/cl.h: No such file or directory
     [exec] #include <CL/cl.h>
     [exec]                    ^
     [exec] compilation terminated.

BUILD FAILED
/home/nkhan/aparapi/build.xml:47: The following error occurred while executing this line:
/home/nkhan/aparapi/com.amd.aparapi.jni/build.xml:616: exec returned: 1

Total time: 11 seconds

Any Idea about this??
Attached is the build.xml where I have specified the cuda path, also I have added a property for linux nvidia.

Regards
Niaz

build.xml

Niaz Khan

unread,

Dec 15, 2015, 9:01:20 AM12/15/15

to aparapi-discuss

The machine is running CentOS7 and I have ant version 1.9.2 and I am using Oracle JDK jdk1.7.0_79

Cheers
Niaz

Niaz Khan

unread,

Dec 15, 2015, 11:28:36 AM12/15/15

to aparapi-discuss

Hi Gary,

I have managed to build the latest aparapi version on the linux machine too.
I have added manually the path to OpenCL headers and .so for gcc in the build.xml which fixed those errors and the build is successful.

Cheers
Niaz

Gary Frost

unread,

Dec 15, 2015, 11:37:57 AM12/15/15

to aparapi-discuss

Great. I am on the bus on the way to office and was hoping to try to look at this when i got to my desk.

Glad you sorted it out.

Gary

--

Niaz Khan

unread,

Dec 15, 2015, 12:34:35 PM12/15/15

to aparapi-discuss

Hi Gary,

I had a quick test of the same code on the linux machine which runs fine on the windows machine.

On linux machine the output is:

Execution mode=GPU
java.lang.NoClassDefFoundError: Could not initialize class sun.util.calendar.ZoneInfoFile
    at sun.util.calendar.ZoneInfo.getTimeZone(ZoneInfo.java:589)
    at java.util.TimeZone.getTimeZone(TimeZone.java:560)
    at java.util.TimeZone.setDefaultZone(TimeZone.java:666)
    at java.util.TimeZone.getDefaultRef(TimeZone.java:636)
    at java.util.TimeZone.getDefault(TimeZone.java:625)
    at java.util.Calendar.getInstance(Calendar.java:1640)
    at java.util.Formatter$FormatSpecifier.printDateTime(Formatter.java:2826)
    at java.util.Formatter$FormatSpecifier.print(Formatter.java:2740)
    at java.util.Formatter.format(Formatter.java:2526)
    at java.util.Formatter.format(Formatter.java:2455)
    at java.lang.String.format(String.java:2940)
    at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:161)
    at java.util.logging.StreamHandler.publish(StreamHandler.java:211)
    at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:116)
    at java.util.logging.Logger.log(Logger.java:738)
    at java.util.logging.Logger.doLog(Logger.java:765)
    at java.util.logging.Logger.log(Logger.java:788)
    at java.util.logging.Logger.warning(Logger.java:1476)
    at com.amd.aparapi.internal.kernel.KernelRunner.fallBackToNextDevice(KernelRunner.java:1083)
    at com.amd.aparapi.internal.kernel.KernelRunner.fallBackToNextDevice(KernelRunner.java:1074)
    at com.amd.aparapi.internal.kernel.KernelRunner.executeInternalInner(KernelRunner.java:1232)
    at com.amd.aparapi.internal.kernel.KernelRunner.executeInternalOuter(KernelRunner.java:1136)
    at com.amd.aparapi.internal.kernel.KernelRunner.execute(KernelRunner.java:1126)
    at com.amd.aparapi.Kernel.execute(Kernel.java:2112)
    at com.amd.aparapi.Kernel.execute(Kernel.java:2069)
    at com.amd.aparapi.Kernel.execute(Kernel.java:2044)
    at spmv_pcj_aparapi.AparapiTest.main(AparapiTest.java:220)
    at org.pcj.internal.PcjThread.run(PcjThread.java:29)

Not sure why I am getting the util errors, I do not get them while running the code on windows machine using Eclipse.

On linux machine I am using this JDK:
openjdk version "1.8.0_65"
OpenJDK Runtime Environment (build 1.8.0_65-b17)
OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)

Do I have to use any specific JDK? Not sure if you have come across this error?

Cheers
Niaz

Gary Frost

unread,

Dec 15, 2015, 1:42:35 PM12/15/15

to aparapi-discuss

Curious

I see this same issue raised over here

http://lists.linuxfromscratch.org/pipermail/blfs-book/2013-March.txt

Also referenced

http://jalbum.net/forum/thread.jspa?threadID=43195

https://www.eclipse.org/forums/index.php/t/154983/

I have seen this referred to as a GC issue (bad) or a concurrent modication issue (which might mean we need to add a lock in aparapi)

Some suggest switching time-zone (what?)

Gary

Eric Caspole

unread,

Dec 15, 2015, 1:47:45 PM12/15/15

to aparapi...@googlegroups.com

I see you are using an OpenJDK build of Java bundled with your Linux distro. Maybe Gary can say otherwise but I don't recall we ever tested that at all. I recommend getting a regular Oracle JDK from java.sun.com and see what happens.

Regards,

Eric

Gary Frost

unread,

Dec 15, 2015, 2:05:59 PM12/15/15

to aparapi-discuss

Great catch Eric.

I never saw that.

Yes. Absolutely. Can you switch to use Oracle's JVM rather than the OpenJDK. We have always argued that OpenJDK 'should' work, but mostly we prefer Oracles.

Gary

Niaz Khan

unread,

Dec 18, 2015, 12:33:34 PM12/18/15

to aparapi-discuss

Hi Gary and Eric,

I have tried running the code on Linux machine using Oracle's JVM and it works perfect :)

Gary,

I am still not able to see the profile info on the linux machine, I have tried to use -Dcom.amd.aparapi.enableProfiling=true in the Java command.

This is how I am trying to profile in the code:

final List<ProfileInfo> profileInfo = kernel.getProfileInfo();

if(profileInfo == null)
System.out.println("profile info null");

              if ((profileInfo != null) && (profileInfo.size() > 0)) {
                 for (final ProfileInfo p : profileInfo) {
                    System.out.print(" " + p.getType() + " " + p.getLabel() + " " + (p.getStart() / 1000) + " .. "
                          + (p.getEnd() / 1000) + " " + ((p.getEnd() - p.getStart()) / 1000) + "us");
                 }
                 System.out.println();
              }

Its returning the profile info null. Just wondering what could cause this, any idea?

Cheers
Niaz

Barney Pitt

unread,

Dec 18, 2015, 3:14:45 PM12/18/15

to aparapi...@googlegroups.com

Should we have an "Approved JVM" list, to help prevent issues like this?

I've only used Oracle and Excelsior Jet JVMs ... Jet seems to work just fine.

Barney

Niaz Khan

unread,

Dec 20, 2015, 1:40:46 PM12/20/15

to aparapi-discuss

Gary,

I am still not able to see the profile info on both linux and windows machines, I have tried to use -Dcom.amd.aparapi.

enableProfiling=true in the Java command and also tried to compile the code with -g but still no luck with it.

This is how I am trying to profile in the code:

final List<ProfileInfo> profileInfo = kernel.getProfileInfo();

if(profileInfo == null)
System.out.println("profile info null");

              if ((profileInfo != null) && (profileInfo.size() > 0)) {
                 for (final ProfileInfo p : profileInfo) {
                    System.out.print(" " + p.getType() + " " + p.getLabel() + " " + (p.getStart() / 1000) + " .. "
                          + (p.getEnd() / 1000) + " " + ((p.getEnd() - p.getStart()) / 1000) + "us");
                 }
                 System.out.println();
              }

Its returning the profile info null. Just wondering what could cause this, any idea? I remember when you tried last time on your machine you were able to get the profile info.

Cheers
Niaz

Gary Frost

unread,

Dec 30, 2015, 8:20:58 PM12/30/15

to aparapi-discuss

Niaz

Can you unzip the attached, build and run it.

You may need to hack the makefile to match your OpenCL setup.

It is a pure C application which extracts profile info

Lets see if it works.

Gary

--

prof.zip

Niaz Khan

unread,

Jan 1, 2016, 1:38:08 PM1/1/16

to aparapi-discuss

Hi Gary,

Happy New Year :).

I have tested the C application on windows machine using visual studio and Nvidia GTX 485M.

The code seems to work fine as the profile info is extracted.

The output is attached. I will also try it on the linux machine when I get back to work.

If the plain c code works, what might cause Aparapi to not display the profile info?

Cheers

Niaz

prof-output.PNG

Gary Frost

unread,

Jan 1, 2016, 1:50:20 PM1/1/16

to aparapi-discuss

That is a good question and the reason I sent a pure c++ implementation.

Now I am suspicious that you are running on the GPU or even OpenCL

You are sure that execution is not falling back to Java (no profiling info there).

And you have

-Dcom.amd.aparapi.enableProfiling=true

Try also adding both

-Dcom.amd.aparapi.enableExecutionModeReporting=true

-Dcom.amd.aparapi.dumpFlagse=true

Gary

--

Niaz Khan

unread,

Jan 1, 2016, 2:43:34 PM1/1/16

to aparapi-discuss

Gary,

When I run the code it does not give any info about falling back to Java so that should be fine.

I have run the aparapi code (Eclipse) using these options: -Dcom.amd.aparapi.enableShowExecutionModes=true -Dcom.amd.aparapi.enableProfiling=true -Dcom.amd.aparapi.enableExecutionModeReporting=true -Dcom.amd.aparapi.dumpFlagse=true

Here is the output:

Execution mode=GPU

execution complete: AparapiKernel, modes=[AUTO], current = GPU

profile info null

The execution mode is returning "GPU", so the code might run on GPU rather than OpenCL?