Warning for clCreateCommandQueue() failed: out of host memory

David Wynter

unread,

Mar 26, 2013, 3:08:04 PM3/26/13

to aparapi...@googlegroups.com

Hi,

Just started using aparapi.

A simple pair of Kernal here:

float myRatio = 0f;

int len = returns.length;

final int[] below = {0};

final float[] sum = {0.0f};

final float[] sumBelow = {0.0f};

final float[] secorder = {0.0f};

Kernel kernel = new Kernel(){

@Override public void run(){

int i = getGlobalId();

if( returns[i] < minAcc ) {

below[0]++;

sumBelow[0] += returns[i];

}

sum[0] += returns[i];

}

};

kernel.execute(len);

Kernel kernel2 = new Kernel(){

@Override public void run(){

int i = getGlobalId();

if( returns[i] < minAcc ) {

secorder[0] = (float) (secorder[0] +

(returns[i] - sumBelow[0]/below[0])*(returns[i] - sumBelow[0]/below[0]));

}

};

kernel2.execute(len);

causes this warning.

Mar 26, 2013 6:21:01 PM com.amd.aparapi.KernelRunner warnFallBackAndExecute

WARNING: Reverting to Java Thread Pool (JTP) for class uk.co.davidwynterassoc.sortino.SortinoRatio$1: OpenCL compile failed

!!!!!!! clCreateCommandQueue() failed: out of host memory

Can I allocate more Host memory?

Thx.

David

gfrost

unread,

Mar 26, 2013, 5:28:23 PM3/26/13

to aparapi...@googlegroups.com

David,

What size is returns[] array? I am trying to determine why you are running out of memory.

Can you include the result of running clinfo (this way we can determine how much memory you have at your disposal)

Is this code executed in a loop? if so we may need to hoist the kernel creation out of the loop (each creates a new context and command queue)

You have a race condition in both kernels.

array[0]++;

is unlikely to yield the result you expect. This increment is not atomic across all threads. So no other thread executing can expect to see an updated value from any other thread.

Kernel.atomicAdd(array,index, value);

May help, but is slow.

David Wynter

unread,

Mar 26, 2013, 6:18:59 PM3/26/13

to aparapi...@googlegroups.com

Hi,

Thx for the quick response, I found a clinfoGUISources, ran the clinfo.exe under Wine ( I am on Xubuntu x86_64) and it only shows my Intel Q9300 CPU not the AMD 7770 card. Catalyst shows it though.

The return is only 12 in length.

The array[0]++ was just to add all values in the array, array[0] was just a way of having the array final but the variable inside changeable, is it faster to do that outside this Kernel? Is there some way of putting a variable in GPU memory that all globalId executing tasks can add to? I assume there must be a way of managing contention for that memory?

I have another question, doing things like this in a loop is common

double delta_x = list[i-1] - mean_x[0];

But of course will not work, there must be a common pattern used to accommodate this, but I have found no examples of this. Such a common requirement I thought someone would have documented it. What pattern is used?

Thx.

David

David Wynter

unread,

Mar 26, 2013, 6:26:15 PM3/26/13

to aparapi...@googlegroups.com

Hi,

Might have found the answer, https://code.google.com/p/aparapi/wiki/UsingLocalMemory

Can I use local memory to hold the variable I need to keep a count of all variables that pass the test conditions? Then localBarrier(); will control access to them?

David

On Tuesday, 26 March 2013 21:28:23 UTC, gfrost wrote:

gfrost

unread,

Mar 26, 2013, 6:35:17 PM3/26/13

to aparapi...@googlegroups.com

David are you building Aparapi from source? If so then

$ cd com.amd.aparapi.jni

$ ant cltest

This will create cltest executable under dist subdir

Just run this executable.

$ dist/cltest_x86_64

It should dump info about your devices. (although now I know you have a 7770 :) )

gfrost

unread,

Mar 26, 2013, 6:52:50 PM3/26/13

to aparapi...@googlegroups.com

Local memory will not help unless your global size (Kernel.execute(globalSize)) is less than the group size of your device.

In OpenCL your global size is executed in groups (usually max of 64 or 256 items). So if you request global size of 1024, then this might be executed as 4 groups of 256 threads. Local memory can be synchronized across a group, provided you create a barrier. So at any time 256 threads can all access the same local memory, but this does not solve the race condition presented by foo[0]++.

There is no mechanism (other than atomicAdd(array, index, value)) that will work for global memory.

If you created a local buffer the width of your group (say local[256]) each could set local[getLocalId(0)] to 1 or zero. Then use a fence/barrier to assert that all group writes are complete. Then have one of the group 'sum' across local memory using a for loop.

if (condition)

local[getLocalId(0)]=1;

else

local[getLocalId(0)]=0;

localbarrier(); // to ensure that all work items have completed their writes.

if (getLocalId(0)==0){

for (int i=1; i<groupSize; i++){

local[0]+=local[i];

}

localbarrier();

// now local[0] contains sum for all items in this group. Still does not help with global memory

You need a reduction algorithm (prefix sum) to sum across global memory. Which involves invoking the kernel multiple times.

See http://en.wikipedia.org/wiki/Prefix_sum first.

Then see if you can grok one of the many OpenCL implementations. Here is a nice one.

http://books.google.com/books?id=T0sKa4T-sN0C&pg=PA97&lpg=PA97&dq=prefix+sum+simple+OpenCL

The issue is that your algorithm is really not 'data parallel'. It can be solved on the GPU (as the prefix sum above demonstrates), but it is not pretty or easy.

Gary

David Wynter

unread,

Mar 26, 2013, 7:09:41 PM3/26/13

to aparapi...@googlegroups.com

david@david-Q35M-S2:~/git/autoportfolio/aparapi/aparapi-read-only/com.amd.aparapi.jni$ ./cltest_x86_64

clGetPlatformIDs(0,NULL,&platformc) OK!

There is 1 platform

platform 0{

CL_PLATFORM_VENDOR.."Advanced Micro Devices, Inc."

CL_PLATFORM_VERSION."OpenCL 1.2 AMD-APP (1084.4)"

CL_PLATFORM_NAME...."AMD Accelerated Parallel Processing"

Platform 0 has 2 devices{

Device 0{

CL_DEVICE_TYPE..................... GPU (0x0)

CL_DEVICE_MAX_COMPUTE_UNITS........ 10

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS. 3

dim[0] = 256

dim[1] = 256

dim[2] = 256

CL_DEVICE_MAX_WORK_GROUP_SIZE...... 256

CL_DEVICE_MAX_MEM_ALLOC_SIZE....... 536870912

CL_DEVICE_GLOBAL_MEM_SIZE.......... 867172352

CL_DEVICE_LOCAL_MEM_SIZE........... 32768

CL_DEVICE_PROFILE.................. FULL_PROFILE

CL_DEVICE_VERSION.................. OpenCL 1.2 AMD-APP (1084.4)

CL_DRIVER_VERSION.................. 1084.4 (VM)

CL_DEVICE_OPENCL_C_VERSION......... OpenCL C 1.2

CL_DEVICE_NAME..................... Capeverde

CL_DEVICE_EXTENSIONS............... cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_amd_c1x_atomics

}

Device 1{

CL_DEVICE_TYPE..................... CPU (0x0)

CL_DEVICE_MAX_COMPUTE_UNITS........ 4

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS. 3

dim[0] = 1024

dim[1] = 1024

dim[2] = 1024

CL_DEVICE_MAX_WORK_GROUP_SIZE...... 1024

CL_DEVICE_MAX_MEM_ALLOC_SIZE....... 2147483648

CL_DEVICE_GLOBAL_MEM_SIZE.......... 8241225728

CL_DEVICE_LOCAL_MEM_SIZE........... 32768

CL_DEVICE_PROFILE.................. FULL_PROFILE

CL_DEVICE_VERSION.................. OpenCL 1.2 AMD-APP (1084.4)

CL_DRIVER_VERSION.................. 1084.4 (sse2)

CL_DEVICE_OPENCL_C_VERSION......... OpenCL C 1.2

CL_DEVICE_NAME..................... Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz

CL_DEVICE_EXTENSIONS............... cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt

David Wynter

unread,

Mar 26, 2013, 7:39:35 PM3/26/13

to aparapi...@googlegroups.com

Gary,

I do grok it, very good example. My max array size is 180 btw so easily fits in 256 global size. I'll give it a spin tomorrow after work. Thx for your help.

David

Message has been deleted

David Wynter

unread,

Mar 27, 2013, 2:20:26 PM3/27/13

to aparapi...@googlegroups.com

Gary,

These Kernel execute inside a loop about a million times, I notice it does take a long time to setup the Kernel, are there examples of how to 'hoist' it out?

Thx,

David

On Tuesday, 26 March 2013 21:28:23 UTC, gfrost wrote:

gfrost

unread,

Mar 27, 2013, 2:53:07 PM3/27/13

to aparapi...@googlegroups.com

Whilst most of our online demos show us creating a kernel using anonymous innerclass

Kernel k = new Kernel(){

// overloaded run()

};

Then execute using

k.execute();

You should not create the kernel each time, but re-use it

So instead of

int buff[]= // ;

for (/* millions of loops */){

Kernel k = new Kernel(){

// overloaded run() using buff[]

};

kernel.execute(range);

}

use

int buff[]= // ;

Kernel k = new Kernel(){

// overloaded run() using buff[]

};

for (/* millions of loops */){

// change buf contents

kernel.execute(range);

// get computed content from buf

}

Each time execute() is called on a new kernel instance it has to convert bytecode to OpenCL (this takes a few hundred millisecs) so we want to re-use the kernel.

Also each time execute() is called on a new kernel instance int creates a new command queue/context and buffers, which are not released until you called kernel.dispose(). So this might explain your 'out of memory error"

Gary

Reply all

Reply to author

Forward