Declare threadlocal storage?

570 views
Skip to first unread message

Brendan O'Connor

unread,
Feb 25, 2012, 1:58:58 AM2/25/12
to cython...@googlegroups.com
Hi,

Is there a way to declare threadlocal storage when using prange()?  The wiki page [1] mentions a "threadlocal" command, but I can't get it to work -- is it unimplemented?


My problem is, I need to increment a local variable inside a loop, then use the variable in that same loop -- but using an in-place increment operator automatically makes it a reduction, so it can't be seen.  Example code:

cdef double psum
for ii in prange(start,end, nogil=True):
    for kk in range(K):  psum += (.....)
    sample = draw_sample(..., psum, ....)


Do I need to do something like, create a 'psum' array with a different value per thread?

Thanks, and sorry if I'm missing something obvious --
Brendan
--

mark florisson

unread,
Feb 25, 2012, 9:57:20 AM2/25/12
to cython...@googlegroups.com
On 25 February 2012 06:58, Brendan O'Connor <bren...@gmail.com> wrote:
> Hi,
>
> Is there a way to declare threadlocal storage when using prange()?  The wiki
> page [1] mentions a "threadlocal" command, but I can't get it to work -- is
> it unimplemented?
>
> [1] http://wiki.cython.org/enhancements/prange
>
> My problem is, I need to increment a local variable inside a loop, then use
> the variable in that same loop -- but using an in-place increment operator
> automatically makes it a reduction, so it can't be seen.  Example code:
>
> cdef double psum
> for ii in prange(start,end, nogil=True):
>     for kk in range(K):  psum += (.....)
>     sample = draw_sample(..., psum, ....)
>
>
> Do I need to do something like, create a 'psum' array with a different value
> per thread?

Unfortunately, yes. What we should have is block-local declarations, like

for i in prange(...):
cdef double psum = ...

To declare something truly private and not a reduction. Currently, you
will need to do something horrible like this:

from cython.parallel cimport parallel, prange, threadid
from libc.stdlib cimport malloc, free
cimport openmp

cdef Py_ssize_t i, j
cdef double *psum, *sum

psum = <double *> malloc(sizeof(double) * openmp.omp_get_max_threads() * 32)
with nogil, parallel():
sum = psum + 32 * threadid()
for i in prange(m):
sum[0] = 0
for j in range(n):
sum[0] += f(j)

func(..., sum[0], ...)

free(psum)

The multiplication with 32 is to avoid false sharing (assuming your
cache lines aren't bigger than 256 bytes), another reason why
block-local declarations would be much nicer here.

Dag Sverre Seljebotn

unread,
Feb 25, 2012, 10:06:12 AM2/25/12
to cython...@googlegroups.com
On 02/24/2012 10:58 PM, Brendan O'Connor wrote:
> Hi,
>
> Is there a way to declare threadlocal storage when using prange()? The
> wiki page [1] mentions a "threadlocal" command, but I can't get it to
> work -- is it unimplemented?
>
> [1] http://wiki.cython.org/enhancements/prange
>
> My problem is, I need to increment a local variable inside a loop, then
> use the variable in that same loop -- but using an in-place increment
> operator automatically makes it a reduction, so it can't be seen.
> Example code:
>
> cdef double psum
> for ii in prange(start,end, nogil=True):
> for kk in range(K): psum += (.....)
> sample = draw_sample(..., psum, ....)
>
>
> Do I need to do something like, create a 'psum' array with a different
> value per thread?

Another approach *may* work in your case:

with parallel:
for i in range(...)
# NOTE: range, not prange

Dag

mark florisson

unread,
Feb 25, 2012, 10:23:39 AM2/25/12
to cython...@googlegroups.com
On 25 February 2012 15:06, Dag Sverre Seljebotn

You'd have to adjust the loop bounds to implement work sharing in that
case, and the inplace operator would still specify a reduction. In
fact, I get the error 'Reductions not allowed for parallel blocks',
although I think they should be allowed (at some point it was
considered "too much magic" for some reason?).

Alistair Muldal

unread,
Mar 7, 2014, 8:57:42 PM3/7/14
to cython...@googlegroups.com
That was a very useful example. Could a similar approach be used to create large thread-local arrays for cache purposes? I'm trying to multithread a more complicated reduction that involves generating large intermediate arrays. I really hope that explicitly thread-local variables are still on the roadmap for development, as they would save me a lot of pain for this case!

Alistair
Reply all
Reply to author
Forward
0 new messages