I agree this is a limitation ...
I often use a temporary shared array of size (nthread, output_size), then perform the merge within that array.
here is an example, a kind of histogram.
cdef int[:,:] tmp = numpy.zeros((numthreads, size2), int)
cdef int[:] out = numpy.zeros(size2, int)
for i in parallel.prange(size1, nogil=True):
threadid = parallel.threadid()
j = <int> data[i]
tmp[threadid, j] += 1
for j in parallel.prange(size2, nogil=True):
s = 0
for i in range(numthreads):
s = s + tmp[i, j]
out[j] += s
Note
* "s = 0" and "s = s + x" to enforce the thread locality (i.e. not shared)
* "out[j] +=" to enforce a parallel reduction so out is shared and not local
This works when the number of core is a few: there is a large overhead
with 2 cores or less and requires a lot of memory when there are too
many cores.
Cheers,
--
Jérôme Kieffer
tel +33 476 882 445