How to Allocate smart pointer scratch space per parallel thread

27 views
Skip to first unread message

David Meyer

unread,
Feb 17, 2026, 11:23:39 PMFeb 17
to cython-users
Hi,

I'm trying to follow a specific example from the docs, namely allocating "scratch space" per thread, but the scratch I need to allocate is a smart pointer (from a 3rd party library, also written in cython though). In the ultimate use case, I have a heavy numeric calculation wrapped where the pointer is to a C++ class that implements the calculation, including a number of sizeable internal array allocations (~>100kB), so being able to re-use the class within a thread would be helpful to avoid excessive memory consumption (ie by pre-allocating each loop run individually).

The direct example is rather complex and domain specific, so I've come up with a dummy example that illustrates the problem (everything below is assuming jupyter-cell magic)

```
%%cython --cplus -a
# distutils: extra_compile_args = /std:c++20 /openmp
# cython: boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True

import numpy as np
cimport numpy as np
np.import_array()

from libcpp.memory cimport unique_ptr, make_unique

from cython.parallel cimport prange, parallel
from cython.operator cimport dereference

def parallel_loop_smart_pointer_test(double[::1] in_arr, double x):

    cdef Py_ssize_t in_size = in_arr.size
    cdef Py_ssize_t i

    cdef out_arr = np.empty_like(in_arr)
    cdef double[::1] out_view = out_arr

    cdef unique_ptr[double] x_smart

    with nogil, parallel():

        x_smart = make_unique[double](x)

        try:
            for i in prange(in_size):
                out_view[i] = in_arr[i] + dereference(x_smart)

        finally:
            x_smart.reset()

    return out_arr
```
Compiling throws the error `error C2280: 'std::unique_ptr<double,std::default_delete<double>>::unique_ptr(const std::unique_ptr<double,std::default_delete<double>> &)': attempting to reference a deleted function`

Inspecting the generated cpp, cython is marking `x_smart` as `lastprivate` instead of `private` (meaning it is trying to automatically delete references in other threads?).

Question: is there some subtle incantation to use to allow using smart pointers in the per-thread scratch space?

Possible work-around I'd like to avoid: with some insight from a collaborator much more knowledgeable in Cython/C++, it is possible to work around this by chunking up the work manually and using prange to thread out the chunks. Modifying my above example looks like this
```
%%cython --cplus -a
# distutils: extra_compile_args = /std:c++20 /openmp
# cython: boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True

import numpy as np
cimport numpy as np
np.import_array()

from libcpp.memory cimport unique_ptr, make_unique

from cython.parallel cimport prange, parallel
from cython.operator cimport dereference

cdef void single_thread(double* in_slice, double* out_slice, Py_ssize_t slice_size, double x) noexcept nogil:

    cdef Py_ssize_t i

    cdef unique_ptr[double] x_smart
    x_smart = make_unique[double](x)

    for i in range(slice_size):
        out_slice[i] = in_slice[i] + dereference(x_smart)

def parallel_loop_smart_pointer_workaround(double[::1] in_arr, double x):

    cdef Py_ssize_t in_size = in_arr.size
    cdef Py_ssize_t i

    cdef out_arr = np.empty_like(in_arr)
    cdef double[::1] out_view = out_arr

    cdef int num_threads = 2
    assert in_size % num_threads == 0, 'Dummy example, array input must be easily chunkable'
    cdef int chunk = in_size//num_threads

    for i in prange(num_threads, nogil=True, num_threads=num_threads, schedule='static'):
       
        single_thread(&in_arr[i*chunk], &out_view[i*chunk], chunk, x)

    return out_arr
```

While functional, I'd really rather not re-implement the scheduling logic by hand and would prefer the scratch-space method so I can use standard scheduling methods.

If you've made it this far, thanks for reading! Hopefully you are more experienced than I and know how to make it work.
-David

David Woods

unread,
Feb 18, 2026, 2:38:55 AMFeb 18
to cython...@googlegroups.com
I'm not absolutely sure, but I don't think that `private` Vs `lastprivate` is the problem. I think in both cases OpenMP will try to make a copy of the variable and that this is what's failing. unique_ptr is moveable but not copyable.

I'd probably take a pointer to the smart pointer and then use that inside the loop instead. Obviously you don't get the lifetime management of the smart pointer within the loop, but it doesn't look like you really need it (and if you do then unique_ptr isn't quite right). Something like:

x_smart = make_unique[double](x)
p_x_smart = &x_smart


try:
    for i in prange(in_size):
        out_view[i] = in_arr[i] + dereference(dereference(p_x_smart))

finally:
    x_smart.reset()


This is untested but I'm relatively confident it should work.

da-woods

unread,
Feb 18, 2026, 2:52:16 AMFeb 18
to cython...@googlegroups.com

Thinking about it more... that's probably not quite right - I think the copy is at the end of the `with nogil, parallel` block when it's trying to get one of the allocated pointers out of that block.

In that case, you probably just need to move the contents of that block into a function.

with nogil, parallel():
    allocate_scratch_and_loop(in_arr, out_view, ...)


You should still be able to put the `prange` inside that function so there should be no need to manually chunk it. All you're doing is moving the scratch allocation to be tightly scoped inside a function.

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cython-users/3847BD18-0FEA-40F4-A6F5-03DE2D04B753%40d-woods.co.uk.

David Meyer

unread,
Feb 18, 2026, 11:13:08 AMFeb 18
to cython-users
Thank you! I didn't realize you could put the prange into a function call within the parallel block. The function scoping solves the issue nicely.

Well ... almost. I think I encountered a bug with Cython where the code won't actually use threads unless I pass `use_threads_if=False` to the `parallel` context. Below is example code with the smart pointer non-sense removed.

```
%%cython --cplus

# distutils: extra_compile_args = /std:c++20 /openmp
# cython: boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True

import numpy as np
cimport numpy as np
np.import_array()

from libcpp.memory cimport unique_ptr, make_unique

from cython.parallel cimport prange, parallel, threadid
from cython.operator cimport dereference

cdef void allocate_and_loop(double[::1] in_arr, double[::1] out_arr, Py_ssize_t arr_size, double x, int[::1] tids) noexcept nogil:

    cdef Py_ssize_t i

    for i in prange(arr_size):
        tids[i] = threadid()
        out_arr[i] = in_arr[i] + x

def parallel_loop_scoped_thread_control(double[::1] in_arr, double[::1] out_arr, double x, bint use_parallel):


    cdef Py_ssize_t in_size = in_arr.size
    cdef int[::1] tids = np.zeros(in_size, dtype=np.int32)

    print(f'Using threads: {use_parallel}')

    with nogil, parallel(use_threads_if=use_parallel):

        allocate_and_loop(in_arr, out_arr, in_size, x, tids)

    unique_ids = set(tids)
    print(f'Number of threads used {len(unique_ids)}')

    return out_arr

def parallel_loop_thread_control(double[::1] in_arr, double[::1] out_arr, double x, bint use_parallel):


    cdef Py_ssize_t in_size = in_arr.size
    cdef Py_ssize_t i
    cdef int[::1] tids = np.zeros(in_size, dtype=np.int32)

    print(f'Using threads: {use_parallel}')

    with nogil, parallel(use_threads_if=use_parallel):

        for i in prange(in_size):
            tids[i] = threadid()
            out_arr[i] = in_arr[i] + x

    unique_ids = set(tids)
    print(f'Number of threads used {len(unique_ids)}')

    return out_arr
```

Calling the test functions as
```
big_in_arr = np.linspace(0, 15, 100000)
big_out_arr = np.empty_like(big_in_arr)
parallel_loop_thread_control(big_in_arr, big_out_arr, 3.0, use_parallel=True)
parallel_loop_scoped_thread_control(big_in_arr, big_out_arr, 3.0, use_parallel=False)
```
Gives the following output
```
Using threads: True
Number of threads used 16
Using threads: False
Number of threads used 16
```

As an aside, I learned from this that `use_threads_if` (and `num_threads` for that matter) need to be defined in the parent parallel section thanks to a helpful transpile-time error from Cython, but that error is only thrown in the `parallel_loop_thread_control` function. If you make the same mistake in `parallel_loop_scoped_thread_control`, no error is thrown and compilation succeeds (you just don't get any threading regardless of the value of `use_threads_if`).

Is this an actual bug I should raise over on github, or did I catastrophically mis-understand how this is supposed to work?
-David

da-woods

unread,
Feb 18, 2026, 1:55:07 PM (14 days ago) Feb 18
to cython...@googlegroups.com

I think I've given you bad advice.

It looks like nesting OpenMP prange inside OpenMP parallel only works if they actually are in the same function. The other detail is that `threadid()` doesn't look to be "absolute" but just a "relative" ID for the local block. So what's here is happening:

* with the "scoped" version,  `use_threads_if=False` the `parallel` block doesn't activate. The `prange` block in the separate function is independent and so does activate.
* with the "scoped" version, `use_threads_if=True` the `parallel` block spins up many threads. The `prange` block detects that the program is already running parallely so just becomes a `range` block which gets executed once for each thread.
* with the "unscoped" version, the `parallel` and `prange` block are linked at compile-time and so `use_threads_if` controls both blocks at once.

I don't think this is something we can realistically fix in Cython just because it's largely dictated by OpenMP - at best we might be able to detect and warn about it.

Going back to your original problem, I believe a variation of "pointer to unique_ptr" does work:

```
cdef unique_ptr[double]* x_smart

with nogil, parallel():

    x_smart = new unique_ptr[double](move(make_unique[double](x)))



    try:
        for i in prange(in_size):

            out_view[i] = in_arr[i] + dereference(dereference(x_smart))

    finally:
        del x_smart
```

So it's just a standard C pointer that gets declared as firstprivate here.

It's possible that we could improve Cython here: this variable could just be `private` because it isn't used after the parallel block and that'd probably fix the issue. I'm not sure if there's a good reason why we don't try to detect it.

David Meyer

unread,
Feb 19, 2026, 12:16:40 AM (14 days ago) Feb 19
to cython-users
Ah, that makes sense about the scoping of parallel and prange. It is interesting that the "scoped" version with `parallel(use_threads_if=False)` almost works. It compiles and even runs my basic example over threads (though maybe threadid is lying to me and I don't understand your comment about it). It does fail my ultimate use case though, I believe because the actual object isn't read-only like in my dummy example. Having cython detect this as `private` would be nice, but that is likely beyond my personal capability and I won't expect someone else tackle this niche issue.  Anyway, neither here nor there.

I am having issues with the "pointer to unique_ptr" approach though. It compiles, but when I run it I get a kernel crash. Below is my test code for completeness. My random attempts to fix it have not worked so far.
```
%%cython --cplus -a
# distutils: extra_compile_args = /std:c++20 /openmp
# cython: boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True

import numpy as np
cimport numpy as np
np.import_array()

from libcpp.memory cimport unique_ptr, make_unique
from libcpp.utility cimport move


from cython.parallel cimport prange, parallel, threadid
from cython.operator cimport dereference

def parallel_loop_smart_pointer_pointer(double[::1] in_arr, double x, bint use_parallel):


    cdef Py_ssize_t in_size = in_arr.size
    cdef Py_ssize_t i

    cdef out_arr = np.empty_like(in_arr)
    cdef double[::1] out_view = out_arr

    cdef int[::1] tids = np.zeros(in_size, dtype=np.int32)

    cdef unique_ptr[double]* x_smart_ptr

    with nogil, parallel(use_threads_if=use_parallel):

        x_smart_ptr = new unique_ptr[double](move(make_unique[double](x)))


        try:
            for i in prange(in_size):
                tids[i] = threadid()
                out_view[i] = in_arr[i] + dereference(dereference(x_smart_ptr))

        finally:
            del x_smart_ptr

    unique_threads = set(tids)
    print(f'Used {len(unique_threads)} threads')

    return out_arr
```

da-woods

unread,
Feb 19, 2026, 3:47:20 AM (13 days ago) Feb 19
to cython...@googlegroups.com

On Windows

<long long>dereference(x_smart_ptr).get()

is reporting 0, so somehow the pointer initialization is failing.

Factoring it out to:

```
cdef unique_ptr[double]* make_smart_ptr(double x) nogil:
    cdef unique_ptr[double] tmp = make_unique[double](x)
    return new unique_ptr[double](move(tmp))

...

x_smart_ptr = make_smart_ptr(x)
```


seems to work.

I have no immediate idea why though. On Linux the original code seems to work.

David Meyer

unread,
Feb 19, 2026, 5:28:57 PM (13 days ago) Feb 19
to cython-users
Would you look at that. MSVC ruining my day yet again.

Thank you for getting to the bottom of it on my behalf. Your solution works and is cross-platform, which is a nice bonus.

For completeness, I'll post the full, functioning dummy code, modified so that the scratch variable is actually written to in the loop to confirm operation. It's not the most amazing example in the world, but it does highlight the silliness of either what I'm trying to do or the approach we've had to resort to (maybe both).

```
%%cython --cplus -a
# distutils: extra_compile_args = /std:c++20 /openmp
# cython: boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True

import numpy as np
cimport numpy as np
np.import_array()

from libcpp.memory cimport unique_ptr, make_unique
from libcpp.utility cimport move

from cython.parallel cimport prange, parallel, threadid
from cython.operator cimport dereference

cdef unique_ptr[double]* make_smart_ptr(double x) nogil:
    cdef unique_ptr[double] tmp = make_unique[double](x)
    return new unique_ptr[double](move(tmp))

def parallel_loop_smart_pointer_pointer_readwrite(double[::1] in_arr, double x, bint use_parallel):


    cdef Py_ssize_t in_size = in_arr.size
    cdef Py_ssize_t i

    cdef out_arr = np.empty_like(in_arr)
    cdef double[::1] out_view = out_arr

    cdef int[::1] tids = np.zeros(in_size, dtype=np.int32)

    cdef unique_ptr[double]* x_smart_ptr
    cdef double* x_ptr

    with nogil, parallel(use_threads_if=use_parallel):

        x_smart_ptr = make_smart_ptr(x)
        x_ptr = &dereference(dereference(x_smart_ptr))


        try:
            for i in prange(in_size):
                tids[i] = threadid()
                x_ptr[0] = x_ptr[0] + <double>i
                out_view[i] = in_arr[i] + dereference(dereference(x_smart_ptr)) - <double>i
                x_ptr[0] = x_ptr[0] - <double>i


        finally:
            del x_smart_ptr

    unique_threads = set(tids)
    print(f'Used {len(unique_threads)} threads')

    return out_arr
```

Now I'll see if I can port this over to my actual use case successfully. Thanks again for your help!
-David

David Meyer

unread,
Feb 25, 2026, 2:10:46 AM (7 days ago) Feb 25
to cython-users
If you don't mind, I have a follow-up question related to the unusual behavior using MSVC/OpenMP. I've been able to implement all of the above in my actual use case, but ran into a lot of similar pointer initialization issues around C++ vectors.

My ultimate use case, beyond a smart pointer class, also uses a fair number of cpp vectors, which apparently exhibit the same pointer initialization issues. Below is example code highlighting the problem.

```
%%cython --cplus
# distutils: extra_compile_args = /std:c++20 /openmp
# cython: boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True

import numpy as np
cimport numpy as np
np.import_array()

from libc.stdint cimport uintptr_t
from libcpp.vector cimport vector


from cython.parallel cimport prange, parallel, threadid

def parallel_vector_assignments(double[::1] in_arr, double x, bint use_parallel):


    cdef Py_ssize_t in_size = in_arr.size
    cdef Py_ssize_t i

    cdef out_arr = np.zeros_like(in_arr)

    cdef double[::1] out_view = out_arr

    cdef int[::1] tids = np.zeros(in_size, dtype=np.int32)

    cdef vector[double] y0
    cdef double* in_ptr = &in_arr[0]

    cdef double y

    with nogil, parallel(use_threads_if=use_parallel, num_threads=2):

        for i in prange(in_size):
            y0 = vector[double](in_ptr, in_ptr + in_size)
            y = x
            if not y0.empty():
                tids[i] = threadid()
                out_view[i] = y0[i] + y
            else:
                with gil:
                    print(f'Loop {i}, thread {threadid()}: in_arr ptr {<uintptr_t>in_ptr}, y0.data() {<uintptr_t>y0.data()}, y {y}')
                    raise RuntimeError(f'Failed to copy in_arr to y0 vector on loop {i:d}')


    unique_threads = set(tids)
    print(f'Used {len(unique_threads)} threads')

    return out_arr

def nonparallel_vector_assignments(double[::1] in_arr, double x):


    cdef Py_ssize_t in_size = in_arr.size
    cdef Py_ssize_t i

    cdef out_arr = np.zeros_like(in_arr)

    cdef double[::1] out_view = out_arr

    cdef vector[double] y0
    cdef double* in_ptr = &in_arr[0]

    cdef double y

    with nogil:

        for i in range(in_size):
            y0 = vector[double](in_ptr, in_ptr + in_size)
            y = x
            if not y0.empty():
                out_view[i] = y0[i] + y
            else:
                with gil:
                    print(f'Loop {i}: in_arr ptr {<uintptr_t>in_ptr}, y0.data() {<uintptr_t>y0.data()}, y {y}')
                    raise RuntimeError(f'Failed to copy in_arr to y0 vector on loop {i:d}')

    return out_arr
```

Calling both function I get the following output
```
>>> in_arr = np.linspace(0, 15, 16)
>>> print(nonparallel_vector_assignments(in_arr, 3.0))
>>> print(parallel_vector_assignments(in_arr, 3.0, use_parallel=True))
[ 3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18.]
Loop 0, thread 0: in_arr ptr 2618412717168, y0.data() 0, y 3.0
Loop 8, thread 1: in_arr ptr 2618412717168, y0.data() 0, y 3.0
<snip the RuntimeError>
```

`parallel_vector_assignments` works fine on linux with GCC. My MSVC is coming from Visual Studio Community 2022 (17.14.27). My best guess is that MSVC's OpenMP implementation is lacking with C++ objects and marking things `firstprivate`/`lastprivate` is causing issues with pointer initialization in the parallel block, issues that don't extend to plain old C data like `y` above.

For a firm question, I wonder if you (or anyone else) has deeper insight into this problem? This seems like a pretty glaring issue with MSVC. Picking through release notes, maybe this is related, but I can't quite tell if current VS 2022 is still affected.
-David

da-woods

unread,
Feb 27, 2026, 5:01:11 PM (5 days ago) Feb 27
to cython...@googlegroups.com

I'm able to reproduce the problem in a pretty short bit of c++ code so I've reported it to Microsoft at https://developercommunity.visualstudio.com/t/Move-assignment-of-openmp-private-variab/11051965. From experience, they probably will fix it but not quickly.

A few (untested) suggestions for how to work around it with Cython:

* define the C macro CYTHON_USE_CPP_STD_MOVE to 0. At least on my simple example, copy rather than move doesn't generate the problem and initializing `y0` involves moving from a temporary variable. Although this will make some things perform worse.
* Use /std:c++17 - I don't see the problem with this set.
* Use the clang-cl compiler instead of MSVC. This won't work with setuptools unfortunately, which I think means it won't work with IPython. But I believe will work with Meson as a build system. I'm not really able to provide useful support for this though, but the idea of clang-cl is that it should be an almost drop-in replacement for MSVC.

Hopefully some of that's helpful, but I'm afraid I don't have any definitive solution for how to get this to work reliably. So it may depend on how much time you want to spend on this...

David Meyer

unread,
Mar 2, 2026, 2:53:08 PM (2 days ago) Mar 2
to cython-users
I appreciate you looking into it more deeply. At least I'm not crazy and it does appear to be a bug. Fortunately, I can get around this in my ultimate use case by moving `y0` into the scratch-space defined per-thread (that I'm already using), which doesn't have these issues.

I have one final question, if you don't mind my troubling you one final time. In my ultimate use case, which is working well at this point, I'd like to give the end user some runtime control over the parallelization since the calculations I'm running tend to have varying character. Is there any way to allow the user to specify the default `num_threads=None` to prange (or even schedule/chunksize, though I think I've convinced myself that schedule at least is not runtime configurable)? 

```
def do_sine_args(double[:,:] input, bint use_threads=True, num_threads=None):
    cdef double[:,:] output = np.empty_like(input)
    cdef Py_ssize_t i, j

    cdef int[::1] tids = np.zeros(input, dtype=np.int32)

    for i in prange(input.shape[0], nogil=True, use_threads_if=use_threads, num_threads=num_threads):
        tids[i] = threadid()
        for j in range(input.shape[1]):
            output[i, j] = sin(input[i, j])


    unique_threads = set(tids)
    print(f'Used {len(unique_threads)} threads')

    return np.asarray(output)
```

Though I could hard code `prange(...,num_threads=num)`, passing a variable argument doesn't work since prange is within a nogil block, so I can't pass python arguments directly to prange arguments. But typing the argument as an int then obviously means I can't use None to signal default behavior of letting OpenMP decide. Am I missing something obvious, or do I need to re-implement OpenMP's logic for default behavior to maintain the None as an input option for the user? Assuming the latter, do you know the default recipe for choosing the same number of default threads as OpenMP? On my dev machine, it is number of logical cpus, but not sure how universal that is.

-David

da-woods

unread,
Mar 3, 2026, 3:40:53 PM (22 hours ago) Mar 3
to cython...@googlegroups.com

On 02/03/2026 18:47, David Meyer wrote:
> I have one final question, if you don't mind my troubling you one
> final time. In my ultimate use case, which is working well at this
> point, I'd like to give the end user some runtime control over the
> parallelization since the calculations I'm running tend to have
> varying character. Is there any way to allow the user to specify the
> default `num_threads=None` to prange (or even schedule/chunksize,
> though I think I've convinced myself that schedule at least is not
> runtime configurable)?

There isn't a way to do this in Cython. You aren't the first person to
ask for it, but I can't see a good way to implement it in OpenMP.  I
think the best we could do is automatically duplicate the parallel block
with and without the num_threads.

> On my dev machine, it is number of logical cpus, but not sure how
> universal that is.
That's my impression of what it generally is too. From the documentation
of OpenMP it appears like it's implementation-defined though rather than
having a universal value.
Reply all
Reply to author
Forward
0 new messages