Best way to convert numpy array to C++ vector?

11,186 views
Skip to first unread message

Will Mayner

unread,
Feb 27, 2014, 7:03:49 PM2/27/14
to cython...@googlegroups.com
Hi,

What is the best way to convert a numpy array to a C++ std::vector?

I'm trying to wrap a C++ library for computing the Earth Mover's Distance in Cython so that it can be used to compute the EMD of two numpy arrays from normal Python code, and the library works with C++ vectors. From looking around here I've found that I can probably hack it together by
  • first converting the numpy array to a C-style array by passing the data pointer, as described here;
  • then converting that to a vector, as described here
But I was wondering if there is a better (easier to write, more efficient) way. Any ideas/comments would be greatly appreciated. Thanks!


— Will

Robert Bradshaw

unread,
Feb 28, 2014, 1:27:00 AM2/28/14
to cython...@googlegroups.com
Yep, that'd be the way to do it.

- Robert

Will Mayner

unread,
Feb 28, 2014, 4:28:41 PM2/28/14
to cython...@googlegroups.com
Hi Robert,

Tried to implement it as shown on StackOverflow, and it seems that the call to assign is being treated like (1) in the list below when it should be treated like (2)—with (pointer, pointer + length)as a range rather than(count, initial value).

void assign( size_type count, const T& value );(1)
template< class InputIt >
void assign( InputIt first, InputIt last );
(2)
void assign( std::initializer_list<T> ilist );(3)


Here's the full error:

Error compiling Cython file:
------------------------------------------------------------
...

cdef vector[double] _c_array_long_to_vector(double* array, int length):
    cdef vector[double] output_vector
    output_vector.reserve(length)
    output_vector.assign(array, array + length)
                             ^
------------------------------------------------------------

cyemd/emd.pyx:58:30: Cannot assign type 'double *' to 'size_t'

Error compiling Cython file:
------------------------------------------------------------
...

cdef vector[double] _c_array_long_to_vector(double* array, int length):
    cdef vector[double] output_vector
    output_vector.reserve(length)
    output_vector.assign(array, array + length)
                                     ^
------------------------------------------------------------

cyemd/emd.pyx:58:38: Cannot assign type 'double *' to 'double'
 

How should I get the pointer to be treated like an iterator as described in the SO answer? Apologies if I'm missing something obvious here; I'm new to Cython and C++.


– Will

Robert Bradshaw

unread,
Feb 28, 2014, 5:21:19 PM2/28/14
to cython...@googlegroups.com


--
 
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Will Mayner

unread,
Feb 28, 2014, 5:34:27 PM2/28/14
to cython...@googlegroups.com
Awesome, compiles fine now. Kudos!


– Will
Message has been deleted

pmarini

unread,
Apr 9, 2017, 12:53:34 AM4/9/17
to cython-users
Hello,
is this the way to do the conversion as of today?
I'm wondering whether there exists a way that avoids data copies.
Thanks,
Pietro

Robert Bradshaw

unread,
Apr 9, 2017, 1:05:12 AM4/9/17
to cython...@googlegroups.com
C++ std::vectors don't play well with sharing their memory management [1], a copy is the most straightforward. You could possibly use memcpy if the numpy array is C-contiguous and you're using a modern enough [2] C++ library, though of course the compiler may do that for you. On the other hand, a vector of vectors is a particularly poor representation of 2-d data and isn't even stored the same in memory as a 2d numpy (or C) array. 

[1] Unless, of course, you want to write your own allocator to pass as the second template argument. 

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin Thornton

unread,
Apr 9, 2017, 2:14:14 PM4/9/17
to cython-users
IIRC, using a custom allocator to do this is undefined behavior.  (It is also hard to get right.) The reason is that the vector is able to reallocate for new objects, delete stuff, etc., resulting in the shared memory being moved around.

You can use custom vector classes that are read-only.  For example, inherit privately from std::vector with your allocator and only expose a read-only API.  Getting that set up is straightforward, but tedious.

A perhaps simpler solution is to use the pointer to data in the numpy array as the entry point for a GSL matrix, which can be constructed as a "view" on pre-allocated memory.

If you want to use std::vector and then act on it as if it were a 2d array, copy the numpy data into the vector and then get a GSL view via std::vector<T>::data().


You can use them in Cython via CythonGSL.

For types other than double, GSL provides the same matrix API.  gsl_matrix_int, gsl_matrix_float, _char, _short, etc., in separate headers.  That isn't totally documented, but there's a 1-to-1 matching of the gsl_matrix API with _type suffixed on there.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.

pmarini

unread,
Apr 10, 2017, 1:13:42 AM4/10/17
to cython-users
Hello,
thank you for the reply. I'll have a look to the methods you propose.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.

Robert Bradshaw

unread,
Apr 10, 2017, 12:04:11 PM4/10/17
to cython...@googlegroups.com
On Sun, Apr 9, 2017 at 11:14 AM, Kevin Thornton
<kevin.tho...@gmail.com> wrote:
> IIRC, using a custom allocator to do this is undefined behavior. (It is
> also hard to get right.) The reason is that the vector is able to reallocate
> for new objects, delete stuff, etc., resulting in the shared memory being
> moved around.

For sure using a custom allocator is not for the faint of heart, which
is why I suggested it as a footnote.

> You can use custom vector classes that are read-only. For example, inherit
> privately from std::vector with your allocator and only expose a read-only
> API. Getting that set up is straightforward, but tedious.
>
> A perhaps simpler solution is to use the pointer to data in the numpy array
> as the entry point for a GSL matrix, which can be constructed as a "view" on
> pre-allocated memory.
>
> If you want to use std::vector and then act on it as if it were a 2d array,
> copy the numpy data into the vector and then get a GSL view via
> std::vector<T>::data().

Note however that the lifetime of the pointer is bound by the lifetime
of the vector.

> The relevant GSL functions are here:
> https://www.gnu.org/software/gsl/manual/html_node/Matrix-views.html#Matrix-views
>
> You can use them in Cython via CythonGSL.
>
> For types other than double, GSL provides the same matrix API.
> gsl_matrix_int, gsl_matrix_float, _char, _short, etc., in separate headers.
> That isn't totally documented, but there's a 1-to-1 matching of the
> gsl_matrix API with _type suffixed on there.

I'm not sure how GSL helps here. C arrays and GSL and NumPy and Cython
MemoryViews all play well together, but they all have the same issues
with C++ std::vector. If the API of your library you're trying to use
uses C++ vectors, you have to copy (or do something tricky with
allocators).

Kevin Thornton

unread,
Apr 11, 2017, 3:57:09 PM4/11/17
to cython-users


On Monday, April 10, 2017 at 9:04:11 AM UTC-7, Robert Bradshaw wrote:


I'm not sure how GSL helps here. C arrays and GSL and NumPy and Cython
MemoryViews all play well together, but they all have the same issues
with C++ std::vector. If the API of your library you're trying to use
uses C++ vectors, you have to copy (or do something tricky with
allocators).
 
As an alternative to vector<vector<double>>.  But yes, a copy is still needed.

Robert Bradshaw

unread,
Apr 11, 2017, 10:59:43 PM4/11/17
to cython...@googlegroups.com
The constraint is in what the C++ library takes and returns. It might
only speak nested vectors. (Of course changing the library itself
might be an option, and in this case a good one.)

pmarini

unread,
Apr 14, 2017, 4:16:02 PM4/14/17
to cython-users
Thank you both for the useful answers and comments

I'm currently using a vector<vector<double>> implementation of a matrix in the library I want to wrap in Cython, hence my question.
I can use another data container, such as a 1D C array and play with indexes, instead of a vector<vector<double>>. At this point the requirements are to be able to view the data from python Numpy arrays (probably via cython typed memoryview) and do not affect the performance of the computations, if calling the same functions from C++.

So now the question is: what is the best data structure? You mentioned GSL, C arrays, Armadillo comes to my mind, etc.

In addition, a linked question:  Robert, you wrote that  'a vector of vectors is a particularly poor representation of 2-d data", can you please develop on that? Why is it so? What are the alternatives

Darsh Ranjan

unread,
Apr 14, 2017, 6:16:17 PM4/14/17
to cython...@googlegroups.com
On 04/14/2017 08:45 AM, pmarini wrote:
> Thank you both for the useful answers and comments
>
> I'm currently using a vector<vector<double>> implementation of a matrix
> in the library I want to wrap in Cython, hence my question.
> I can use another data container, such as a 1D C array and play with
> indexes, instead of a vector<vector<double>>. At this point the
> requirements are to be able to view the data from python Numpy arrays
> (probably via cython typed memoryview) and do not affect the performance
> of the computations, if calling the same functions from C++.
>

Regarding 1D C arrays and 2D indexing, several C++ libraries include
functionality that makes this pretty convenient. For example, Eigen3
has the "Map" type that allows you to treat a block of memory like a
matrix, with the numbers of rows and columns specified either at
compile-time (as template arguments) or at run-time.

> So now the question is: what is the best data structure? You mentioned
> GSL, C arrays, Armadillo comes to my mind, etc.
>

Pointers to contiguous blocks of memory are pretty easy to deal with on
all sides, and that's usually the route I've taken when interfacing
between C or C++ and Cython. Something like this:


[_cython/wrapper.pxd]
cdef extern from "../src/cpp_code.hpp":
void cpp_function(const double *input, double *output,
long num_rows, long num_cols);

[_cython/wrapper.pyx]
import numpy
from numpy cimport PyArray_DATA

cdef double [:, ::1] cython_wrapper(double [:, ::1] input_):
cdef long num_rows = input_.shape[0]
cdef long num_cols = input_.shape[1]
cdef double [:, ::1] output = numpy.empty((num_rows, num_cols))
cpp_function(
<double *>PyArray_DATA(numpy.asarray(input_)),
<double *>PyArray_DATA(numpy.asarray(output)),
num_rows,
num_cols)
return output

[src/cpp_code.cpp]
#include <Eigen/Core>

using Eigen::Map;
using Eigen::Matrix;
using Eigen::RowMajor;
using Eigen::Dynamic;

void cpp_function(const double *input, double *output,
long num_rows, long num_cols)
{
Map<const Matrix<double, Dynamic, Dynamic, RowMajor>>
input_map(input, num_rows, num_cols);
Map<Matrix<double, Dynamic, Dynamic, RowMajor>>
output_map(output, num_rows, num_cols);
output_map = input_map*5; // just some arbitrary computation
}


Note 1: this code is for illustrative purposes. It hasn't been tested
and likely won't work as-is.

Note 2: I've used Eigen here on the C++ side, which provides linear
algebra functionality and allows wrapping bare pointers. GSL, which was
mentioned before, also provides those. I don't know in detail what are
the tradeoffs between those (and other similar) packages, except that
code that uses GSL tends to be a bit more verbose and harder to read
than the equivalent code using Eigen.

pmarini

unread,
Apr 19, 2017, 6:13:10 AM4/19/17
to cython-users, darsh....@here.com
Hello Ranjan,
I didn't know Eigen, it looks amazing, I'll check it out. Thanks!

Yes, I think that pointers to contiguous blocks of memory, as you mentioned, is the way to go.

In my library the main interactions with the matrix are not linear algebra methods but mainly column and row selections so that all I need is a fast access interface. For this reason I wouldn't add a dependency such as Eigen, instead I would use bare C arrays, provided that the performance for these operations is comparable with Eigen's.

Krishna Bhogaonker

unread,
Apr 20, 2017, 8:52:01 PM4/20/17
to cython-users
You know, I have been looking for a way to get the data from a 2D numpy array into a C++ array--so that I can build a std::vector<std::vector<double>> as well. I read through the tutorial, but I still was not clear on how C++ receives the data from the Numpy C pointer? Note I need a dynamically sized C++ array--which it seems to OP also needs since the end goal is a dynamically sized std::vector container. The problem is that I am having a ridiculous time trying to figure out how to dynamically allocate a 2D either regular C++ array or std::array in which to catch the numpy data. Seems like whenever you want to dynamically allocate a 2-D array in C++ you have to keep one of the dimensions fixed--which kills the dynamic aspect. 

I think Chris had a post on this from a couple of years ago. But is the best way to deal with dynamic sizing on the C++ side still to use a 1-D "strided" version of the 2D array? I can create a dynamically sized 1D array in C++ and then essentially copy the data from the Numpy C pointer into the C++ heap. Then I could take that 1-D C++ array and build a 2-D vector out of it. 

Is that the way to do it? I keep looking around for a 2-D Numpy to 2-D C++ marriage, but perhaps there is none? Or better still I probably just have such weak C++ skills that I would not know the answer if I saw it :)
Krishna

Robert Bradshaw

unread,
Apr 20, 2017, 9:18:18 PM4/20/17
to cython...@googlegroups.com
What it really boils down to is the issue of memory management in (which touches almost ever aspect of how C++ is structured). Generally it's an advantage that C++ tries to shy away from operating on bare pointers (whose ownership could be unclear) but in this case you *want* to treat the entire 2d array as a single block of memory whose ownership is shared (or, more likely, controlled in one place and everyone else has a view) rather than controlled by the lifetime of the vectors. With vector<vector<T>> the outer vector owns the memory of the inner vector structures, and the inner vectors own the memory of the elements. What's more, getting rows in and out is often a copy (unless you have a vector of pointers to vectors, which leaves you handling those allocations/deallocations manually while still not giving control over the elements allocation). What's more, to avoid copies when sharing with other libraries you need allocate the vector of the right size, grab its data field to share with others, and never resize it or let it go out of scope until all other references are gone. 

If you really have a collection of lists of varying lengths, and copying to get the data in and out is OK, then a vector of vectors might be fine. If you can't avoid it, copy the data in, call your C++ library, and copy it back out. But if you have a choice for 2D data a strided C pointer is much nicer (and works well with Python/Numpy and most C libraries, as well as some numeric C++ ones). 




--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+unsubscribe@googlegroups.com.

Krishna Bhogaonker

unread,
Apr 21, 2017, 2:13:26 AM4/21/17
to cython-users
Wow, thanks so much for your help Bob. Your message sheds some light on these deep technical points. Looking through a million stackexchange posts just left me more confused about what was possible and the most efficient and practical approach to solving my problem. I am working with some C++ computational topology libraries that rely on CGAL. Since CGAL points are their own container type, I think a copy is unavoidable no matter what I do. 

I am not too concerned about the performance hit, so it should be okay. I will keep in mind your point for the strided C pointers too when dealing with the output of my computation :). 
Krishna
Reply all
Reply to author
Forward
0 new messages