Storing an array of numbers within a struct AND getting a numpy dtype from the struct

293 views
Skip to first unread message

Adam Li

unread,
Mar 14, 2022, 3:53:00 PM3/14/22
to cython-users
Hi,

Problem Description:

I'm currently working on something within scikit-learn, and trying the following:
  1. augment a struct to store an array of numbers (e.g. floats and ints). I use a "vector[float]" to essentially store a memory view of the data for efficient access and construction.
  2. getting the Numpy DTYPE for the struct to then do some error checking and comparisons of datatypes.
I have the following example code:

> cdef struct Node: > int I > double k > vector[float] weights # weights of the vector > vector[int] indices # indices of the features
>
> # for reference, I have functions that use Node in the following way and so something with this array of numbers
> # e.g. this function does some computation to get the array of numbers
> cdef get_node_vals(): > vector[float] weights = get_weights(...)
> vector[float] indices = get_indices(...)
>
> # this function then uses them and also stores the numbers within a Node struct
> cdef do_something(vector[float] weights, vector[int] indices): > node.weights = deref(weights)
> node.indices = deref(indices)
> > # I need to get the DTYPE of "Node". > cdef Node dummy; > NODE_DTYPE = np.asarray(<Node[:1]>(&dummy)).dtype

However, when I try to get the "dtype" of Node, it results in an error:

> Error compiling Cython file: > ------------------------------------------------------------ > ... > cdef Node dummy; > NODE_DTYPE = np.asarray(<Node[:1]>(&dummy)).dtype > ^ > ------------------------------------------------------------ > Invalid base type for memoryview slice: Node

Note: If I don't have the "vector" within the Node,

> cdef Node dummy; > NODE_DTYPE = np.asarray(<Node[:1]>(&dummy)).dtype

compiles without error. 

Is it possible to store an array of numbers as a standard data structure like a vector in "Node" struct, while also allowing it to get a dtype?

da-woods

unread,
Mar 15, 2022, 4:06:03 AM3/15/22
to cython...@googlegroups.com
I don't believe that what you describe will work in a memoryview. They require a fixed data layout with known dimensions. Your "struct containing a vector of floats and a vector of ints" goes through a layer of indirection to subcomponents of varying size.

There simply isn't a numpy dtype representation of that. I would not be possible to create a Numpy structured array that represents that, and so trying to do it in Cython won't work either.

The best you could do with a memoryview is to have a pointer cast to a struct that doesn't containing the vectors but just has enough padding in to hold the vectors (note that this is holding the vector "control struct... the actual memory for the vectors will be allocated elsewhere). I doubt this is what you want.

I suspect you don't actually need a dtype and we could possibly suggest a more appropriate alternative for "error checking and comparisons of datatypes".
--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/e227c031-cdae-4e05-a7b2-6c34ff7d1da8n%40googlegroups.com.


Adam Li

unread,
Mar 15, 2022, 11:35:34 AM3/15/22
to cython...@googlegroups.com
I see. So I actually do have an array of fixed numbers (i.e. n_cols) which are defined apriori by the user that needs to get filled in. I'm just using the vector memoryview as a storage mechanism. I'm not sure I follow the pointer suggestion. Are you saying define a pointer to yet another struct that holds this array?

In terms of downstream functionality and error checking, here are an example of two functions I need to use the NODE DTYPE:

cdef unpickle(d):
"""Unpickle a Node"
node_ndarray = d['nodes']
node_ndarray.astype(NODE_DTYPE, casting="same_kind")
...

cdef np.ndarray _get_node_ndarray(self):
"""Wraps nodes as a NumPy struct array to return to user.

Keep reference to the existing class.
"""
cdef np.npy_intp shape[1]
shape[0] = <np.npy_intp> self.node_count
cdef np.npy_intp strides[1]
strides[0] = sizeof(Node)
cdef np.ndarray arr
Py_INCREF(NODE_DTYPE)

# I use NODE_DTYPE here
arr = PyArray_NewFromDescr(<PyTypeObject *> np.ndarray,
<np.dtype> NODE_DTYPE, 1, shape,
strides, <void*> self.nodes,
np.NPY_DEFAULT, None)
Py_INCREF(self)
if PyArray_SetBaseObject(arr, <PyObject*> self) < 0:
raise ValueError("Can't initialize array.")
return arr

Thanks for all the help. Let me know if I can provide more information.



--
Best Regards,

Adam Li (he/him), PhD in Biomedical Engineering 
Postdoctoral Researcher at Columbia University
Causal AI Lab

D Woods

unread,
Mar 16, 2022, 3:54:48 AM3/16/22
to cython-users
(Trying again because the first time I posted this it didn't get through)

Is n_cols known at compile-time or at runtime? Or is there a known upper limit for n_cols? If it (or its upper limit) is known at compile-time then you could do it with Cython and structs. If it's known at runtime then you could allocate the array in Numpy (but without a very easy way to access it quickly in Cython).

Are the features of Numpy you need pickling/unpickling, and providing an indexable array for your users? Both of these could definitely be done in other ways.

My pointer suggestion was something like (untested):

cdef struct Node:
int I double k vector[float] weights # weights of the vector vector[int] indices # indices of the features
cdef struct NodeForView:
int l
double k
char padding[sizeof(vector[float])+sizeof(vector[int])]
cdef Node nodearray[10]
cdef NodeForView[::1] = <NodeForView[:10]><NodeForView*>(nodearray)
i.e. viewing it with a struct with the same size, but with the vectors hidden. I do not believe this is a particularly good idea though. It'll let you get a memoryview of l and k, but would provide no useful access to weights and indices (and plenty of opportunity to misuse them).


Adam Li

unread,
Mar 16, 2022, 11:51:31 AM3/16/22
to cython...@googlegroups.com
Ah yeah I would know it at runtime because it depends on the user passed in data.

Regarding the features of numpy, yes I need it to pickle and provide an indexable array. Moreover, I need the weights/indices accessible within Cython to do nogil computations.

Essentially, I need a storage mechanism for an array of numbers that is convertible to numpy array (so that it's pickleable and can be shown to users), and can be used within Cython without significant performance losses.

Would a Cython dataclass be a suitable replacement?

da-woods

unread,
Mar 16, 2022, 5:37:54 PM3/16/22
to cython...@googlegroups.com
What I'd do I think:

create the dtype at runtime in Numpy:

dtype = np.dtype([("I", np.int), ("k", np.float64), ("weights", np.float32, n_col), ("indicies", np.int, n_col)])

Allocate the array with numpy:

arr = np.zeros((N,), dtype=dtype)

That gives you a Numpy array for pickling and sending back to your users. Within Cython you can create separate memoryviews of each of the bits of the array:

cdef int[:] l = arr["l"]
cdef double[:] k = arr["k"]
cdef weights[:,:] = arr["weights"]  # 2D memoryview
cdef int[:,:] indicies = arr["indices"]  # 2D memoryview

The `arr[string]` indexing is a Python call and so isn't optimized, but all of the individual part memoryviews can be accessed quickly without the GIL.

A Cython dataclass is not a replacement - it's really just a cdef class with some automatically generated useful functions (__init__, __repr__, etc). It really doesn't do any of what you want.

Adam Li

unread,
Mar 17, 2022, 4:36:24 PM3/17/22
to cython...@googlegroups.com
Ah I see that's a nice workaround! Thanks da-woods. 

Summary: I think I see a way forward then now with your explicit explanation. I can easily keep the `vector[double]/vector[int]` within my Node struct because I have many Nodes possibly (I am building a binary tree and each node of the tree is parametrized by this Node struct). Now the size of the vectors within each Node is fixed once the user specifies `n_cols`. 

When I initialize the class/functions, I can as you advised construct a numpy dtype within a non-gil function, so that it is accessible. Meanwhile within any nogil functions, I can add data and perform computations with `Nodes`. Finally, when I need to pickle the overall class/object, which requires the dtype for error checking and extracting the contents of `Nodes` into a numpy array, I can just build up the numpy array dumping contents of Nodes in it.

What I was missing: The main step here that I was missing it seems is that there needs to be a function that is called by the class constructor to compute the numpy dtype upon instantiation, so that it is known at runtime. Seems like it is a fair thing to do since it is only done once in "Python".

Pseudocode: As a result the following is all possible (I think?)!

cdef struct Node: int I double k vector[float] weights # weights of the vector vector[int] indices # indices of the features

cdef class MyTreeWithNodes:
cdef __cinit__(self, n_cols):
# compute the numpy dtype when instantiating cuz we'll know n_cols
self.node_dtype = np.dtype([("I", np.int), ("k", np.float64), ("weights", np.float32, n_cols), ("indicies", np.int, n_cols)])
    
cpdef pickle/unpickle():
    # we'll know how many nodes we have
    # so I can initialize the array needed in numpy to hold all the data from all nodes
    arr = np.zeros((num_nodes * size_of_each_node,), dtype=self.node_dtype)
     
cdef get_node_vals(node) nogil:
# I can say set the node weights/indices on the fly as needed
vector[float] weights = get_weights(...)
vector[float] indices = get_indices(...)
node.indices = indices
node.weights = weights



Reply all
Reply to author
Forward
0 new messages