How to iterate arrow table rows in cython?

1,461 views
Skip to first unread message

Dino Dolinsek

unread,
Oct 27, 2021, 11:19:54 AM10/27/21
to cython-users
Hi there,

we are passing Arrow table from Python to Cython in order to quickly iterate millions of arrow table rows inside Cython. We need to do this in a sequence, so iterating is the only option. In C++ there is an example: https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
which i tested and it works good, but i couldn't get the same methods to work in Cython (based on my limited knowledge on the matter)

We plan to iterate rows in a for loop, but we can't figure out how to "access / check / print"  values for each row.

How could we, for example, print a single row from the passed arrow table (or multiple rows if you so prefer) of "date, name, age, weight" inside Cython? Could you give us an example please?

python code:
===============================================
import pandas as pd
import prophet.cython.arrow.myarrow as myarrow

df = pd.DataFrame({
    'date': pd.date_range(start='2020-01-01 00:00:00', periods=3, freq='1min'),
    'name': ['jack', 'tim', 'frank'],
    'age': [32, 25, 65],
    'weight': [66.46, 84.11, 71.52]
})
table = pa.Table.from_pandas(df)
myarrow.iterate_table(table) # This is where arrow table is being passed from Python to Cython
===============================================

cython code:
===============================================
from __future__ import print_function
cimport pyarrow
from pyarrow.lib cimport *

def iterate_table(obj):
    cdef int num_columns = 0
    cdef int num_rows = 0
    cdef:
        shared_ptr[CTable] table = pyarrow_unwrap_table(obj)
        shared_ptr[CChunkedArray] array
        shared_ptr[CArray] chunk
        shared_ptr[CArrayData] data
        
    if table.get() == NULL:
        raise TypeError("not a table...")

    num_columns = table.get().num_columns()
    num_rows = table.get().num_rows()
    print("num_columns: ",num_columns) # prints 4 as expected
    print("num_rows: ",num_rows) # prints 3 as expected

    array = table.get().column(2)
    chunk = array.get().chunk(0)
    data = chunk.get().data()
    print("chunk length: ", chunk.get().length()) # prints 3 as expected
    print("data length: ", data.get().length) # prints 3 as expected

===============================================

best regards,
Neon

Stefan Behnel

unread,
Oct 28, 2021, 2:56:06 AM10/28/21
to cython...@googlegroups.com
Hi,

Dino Dolinsek schrieb am 27.10.21 um 13:16:
> we are passing Arrow table from Python to Cython in order to quickly
> iterate millions of arrow table rows inside Cython. We need to do this in a
> sequence, so iterating is the only option. In C++ there is an example:
> https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
> which i tested and it works good, but i couldn't get the same methods to
> work in Cython (based on my limited knowledge on the matter)
>
> We plan to iterate rows in a for loop, but we can't figure out how to
> "access / check / print" values for each row.

I can't see a loop in your code example below. Have you tried Python's
for-loop? The Apache code example looks like it should work.


> How could we, for example, print a single row from the passed arrow table
> (or multiple rows if you so prefer) of "date, name, age, weight" inside
> Cython? Could you give us an example please?
>
> *python code:*
> ===============================================
> import pandas as pd
> import prophet.cython.arrow.myarrow as myarrow
>
> df = pd.DataFrame({
> 'date': pd.date_range(start='2020-01-01 00:00:00', periods=3,
> freq='1min'),
> 'name': ['jack', 'tim', 'frank'],
> 'age': [32, 25, 65],
> 'weight': [66.46, 84.11, 71.52]
> })
> table = pa.Table.from_pandas(df)
> myarrow.iterate_table(table) # This is where arrow table is being passed
> from Python to Cython
> ===============================================
>
> *cython code:*
> ===============================================
> from __future__ import print_function
> cimport pyarrow
> from pyarrow.lib cimport *

Star-imports are generally a bad idea.


> def iterate_table(obj):
> cdef int num_columns = 0
> cdef int num_rows = 0
> cdef:
> shared_ptr[CTable] table = pyarrow_unwrap_table(obj)
> shared_ptr[CChunkedArray] array
> shared_ptr[CArray] chunk
> shared_ptr[CArrayData] data
>
> if table.get() == NULL:
> raise TypeError("not a table...")
>
> num_columns = table.get().num_columns()
> num_rows = table.get().num_rows()
> print("num_columns: ",num_columns) # prints 4 as expected
> print("num_rows: ",num_rows) # prints 3 as expected
>
> array = table.get().column(2)
> chunk = array.get().chunk(0)
> data = chunk.get().data()
> print("chunk length: ", chunk.get().length()) # prints 3 as expected
> print("data length: ", data.get().length) # prints 3 as expected
>
> ===============================================

So, if everything works, what have you tried that didn't work?

Stefan
Reply all
Reply to author
Forward
0 new messages