How to read/write large files in Cython

38 views
Skip to first unread message

Amir Sharif

unread,
Dec 8, 2022, 1:57:03 PM12/8/22
to cython-users
Hi everybody,
I am new in Cython. I am using Python and dealing with several super large data (>20 GB). I break the large file through multi-processing and this made the program speed up a lot, but still not fast enough... I am trying to open/load files through Cython to make the program faster.
Are there any sample code about opening and reading and writing files with Cython?
(I am working with an Apple M1 Pro).

Thanks a lot,
Amir

D Woods

unread,
Dec 8, 2022, 3:16:57 PM12/8/22
to cython-users
You're unlikely to be able to speed up Python file objects much with Cython - they expose a fairly small amount of C API and I don't think Cython does anything to use it. You could try manually calling the C API functions (https://docs.python.org/3/c-api/file.html). If you're reading/writing lots of short lines that might help a little. But it's unlikely to be big success.

Your best option is to use the C `FILE*` API. For this you are basically writing C code. Therefore you should search for C documentation for this rather than Cython documentation. You will find it much more low-level that Python, but it's about as fast as you can get.

You could also look at the C++ standard library for something a little more high level. However "iostreams" are known to be fairly slow so might not be your best option.

To be honest I suspect you will be disappointed and discover that the main limitation is just the speed of your disk. However, that's a guess without seeing much code.

Amir Hossein Sharifzadeh

unread,
Dec 13, 2022, 2:18:45 AM12/13/22
to cython-users

Hello.
Thanks for your reply. Now, I have an issue with struct.unpack in Cython. The general scenario is I have a very large binary dataset and need to break them into small chunks through multiprocessing and do some calculations. The program works fine in Python but I have an issue with unpack function in Cython:
This is the Python version:
fid = open(rawfile, 'rb')
fid.seek(0, os.SEEK_END)
nFileLen = fid.tell()
fid.seek(0, 0).
............
with Pool(max_number_of_process, maxtasksperchild=1) as pool:
for i in tqdm.tqdm(range(0, Nt)):
nLenVals_i = round(chunk_size / 4)
# On each step we load a new chunk of the file and pass it to a new process.
nVals_i = struct.unpack('I' * nLenVals_i, fid.read(chunk_size)) 
 .....


and, this is my code in Cython:
ff = rawfile.encode('utf-8')
cdef float *buf
fid = fopen(ff, 'rb')
if (fid == NULL):
return
else:
fseek(fid, 0, SEEK_END)
nFileLen = ftell(fid)
fseek(fid, 0, SEEK_SET)
........... 
 with Pool(max_number_of_process, maxtasksperchild=1) as pool: 
  for i in tqdm.tqdm(range(0, Nt)):
nLenVals_i = round(chunk_size / 4)
buf = malloc(10 * sizeof(float))
result = fread(buf, sizeof(float), chunk_size, fid) 
 nVals_i = struct.unpack('I' * nLenVals_i, buf) .
... 
 free(buf)

I get this error: "Cannot convert 'float *' to Python object".

Thanks for any suggestions.
Best,
Amir


On Thursday, December 8, 2022 at 3:16:57 PM UTC-5 D Woods wrote:
You're unlikely to be able to speed up Python file objects much with Cython - they expose a fairly small amount of C API and I don't think Cython does anything to use it. You could try manually calling the C API functions (https://docs.python.org/3/c-api/file.html). If you're reading/writing lots of short lines that might help a little. But it's unlikely to be big success.

Your best option is to use the C `FILE*` API. For this you are basically writing C code. Therefore you should search for C documentation for this rather than Cython documentation. You will find it much more low-level that Python, but it's about as fast as you can get.

You could also look at the C++ standard library for something a little more high level. However "iostreams" are known to be fairly slow so might not be your best option.

To be honest I suspect you will be disappointed and discover that the main limitation is just the speed of your disk. However, that's a guess without seeing much code.



On Thursday, December 8, 2022 at 6:57:03 PM UTC
Hi everybody,

Hanno Klemm

unread,
Dec 13, 2022, 2:19:02 AM12/13/22
to cython...@googlegroups.com

On 8. Dec 2022, at 19:57, Amir Sharif <amir.at....@gmail.com> wrote:

Hi everybody,
--

It’s difficult to say if this might help you because I don’t know what data you’re talking about but it can help a lot to read your data in once and then use some more optimised structure to save your data (e.g. hdf5, or numpy .npy files, etc). 

20GB isn’t that big and if you save it in an access friendly way, I/O shouldn’t be the bottle neck. But, as I said, this is difficult to know without further details. 

Hanno

D Woods

unread,
Dec 13, 2022, 12:17:06 PM12/13/22
to cython-users
On Tuesday, December 13, 2022 at 7:18:45 AM UTC amirsha...@gmail.com wrote:
........... 
 with Pool(max_number_of_process, maxtasksperchild=1) as pool: 
  for i in tqdm.tqdm(range(0, Nt)):
nLenVals_i = round(chunk_size / 4)
buf = malloc(10 * sizeof(float))
result = fread(buf, sizeof(float), chunk_size, fid) 
 nVals_i = struct.unpack('I' * nLenVals_i, buf) .
... 
 free(buf)

I get this error: "Cannot convert 'float *' to Python object".

Yes, `struct.unpack` is a Python function expecting a Python object that has the buffer protocol and a float pointer isn't a Python object.

You could probably get a memoryview of the float array though:

 nVals_i = struct.unpack('I' * nLenVals_i, <float[:10]>buf)

Additionally, memory allocation and deallocation is usually relatively slow. Since "buf" is always the same size maybe allocate it once and free it once at the end.
Reply all
Reply to author
Forward
0 new messages