Large image arrays, zarr and dask

Alex Liberzon

unread,

Aug 10, 2024, 5:36:19 AM8/10/24

to openpiv-users

Dear OpenPIV users,

We’re closer than ever to the open source PIV solution that can become a viable replacement for the commercial software. One thing that can drastically change the situation is the ability to work with large image datasets. At the moment we’re working in the most inefficient way of a) storing large image dataset in some format, e.g. TIFF, BMP, etc. and then b) using Python to operate File I/O (read/write) that is not very fast. Much faster solution would be to work with the modern data structures, allowing to access image data in parallel (we could run PIV in parallel on chunks of image or on several image pairs at once), use fast I/O and store the result in a similarly fast format for post-processing. E.g. PIVPy - our first prototype for post-processing in a spirit of PIVMAT is using xarray and NetCDF binary format. It’s faster to load and write and it create huge datasets in the cloud storage efficiently.

There is even better solution with Zarr and Dask. Or we could check additional options from microscopy or astronomy solutions. Is anyone here is interested to help this development?

Thanks

Alex

Ivan Nepomnyashchikh

unread,

Aug 10, 2024, 1:08:59 PM8/10/24

to Alex Liberzon, openpiv-users

Hello, Dr. Liberzon,

I think you've brought up a an important issue.

You are talking about speed of PIV processing of big datasets of PIV pairs. And yes, Dask looks like the thing to go with in this case.

However, in my experience with PIVPY such a solution has been a great pain. PIVPY is based on Xarray. Xarray is excruciatingly slow. To make the long story short, I ended up cooking a crazy brew of Xarray, Dask and numpy to get an acceptable speed of data processing - in my case it was the analysis of vortex size using custom Gamma1 and Gamma2 functions (which I still plan to "pull-request" to PIVPY). The problem with that is that I have to learn two different things: Xarray and Dask. To say nothing about NetCDF. On the other hand, if I had just one solution, that would be more convenient, because I would have to learn only one library.

That's why I have become a bit disappointed in solution framework where one uses Python-based data structures (e.g., Xarray) in conjunction with Python-based chunking libraries (e.g., Dask) and storing data in tensor-tailored binary files (e.g., NetCDF).

It's my personal wish to explore a different approach. Namely, to have an analog of numpy but for data structures. By that I mean a library which is written in a fast low-level programming language and that provides a Python "wrapper". And I found such a solution: Polars. I'm beginning to play with it. I don't know yet how to chunk it for the case of huge datasets, but, frankly speaking, Polars is so fast you might not need to chunk it. My thought was to create a copy of PIVPY but Polars-based, instead of Xarray-based and send it to you for review. But you were faster with this announcement.

I also like your Zarr suggestion. I hate NetCDF. I have been using hdf5 - I love hdf5. Having looked at Zarr, I think, for me personally, it might be a viable replacement for hdf5.

Here is what I am thinking. Elon Musk says he likes to assign the same task to two teams within his company and ask them to compete for the best solution. Let's do the same thing? I'll keep working out my Polars idea. You (and maybe other interested in your approach) will be working out your idea. And we'll compare the results once we are done. What do you think about it?

Thank you.

Ivan

On 8/10/24 02:36, Alex Liberzon wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

--
You received this message because you are subscribed to the Google Groups "openpiv-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openpiv-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/openpiv-users/252ff6ca-974d-408a-b34c-96afd27431d1n%40googlegroups.com.

Alex Liberzon

unread,

Aug 10, 2024, 1:29:51 PM8/10/24

to openpiv-users

Hi Ivan,

I’m not sure about why xarray is such a pain for you - basically it’s a Numpy array with some metadata. Similar to Pandas I think. NetCDF is a choice of the people that developed xarray - they follow what is popular for the oceanographic, meteorological and other communities. Both are comparable to HDF5.

Zarr/Dask is a different story I think, yet I agree with your suggestion - let’s look for different solutions and compare. Eventually we’ll have to get some benchmarking cases to compare and see the advantage/disadvantage of various tools. We’re an open source community and we are cooperating by joint new ideas and discussions. Please go ahead.

Alex

Ivan Nepomnyashchikh

unread,

Aug 10, 2024, 2:56:12 PM8/10/24

to openpi...@googlegroups.com

Ok, thank you, I will do something with Polars and will share the results, then. I will also be thinking what can be used as benchmarking cases.

When I will get ready my Gamma functions for PIVPY pull request, I'll try to make a Jupiter notebook showing a case where Xarray completely fails in speed.

Ivan

To view this discussion on the web, visit https://groups.google.com/d/msgid/openpiv-users/8d886570-0fd7-4ec1-ae6b-ff0f1f0d246dn%40googlegroups.com.

Reply all

Reply to author

Forward