Hello, Dr. Liberzon,
I think you've brought up a an important issue.
You are talking about speed of PIV processing of big datasets of PIV pairs. And yes, Dask looks like the thing to go with in this case.
However, in my experience with PIVPY such a solution has been a
great pain. PIVPY is based on Xarray. Xarray is excruciatingly
slow. To make the long story short, I ended up cooking a crazy
brew of Xarray, Dask and numpy to get an acceptable speed of data
processing - in my case it was the analysis of vortex size using
custom Gamma1 and Gamma2 functions (which I still plan to
"pull-request" to PIVPY). The problem with that is that I have to
learn two different things: Xarray and Dask. To say nothing about
NetCDF. On the other hand, if I had just one solution, that would
be more convenient, because I would have to learn only one
library.
That's why I have become a bit disappointed in solution framework where one uses Python-based data structures (e.g., Xarray) in conjunction with Python-based chunking libraries (e.g., Dask) and storing data in tensor-tailored binary files (e.g., NetCDF).
It's my personal wish to explore a different approach. Namely, to
have an analog of numpy but for data structures. By that I mean a
library which is written in a fast low-level programming language
and that provides a Python "wrapper". And I found such a solution:
Polars. I'm beginning to play with it. I don't know yet how to
chunk it for the case of huge datasets, but, frankly speaking,
Polars is so fast you might not need to chunk it. My thought was
to create a copy of PIVPY but Polars-based, instead of
Xarray-based and send it to you for review. But you were faster
with this announcement.
I also like your Zarr suggestion. I hate NetCDF. I have been using hdf5 - I love hdf5. Having looked at Zarr, I think, for me personally, it might be a viable replacement for hdf5.
Here is what I am thinking. Elon Musk says he likes to assign the same task to two teams within his company and ask them to compete for the best solution. Let's do the same thing? I'll keep working out my Polars idea. You (and maybe other interested in your approach) will be working out your idea. And we'll compare the results once we are done. What do you think about it?
Thank you.
Ivan
[This email originated from outside of OSU. Use caution with links and attachments.]
--
You received this message because you are subscribed to the Google Groups "openpiv-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openpiv-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/openpiv-users/252ff6ca-974d-408a-b34c-96afd27431d1n%40googlegroups.com.
Ok, thank you, I will do something with Polars and will share the results, then. I will also be thinking what can be used as benchmarking cases.
When I will get ready my Gamma functions for PIVPY pull request, I'll try to make a Jupiter notebook showing a case where Xarray completely fails in speed.
Ivan
To view this discussion on the web, visit https://groups.google.com/d/msgid/openpiv-users/8d886570-0fd7-4ec1-ae6b-ff0f1f0d246dn%40googlegroups.com.