So we decided to apply to a NumFOCUS Small Development Grant ($9500 USD) so as to initiate an effort in this direction. I am attaching the application so that you can see where we would be headed. Criticisms or suggestions for improvements are welcome!
PyTables to leverage the HDF5 enhanced direct chunk capabilities
Two Sentence Summary or Proposal
HDF5 has support for a "direct chunking" paradigm, where data chunks can be offloaded to the application for compression and decompression. This can lead to big opportunities to accelerate I/O, and hence to improve performance on applications that use PyTables for handling (potentially big) datasets.
Description of Proposal
) is a popular Python wrapper for HDF5 with special attention to handle table (heterogeneous data types) as well as array (homogeneous types) datasets which can be very large. Although PyTables has had support for the fast Blosc compression library since 2010, I/O has been traditionally bottlenecked by the complex HDF5 mechanism to handle filter pipelines, and chunking processing in general.
Fortunately, the HDF5 crew figured this out some time ago, and created a way to bypass this mechanism and let the application to handle the process on their own. Such a mechanism is fully documented in https://support.hdfgroup.org/HDF5/doc/Advanced/DirectChunkWrite
, and even though only the write part is mentioned, there is a counterpart mechanism for reading as well. Recently we performed an actual benchmark about this (https://github.com/oscargm98/HDF5-Blosc2
) with very encouraging results: writes can go at 30x ~50x faster, whereas reads reach 40x ~60x faster operation. See this presentation on Blosc2 + HDF5 integration for details: https://www.blosc.org/docs/Blosc2-and-HDF5-European-HUG2022.pdf
We propose to adapt PyTables to support the direct chunking mechanism to allow access to data much faster, and with more flexibility. These are the proposed implementations:
1) Data chunks will be compressed/decompressed inside PyTables, using the fast Blosc2 compression library directly via the direct chunking mechanism.
2) When slices are required, PyTables can inform about the actual data *inside* chunks that needs to be decompressed, and Blosc2 will decompress *only* the parts that are necessary, providing much better efficiency in selections that require partial chunk reads. See the slides about block masks and parallel I/O in https://www.blosc.org/docs/Blosc2-and-HDF5-European-HUG2022.pdf
3) Enable C-Blosc2 filter and codec plugin capabilities in PyTables API, so in case a user wants to have access to different codecs or filters (other than the standard in C-Blosc2), they will be able to use new ones straight from PyTables.
4) Offer support for the regular filter mechanism in HDF5 in order to ensure that, in case the HDF5 files produced using the new direct chunking mechanism is being used in apps that does not have this support, the data can at least be read (at the expense of speed indeed). Although we are not completely sure on how this can be achieved, we are confident that the HDF5 folks would guide us in the way to achieve it.
Besides these, we consider that advertising this properly in the social networks is a good way to interact with the community, actively asking for feedback so that their input can help us to improve the implementation. We plan to take different actions for achieving these goals:
1) Provide a small blog entry per every of the features implemented above. This should be carried out during or immediately after the implementation, and preferably before a formal release of the PyTables package.
2) Communicate our achievements via traditional networks like Twitter or LinkedIn.
3) Regularly scan forums like StackOverflow so that we can be more proactive in helping people interested in using compression in their data-flow and, if appropriate, suggest the use of the PyTables package.
4) Ask for help from other open source projects, or from NumFOCUS, about other ways to reach out the community.
Benefit for the community
PyTables is used in scenarios where very large datasets in form of tables or arrays are handled and processed. Providing support for C-Blosc2 via the direct chunking mechanism will bring much faster I/O and more efficient slicing (via block masks) to PyTables users. As a result, the existing community will have the freedom to choose the existing standard chunk handling pipeline, or use the new one for much enhanced performance and flexibility.
More specifically, the benefits for the data ecosystem should be very important because this proposal will allow:
* Data can be compressed and decompressed faster
* More data selectivity during slicing operations, requiring less I/O
* Support for plugins just from inside the Blosc2 library, offering less impedance and faster operation
* Improved flexibility in using the Blosc2 capabilities straight from PyTables API
* Support for the standard plugin mechanism in HDF5 for read support from any HDF5 application
This will add a lot of added value to many areas of data handling, but specially in areas like data architecture, out-of-core high performance computing and storage efficiency, which are always in need of better ways to handle bigger datasets and/or using less resources.
Brief Budget Justification
The whole amount of the grant will be spent in stipends for each person like this:
Oscar Guiñón: $7500 (300 hours, $25/hour)
Francesc Alted: $2000 (40 hours, $50/hour)
As can be seen, Oscar Guiñón will be doing most of the work. Besides being part of the Blosc development team, Oscar has also had a good exposure to this scenario, creating the benchmark demonstrating the feasibility of this project (https://github.com/oscargm98/HDF5-Blosc2
). Francesc Alted (BDFL of the Blosc project, and creator of the PyTables library: https://blosc.org/pages/francesc-alted-resume/
) will act as tutor for the work.
Timeline of deliverables
+ July 18th, 2022
* Analysis of requirements.
+ July 25th, 2022
* Start the implementation of the proposed features.
+ September 30th
* Completion of the first prototype. Do a first blog and start advertising/communicating these developments in social networks. Start contributing to questions in StackOverflow, Twitter or others.
+ October 15th
* Completing support for the direct chunking mechanism. Study possibly suggestions by the community on work on improving the user interface.
* Document completely (with examples) the new feature.
+ December 15th, 2022
* Finish the regular HDF5 filter feature for maximum portability of the generated data via the new mechanism.
* Final version of a pull request for the PyTables project with all features implemented and ready to use.
* New blog explaining the benefits for the users. Announce it to the community (twitter, mailing lists...). We will follow up discussions for possible future work.
Oscar Guiñón: Member of the Blosc team. Author of the experiment proving the benefits of HDF5's direct chunking.
Francesc Alted: Creator of the PyTables project. BDFL of the Blosc project.
How will someone be identified to carry out the work?
We plan to use GitHub for the completing the development of the tasks listed here, and the commits will serve as as an accounting book for the work done. Then, the series blogs (published at https://pytables.org
) will be signed by the authors, so that will serve as another check for identifying the people working there.