NumFOCUS Small Development Grant

8 views
Skip to first unread message

Francesc Alted

unread,
Jun 4, 2022, 4:45:54 AM6/4/22
to pytabl...@googlegroups.com
Hi PyTables developers,

As a consequence of interest in better integrating Blosc2 from part of the HDF5 community, I gave a presentation at the recent European HDF5 User Group (see https://www.blosc.org/docs/Blosc2-and-HDF5-European-HUG2022.pdf), where a colleague and me determined that the direct chunking mechanism would offer many advantages in this integration.

So we decided to apply to a NumFOCUS Small Development Grant ($9500 USD) so as to initiate an effort in this direction.  I am attaching the application so that you can see where we would be headed.  Criticisms or suggestions for improvements are welcome!

Cheers,
Francesc

PyTables to leverage the HDF5 enhanced direct chunk capabilities
================================================================

Two Sentence Summary or Proposal
--------------------------------

HDF5 has support for a "direct chunking" paradigm, where data chunks can be offloaded to the application for compression and decompression.  This can lead to big opportunities to accelerate I/O, and hence to improve performance on applications that use PyTables for handling (potentially big) datasets.

Description of Proposal
-----------------------

PyTables (https://www.pytables.org) is a popular Python wrapper for HDF5 with special attention to handle table (heterogeneous data types) as well as array (homogeneous types) datasets which can be very large.  Although PyTables has had support for the fast Blosc compression library since 2010, I/O has been traditionally bottlenecked by the complex HDF5 mechanism to handle filter pipelines, and chunking processing in general.

Fortunately, the HDF5 crew figured this out some time ago, and created a way to bypass this mechanism and let the application to handle the process on their own.  Such a mechanism is fully documented in https://support.hdfgroup.org/HDF5/doc/Advanced/DirectChunkWrite, and even though only the write part is mentioned, there is a counterpart mechanism for reading as well.  Recently we performed an actual benchmark about this (https://github.com/oscargm98/HDF5-Blosc2) with very encouraging results: writes can go at 30x ~50x faster, whereas reads reach 40x ~60x faster operation. See this presentation on Blosc2 + HDF5 integration for details: https://www.blosc.org/docs/Blosc2-and-HDF5-European-HUG2022.pdf.

We propose to adapt PyTables to support the direct chunking mechanism to allow access to data much faster, and with more flexibility.  These are the proposed implementations:

1) Data chunks will be compressed/decompressed inside PyTables, using the fast Blosc2 compression library directly via the direct chunking mechanism.

2) When slices are required, PyTables can inform about the actual data *inside* chunks that needs to be decompressed, and Blosc2 will decompress *only* the parts that are necessary, providing much better efficiency in selections that require partial chunk reads.  See the slides about block masks and parallel I/O in https://www.blosc.org/docs/Blosc2-and-HDF5-European-HUG2022.pdf.

3) Enable C-Blosc2 filter and codec plugin capabilities in PyTables API, so in case a user wants to have access to different codecs or filters (other than the standard in C-Blosc2), they will be able to use new ones straight from PyTables.

4) Offer support for the regular filter mechanism in HDF5 in order to ensure that, in case the HDF5 files produced using the new direct chunking mechanism is being used in apps that does not have this support, the data can at least be read (at the expense of speed indeed).  Although we are not completely sure on how this can be achieved, we are confident that the HDF5 folks would guide us in the way to achieve it.

Besides these, we consider that advertising this properly in the social networks is a good way to interact with the community, actively asking for feedback so that their input can help us to improve the implementation.  We plan to take different actions for achieving these goals:

1) Provide a small blog entry per every of the features implemented above.  This should be carried out during or immediately after the implementation, and preferably before a formal release of the PyTables package.

2) Communicate our achievements via traditional networks like Twitter or LinkedIn.

3) Regularly scan forums like StackOverflow so that we can be more proactive in helping people interested in using compression in their data-flow and, if appropriate, suggest the use of the PyTables package.

4) Ask for help from other open source projects, or from NumFOCUS, about other ways to reach out the community.


Benefit for the community
-------------------------

PyTables is used in scenarios where very large datasets in form of tables or arrays are handled and processed.  Providing support for C-Blosc2 via the direct chunking mechanism will bring much faster I/O and more efficient slicing (via block masks) to PyTables users.  As a result, the existing community will have the freedom to choose the existing standard chunk handling pipeline, or use the new one for much enhanced performance and flexibility.

More specifically, the benefits for the data ecosystem should be very important because this proposal will allow:

* Data can be compressed and decompressed faster
* More data selectivity during slicing operations, requiring less I/O
* Support for plugins just from inside the Blosc2 library, offering less impedance and faster operation
* Improved flexibility in using the Blosc2 capabilities straight from PyTables API
* Support for the standard plugin mechanism in HDF5 for read support from any HDF5 application

This will add a lot of added value to many areas of data handling, but specially in areas like data architecture, out-of-core high performance computing and storage efficiency, which are always in need of better ways to handle bigger datasets and/or using less resources.


Brief Budget Justification
--------------------------

The whole amount of the grant will be spent in stipends for each person like this:

Oscar Guiñón:  $7500 (300 hours, $25/hour)
Francesc Alted: $2000 (40 hours, $50/hour)

As can be seen, Oscar Guiñón will be doing most of the work.  Besides being part of the Blosc development team, Oscar has also had a good exposure to this scenario, creating the benchmark demonstrating the feasibility of this project (https://github.com/oscargm98/HDF5-Blosc2).  Francesc Alted (BDFL of the Blosc project, and creator of the PyTables library: https://blosc.org/pages/francesc-alted-resume/) will act as tutor for the work.


Timeline of deliverables
------------------------

+ July 18th, 2022

    * Analysis of requirements.

+ July 25th, 2022

    * Start the implementation of the proposed features.

+ September 30th

    * Completion of the first prototype.  Do a first blog and start advertising/communicating these developments in social networks.  Start contributing to questions in StackOverflow, Twitter or others.

+ October 15th

    * Completing support for the direct chunking mechanism.  Study possibly suggestions by the community on work on improving the user interface.

    * Document completely (with examples) the new feature.


+ December 15th, 2022

    * Finish the regular HDF5 filter feature for maximum portability of the generated data via the new mechanism.

    * Final version of a pull request for the PyTables project with all features implemented and ready to use.

    * New blog explaining the benefits for the users.  Announce it to the community (twitter, mailing lists...).  We will follow up discussions for possible future work.
 

Project team
------------

Oscar Guiñón: Member of the Blosc team.  Author of the experiment proving the benefits of HDF5's direct chunking.
Francesc Alted: Creator of the PyTables project.  BDFL of the Blosc project.


How will someone be identified to carry out the work?
-----------------------------------------------------

We plan to use GitHub for the completing the development of the tasks listed here, and the commits will serve as as an accounting book for the work done.  Then, the series blogs (published at https://pytables.org) will be signed by the authors, so that will serve as another check for identifying the people working there.

--
Francesc Alted

antonio....@tiscali.it

unread,
Jun 7, 2022, 1:15:17 PM6/7/22
to pytabl...@googlegroups.com

Dear Francesc,
thanks a lot for the very interesting initiative.
I think that the new features that you are proposing would be a valuable addition for PyTables.

I gave a quick look to the application text.
It is very clear and well written.
One important point, IMHO, is the one regarding the compatibility with other libraries and applications.
Can you already say something about possible limitations?


cheers
Antonio
 

--
You received this message because you are subscribed to the Google Groups "pytables-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pytables-dev/CAFrp1vqHukt78WFaQWowFsW6a43XPLNHZMfdCRQHeY63t23Nng%40mail.gmail.com.



Voucher MISE per P.IVA e PMI: fino a 2500€ per la Banda Ultralarga. https://casa.tiscali.it/promo/?u=https://promozioni.tiscali.it/voucher_business/

Francesc Alted

unread,
Jun 15, 2022, 11:56:00 AM6/15/22
to pytabl...@googlegroups.com
Dear Antonio,

Thanks for your time in evaluating the proposal.  Regarding the compatibility with other libraries, your are right, that's an important constraint and frankly, we are not completely sure on how we can achieve that, but efforts like ImarisWriter (https://github.com/imaris/ImarisWriter) are showing that it is indeed possible to write data using the low-level H5Dwrite_chunk/H5Dread_chunk and then use standard tools (e.g. h5view) to read the data out.  Although we need to dig a bit deeper into this, I think we should be using the HDF5 plugin mechanism for that.  That means that even if high performance can only be achieved from the HDF5 application (in this case PyTables), at least other libraries (e.g. h5py) would still be able to read the data (in theory, at least ;-).

Best,
Francesc

Francesc Alted

unread,
Jul 28, 2022, 3:10:09 PM7/28/22
to pytabl...@googlegroups.com
Hi Developers,

This is to inform you that we've got the Small Development Grant from NumFOCUS foundation, although this time we have been awarded with only a partial amount ($4300 USD vs $9500 USD that we were asking for).  However, as there were about $3000 USD in the PyTables Collective account (coming from some donations, and IIRC, a good one was made by a company sponsorizing a release that I did some years ago), I think we can still do most of the planned work with the cummulated $7300 USD.

We actually have started the job, and one first good news is that we have strong evidence that we will be able to create a standard HDF5 plugin so that other HDF5 apps will be able to read data written with the direct chunking mechanism.  So we should have the best of the two worlds: very good I/O speed from PyTables, while keeping compatibility with other HDF5 apps :-)

We will continue informing as we are making progress towards project completion.

Cheers!
--
Francesc Alted

Antonio Valentino

unread,
Jul 29, 2022, 2:02:59 AM7/29/22
to pytabl...@googlegroups.com
Hi Francesc,
congratulations for getting the grant.
Looking forward to see this new feature in PyTables.


kind regards
antonio

Il 28/07/22 21:09, Francesc Alted ha scritto:
>>> <https://groups.google.com/d/msgid/pytables-dev/CAFrp1vqHukt78WFaQWowFsW6a43XPLNHZMfdCRQHeY63t23Nng%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>>
>>>
>>>
>>> Voucher MISE per P.IVA e PMI: fino a 2500€ per la Banda Ultralarga.
>>> https://casa.tiscali.it/promo/?u=https://promozioni.tiscali.it/voucher_business/
>>> <https://casa.tiscali.it/promo/?u=https://promozioni.tiscali.it/voucher_business/?r=TS00000A00025&dm=link&p=tiscali&utm_source=tiscali&utm_medium=link&utm_campaign=voucherbusiness&wt_np=tiscali.link.footermail.voucherbusiness.btb..>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "pytables-dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to pytables-dev...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/pytables-dev/039b8348f5f7e52e3f8f62319dd05b74%40tiscali.it
>>> <https://groups.google.com/d/msgid/pytables-dev/039b8348f5f7e52e3f8f62319dd05b74%40tiscali.it?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> --
>> Francesc Alted
>>
>
>

--
Antonio Valentino
Reply all
Reply to author
Forward
0 new messages