New NumFOCUS Small Development Grant

1 view

Skip to first unread message

Ivan Vilata i Balaguer

unread,

Feb 27, 2024, 8:35:21 AMFeb 27

to PyTables development

Hi PyTables developers,

From recent Blosc2 development experience with the integration of new
compression algorithms and filters into Blosc2 (also in the form of
plugins), as well as their use in HDF5 via h5py, we noted how useful and
powerful it is to be able to read and write HDF5 chunk data directly,
bypassing the HDF5 filter pipeline.

Such "direct chunking" is already available via h5py, and it's also used
internally by PyTables for optimized reading of Blosc2-compressed
datasets, but it's not directly available to the user.

This is why we decided to apply to a NumFOCUS Small Development Grant
($10000) to add such support to the public PyTables API. Below, you can
find the proposal that we plan to send, in case you have any comments or
suggestions.

Thanks a lot, and cheers!

----

# HDF5 direct chunking support for PyTables

## Two Sentence Summary or Proposal

Extend PyTables chunked dataset support to allow users to read and write
HDF5 chunk content directly, avoiding the overhead and limitations of
the HDF5 filter pipeline. Include benchmarks, tests, documentation and
dissemination materials.

## Description of Proposal

PyTables (<https://pytables.org/>) provides efficient, high-level access
to HDF5 datasets. Chunking allows datasets to be compressed using a
series of filters that form a pipeline. This pipeline can be configured
by the user to some extent, but custom filters need to be registered
with the HDF5 development team and ad hoc C plugins loaded to support
transparent operation, with very rigid and limiting parametrization.
Moreover, the implementation of this pipeline adds some noticeable
overhead that can hurt performance.

With funding from NumFOCUS, PyTables was extended with a Blosc2 HDF5
filter plugin (adopted by the hdf5plugin project) and optimized access
to Blosc2-compressed tables and arrays that avoided the HDF5 filter
pipeline, with remarkable performance increases (see
<https://blosc.org/posts/blosc2-pytables-perf/> and
<https://blosc.org/posts/pytables-b2nd-slicing/>).

By supporting direct chunking, PyTables will be able to inject already
compressed chunks (using any compressed container available, including
Blosc2), bypassing the inefficient HDF5 pipeline, while still allowing
data to be decompressed transparently by a filter available for HDF5,
like the aforementioned one for Blosc2 (which is very fine-tuned
already).

With funding from this grant, we would:

1. Implement new interfaces in PyTables' chunked datasets (CArray,
EArray, Table) to read and write chunk data directly (plus support
for gathering information about chunks). This includes creating unit
tests and a new PyTables release.

2. Carry out benchmarks to compare the performance of (i) filter-based,
(ii) optimized, and (iii) direct reading of chunk data.

3. Document the new interfaces in the reference section of the User
Manual, and add usage instructions or examples in the “Tutorial” or
“Cookbook” sections.

4. Publicize the new feature and benchmark results in a blog post, in
mailing lists and social networks.

## Benefit of this proposal (project, scientific ecosystem, community)

Currently, PyTables only allows creating datasets with standard HDF5
filters, plus Blosc2 compressors (via a custom C plugin). The
configuration of the latter is further restricted due to HDF5's limited
filter parametrization, barring access to some advanced algorithms like
JPEG2000.

On the other hand, compressed containers like Blosc2 frames include all
the parameters needed for decompressing their data. With just a hint
about the container type in HDF5 filters configuration, and with the
desired support for direct chunk read/write, a PyTables user should be
able to pack and unpack data with any configuration supported natively
by the compressor, even if HDF5 was unable to encode it. This would
provide *improved flexibility* to users requiring very specialized or
experimental filters and configurations (Blosc2 or otherwise), as is the
case in demanding scientific applications, making the project useful in
a wider set of scenarios.

This flexibility would extend to choosing which operations to implement
in a C plugin or handle at the Python level. For instance, one may
write custom chunks manually from Python, and implement just the
decompression part in a C plugin for transparent reading operations.
This would help developers support new compressors in a progressive
manner.

Last but not least, writing and reading chunks directly may imply
*relevant performance gains* for time-critical operations like the
collection of scientific measurements, when only a single custom
compressor/filter like Blosc2 is to be used (as illustrated in the blog
posts linked above).

## Brief Budget Justification

The bulk of development and documentation work for this feature will be
carried out by Ivan Vilata (<https://elvil.net/vcard/>), who was a core
PyTables developer years ago, and was recently tasked with extending
PyTables' support for optimized reading of Blosc2-compressed datasets to
also cover n-dimensional arrays.

Francesc Alted (<https://github.com/FrancescAlted/>) will act as a tutor
and apply his long-time experience with large dataset handling to ensure
that the implementation counts with meaningful and comprehensive tests
and benchmarks, and to help with the dissemination of results.

The whole amount of the grant will be spent in stipends for each person
like this:

- Ivan Vilata: $7500 (150 hours, $50/hour)
- Francesc Alted: $2500 (50 hours, $50/hour)

## Timeline of deliverables

- May 2nd, 2024:
- Analysis of requirements.
- May 15th, 2024:
- Public specification of the API for querying, reading and writing
direct chunks.
- May 22nd, 2024:
- Write unit tests that use the direct chunking API in the current
PyTables test suite.
- July 5th, 2024:
- Implementation of the direct chunking API in PyTables chunked
datasets.
- July 19th:
- Benchmark results and performance fixes.
- July 31st, 2024:
- Updated User Manual sections with usage and optimization tips.
- September 4th, 2024:
- Blog post with coding example and benchmark results.
- PyTables release with the new API.
- Public announcements in relevant channels.

## Project team

- Ivan Vilata: Recent member of the Blosc team and PyTables developer
that came back to the project after many years.
- Francesc Alted: Creator and BDFL of the Blosc project.

## How will someone be identified to carry out the work?

We plan to use GitHub for completing the development of the tasks listed
here, and the commits will serve as an accounting book for the work
done. Then, the series of blog posts (published at <https://blosc.org>)
will be signed by the authors, so that will serve as another check for
identifying the people working there.

----

--
Ivan Vilata i Balaguer -- https://elvil.net/

Antonio Valentino

unread,

Feb 27, 2024, 1:46:45 PMFeb 27

to pytabl...@googlegroups.com, Ivan Vilata i Balaguer

Dear Ivan,
thanks for preparing the proposal.
It is a extremely interesting enhancement for PyTables.

kind regards
antonio

Il 27/02/24 14:35, Ivan Vilata i Balaguer ha scritto: