Requirements & API proposal for direct chunking support

9 views
Skip to first unread message

Ivan Vilata i Balaguer

unread,
May 2, 2024, 11:29:05 AMMay 2
to PyTables development
Hi everyone! I'm happy to announce that our proposal to NumFOCUS for adding
HDF5 direct chunking support for PyTables was granted, and we just started
working on it.

With this email I'd like to summarize the requirements that Francesc and I saw
for this support, and propose a simple extension to the PyTables API that
fulfills these requirements. It should be familiar to h5py users, though it
is a little more streamlined to avoid extra features and align with current
PyTables style.

Your opinions are welcome!

## Requirements

- To be added to table and chunked/extendable array ojects.
- Directly as properties and methods, so keep interface minimal.
- Similar to h5py in semantics to ease adoption by familiarity, however:
- Simple explicit access to chunks only by coordinates.
- Not by chunk index (may be out of order, with gaps… not very useful).
- No internal support for advanced slicing, i.e. no
`dataset.iter_chunks()` (with or without selection), unless there is
spare time.
- Handle the chunk's filter mask to avoid misinterpreting chunk data.
- Not implementing a whole separate `dataset.id` object.
- Dataset interfaces:
- Basic: Read a chunk given its start coordinates.
- Basic: Write a chunk given its start coordinates.
- Support: Get chunk information given a coordinate, return at least start
coordinates & byte offset (chunk may not exist).
- Byte offset allows decompressors to re-open HDF5 file and read directly
from that offset.

## API draft

New methods are added to the `Leaf` class as it already includes some support
for chunks (even if the actual dataset is not chunked, i.e. a plain `Array`);
this adds support to the target classes `Table`, `CArray` and `EArray` in a
single place (and also to `VLArray`, which should be ok as the methods are
blind to the atom/dtype):

```python
Coords: TypeAlias = tuple[int, ...] # in dataset units

class ChunkInfo(NamedTuple): # compatible with StoreInfo
start: Coords | None # None for missing chunk
filter_mask: int
offset: int | None # in storage bytes, None for missing chunk
size: int # raw size in storage

class Leaf:
...
def chunk_info(coords: Coords) -> ChunkInfo:
...
def read_chunk(coords: Coords,
out: bytearray | NDArray[uint8] | None = None,
) -> bytes | memoryview:
...
def write_chunk(coords: Coords, data: Buffer, filter_mask: int = 0):
...
```

Calling any of the new methods on a non-chunked dataset raises an
`HDF5ExtError`.

`chunk_info(coords)` returns a `ChunkInfo` instance with information about the
chunk that contains the element at the given `coords`: the dataset coordinates
where the chunk starts, its filter mask, and its byte offset and size in the
file. It is similar to h5py's [get_chunk_info_by_coord][] and `StoreInfo`. If
there is no such chunk in storage (or it is beyond maximum dataset shape), a
`ChunkInfo` with `start = offset = None` is returned.

[get_chunk_info_by_coord]: https://api.h5py.org/h5d.html#h5py.h5d.DatasetID.get_chunk_info_by_coord

`read_chunk(coords[, out])` reads the chunk that starts at the given `coords`
and returns its raw bytes. It is similar to h5py's [read_direct_chunk][],
without data transfer properties nor filter mask return value. The encoded
data is supposed to have gone through dataset filters minus those in the
chunk's filter mask (use `chunk_info()` to get it). If `out` is given, put
read bytes there and return a memory view to it (`chunk_info()` may help find
the minimum capacity of `out`). If `coords` are not multiples of chunk shape,
raise an `HDF5ExtError`. Reading a chunk which is missing raises an
`HDF5ExtError`.

[read_direct_chunk]: https://api.h5py.org/h5d.html#h5py.h5d.DatasetID.read_direct_chunk

`write_chunk(coords, data[, filter_mask])` writes the bytes in the `data`
buffer as the raw bytes for the chunk that starts at the given `coords`. It is
similar to h5py's [write_direct_chunk][], without data transfer
properties. The encoded data is supposed to have gone through dataset filters
minus those in the given `filter_mask` (which is to be saved along data). If
`coords` are not multiples of chunk shape, raise an `HDF5ExtError`. Writing a
chunk beyond maximum dataset shape raises an `HDF5ExtError` (extendable
dimensions in PyTables are infinite, but check for finite ones anyway).

[write_direct_chunk]: https://api.h5py.org/h5d.html#h5py.h5d.DatasetID.write_direct_chunk

## Example usage

```python
>>> leaf.shape
(100, 20)
>>> leaf.chunkshape
(10, 10)
>>> chunk_info = leaf.chunk_info((2, 15))
>>> chunk_info
ChunkInfo(start=(0, 10), filter_mask=0, byte_offset=8952, size=42)
>>> chunk = leaf.read_chunk((0, 10))
>>> chunk
b'...'
>>> array1 = my_decompress(chunk)
>>> array2 = my_decompress_from_file(leaf._v_file.filename,
... offset=chunk_info.offset)
>>> numpy.array_equal(array1, array2)
True
>>> array = numpy.arange(100, dtype=leaf.dtype).reshape(leaf.chunkshape)
>>> chunk = my_compress(array)
>>> leaf.write_chunk((0, 10), chunk)
```

--
Ivan Vilata i Balaguer -- https://elvil.net/

Antonio Valentino

unread,
May 4, 2024, 3:37:40 AMMay 4
to pytabl...@googlegroups.com
Hi Ivan,
congratulations for the grant, and thanks for sharing the details of
your plan.

cheers
antonio


Il 02/05/24 17:29, Ivan Vilata i Balaguer ha scritto:
Antonio Valentino

Ivan Vilata i Balaguer

unread,
May 7, 2024, 6:08:21 AMMay 7
to pytabl...@googlegroups.com
Thanks, Antonio! I guess that tomorrow I'll create a new branch and start
adding unit tests with the proposed API. I'll wait a little more for feedback
on the API, but during the next week I'll freeze it so as to stabilize tests
and start with the implementation later on.

Cheers!


Antonio Valentino (2024-05-04 09:37:36 +0200) wrote:

> Hi Ivan,
> congratulations for the grant, and thanks for sharing the details of your
> plan.
>
> cheers
> antonio
>
>
> Il 02/05/24 17:29, Ivan Vilata i Balaguer ha scritto:
> > Hi everyone! I'm happy to announce that our proposal to NumFOCUS for adding
> > HDF5 direct chunking support for PyTables was granted, and we just started
> > working on it.
> >
> > With this email I'd like to summarize the requirements that Francesc and I saw
> > for this support, and propose a simple extension to the PyTables API that
> > fulfills these requirements. It should be familiar to h5py users, though it
> > is a little more streamlined to avoid extra features and align with current
> > PyTables style.
> >
> > Your opinions are welcome! […]

Ivan Vilata i Balaguer

unread,
May 8, 2024, 12:30:26 PMMay 8
to pytabl...@googlegroups.com
Hi! While writing unit tests I noticed that new methods abuse the
`HDF5ExtError` exception, which should be reserved to Cython code. I propose
defining the following exceptions:

```python
class ChunkError(ValueError):
...
class NotChunkedError(ChunkError):
...
class NotChunkAlignedError(ChunkError):
...
class NoSuchChunkError(ChunkError):
...
```

And the following behaviour changes:

Calling any of the new methods on a non-chunked dataset raises a
`NotChunkedError`.

`chunk_info(coords)`: If there is no such chunk in storage, a `ChunkInfo` with
`start = offset = None` is returned. **Note:** I'm considering raising
`ChunkError` when specifying a chunk beyond maximum dataset shape (this is in
contrast with h5py, which doesn't differentiate a missing chunk within
`maxshape` from a chunk beyond it, and returns an invalid `StoreInfo` in both
cases).

`read_chunk(coords[, out])`: If `coords` are not multiples of chunk shape,
raise a `NotChunkAlignedError`. Reading a chunk which is missing raises a
`NoSuchChunkError`.

`write_chunk(coords, data[, filter_mask])`: If `coords` are not multiples of
chunk shape, raise a `NotChunkAlignedError`. Writing a chunk beyond maximum
dataset shape raises a `ChunkError`.

Comments welcome. Cheers!

Antonio Valentino

unread,
May 9, 2024, 2:15:35 AMMay 9
to pytabl...@googlegroups.com, Ivan Vilata i Balaguer
Dear Ivan,
I think that the Exception hierarchy that you propose is good.

By the way, HDF5ExtError has a specific machinery to retrieve and
display the (HDF5) traceback of errors generated in the HDF5 C library.
If you need to manage potential errors generated in HDF5 C than you
should in any vase inherit form HDF5ExtError IMHO.


kind regards
antonio

Il 08/05/24 18:30, Ivan Vilata i Balaguer ha scritto:
Antonio Valentino

Ivan Vilata i Balaguer

unread,
May 9, 2024, 3:16:08 AMMay 9
to pytabl...@googlegroups.com
Thanks Antonio for the heads up, that's good to know!

I think that all the conditions for which a `ChunkError` is to be raised can
be safely checked *before* performing the relevant HDF5 library call in the
extension, thus there should be no need to care about a lib traceback for
these errors.

Cheers,


Antonio Valentino (2024-05-09 08:15:31 +0200) wrote:

> Dear Ivan,
> I think that the Exception hierarchy that you propose is good.
>
> By the way, HDF5ExtError has a specific machinery to retrieve and display
> the (HDF5) traceback of errors generated in the HDF5 C library.
> If you need to manage potential errors generated in HDF5 C than you should
> in any vase inherit form HDF5ExtError IMHO.
>
>

Ivan Vilata i Balaguer

unread,
May 16, 2024, 6:48:12 AMMay 16
to pytabl...@googlegroups.com
Some more minor updates to the proposed API while writing the (nearly
complete) unit tests:

`read_chunk(coords[, out])`: Reading a chunk beyond maximum dataset shape
raises a `ChunkError`. If `out` has insufficient storage for the read chunk,
raise a `ValueError`.

A note on `write_chunk()`: `Leaf.truncate()` can be used to grow a dataset
cheaply along its enlargeable dimension as it doesn't write new chunks, then
new chunks with actual data may be written with `Leaf.write_chunk()`. Doing it
the other way around should also work (it does with h5py) and show the data in
the new chunks; however, shrinking a dataset dumps chunks beyond the new shape
(and truncating to the current shape is a no-op).

(As an unrelated aside, I noticed that classes like `Table` or `EArray` may
currently need to extend `truncate` with some maintenance actions as done in
`append` or `remove_rows` operations, e.g. to invalidate indexes.)

Unless I find some very weird situation while adding `Table` unit tests, I
consider the new API as frozen and will start adding doctrings ASAP.

BTW, you may check code progress in this branch:
<https://github.com/ivilata/PyTables/tree/direct-chunking-api>

Cheers!


Ivan Vilata i Balaguer (2024-05-07 12:08:14 +0200) wrote:

> Thanks, Antonio! I guess that tomorrow I'll create a new branch and start
> adding unit tests with the proposed API. I'll wait a little more for feedback
> on the API, but during the next week I'll freeze it so as to stabilize tests
> and start with the implementation later on.
>
>
> Antonio Valentino (2024-05-04 09:37:36 +0200) wrote:
>
> > congratulations for the grant, and thanks for sharing the details of your
> >
> >
> > Il 02/05/24 17:29, Ivan Vilata i Balaguer ha scritto:
> > > Hi everyone! I'm happy to announce that our proposal to NumFOCUS for adding
> > > HDF5 direct chunking support for PyTables was granted, and we just started
> > > working on it.
> > >
> > > With this email I'd like to summarize the requirements that Francesc and I saw
> > > for this support, and propose a simple extension to the PyTables API that
> > > fulfills these requirements. It should be familiar to h5py users, though it
> > > is a little more streamlined to avoid extra features and align with current
> > > PyTables style.

Ivan Vilata i Balaguer

unread,
May 23, 2024, 4:45:11 AMMay 23
to pytabl...@googlegroups.com
Hi! I finished added docstrings for the new classes and methods:

- Exceptions: <https://github.com/ivilata/PyTables/blob/4359cf65782852e4dec58db84f112b8a3b818c68/tables/exceptions.py#L404>
- ChunkInfo: <https://github.com/ivilata/PyTables/blob/4359cf65782852e4dec58db84f112b8a3b818c68/tables/leaf.py#L103>
- Leaf methods: <https://github.com/ivilata/PyTables/blob/4359cf65782852e4dec58db84f112b8a3b818c68/tables/leaf.py#L832>

The only changes were a minor clarification on getting chunk info for
coordinates beyond maximum size but within boundary chunks, and making
`ChunkError` inherit from `Exception` rather than `ValueError`.

With this I'm starting with the actual implementation. We expect this to get
well into the summer, so you may not be hearing from us in a while, but you
may always keep an eye on the repo linked below for activity.

Cheers!


Ivan Vilata i Balaguer (2024-05-16 12:48:05 +0200) wrote:

> Unless I find some very weird situation while adding `Table` unit tests, I
> consider the new API as frozen and will start adding doctrings ASAP.
>
> BTW, you may check code progress in this branch:
> <https://github.com/ivilata/PyTables/tree/direct-chunking-api>
> > >
> > > Il 02/05/24 17:29, Ivan Vilata i Balaguer ha scritto:
> > > > Hi everyone! I'm happy to announce that our proposal to NumFOCUS for adding
> > > > HDF5 direct chunking support for PyTables was granted, and we just started
> > > > working on it.

Antonio Valentino

unread,
May 24, 2024, 3:26:34 AMMay 24
to pytabl...@googlegroups.com, Ivan Vilata i Balaguer
Dear Ivan,

Il 23/05/24 10:45, Ivan Vilata i Balaguer ha scritto:
> Hi! I finished added docstrings for the new classes and methods:
>
> - Exceptions: <https://github.com/ivilata/PyTables/blob/4359cf65782852e4dec58db84f112b8a3b818c68/tables/exceptions.py#L404>
> - ChunkInfo: <https://github.com/ivilata/PyTables/blob/4359cf65782852e4dec58db84f112b8a3b818c68/tables/leaf.py#L103>
> - Leaf methods: <https://github.com/ivilata/PyTables/blob/4359cf65782852e4dec58db84f112b8a3b818c68/tables/leaf.py#L832>
>
> The only changes were a minor clarification on getting chunk info for
> coordinates beyond maximum size but within boundary chunks, and making
> `ChunkError` inherit from `Exception` rather than `ValueError`.
>
> With this I'm starting with the actual implementation. We expect this to get
> well into the summer, so you may not be hearing from us in a while, but you
> may always keep an eye on the repo linked below for activity.

Thanks for the update.
Of course please feel free to use a branch that is directly on the main
repository if you prefer.


kind regards
--
Antonio Valentino

Ivan Vilata i Balaguer

unread,
Jun 5, 2024, 2:58:17 AMJun 5
to pytabl...@googlegroups.com
Ivan Vilata i Balaguer (2024-05-16 12:48:05 +0200) wrote:

> Unless I find some very weird situation while adding `Table` unit tests, I
> consider the new API as frozen and will start adding doctrings ASAP.
>
> BTW, you may check code progress in this branch:
> <https://github.com/ivilata/PyTables/tree/direct-chunking-api>

Hi! While doing the implementation I noticed that some of my initial tests
with HDF5 functionality were wrong, and writing a chunk beyond the current
shape of an enlargeable array, but still within the array's maximum shape,
causes HDF5 to produce an error.

Since reading a chunk between current and maximum shape is also not very
intuitive (and kind of unreliable in h5py, causing two different errors for
the same operation), I decided to change the API slightly so that direct
chunking operations beyond the current shape always cause a `ChunkError`
(instead of a `NoSuchChunkError` on read or no no error on write, both if the
chunk is still within the maximum shape — remember that native PyTables
enlaregable leafs can be infinitely enlargeable over a single dimension, whose
maximum size is infinite).

This commit contains all the required changes to docstrings and tests:
<https://github.com/ivilata/PyTables/commit/dfc3daf7515444a91b4c4275529d530e7fd721d7>

Fixes to the implementation will come next.

Cheers,

Ivan Vilata i Balaguer

unread,
Jun 6, 2024, 7:34:01 AMJun 6
to pytabl...@googlegroups.com
Ivan Vilata i Balaguer (2024-06-05 08:58:08 +0200) wrote:

> Since reading a chunk between current and maximum shape is also not very
> intuitive (and kind of unreliable in h5py, causing two different errors for
> the same operation), I decided to change the API slightly so that direct
> chunking operations beyond the current shape always cause a `ChunkError`
> (instead of a `NoSuchChunkError` on read or no no error on write, both if the
> chunk is still within the maximum shape — remember that native PyTables
> enlaregable leafs can be infinitely enlargeable over a single dimension, whose
> maximum size is infinite).

After this change I noticed that plain `ChunkError` was only being used to
signal coordinates not within the dataset shape, so I decided to use plain
`IndexError` in that case, which is IMO more standard and expected. The
changes to docstrings, tests and implementation (trivial) are here:
<https://github.com/ivilata/PyTables/commit/708ae792986c39289a89fe026ed092544feee0bf>

Finally, I also decided to make `ChunkInfo.start` always have a valid value,
as it's only returned when the requested coordinates are within dataset shape,
so it may still be useful e.g. to query the start coordinates of a missing
chunk. Then to ease detection of a missing chunk, the rest of fields in
`ChunkInfo` (i.e. `filter_mask`, `offset` and `size`) would be `None`. Here
is the change to docstrings and tests:
<https://github.com/ivilata/PyTables/commit/7556846be7634dfd8212c20c433ae9f8f1db3247>

I hope that no more changes to the proposed API are needed!

Ivan Vilata i Balaguer

unread,
Jun 17, 2024, 12:44:37 PMJun 17
to pytabl...@googlegroups.com
Antonio Valentino (2024-05-24 09:26:27 +0200) wrote:

> Il 23/05/24 10:45, Ivan Vilata i Balaguer ha scritto:
> >
> > With this I'm starting with the actual implementation. We expect this to get
> > well into the summer, so you may not be hearing from us in a while, but you
> > may always keep an eye on the repo linked below for activity.
>
> Thanks for the update.
> Of course please feel free to use a branch that is directly on the main
> repository if you prefer.

Since the implementation as it sits currently is already passing all new unit
tests and not breaking any additional old ones (at least those not broken by
the NumPy v2.0 release), I moved the branch under the main repo as Antonio
suggested. I still need to perform some code profiling and maybe other
updates, though.

<https://github.com/PyTables/PyTables/tree/direct-chunking-api>

Cheers,

Ivan Vilata i Balaguer

unread,
Jul 17, 2024, 6:35:52 AM (8 days ago) Jul 17
to pytabl...@googlegroups.com
Ivan Vilata i Balaguer (2024-06-17 18:44:28 +0200) wrote:

> Since the implementation as it sits currently is already passing all new unit
> tests and not breaking any additional old ones (at least those not broken by
> the NumPy v2.0 release), I moved the branch under the main repo as Antonio
> suggested. I still need to perform some code profiling and maybe other
> updates, though.
>
> <https://github.com/PyTables/PyTables/tree/direct-chunking-api>

Since test failures caused by NumPy 2 seem to have been finally fixed in the
`master` branch (big thanks to everyone who made it possible!), I took the
chance to merge `master` back into the `direct-chunking-api` and apply the
last pending retouches after some profiling. Windows CI runs still fail [1]
because they don't find the Bzip2 library (they do in `master` runs), which is
weird but most probably something spurious and not related with this branch.

[1]: https://github.com/PyTables/PyTables/actions/runs/9971898097

Thus we're considering the implementation phase as complete. We're working on
benchmarks and Manual documentation as planned.

Thanks and cheers,

Antonio Valentino

unread,
Jul 18, 2024, 1:39:34 AM (8 days ago) Jul 18
to pytabl...@googlegroups.com
Dear Ivan,
thanks you for the update

Il 17/07/24 12:35, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
> Ivan Vilata i Balaguer (2024-06-17 18:44:28 +0200) wrote:
>
>> Since the implementation as it sits currently is already passing all new unit
>> tests and not breaking any additional old ones (at least those not broken by
>> the NumPy v2.0 release), I moved the branch under the main repo as Antonio
>> suggested. I still need to perform some code profiling and maybe other
>> updates, though.
>>
>> <https://github.com/PyTables/PyTables/tree/direct-chunking-api>
>
> Since test failures caused by NumPy 2 seem to have been finally fixed in the
> `master` branch (big thanks to everyone who made it possible!), I took the
> chance to merge `master` back into the `direct-chunking-api` and apply the
> last pending retouches after some profiling. Windows CI runs still fail [1]
> because they don't find the Bzip2 library (they do in `master` runs), which is
> weird but most probably something spurious and not related with this branch.
>
> [1]: https://github.com/PyTables/PyTables/actions/runs/9971898097

Please not that full support to numpy2 is still not in master, it is in
[PR-1183] which is ready for review.
Please feel free to do it if you have time.

Moreover, [PR-1183] also includes a small fix related to bzip2 on windows.


[PR-1183] https://github.com/PyTables/PyTables/pull/1183


> Thus we're considering the implementation phase as complete. We're working on
> benchmarks and Manual documentation as planned.
>
> Thanks and cheers,

Fantastic! thanks a lot

cheers
--
Antonio Valentino
Reply all
Reply to author
Forward
0 new messages