Hi everyone! I'm happy to announce that our proposal to NumFOCUS for adding
HDF5 direct chunking support for PyTables was granted, and we just started
working on it.
With this email I'd like to summarize the requirements that Francesc and I saw
for this support, and propose a simple extension to the PyTables API that
fulfills these requirements. It should be familiar to h5py users, though it
is a little more streamlined to avoid extra features and align with current
PyTables style.
Your opinions are welcome!
## Requirements
- To be added to table and chunked/extendable array ojects.
- Directly as properties and methods, so keep interface minimal.
- Similar to h5py in semantics to ease adoption by familiarity, however:
- Simple explicit access to chunks only by coordinates.
- Not by chunk index (may be out of order, with gaps… not very useful).
- No internal support for advanced slicing, i.e. no
`dataset.iter_chunks()` (with or without selection), unless there is
spare time.
- Handle the chunk's filter mask to avoid misinterpreting chunk data.
- Not implementing a whole separate `
dataset.id` object.
- Dataset interfaces:
- Basic: Read a chunk given its start coordinates.
- Basic: Write a chunk given its start coordinates.
- Support: Get chunk information given a coordinate, return at least start
coordinates & byte offset (chunk may not exist).
- Byte offset allows decompressors to re-open HDF5 file and read directly
from that offset.
## API draft
New methods are added to the `Leaf` class as it already includes some support
for chunks (even if the actual dataset is not chunked, i.e. a plain `Array`);
this adds support to the target classes `Table`, `CArray` and `EArray` in a
single place (and also to `VLArray`, which should be ok as the methods are
blind to the atom/dtype):
```python
Coords: TypeAlias = tuple[int, ...] # in dataset units
class ChunkInfo(NamedTuple): # compatible with StoreInfo
start: Coords | None # None for missing chunk
filter_mask: int
offset: int | None # in storage bytes, None for missing chunk
size: int # raw size in storage
class Leaf:
...
def chunk_info(coords: Coords) -> ChunkInfo:
...
def read_chunk(coords: Coords,
out: bytearray | NDArray[uint8] | None = None,
) -> bytes | memoryview:
...
def write_chunk(coords: Coords, data: Buffer, filter_mask: int = 0):
...
```
Calling any of the new methods on a non-chunked dataset raises an
`HDF5ExtError`.
`chunk_info(coords)` returns a `ChunkInfo` instance with information about the
chunk that contains the element at the given `coords`: the dataset coordinates
where the chunk starts, its filter mask, and its byte offset and size in the
file. It is similar to h5py's [get_chunk_info_by_coord][] and `StoreInfo`. If
there is no such chunk in storage (or it is beyond maximum dataset shape), a
`ChunkInfo` with `start = offset = None` is returned.
[get_chunk_info_by_coord]:
https://api.h5py.org/h5d.html#h5py.h5d.DatasetID.get_chunk_info_by_coord
`read_chunk(coords[, out])` reads the chunk that starts at the given `coords`
and returns its raw bytes. It is similar to h5py's [read_direct_chunk][],
without data transfer properties nor filter mask return value. The encoded
data is supposed to have gone through dataset filters minus those in the
chunk's filter mask (use `chunk_info()` to get it). If `out` is given, put
read bytes there and return a memory view to it (`chunk_info()` may help find
the minimum capacity of `out`). If `coords` are not multiples of chunk shape,
raise an `HDF5ExtError`. Reading a chunk which is missing raises an
`HDF5ExtError`.
[read_direct_chunk]:
https://api.h5py.org/h5d.html#h5py.h5d.DatasetID.read_direct_chunk
`write_chunk(coords, data[, filter_mask])` writes the bytes in the `data`
buffer as the raw bytes for the chunk that starts at the given `coords`. It is
similar to h5py's [write_direct_chunk][], without data transfer
properties. The encoded data is supposed to have gone through dataset filters
minus those in the given `filter_mask` (which is to be saved along data). If
`coords` are not multiples of chunk shape, raise an `HDF5ExtError`. Writing a
chunk beyond maximum dataset shape raises an `HDF5ExtError` (extendable
dimensions in PyTables are infinite, but check for finite ones anyway).
[write_direct_chunk]:
https://api.h5py.org/h5d.html#h5py.h5d.DatasetID.write_direct_chunk
## Example usage
```python
>>> leaf.shape
(100, 20)
>>> leaf.chunkshape
(10, 10)
>>> chunk_info = leaf.chunk_info((2, 15))
>>> chunk_info
ChunkInfo(start=(0, 10), filter_mask=0, byte_offset=8952, size=42)
>>> chunk = leaf.read_chunk((0, 10))
>>> chunk
b'...'
>>> array1 = my_decompress(chunk)
>>> array2 = my_decompress_from_file(leaf._v_file.filename,
... offset=chunk_info.offset)
>>> numpy.array_equal(array1, array2)
True
>>> array = numpy.arange(100, dtype=leaf.dtype).reshape(leaf.chunkshape)
>>> chunk = my_compress(array)
>>> leaf.write_chunk((0, 10), chunk)
```
--
Ivan Vilata i Balaguer --
https://elvil.net/