ANN: python-blosc2 4.3.0 is out!

0 views

Skip to first unread message

Francesc Alted

unread,

May 18, 2026, 12:57:02 PMMay 18

to Blosc, pyd...@googlegroups.com, pytabl...@googlegroups.com

Announcing Python-Blosc2 4.3.0
===============================

We are happy to announce Python-Blosc2 4.3.0. This release deepens the
``CTable`` container with three major additions: **N-dimensional columns**,
**group-by aggregation**, and **dictionary/categorical column types**.

N-dimensional columns let each cell in a CTable hold a full compressed
multidimensional array — ideal for embedding vectors, image patches,
time-series windows or any per-row tensor payload. ndarray columns support
CSV/DataFrame round-trips, nullable semantics, and are automatically detected
when importing from pandas.

N-dimensional columns inherit the full power of Blosc2 ``NDArray`` objects.
They are not opaque blobs, but first-class arrays that support slicing,
indexing, and full Blosc2 operations, including multi-threaded and
SIMD-accelerated expression evaluation. This gives ``CTable`` a strong bridge
between tabular and array workflows in a single container.

See example at: https://github.com/Blosc/python-blosc2/blob/main/examples/ctable/ndim-cols.py

Group-by brings SQL-style ``GROUP BY`` directly to CTables::

by_city = t.group_by("city", sort=True)
by_city.agg({"sales": ["sum", "mean"], "qty": "sum"})

Multi-key groupings, filtered aggregates (``where=`` pushdown), and persistent
output (``urlpath=``) are all supported. Behind the scenes, Cython-accelerated
kernels deliver dramatic speedups — ~25× for float keys, ~8× for integer keys —
backed by dense-indexing and general-purpose hash-table paths.

Also, ``DictionarySpec`` introduces dictionary-encoded (categorical) columns
that store compact integer codes mapped to a shared string dictionary, giving
both compact storage and accelerated equality/membership queries. Dictionary
columns work transparently in ``where`` clauses and nested dotted-name
expressions.

Other highlights in 4.3.0 include:

- **Nested columns and field-name escaping**: Columns from Arrow/Parquet struct
hierarchies are flattened into physical leaf columns under hierarchical
``_cols`` storage paths, with logical dotted-name access preserved.
Round-trip fidelity is maintained for nested schemas, and literal ``.`` / ``/``
in field names are automatically escaped.

- **Parquet import improvements**: Arrow serializer is now the default;
nested columns are always separated; new ``--progress``/``--max-rows``/
``--timestamp-unit``/``--float-trunc-prec`` options for
``parquet-to-blosc2`` CLI; and a ``list_serializer`` parameter for fine-tuning
list-type column storage.

- **Inline CTable support in TreeStore**: CTables can now be stored as items
inside a ``TreeStore``, enabling hierarchical containers that mix arrays
and tables.

- **Performance wins**: ``CTable.open()`` is faster thanks to lazy ``nrows``
and deferred column metainfo loading. Scalar and small-slice access paths
have been overhauled. ``import blosc2`` is leaner via late-import
optimizations for heavy optional dependencies.

- **New tutorials and examples**: Group-by, nested fields, dictionary columns,
TreeStore–CTable integration, and dedicated benchmarks for group-by,
nested-filter, and Parquet round-trips.

- **Fixes**: Null/NaN sentinel normalization, empty aggregate results,
generated-column safety, miniexpr bundling, and more.

- **Updated C-Blosc2** to version 3.0.3 (latest).

You can think of Python-Blosc2 4.x as an extension of NumPy/numexpr that:

- Can deal with NDArray compressed objects using first-class codecs & filters.
- Performs many kinds of math expressions, including reductions, indexing...
- Supports multi-threading and SIMD acceleration (via numexpr/miniexpr).
- Can operate with data from other libraries (like PyTables, h5py, Zarr, Dask, etc).
- Supports NumPy ufunc mechanism: mix and match NumPy and Blosc2 computations.
- Integrates with Numba and Cython via UDFs (User Defined Functions).
- Adheres to modern array API standard conventions (https://data-apis.org/array-api/).
- Can perform linear algebra operations (like ``blosc2.tensordot()``).
- Can store and query compressed columnar tables via ``blosc2.CTable``.

Install it with::

pip install blosc2 --upgrade # if you prefer wheels
conda install -c conda-forge python-blosc2 mkl # if you prefer conda and MKL

For more info, you can have a look at the release notes in:

https://github.com/Blosc/python-blosc2/releases

Small CTable group-by example::

import blosc2
from dataclasses import dataclass

@dataclass
class Order:
city: str = blosc2.field(blosc2.string(max_length=16))
product: str = blosc2.field(blosc2.string(max_length=16))
qty: int = blosc2.field(blosc2.int32())
price: float = blosc2.field(blosc2.float64(nullable=True), default=0.0)

# Create a table with 200 random orders
t = blosc2.CTable(Order, new_data=orders)

# Group by city: total and average price per city in one call
print(t.group_by("city", sort=True).agg({"price": ["sum", "mean"], "qty": "sum"}))

# Multi-key: city + product breakdown
print(t.group_by(["city", "product"], sort=True).agg({"qty": "sum", "price": "mean"}))

# Filtered: only Widget orders, grouped by city
print(t.where(t.product == "Widget").group_by("city", sort=True).agg({"qty": "sum"}))

Sources repository
------------------

The sources and documentation are managed through GitHub services at:

https://github.com/Blosc/python-blosc2

Python-Blosc2 is distributed using the BSD license, see
https://github.com/Blosc/python-blosc2/blob/main/LICENSE.txt
for details.

Mastodon feed
-------------

Follow https://fosstodon.org/@Blosc2 to get informed about the latest
developments.

Enjoy!

- Blosc Development Team
Compress Better, Compute Bigger

Reply all

Reply to author

Forward

0 new messages