Writeup on 20 years of PyTables

4 views
Skip to first unread message

Francesc Alted

unread,
Dec 30, 2022, 7:06:07 AM12/30/22
to pytabl...@googlegroups.com
Hi Developers,

3 months ago PyTables 0.1 was announced.  I was meaning to blog about this for a while, and just only recently have found to some for it (see below).  Probably my memories are fuzzy and sometimes even wrong, so please feel free to chime in and suggests changes or additions!

When we would be comfortable with the result we should decide where to publish this.  I'm comfortable in adding it to the blosc.org blogsite, but other proposals are welcome!

Thank you guys!  It is a honor to continue being part of the team.
Francesc


20 years of PyTables

* The beginning

Back in October 2002 the first version of PyTables was released.  It was an attempt to store a large amount of tabular data while being able to provide a hierarchical structure around it.  Here it is the first announcement by Francesc Alted:

“””
Hi!,

PyTables is a Python package which allows dealing with HDF5 tables.  Such a table is defined as a collection of records whose values are stored in fixed-length fields.  PyTables is intended to be easy-to-use, and tried to be a high-performance interface to HDF5.  To achieve this, the newest improvements in Python 2.2 (like generators or slots and metaclasses in brand-new classes) has been used.  Python creation extension tool has been chosen to access the HDF5 library.

This package should be platform independent, but until now I’ve tested it only with Linux.  It’s the first public release (v 0.1), and it is in alpha state.
“””

As you can see, PyTables was an early adopter of generators and metaclasses that were introduced in the new (by that time) Python.  It turned out that generators demonstrated to be an excellent tool in many libraries related with data science although admittedly. Also, Pyrex adoption (which was released just a few months ago: http://blog.behnel.de/posts/cython-is-20/) greatly simplified the wrapping of native C libraries like HDF5.

At this time there was little amount of libraries for persisting tabular data with a format that allowed compression for tabular data, so that gave PyTables a chance to be considered as an alternative to other existing formats.  Some months later, PyCon 2003 accepted the first talk about PyTables ever (http://www.pytables.org/docs/pycon2003.pdf).  Since then, we (mainly Francesc Alted, with the support from Scott Prater on the documentation part) gave several presentations in different international conferences, like SciPy or EuroSciPy.

* Cárabos Coop. V.

In 2005, and after receiving some good inputs on PyTables by some customers (including The HDF Group: https://www.hdfgroup.org), we decided to try to make a life out of PyTables development and together with Vicent Mas (https://vitables.org) and Ivan Vilata (https://elvil.net) set out to create a cooperative called Cárabos.  Unfortunately, and after 3 years of hard work, we did not succeed in making a life out it, and we closed the business in 2008.

However, during this period we achieved to push for a professional version of PyTables that was using indexes (see OPSI indexes) or providing a visual interface called ViTables (https://vitables.org), and immediately after closing Cárabos we open sourced both technologies, and we are proud to say that they are still in heavy use, most specially OPSI indexes, that are used to perform fast queries in very large datasets (http://www.pytables.org/usersguide/optimization.html#indexed-searches).

* Crew renewal and attempt to merge with h5py

After Cárabos closure, Francesc Alted continued to maintain PyTables for a while, but in 2010 he expressed his desire to handover the project, and shortly after, a new set of people, including Anthony Scopatz and Antonio Valentino (with Andrea Bedini joining shortly after) stepped ahead and took PyTables over.  This is where open source is strong: whenever a project faces difficulties, there are always people eager to jump up to the wagon and continue providing traction for the project.

Meanwhile, the h5py project (http://www.h5py.org) was receiving a great adoption, specially from the community that valued more the multidimensional arrays than the tables side of the things.  There was a feeling that we were duplicating efforts and by 2016, Andrea Bedini, with the help of Anthony Scopatz, organized a HackFest in Perth, Australia (https://curtinic.github.io/python-and-hdf5-hackfest/) where developers of the h5py and PyTables gathered to attempt a merge of the two projects.  After the initial work there, we continued this effort with a grant from NumFOCUS.  Unfortunately, the effort demonstrated to be complex enough so that we could not finished it properly (for the sake of curiosity, the attempt is still available: https://github.com/PyTables/PyTables/pull/634.  At any rate, we are encouraging people using both packages depending on the need;  for example, Tom Kooij gave a tutorial on both h5py and PyTables in his tutorial at SciPy 2017 (https://github.com/tomkooij/scipy2017).

* Satellite Projects

As many other open sources libraries, PyTables stands in the shoulders of giants, and leverages amazing libraries like HDF5 or NumPy for doing its magic.  In addition to that, in order to allow PyTables push over the hardware I/O and computational limits, it leverages two high-performance packages: Blosc (https://www.blosc.org) and numexpr (https://github.com/pydata/numexpr).  Blosc is in charge of compressing data efficiently and at very high speeds so as to overcome limits imposed by the I/O subsystem, while numexpr allows to get maximum performance from computations in CPU when processing queries on large tables.  Both projects have been substantially improved out of the needs of PyTables.

In an unexpected twist, the Blosc compressor, although it was born out of the needs of PyTables, it took off as a standalone compressor (or meta-compressor, as it can use several codecs internally) that is meant to accelerate not just disk I/O, but also memory access (https://www.blosc.org/pages/blosc-in-depth/).  And more recently, the latest generation of Blosc, aka Blosc2, grown its own multi-level data partitioning system, and currently helps PyTables go well beyond the limitations that impose the current HDF5 filter pipeline to reach new performance heights, reaching in-memory speed even when using high performance libraries like pandas (https://www.blosc.org/posts/blosc2-pytables-perf/).  With that, it has completed a full circle, where a subproject like Blosc grown in functionality to a point that it can help the HDF5 dependency in overcoming their own limits.

* Conclusions

It has been a long way since PyTables started 20 years ago.  We are happy to have contributed to fulfill the data storage and filtering needs of many people during the journey.  Many thanks to all maintainers and contributors (either with code or donations) to the project; they are too numerous to mention them all here, but if you are reading this and are among them, be sure to be proud to have contributed to PyTables.  The road has been certainly bumpy, but it somehow worked and many difficulties have been surpassed; such is the magic and the grace of open-source!
--
Francesc Alted

Antonio Valentino

unread,
Dec 30, 2022, 7:27:28 AM12/30/22
to pytabl...@googlegroups.com
Hi Francesc,

Il 30/12/22 13:05, Francesc Alted ha scritto:
> Hi Developers,
>
> 3 months ago PyTables 0.1 was announced. I was meaning to blog about this
> for a while, and just only recently have found to some for it (see below).
> Probably my memories are fuzzy and sometimes even wrong, so please feel
> free to chime in and suggests changes or additions!
>
> When we would be comfortable with the result we should decide where to
> publish this. I'm comfortable in adding it to the blosc.org blogsite, but
> other proposals are welcome!

[...]

Thank you very much for the very nice initiative.
It is a fantastic blog post.
To me the text is perfect and I'm also fine with publishing on the
blosc.org blogsite.

cheers
--
Antonio Valentino

Francesc Alted

unread,
Dec 31, 2022, 11:13:12 AM12/31/22
to pytabl...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "pytables-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pytables-dev/71e98800-9b37-bd91-4d80-1fb08a3093b7%40tiscali.it.


--
Francesc Alted
Reply all
Reply to author
Forward
0 new messages