Benchmarks & documentation for direct chunking support

8 views
Skip to first unread message

Ivan Vilata i Balaguer

unread,
Jul 30, 2024, 12:04:19 PM7/30/24
to PyTables development
Hi everyone,

So, continuing with the NumFOCUS-sponsored work on adding direct chunking
support to PyTables, Francesc Alted and I performed some benchmarking on the
current implementation and extended the User's Guide with usage & optimization
tips.

The benchmarks turned out very positive results. Francesc created a notebook
where he documents the performed tests and comments on the results:
<https://github.com/PyTables/PyTables/blob/direct-chunking-api/bench/direct-chunking-AMD-7800X3D.ipynb>

The new direct chunking API was already added to the User's Guide reference
chapter a while ago. Now I added an introductory section to «Optimization
tips» on how to use direct chunking:
<https://github.com/PyTables/PyTables/blob/direct-chunking-api/doc/source/usersguide/optimization.rst#low-level-access-to-chunks-direct-chunking>

I also took Francesc's benchmark results (that he got on different machines)
and added another section to «Optimization tips» discussing the potential
performance benefits of using direct chunking:
<https://github.com/PyTables/PyTables/blob/direct-chunking-api/doc/source/usersguide/optimization.rst#avoiding-filter-pipeline-overhead-with-direct-chunking>

With that, we think that we're done with our changes to the
`direct-chunking-api`, thus I created
<https://github.com/PyTables/PyTables/pull/1187>. I encourage you to have a
look at the changes and comment on it here or there, thanks!

Again, thanks to NumFOCUS for sponsoring this work!

--
Ivan Vilata i Balaguer -- https://elvil.net/

Ivan Vilata i Balaguer

unread,
Jul 31, 2024, 6:09:32 AM7/31/24
to pytabl...@googlegroups.com
'Ivan Vilata i Balaguer' via pytables-dev (2024-07-30 18:04:12 +0200) wrote:

> So, continuing with the NumFOCUS-sponsored work on adding direct chunking
> support to PyTables, Francesc Alted and I performed some benchmarking on the
> current implementation and extended the User's Guide with usage & optimization
> tips. […]
>
> With that, we think that we're done with our changes to the
> `direct-chunking-api`, thus I created
> <https://github.com/PyTables/PyTables/pull/1187>. I encourage you to have a
> look at the changes and comment on it here or there, thanks!

Antonio and Francesc gave positive reviews of the PR (thanks!), so I merged
it. So direct chunking support is already in `master` and hopefully ready for
the next release! 🙂

The Grant proposal plans a release with the new feature in early September,
but we may discuss (in another thread) whether we prefer a release closer in
time. I have time to help or with or take care of the release process, too.

Cheers,

Antonio Valentino

unread,
Jul 31, 2024, 2:02:14 PM7/31/24
to pytabl...@googlegroups.com
Dear Ivan,

Il 31/07/24 12:09, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
> 'Ivan Vilata i Balaguer' via pytables-dev (2024-07-30 18:04:12 +0200) wrote:
>
>> So, continuing with the NumFOCUS-sponsored work on adding direct chunking
>> support to PyTables, Francesc Alted and I performed some benchmarking on the
>> current implementation and extended the User's Guide with usage & optimization
>> tips. […]
>>
>> With that, we think that we're done with our changes to the
>> `direct-chunking-api`, thus I created
>> <https://github.com/PyTables/PyTables/pull/1187>. I encourage you to have a
>> look at the changes and comment on it here or there, thanks!
>
> Antonio and Francesc gave positive reviews of the PR (thanks!), so I merged
> it. So direct chunking support is already in `master` and hopefully ready for
> the next release! 🙂

thanks a lot for the great job that you did with the direct chunking
support.

> The Grant proposal plans a release with the new feature in early September,
> but we may discuss (in another thread) whether we prefer a release closer in
> time. I have time to help or with or take care of the release process, too.

That would be fantastic!

I'm investigating https://github.com/PyTables/PyTables/issues/1185.
The update of the index seems indeed to be broken but I have no clue, so
far, about the real origin of the issue.
It is triggered by numpy 2.
Any help on this side is more than welcome.

Moreover, I see that Eric is working on the issue with windows wheel
generation in https://github.com/PyTables/PyTables/pull/1188.

From my perspective those ones are the only blocking issue for a new
release.
I suggest v3.10 by the way.


kind regards
--
Antonio Valentino

Ivan Vilata i Balaguer

unread,
Aug 6, 2024, 1:52:34 PM8/6/24
to pytabl...@googlegroups.com
Antonio Valentino (2024-07-31 20:02:11 +0200) wrote:

> Il 31/07/24 12:09, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
>
> > The Grant proposal plans a release with the new feature in early September,
> > but we may discuss (in another thread) whether we prefer a release closer in
> > time. I have time to help or with or take care of the release process, too.
>
> That would be fantastic!
>
> I'm investigating https://github.com/PyTables/PyTables/issues/1185.
> The update of the index seems indeed to be broken but I have no clue, so
> far, about the real origin of the issue.
> It is triggered by numpy 2.
> Any help on this side is more than welcome.
>
> Moreover, I see that Eric is working on the issue with windows wheel
> generation in https://github.com/PyTables/PyTables/pull/1188.
>
> From my perspective those ones are the only blocking issue for a new
> release.
> I suggest v3.10 by the way.

Thanks Antonio! I've been having a look at bugs and I see these issues with
the release.

- #1185 regarding indexing and NumPy 2, as you mentioned.

- Upgrading wheels to depend on Python-Blosc2 >= 2.7.1, which would fix #1186
(related with PR#1188 too). I don't know how to correctly update the
various requirements files with `pip-compile`, though.

- The situation with NumPy 2. Everything seems to work according to CI (with
the exception of #1185), but requirements files here and there still
indicate `<2`. Do we intend to support NumPy 2 or build wheels upon it for
this release?

Thanks again, and cheers!

Antonio Valentino

unread,
Aug 7, 2024, 9:33:25 AM8/7/24
to pytabl...@googlegroups.com
Dear Ivan, dear all,

Il 06/08/24 19:52, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
> Antonio Valentino (2024-07-31 20:02:11 +0200) wrote:
>
>> Il 31/07/24 12:09, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
>>
>>> The Grant proposal plans a release with the new feature in early September,
>>> but we may discuss (in another thread) whether we prefer a release closer in
>>> time. I have time to help or with or take care of the release process, too.
>>
>> That would be fantastic!
>>
>> I'm investigating https://github.com/PyTables/PyTables/issues/1185.
>> The update of the index seems indeed to be broken but I have no clue, so
>> far, about the real origin of the issue.
>> It is triggered by numpy 2.
>> Any help on this side is more than welcome.
>>
>> Moreover, I see that Eric is working on the issue with windows wheel
>> generation in https://github.com/PyTables/PyTables/pull/1188.
>>
>> From my perspective those ones are the only blocking issue for a new
>> release.
>> I suggest v3.10 by the way.
>
> Thanks Antonio! I've been having a look at bugs and I see these issues with
> the release.
>
> - #1185 regarding indexing and NumPy 2, as you mentioned.

This should be hopefully fixed now
I opened PR https://github.com/PyTables/PyTables/pull/1192

> - Upgrading wheels to depend on Python-Blosc2 >= 2.7.1, which would fix #1186
> (related with PR#1188 too). I don't know how to correctly update the
> various requirements files with `pip-compile`, though.

instructions are in the various requirements*.in files.
The new machinery was provided via PR by a contributor.
Unfortunately we do not have it in our wiki.
If needed I could help on this.

> - The situation with NumPy 2. Everything seems to work according to CI (with
> the exception of #1185), but requirements files here and there still
> indicate `<2`. Do we intend to support NumPy 2 or build wheels upon it for
> this release?

I would like to keep numpy < 2 at least for one test for the time being, to ensure that (build) compatibility is not broken.
I totally agree, of course, that we should build wheels using the latest numpy.
It should be already the case by the way.


In addition probably we should also update the embedded version of cblosc and hdf5-blosc* (?)


Finally I would like to have your opinion on https://github.com/conda-forge/pytables-feedstock/issues/70.
Should we build conda packages without lzo?


Kind regards
--
Antonio Valentino

Ivan Vilata i Balaguer

unread,
Aug 7, 2024, 11:21:31 AM8/7/24
to pytabl...@googlegroups.com
Antonio Valentino (2024-08-07 15:33:21 +0200) wrote:

> Il 06/08/24 19:52, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
> >
> > Thanks Antonio! I've been having a look at bugs and I see these issues with
> > the release.
> >
> > - #1185 regarding indexing and NumPy 2, as you mentioned.
>
> This should be hopefully fixed now
> I opened PR https://github.com/PyTables/PyTables/pull/1192

Wow, thanks a lot Antonio! I already approved the PR.

>
> > - Upgrading wheels to depend on Python-Blosc2 >= 2.7.1, which would fix #1186
> > (related with PR#1188 too). I don't know how to correctly update the
> > various requirements files with `pip-compile`, though.
>
> instructions are in the various requirements*.in files.
> The new machinery was provided via PR by a contributor.
> Unfortunately we do not have it in our wiki.
> If needed I could help on this.

That'd be great, thanks! I did follow the instructions in the files but got
weird results (like NumPy reverting to 1.x), I'm afraid I may be doing
something incorrectly there, as I'm unfamiliar with the workings of these
compiled requirements files.

>
> > - The situation with NumPy 2. Everything seems to work according to CI (with
> > the exception of #1185), but requirements files here and there still
> > indicate `<2`. Do we intend to support NumPy 2 or build wheels upon it for
> > this release?
>
> I would like to keep numpy < 2 at least for one test for the time being, to ensure that (build) compatibility is not broken.
> I totally agree, of course, that we should build wheels using the latest numpy.
> It should be already the case by the way.

Oh ok, then it makes total sense. 🙂

>
> In addition probably we should also update the embedded version of cblosc and hdf5-blosc* (?)

+1 for C-Blosc and HDF5-Blosc; regarding HDF5-Blosc2, AFAIK the source of in
PyTables is the reference one (I can suggest Francesc and Óscar to move
<https://github.com/oscargm98/HDF5-Blosc2> into the Blosc GH org to continue
updating it). I can take care of these.

>
> Finally I would like to have your opinion on https://github.com/conda-forge/pytables-feedstock/issues/70.
> Should we build conda packages without lzo?

I'm sorry but I'm not familiar with Conda packaging, I can only say that to my
knowledge there's no exception for linking LZO with PyTables. 🤷

Thanks again and cheers!

Francesc Alted

unread,
Aug 7, 2024, 11:32:12 AM8/7/24
to pytabl...@googlegroups.com
On Wed, Aug 7, 2024 at 5:21 PM 'Ivan Vilata i Balaguer' via
pytables-dev <pytabl...@googlegroups.com> wrote:
>
> Antonio Valentino (2024-08-07 15:33:21 +0200) wrote:
>
> > Il 06/08/24 19:52, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
> > >
> > > Thanks Antonio! I've been having a look at bugs and I see these issues with
> > > the release.
> > >
> > > - #1185 regarding indexing and NumPy 2, as you mentioned.
> >
> > This should be hopefully fixed now
> > I opened PR https://github.com/PyTables/PyTables/pull/1192
>
> Wow, thanks a lot Antonio! I already approved the PR.

Me too. Excellent work Antonio.

> > > - The situation with NumPy 2. Everything seems to work according to CI (with
> > > the exception of #1185), but requirements files here and there still
> > > indicate `<2`. Do we intend to support NumPy 2 or build wheels upon it for
> > > this release?
> >
> > I would like to keep numpy < 2 at least for one test for the time being, to ensure that (build) compatibility is not broken.
> > I totally agree, of course, that we should build wheels using the latest numpy.
> > It should be already the case by the way.
>
> Oh ok, then it makes total sense. 🙂

+1

>
> >
> > In addition probably we should also update the embedded version of cblosc and hdf5-blosc* (?)
>
> +1 for C-Blosc and HDF5-Blosc; regarding HDF5-Blosc2, AFAIK the source of in
> PyTables is the reference one (I can suggest Francesc and Óscar to move
> <https://github.com/oscargm98/HDF5-Blosc2> into the Blosc GH org to continue
> updating it). I can take care of these.

+1

>
> >
> > Finally I would like to have your opinion on https://github.com/conda-forge/pytables-feedstock/issues/70.
> > Should we build conda packages without lzo?
>
> I'm sorry but I'm not familiar with Conda packaging, I can only say that to my
> knowledge there's no exception for linking LZO with PyTables. 🤷

I'm not a guru on licensing, but it is true that GPL may pose an issue
with dynamic linking, so my particular vote is to build new binary
packages without LZO support out-of-the-box. In case anyone needs LZO,
they can always build from scratch.

Cheers,

Francesc Alted

Ivan Vilata i Balaguer

unread,
Aug 8, 2024, 8:14:08 AM8/8/24
to pytabl...@googlegroups.com
Cool! Since all of the issues below are fixed now, I created the
`releases/v3.10.0` branch to start the release procedure, the draft PR is
<https://github.com/PyTables/PyTables/pull/1198>.

Antonio, if you're ok with it I'll start collecting release notes.

We can sync further in the PR's discussion.

Thanks everyone for the effort put into this!


Francesc Alted (2024-08-07 17:31:59 +0200) wrote:

> On Wed, Aug 7, 2024 at 5:21 PM 'Ivan Vilata i Balaguer' via
> pytables-dev <pytabl...@googlegroups.com> wrote:
> >
> > Antonio Valentino (2024-08-07 15:33:21 +0200) wrote:
> >
> > > Il 06/08/24 19:52, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
> > > >
> > > > - #1185 regarding indexing and NumPy 2, as you mentioned.
> > >
> > > This should be hopefully fixed now
> > > I opened PR https://github.com/PyTables/PyTables/pull/1192
> >
> > Wow, thanks a lot Antonio! I already approved the PR.
>
> Me too. Excellent work Antonio.
>
> > > I would like to keep numpy < 2 at least for one test for the time being, to ensure that (build) compatibility is not broken.
> > > I totally agree, of course, that we should build wheels using the latest numpy.
> > > It should be already the case by the way.
> >
> > Oh ok, then it makes total sense. 🙂
>
> +1
>
> > > In addition probably we should also update the embedded version of cblosc and hdf5-blosc* (?)
> >
> > +1 for C-Blosc and HDF5-Blosc; regarding HDF5-Blosc2, AFAIK the source of in
> > PyTables is the reference one (I can suggest Francesc and Óscar to move
> > <https://github.com/oscargm98/HDF5-Blosc2> into the Blosc GH org to continue
> > updating it). I can take care of these.
>
> +1
>
> > > Finally I would like to have your opinion on https://github.com/conda-forge/pytables-feedstock/issues/70.
> > > Should we build conda packages without lzo?
> >
> > I'm sorry but I'm not familiar with Conda packaging, I can only say that to my
> > knowledge there's no exception for linking LZO with PyTables. 🤷
>
> I'm not a guru on licensing, but it is true that GPL may pose an issue
> with dynamic linking, so my particular vote is to build new binary
> packages without LZO support out-of-the-box. In case anyone needs LZO,
> they can always build from scratch.

Antonio Valentino

unread,
Aug 8, 2024, 11:00:08 AM8/8/24
to pytabl...@googlegroups.com
Dear Ivan,

Il 08/08/24 14:14, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
> Cool! Since all of the issues below are fixed now, I created the
> `releases/v3.10.0` branch to start the release procedure, the draft PR is
> <https://github.com/PyTables/PyTables/pull/1198>.
>
> Antonio, if you're ok with it I'll start collecting release notes.
>
> We can sync further in the PR's discussion.
>
> Thanks everyone for the effort put into this!

I have just renamed the github milestone to v3.10 and made some cleanup
of bugs assigned to it (see
https://github.com/PyTables/PyTables/milestone/26).
There are 3 remaining issues linked to the upcoming release:

#1165: Request: Make Apple Silicon Wheels Available
This should be automatically closed by the release itself

The following ones instead or are very simple to address or already have
PR that could be quickly merged )or rejected):

#1010 Update leaf.py
#1100 Bug in __init()__

Please have a look and let me know how do you want to proceed.
We could always push then to the next release if deemed appropriate.

Ivan Vilata i Balaguer

unread,
Aug 8, 2024, 1:41:03 PM8/8/24
to pytabl...@googlegroups.com
Hi Antonio, thanks for the update! More inline…

Antonio Valentino (2024-08-08 17:00:05 +0200) wrote:

> Il 08/08/24 14:14, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
> > Cool! Since all of the issues below are fixed now, I created the
> > `releases/v3.10.0` branch to start the release procedure, the draft PR is
> > <https://github.com/PyTables/PyTables/pull/1198>.
> >
> > Antonio, if you're ok with it I'll start collecting release notes.

I've collected new release notes in `RELEASE_NOTES.rst`, the blurb is still
pending.

>
> I have just renamed the github milestone to v3.10 and made some cleanup of
> bugs assigned to it (see https://github.com/PyTables/PyTables/milestone/26).
> There are 3 remaining issues linked to the upcoming release:
>
> #1165: Request: Make Apple Silicon Wheels Available
> This should be automatically closed by the release itself
>
> The following ones instead or are very simple to address or already have PR
> that could be quickly merged )or rejected):
>
> #1010 Update leaf.py
> #1100 Bug in __init()__
>
> Please have a look and let me know how do you want to proceed.
> We could always push then to the next release if deemed appropriate.

Cool! I won't be able to have a look at these until Tuesday, but feel free to
push fixes to master or the release branch in the meanwhile if you feel like
it.

Thanks and cheers!

Ivan Vilata i Balaguer

unread,
Aug 12, 2024, 5:05:17 AM8/12/24
to pytabl...@googlegroups.com
Antonio Valentino (2024-08-08 17:00:05 +0200) wrote:

> I have just renamed the github milestone to v3.10 and made some cleanup of
> bugs assigned to it (see https://github.com/PyTables/PyTables/milestone/26).
> There are 3 remaining issues linked to the upcoming release:
>
> #1165: Request: Make Apple Silicon Wheels Available
> This should be automatically closed by the release itself
>
> The following ones instead or are very simple to address or already have PR
> that could be quickly merged )or rejected):
>
> #1010 Update leaf.py
> #1100 Bug in __init()__
>
> Please have a look and let me know how do you want to proceed.
> We could always push then to the next release if deemed appropriate.

So all relevant issues for the milestone are closed now, and release notes are
up to date. I made the release PR ready for review.

Thanks to everyone who helped with this!

Ivan Vilata i Balaguer

unread,
Aug 26, 2024, 6:12:51 AM8/26/24
to pytabl...@googlegroups.com
Hi everyone! We just posted a new article in the Blosc blog about the new
direct chunking API of PyTables v3.10, including some coding examples and
benchmark results. This is a little more high-level than the documentation in
the User's Guide, so it may be more readable as an introduction to the
feature. Please share it!

<https://www.blosc.org/posts/pytables-direct-chunking/>

With this post we completed the pending tasks for the Small Development Grant
that funded the development of this new feature. Thanks again to NumFOCUS for
supporting this effort, and to everyone who helped us get there!

Cheers,


'Ivan Vilata i Balaguer' via pytables-dev (2024-07-30 18:04:12 +0200) wrote:

> So, continuing with the NumFOCUS-sponsored work on adding direct chunking
> support to PyTables, Francesc Alted and I performed some benchmarking on the
> current implementation and extended the User's Guide with usage & optimization
> tips.
>
> The benchmarks turned out very positive results. Francesc created a notebook
> where he documents the performed tests and comments on the results:
> <https://github.com/PyTables/PyTables/blob/direct-chunking-api/bench/direct-chunking-AMD-7800X3D.ipynb>
>
> The new direct chunking API was already added to the User's Guide reference
> chapter a while ago. Now I added an introductory section to «Optimization
> tips» on how to use direct chunking:
> <https://github.com/PyTables/PyTables/blob/direct-chunking-api/doc/source/usersguide/optimization.rst#low-level-access-to-chunks-direct-chunking>
>
> I also took Francesc's benchmark results (that he got on different machines)
> and added another section to «Optimization tips» discussing the potential
> performance benefits of using direct chunking:
> <https://github.com/PyTables/PyTables/blob/direct-chunking-api/doc/source/usersguide/optimization.rst#avoiding-filter-pipeline-overhead-with-direct-chunking>

Antonio Valentino

unread,
Aug 26, 2024, 12:23:37 PM8/26/24
to pytabl...@googlegroups.com
Il 26/08/24 12:12, 'Ivan Vilata i Balaguer' via pytables-dev ha scritto:
Thanks Ivan

cheers
--
Antonio Valentino
Reply all
Reply to author
Forward
0 new messages