EArray caching of large chunks

Jason Brodsky

unread,

Sep 6, 2024, 6:14:13 PM9/6/24

to pytables-users

Hello everyone! I'm hitting an issue efficiently accessing data in an EArray.

I have an EArray with chunk size (5000, 2, 255, 200), with the first axis being the main (extensible) axis. I am using these big chunks because I intend to shuffle elements. I don't need truly random access into the file, but I'd like to shuffle, say, 5000 elements together before moving on to the next chunk.

Accessing a single element takes about 1.5 seconds. That doesn't surprise me as a time to load and decompress one 2 GB chunk.

Accessing a slice of hundreds of elements also takes about 1.5 seconds. Again, no surprise: once the chunk is decompressed, grabbing a bigger slice of it doesn't take much time.

Accessing a single element twice takes 3 seconds. That is:

array[0]

takes twice as long as

array[0]

This looks to me like a caching issue. The chunk is decompressed on the first access, then dropped from memory so it needs to be reloaded and decompressed again.

If this is a caching issue, what do I need to do to change that? Opening the file with:

pytables.open_file(path, 'r',CHUNK_CACHE_SIZE=3_000_000_000)

doesn't do the trick. Is there something else required to ensure at least one chunk is kept in memory? Is there any way I can confirm whether the cache contains chunks, besides inferring that it doesn't from the above timing exercise?

Thanks,

Jason

Antonio Valentino

unread,

Sep 7, 2024, 5:50:51 AM9/7/24

to pytable...@googlegroups.com

Dear Jason,

Il 07/09/24 00:14, Jason Brodsky ha scritto:

The CHUNK_CACHE_SIZE parameter is not directly used by PyTables.
It is just passed to the underlying HDF5 library.

I will try to debug a little bit to be sure that the parameter is
correctly handled as expected.

I honestly do not know what to suggest, apart the obvious workaround of
using a local variable to store the content of the de-compressed chunk.

kind regards
--
Antonio Valentino

Jason Brodsky

unread,

Sep 9, 2024, 12:47:48 PM9/9/24

to pytables-users

Thank you! I have a basic (one chunk) local cache of my own set up for the time being, but I can anticipate adding some features to this data loading process where having an LRU cache would be very useful. Being able to use the builtin LRU instead of setting up my own would be a nice feature.

Thanks,

Jason

Francesc Alted

unread,

Sep 11, 2024, 1:49:31 AM9/11/24

to Jason Brodsky, pytables-users

Hello Jason,

I think a (much) better approach for getting good performance on small slicing on large chunks would be using the Blosc2 compressor via direct chunking. See this blog on why this works: https://www.blosc.org/posts/pytables-direct-chunking/

Francesc

El ds., 7 de set. 2024, 0:14, Jason Brodsky <jay...@gmail.com> va escriure:

--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/pytables-users/970556dc-1308-424a-afe1-fd72dc93cbc8n%40googlegroups.com.

Reply all

Reply to author

Forward