Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

EArray caching of large chunks

5 views
Skip to first unread message

Jason Brodsky

unread,
Sep 6, 2024, 6:14:13 PM9/6/24
to pytables-users
Hello everyone! I'm hitting an issue efficiently accessing data in an EArray.

I have an EArray with chunk size (5000, 2, 255, 200), with the first axis being the main (extensible) axis. I am using these big chunks because I intend to shuffle elements. I don't need truly random access into the file, but I'd like to shuffle, say, 5000 elements together before moving on to the next chunk.

Accessing a single element takes about 1.5 seconds. That doesn't surprise me as a time to load and decompress one 2 GB chunk.

Accessing a slice of hundreds of elements also takes about 1.5 seconds. Again, no surprise: once the chunk is decompressed, grabbing a bigger slice of it doesn't take much time.

Accessing a single element twice takes 3 seconds. That is:
array[0]
array[0]
takes twice as long as
array[0]

This looks to me like a caching issue. The chunk is decompressed on the first access, then dropped from memory so it needs to be reloaded and decompressed again.

If this is a caching issue, what do I need to do to change that? Opening the file with:
pytables.open_file(path, 'r',CHUNK_CACHE_SIZE=3_000_000_000)
doesn't do the trick. Is there something else required to ensure at least one chunk is kept in memory? Is there any way I can confirm whether the cache contains chunks, besides inferring that it doesn't from the above timing exercise?

Thanks,
Jason

Antonio Valentino

unread,
Sep 7, 2024, 5:50:51 AM9/7/24
to pytable...@googlegroups.com
Dear Jason,

Il 07/09/24 00:14, Jason Brodsky ha scritto:
The CHUNK_CACHE_SIZE parameter is not directly used by PyTables.
It is just passed to the underlying HDF5 library.

I will try to debug a little bit to be sure that the parameter is
correctly handled as expected.

I honestly do not know what to suggest, apart the obvious workaround of
using a local variable to store the content of the de-compressed chunk.

kind regards
--
Antonio Valentino

Jason Brodsky

unread,
Sep 9, 2024, 12:47:48 PM9/9/24
to pytables-users
Thank you! I have a basic (one chunk) local cache of my own set up for the time being, but I can anticipate adding some features to this data loading process where having an LRU cache would be very useful. Being able to use the builtin LRU instead of setting up my own would be a nice feature.

Thanks,
Jason

Francesc Alted

unread,
Sep 11, 2024, 1:49:31 AM9/11/24
to Jason Brodsky, pytables-users
Hello Jason,

I think a (much) better approach for getting good performance on small slicing on large chunks would be using the Blosc2 compressor via direct chunking. See this blog on why this works: https://www.blosc.org/posts/pytables-direct-chunking/

Francesc 

El ds., 7 de set. 2024, 0:14, Jason Brodsky <jay...@gmail.com> va escriure:
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/pytables-users/970556dc-1308-424a-afe1-fd72dc93cbc8n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages