Hello everyone! I'm hitting an issue efficiently accessing data in an EArray.
I have an EArray with chunk size (5000, 2, 255, 200), with the first axis being the main (extensible) axis. I am using these big chunks because I intend to shuffle elements. I don't need truly random access into the file, but I'd like to shuffle, say, 5000 elements together before moving on to the next chunk.
Accessing a single element takes about 1.5 seconds. That doesn't surprise me as a time to load and decompress one 2 GB chunk.
Accessing a slice of hundreds of elements also takes about 1.5 seconds. Again, no surprise: once the chunk is decompressed, grabbing a bigger slice of it doesn't take much time.
Accessing a single element twice takes 3 seconds. That is:
array[0]
array[0]
takes twice as long as
array[0]
This looks to me like a caching issue. The chunk is decompressed on the first access, then dropped from memory so it needs to be reloaded and decompressed again.
If this is a caching issue, what do I need to do to change that? Opening the file with:
pytables.open_file(path, 'r',CHUNK_CACHE_SIZE=3_000_000_000)
doesn't do the trick. Is there something else required to ensure at least one chunk is kept in memory? Is there any way I can confirm whether the cache contains chunks, besides inferring that it doesn't from the above timing exercise?
Thanks,
Jason