Hi Matt,
Thank you a lot for the informative advice! :)
"The HDF5 site claims that chunking can substantially increase I/O speed,
but I have yet to see this in practice for most datasets that are dense
and accessed sequentially. "
- Good to know that! :) My dataset is composed of two parts, input feature and target data, and the majority, i.e. the input feature, is quite dense and indeed sequential.
"Chunking is particularly beneficial when your data is sparse, randomly
accessed, or compressed, *and* you know *exactly* how you will be
slicing your data."
- Hmmm...sounds this is not my case....
"Given that you seem to be iterating over slices of your data, you can probably make that happen."
- Yeah, currently the iterator is not written by me so I don't quite know what it is actually doing. I will probably need to re-write an iterator to better fits the data structure, e.g. the prior chunk information.
"It's also curious that you're spending a quarter of your time in
select_hyperslab. That says that you're (probably) doing many, many
small reads."
- I agree. I guess this is because of the iterator...
"If you can afford to read in bigger subsets of your data (say, read 10
GB at a time, then have your iterator dole out slices of that until you
need to read the next 10 GB), you might come out way ahead on
performance. HDF5 (generally speaking) takes care of the nitty-gritty
actions of reading and writing data, but leaves things like access
pattern optimization (like prefetching and caching data) to the user.
I've managed to increase my read performance a couple of orders of
magnitude by eliminating chunking and then pre-fetching data in the
largest
possible useful subsets."
- Good lesson to learn! In the design of our experiment, we have filtered out the problem of random access for now. As you said, then we indeed have some space to optimize the data access algorithms since the iterator could simply make use of the data buffered in memory, and hopefully this could speed up the process a lot. Currently this system is around 20times slower than I expected. And it is quite amazing to me that the time spent on data reading is ~100 times the time spent on real computation!
I will try find out the source for the slicing problem at first. I will set a large default chunk (as long as our grid manager does not kill me :P) and have another run. :)
Matt thank you so much again! :)
Best
Yulan