Slow data reading with h5py

7,496 views
Skip to first unread message

Yulan Liu

unread,
Nov 7, 2014, 10:17:01 AM11/7/14
to h5...@googlegroups.com
Hi dear all,

I am new to HDF5 and days ago I started using h5py to fetch data to train deep neural network models for speech recognition tasks. However I immediately got a problem that the data access speed is extremely slow. Here is the analysis results with python cProfile:

-----------------------------------------------------------
In [5]: p.sort_stats('time').print_stats(10)
Thu Nov  6 20:03:53 2014    log-profile_train-dnn.txt

         515532571 function calls (515450941 primitive calls) in 9971.452 seconds

   Ordered by: internal time
   List reduced from 5476 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    17740 6313.706    0.356 6313.706    0.356 {method 'read' of 'h5py.h5d.DatasetID' objects}
 18145640 2457.244    0.000 2457.244    0.000 {method 'select_hyperslab' of 'h5py.h5s.SpaceID' objects}
    35480  346.389    0.010 9829.833    0.277 build/bdist.linux-x86_64/egg/h5py/_hl/dataset.py:316(__getitem__)
 18145640  154.193    0.000  444.183    0.000 build/bdist.linux-x86_64/egg/h5py/_hl/selections.py:491(_handle_simple)
    17740  132.929    0.007 3153.816    0.178 build/bdist.linux-x86_64/egg/h5py/_hl/selections.py:399(__getitem__)
 18163380   95.416    0.000  150.156    0.000 build/bdist.linux-x86_64/egg/h5py/_hl/selections.py:468(_expand_ellipsis)
    17740   90.050    0.005   90.050    0.005 {method 'nonzero' of 'numpy.ndarray' objects}
     8881   62.871    0.007   63.354    0.007 /share/spandh.ami1/sw/std/python/v2.7.6/x86_64/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py:483(__call__)
199827784   57.468    0.000   57.468    0.000 {method 'append' of 'list' objects}
    21827   31.181    0.001   31.181    0.001 {numpy.core.multiarray.array}
----------------------------------------------------------

As we can see, the HDF5 reading is really time consuming. Since I have been hearing that HDF5 is a good direction to move for fast access to large dataset. I guess there might be something I did wrong, say, improper data chunk size, or improper reading method.

As suggested by some people, I played a little bit with chunk size and that indeed matters. Currently I am using HDF5 to store data in numpy matrix. Let's say the matrix is of a shape (n_frame, dim_fea). I set the chunk size to be (512, dim_fea), which makes the reading speed faster than either (512, 512) or (1, dim_fea). However I still don't know what is the optimal chunk size to set. However the speed gain from chunk size change is small, thus there must be some other problem.

Could anyone please give me some advice or comments? For example, what is the common pitfalls in h5py that lead to slow reading please?

Another thing I want to ask, not related to this problem but related to what I am going to do, is that in some to-test codes (not tested yet), I have quite a few operations like "numpy.lib.stride_tricks.as_strided" that change the reading strides to reduce the consumption of memory. Would that be a potential problem that could slow down h5py file reading (in the codes to test) please?

Thank you so much! :)

Best regards
Yulan

Jerome Kieffer

unread,
Nov 7, 2014, 10:32:41 AM11/7/14
to h5...@googlegroups.com
On Fri, 7 Nov 2014 07:17:01 -0800 (PST)
Yulan Liu <acp...@sheffield.ac.uk> wrote:

> Hi dear all,
>
> I am new to HDF5 and days ago I started using h5py to fetch data to train
> deep neural network models for speech recognition tasks. However I
> immediately got a problem that the data access speed is extremely slow.

Are you using windows ?

This could be the issue as HDF5 relies on POSIX filesystem
http://www.hdfgroup.org/HDF5/doc/UG/UG_frame08TheFile.html
--
Jerome Kieffer <goo...@terre-adelie.org>

Yulan Liu

unread,
Nov 7, 2014, 10:47:08 AM11/7/14
to h5...@googlegroups.com
Hi Jerome,

No, I am using linux:

------------------------------
[yulan@snarl acn]$ cat /etc/*-release
Scientific Linux release 6.3 (Carbon)
Scientific Linux release 6.3 (Carbon)
------------------------------

As for the python:

--------------------------------
Python 2.7.6 (default, Feb 13 2014, 17:36:15)
---------------------------------

As for CPU:

--------------------------------------
[yulan@snarl acn]$ cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Xeon(R) CPU            5160  @ 3.00GHz
stepping    : 6
cpu MHz        : 2999.997
cache size    : 4096 KB
physical id    : 0
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dts tpr_shadow
bogomips    : 5984.61
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 1
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Xeon(R) CPU            5160  @ 3.00GHz
stepping    : 6
cpu MHz        : 2999.997
cache size    : 4096 KB
physical id    : 3
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 6
initial apicid    : 6
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dts tpr_shadow
bogomips    : 5983.91
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 2
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Xeon(R) CPU            5160  @ 3.00GHz
stepping    : 6
cpu MHz        : 2999.997
cache size    : 4096 KB
physical id    : 0
siblings    : 2
core id        : 1
cpu cores    : 2
apicid        : 1
initial apicid    : 1
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dts tpr_shadow
bogomips    : 5984.61
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 3
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Xeon(R) CPU            5160  @ 3.00GHz
stepping    : 6
cpu MHz        : 2999.997
cache size    : 4096 KB
physical id    : 3
siblings    : 2
core id        : 1
cpu cores    : 2
apicid        : 7
initial apicid    : 7
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dts tpr_shadow
bogomips    : 5983.91
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:
---------------------------------------

Matthew Zwier

unread,
Nov 7, 2014, 11:13:38 AM11/7/14
to h5...@googlegroups.com
Hi Yulan,

Do you have the option of *not* chunking your data? Access to non-chunked data can be substantially faster.

Also, can you share the snippet of code that does the reading? That will give us some insight as to what's going on.

Cheers,
Matt Z.

--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yulan Liu

unread,
Nov 7, 2014, 11:19:03 AM11/7/14
to h5...@googlegroups.com
Hi Matt,

Thank you a lot for the advice! :)

I will try not chunking the data and run again. (Actually I didn't really understand the benefit of chunking data, I just did it as some googling results mentioned it...)

Here are the codes

https://github.com/lisa-lab/pylearn2/blob/master/pylearn2/datasets/hdf5.py

Especially, it is the class "HDF5Dataset".

There are some other top functions. But I haven't figured out yet the structure. But the codes above is definitely all the bottom h5py interface in the code package I am using.

Thank you! :)

Best
Yulan

Yulan Liu

unread,
Nov 7, 2014, 11:31:01 AM11/7/14
to h5...@googlegroups.com
Hi Matt,

I just read the h5py page again. Since it is said: "Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk." I am afraid I have to keep data chunked, but probably in a larger chunk, as my dataset can be 100GB large later, and I want to control the memory consumption within 10 GB....

If there is any way to improve the speed without pushing chunk size to the limit, it would be great! :) Thank you a lot!

Best regards


Yulan

On Friday, November 7, 2014 4:13:38 PM UTC, Matt Zwier wrote:

Matthew Zwier

unread,
Nov 7, 2014, 11:42:49 AM11/7/14
to h5...@googlegroups.com
Hi Yulan,

The HDF5 site claims that chunking can substantially increase I/O speed, but I have yet to see this in practice for most datasets that are dense and accessed sequentially. Chunking is particularly beneficial when your data is sparse, randomly accessed, or compressed, *and* you know *exactly* how you will be slicing your data. Your access pattern appears sequential, so I would expect a speedup by storing your data un-chunked on disk. As long as you never read a single *slice* bigger than 10 GB (actually, probably more like 2-4 GB), you'll not have memory consumption above 10 GB. Given that you seem to be iterating over slices of your data, you can probably make that happen.

It's also curious that you're spending a quarter of your time in select_hyperslab. That says that you're (probably) doing many, many small reads. If you can afford to read in bigger subsets of your data (say, read 10 GB at a time, then have your iterator dole out slices of that until you need to read the next 10 GB), you might come out way ahead on performance. HDF5 (generally speaking) takes care of the nitty-gritty actions of reading and writing data, but leaves things like access pattern optimization (like prefetching and caching data) to the user. I've managed to increase my read performance a couple of orders of magnitude by eliminating chunking and then pre-fetching data in the largest possible useful subsets.

So, in summary: (1) don't chunk if you don't have to; (2) try to reduce the number of HDF5 reads by reading bigger subsets of your data and then slicing them in memory.

Cheers,
Matt Z.

Yulan Liu

unread,
Nov 7, 2014, 11:59:31 AM11/7/14
to h5...@googlegroups.com
Hi Matt,

Thank you a lot for the informative advice! :)


"The HDF5 site claims that chunking can substantially increase I/O speed, but I have yet to see this in practice for most datasets that are dense and accessed sequentially. "

- Good to know that! :) My dataset is composed of two parts, input feature and target data, and the majority, i.e. the input feature, is quite dense and indeed sequential.



"Chunking is particularly beneficial when your data is sparse, randomly accessed, or compressed, *and* you know *exactly* how you will be slicing your data."

- Hmmm...sounds this is not my case....



"Given that you seem to be iterating over slices of your data, you can probably make that happen."

- Yeah, currently the iterator is not written by me so I don't quite know what it is actually doing. I will probably need to re-write an iterator to better fits the data structure, e.g. the prior chunk information.



"It's also curious that you're spending a quarter of your time in select_hyperslab. That says that you're (probably) doing many, many small reads."

- I agree. I guess this is because of the iterator...



"If you can afford to read in bigger subsets of your data (say, read 10 GB at a time, then have your iterator dole out slices of that until you need to read the next 10 GB), you might come out way ahead on performance. HDF5 (generally speaking) takes care of the nitty-gritty actions of reading and writing data, but leaves things like access pattern optimization (like prefetching and caching data) to the user. I've managed to increase my read performance a couple of orders of magnitude by eliminating chunking and then pre-fetching data in the largest possible useful subsets."

- Good lesson to learn! In the design of our experiment, we have filtered out the problem of random access for now. As you said, then we indeed have some space to optimize the data access algorithms since the iterator could simply make use of the data buffered in memory, and hopefully this could speed up the process a lot. Currently this system is around 20times slower than I expected. And it is quite amazing to me that the time spent on data reading is ~100 times the time spent on real computation!

I will try find out the source for the slicing problem at first. I will set a large default chunk (as long as our grid manager does not kill me :P) and have another run. :)

Matt thank you so much again! :)


Best
Yulan

Yulan Liu

unread,
Nov 12, 2014, 1:33:57 PM11/12/14
to h5...@googlegroups.com
Hi Matt,

I found the HDF5 chunking explain (link below) from a previous discussion in h5py group very useful:

http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/

According to Example 2 in that link, the many "select_hyperslab" operations in my case corresponds to the "H5Select_hyperslab" function. If there is a lot of such operations, that means the chunksize is set very improperly and the codes fall into a situation that it has to read a lot of column vectors. However my chunk size is set in a way most efficient for fast reading of row vectors.


Thank you Matt again for pointing out and emphasize that "select_hyperslab" function! :)

Best
Yulan
Reply all
Reply to author
Forward
0 new messages