hdf5 file is 6x bigger than simple txt file

1,366 views
Skip to first unread message

Ondrej Certik

unread,
May 19, 2011, 2:07:46 AM5/19/11
to h5...@googlegroups.com
Hi,

I use h5py and I saved 92 groups of a float E_tot and a 1D double numpy array (of size 19 or less) using the following code:

f = File("data.hdf5", "w")
r = range(1, 93)
r.reverse()
for Z in r:
    f.create_group("Z%02d" % Z)
    E_tot, ks_energies = get_energies(Z, True)
    f.create_dataset("/Z%02d/E_tot" % Z, data=E_tot)
    f.create_dataset("/Z%02d/ks_energies" % Z, data=ks_energies)
    print "%d %e" %(Z, E_tot)


I then read it and write a simple txt file with ascii output of the numbers:

from h5py import File
from numpy import array

f = File("nonrel_energies.hdf5")
g = open("xx.txt", "w")
for Z in range(1, 93):
    E_tot = array(f["/Z%02d/E_tot" % Z])
    ks_energies = f["/Z%02d/ks_energies" % Z][...]
    g.write("Z = %02d\n" % Z)
    g.write("E_tot = %20.14f\n" % E_tot)
    g.write("ks_energies =\n")
    for e in ks_energies:
        g.write("%20.14f\n" % e)
    g.write("\n")


and the file xx.txt has 24K, while the hdf5 file has 159K. I thought that one of the advantages of the hdf5 format is that it is smaller than just saving the numbers using ascii. So I must be doing something wrong. What would be the most efficient way to save my data? The 92 arrays have sizes from 1 to 19.

If hdf5 is not a good format for that, what would be the best way to do it, so that I don't have to worry about precision, platform dependent things, and also so that it is small?

Thanks,
Ondrej Certik

nils

unread,
May 19, 2011, 8:10:01 AM5/19/11
to h5py
Hi,

Each group adds around 1K overhead, and each dataset maybe half of
that or so. That puts you already at > 100 K without any data.

For a scalar, you want to store it in an attribute instead; the same
is probably true for your 1d arrays (i read somewhere 32K min for a
dataset). That way you at least save the dataset overhead.

In general, for compact storage you want not-too-small datasets. Only
then it makes sense to switch on compression for those (which then can
save space)

I had a similar, but different problem a while back; see this thread
on the hdf user group

http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2011-February/004230.html

In my case it turned out that 'chunked' storage adds a large overhead
for small datasets. 'Contiguous' or 'compact' storage layout is more
space efficient. In h5py the kind of layout you get depends on how you
create the datasets. For create_dataset with a data kwargs you most
likely already have contiguous storage.

cheers, Nils
Reply all
Reply to author
Forward
0 new messages