>>> io = IO('/tmp/dataset.hdf5')
>>> io['y'] = y # <--- save
>>> z = io['y'] # <--- load
>>> del io['y'] # <--- delete from archive
That's easy to do since I am using the excellent (dictionary-like)
h5py package.
The problem I am having is with deleting from the archive. A delete on
a HDF5 file is just an unlink and often the space doesn't get reused.
So the archive has to be repacked from time to time. That's not
something I want the user to worry about.
I could repack after every delete. But that doesn't sounds like a good
idea if the archive is big and the delete is small. Alternatively, the
user could specify a size, say 100MB, and I could repack if a delete
would cause the total unlinked space to exceed 100MB.
How do I determine the size of the unlinked space? Here's was my
unsuccessful attempt:
>> f = h5py.File('/tmp/data.hdf5')
>> f.__sizeof__()
32
>> x = np.random.rand(2000,2000)
>> f['x'] = x
>> f.__sizeof__()
32
>> f['x'].__sizeof__()
32
>> f.__sizeof__?
Type: builtin_function_or_method
Base Class: <type 'builtin_function_or_method'>
String Form: <built-in method __sizeof__ of File object at 0x1add2d0>
Namespace: Interactive
Docstring:
__sizeof__() -> size of object in memory, in bytes
I guess memory is RAM and not disk space.
So do I need to ask the filesystem the total size of the archive and
then estimate the size of each array based on shape, dtype,
compression? Ugh. Repacking after every delete is starting to look
good. Or just let the user decide when to repack.
How do you guys deal with all this?
> How do I determine the size of the unlinked space?
Yes, freespace in HDF5 is a bit of a pain. Fortunately HDF5 tracks
freespace internally until the file is closed. You can get at this
using the low-level interface (http://h5py.alfven.org/docs/api/):
>>> f = h5py.File('foo.hdf5','w')
>>> f.fid.get_freespace()
0L
You can also manually query each dataset to find out how much space
it's taking on disk:
>>> ds = f.create_dataset('foo', data=np.arange(10000))
>>> ds.id.get_storage_size()
40000L
And if you have HDF5 1.8, you can determine the amount of space
"really" occupied by datasets by recursive iteration:
size=0
def sizefinder(name, obj):
global size
if isinstance(obj, h5py.Dataset):
size = obj.id.get_storage_size()
f.visititems(sizefinder)
Keep in mind this will never exactly match the filesize, as HDF5 has
some overhead for things like groups and metadata. However, with
multi-hundred-megabyte files this is likely unimportant.
This should be "+=", of course. :)
HTH,
Andrew
I don't have HDF5 1.8 but I see that it (and h5py!) will be in Ubuntu 10.4.
Your first solution works well for me:
>> f = h5py.File('/tmp/data.hdf5')
>> f['x'] = np.random.rand(1000,1000)
>> f.fid.get_filesize() - sum([f[z].id.get_storage_size() for z in f.keys()])
2048L
>> del f['x']
>> f.fid.get_filesize() - sum([f[z].id.get_storage_size() for z in f.keys()])
8002048L
I like h5py. I hope you plan to keep working on it after your
dissertation is done.
Oh, does the low-level h5py interface give me access to HDF5's repack?
If not, it should be easy to roll my own.
> Your first solution works well for me:
Sounds good. Just make sure to call f.flush() before you test the
sizes; in this case, it looks like HDF5 does indeed truncate the file
after 'x' is deleted. However, if you create an "x" and "y" and then
delete "x", a hole appears:
>>> f['x'] = np.random.rand(1000,1000)
>>> f['y'] = np.random.rand(1000,1000)
>>> del f['x']
>>> f.flush()
>>> f.fid.get_filesize() - sum(x.id.get_storage_size() for x in f.itervalues())
8002048L
> I like h5py. I hope you plan to keep working on it after your
> dissertation is done.
Thanks! I expect to support h5py indefinitely, even if part-time.
It's not going anywhere. :)
> Oh, does the low-level h5py interface give me access to HDF5's repack?
> If not, it should be easy to roll my own.
No, this is done by a custom application from the HDF Group which
ships with HDF5 (and should be in Ubuntu). You can write one in h5py,
but if theirs is available it might be faster.
Andrew
Thank god.