Deleting and file size

1,556 views
Skip to first unread message

kwgoodman

unread,
Jan 4, 2010, 10:56:54 AM1/4/10
to h5py
I am making a dictionary-like interface for archiving an object that
contains a numpy array and some other stuff:

>>> io = IO('/tmp/dataset.hdf5')
>>> io['y'] = y # <--- save
>>> z = io['y'] # <--- load
>>> del io['y'] # <--- delete from archive

That's easy to do since I am using the excellent (dictionary-like)
h5py package.

The problem I am having is with deleting from the archive. A delete on
a HDF5 file is just an unlink and often the space doesn't get reused.
So the archive has to be repacked from time to time. That's not
something I want the user to worry about.

I could repack after every delete. But that doesn't sounds like a good
idea if the archive is big and the delete is small. Alternatively, the
user could specify a size, say 100MB, and I could repack if a delete
would cause the total unlinked space to exceed 100MB.

How do I determine the size of the unlinked space? Here's was my
unsuccessful attempt:

>> f = h5py.File('/tmp/data.hdf5')
>> f.__sizeof__()
32
>> x = np.random.rand(2000,2000)
>> f['x'] = x
>> f.__sizeof__()
32
>> f['x'].__sizeof__()
32
>> f.__sizeof__?
Type: builtin_function_or_method
Base Class: <type 'builtin_function_or_method'>
String Form: <built-in method __sizeof__ of File object at 0x1add2d0>
Namespace: Interactive
Docstring:
__sizeof__() -> size of object in memory, in bytes

I guess memory is RAM and not disk space.

So do I need to ask the filesystem the total size of the archive and
then estimate the size of each array based on shape, dtype,
compression? Ugh. Repacking after every delete is starting to look
good. Or just let the user decide when to repack.

How do you guys deal with all this?

Andrew Collette

unread,
Jan 4, 2010, 1:21:09 PM1/4/10
to h5...@googlegroups.com
Hi,

> How do I determine the size of the unlinked space?

Yes, freespace in HDF5 is a bit of a pain. Fortunately HDF5 tracks
freespace internally until the file is closed. You can get at this
using the low-level interface (http://h5py.alfven.org/docs/api/):

>>> f = h5py.File('foo.hdf5','w')
>>> f.fid.get_freespace()
0L

You can also manually query each dataset to find out how much space
it's taking on disk:

>>> ds = f.create_dataset('foo', data=np.arange(10000))
>>> ds.id.get_storage_size()
40000L

And if you have HDF5 1.8, you can determine the amount of space
"really" occupied by datasets by recursive iteration:

size=0
def sizefinder(name, obj):
global size
if isinstance(obj, h5py.Dataset):
size = obj.id.get_storage_size()

f.visititems(sizefinder)

Keep in mind this will never exactly match the filesize, as HDF5 has
some overhead for things like groups and metadata. However, with
multi-hundred-megabyte files this is likely unimportant.

Andrew Collette

unread,
Jan 4, 2010, 1:29:40 PM1/4/10
to h5...@googlegroups.com
>    if isinstance(obj, h5py.Dataset):
>        size = obj.id.get_storage_size()

This should be "+=", of course. :)

HTH,
Andrew

Keith Goodman

unread,
Jan 4, 2010, 2:01:35 PM1/4/10
to h5...@googlegroups.com

I don't have HDF5 1.8 but I see that it (and h5py!) will be in Ubuntu 10.4.

Your first solution works well for me:

>> f = h5py.File('/tmp/data.hdf5')

>> f['x'] = np.random.rand(1000,1000)
>> f.fid.get_filesize() - sum([f[z].id.get_storage_size() for z in f.keys()])
2048L
>> del f['x']
>> f.fid.get_filesize() - sum([f[z].id.get_storage_size() for z in f.keys()])
8002048L

I like h5py. I hope you plan to keep working on it after your
dissertation is done.

Oh, does the low-level h5py interface give me access to HDF5's repack?
If not, it should be easy to roll my own.

Andrew Collette

unread,
Jan 4, 2010, 6:31:58 PM1/4/10
to h5...@googlegroups.com
Hi,

> Your first solution works well for me:

Sounds good. Just make sure to call f.flush() before you test the
sizes; in this case, it looks like HDF5 does indeed truncate the file
after 'x' is deleted. However, if you create an "x" and "y" and then
delete "x", a hole appears:

>>> f['x'] = np.random.rand(1000,1000)
>>> f['y'] = np.random.rand(1000,1000)
>>> del f['x']
>>> f.flush()
>>> f.fid.get_filesize() - sum(x.id.get_storage_size() for x in f.itervalues())
8002048L

> I like h5py. I hope you plan to keep working on it after your
> dissertation is done.

Thanks! I expect to support h5py indefinitely, even if part-time.
It's not going anywhere. :)

> Oh, does the low-level h5py interface give me access to HDF5's repack?
> If not, it should be easy to roll my own.

No, this is done by a custom application from the HDF Group which
ships with HDF5 (and should be in Ubuntu). You can write one in h5py,
but if theirs is available it might be faster.

Andrew

Darren Dale

unread,
Jan 5, 2010, 7:42:58 AM1/5/10
to h5...@googlegroups.com
On Mon, Jan 4, 2010 at 6:31 PM, Andrew Collette
<andrew....@gmail.com> wrote:
>> I like h5py. I hope you plan to keep working on it after your
>> dissertation is done.
>
> Thanks!  I expect to support h5py indefinitely, even if part-time.
> It's not going anywhere. :)

Thank god.

Reply all
Reply to author
Forward
0 new messages