Read HDF5 file into memory, and then process

1,313 views
Skip to first unread message

bob.w...@gmail.com

unread,
Oct 9, 2013, 3:48:22 AM10/9/13
to pytable...@googlegroups.com
Hi Guys,
I am doing some batch processing of existing HDF5 files. I am doing some processing and then loading in HBase on Hadoop. I have this working with pytables, opening the file, processing and then loading in HBase.

For complicated reasons, I now need to read the HDF5 files from stdin into memory, and then do the processing.
Is this possible using python? Any clues? Files are around 200MB so easily fit in memory.
eg;

cat on_disk_file.hd5 | python process_hd5.py


where process_hd5.py will be something like:

#!/usr/bin/env python

import sys
import tables

def read_in_chunks( file_object, chunk_size=2**16 ):
    while True:
        chunk = file_object.read( chunk_size )
        if not chunk:
            break
        yield chunk

for chunk in read_in_chunks( sys.stdin ):
    <now I am stuck>

How can I get the file into memory and accessible with pytables? I would normally process with something like this:

with tables.openFile( h5_filename, 'r' ) as h5f:
        query_tables = h5f.walkNodes("/", "Table")


My python is so poor that I can't see how to get there.
The documentation for PyTables 3.0 shows the sort of thing I want to do:

h5file = tables.openFile("new_sample.h5", "w", driver="H5FD_CORE", driver_core_bacling_store=0)

But this is opening a file on disk, whereas I want to open a 'file' in memory?
Thanks for helping out a newbie



Antonio Valentino

unread,
Oct 9, 2013, 4:45:07 AM10/9/13
to pytable...@googlegroups.com
Hi Bob,
You can use the DRIVER_CORE_IMAGE parameter [1] of the tables.open_file
function [2] to pass the memory buffer containing the hdf5 file:

# read data from stdin
data = sys.stdin.read() # python 2

or

data = sys.stdin.buffer.read() # python 3

# open the HDF5 handler passing a memory buffer
with tables.openFile("new_sample.h5", "w", driver="H5FD_CORE",
driver_core_bacling_store=0, DRIVER_CORE_IMAGE=data) as h5file:
[CODE HERE]


[1] http://pytables.githu
b.io/usersguide/parameter_files.html#tables.parameters.DRIVER_CORE_IMAGE
[2]
http://pytables.github.io/usersguide/libref/top_level.html#tables.open_file


cheers

--
Antonio Valentino

bob.w...@gmail.com

unread,
Oct 9, 2013, 7:11:13 AM10/9/13
to pytable...@googlegroups.com

Thanks Antonio. That has moved me along. However there is still something I don't understand. Isn't the new file 'new_sample.h5' empty because of the 'w' flag?
This means my operations on the file are on an empty file?
Here is my actual code for testing and result:

print 'start reading'
data         = sys.stdin.read()
print 'finish reading'
print len( data )

with tables.openFile("new_sample.h5", "w", driver="H5FD_CORE", driver_core_backing_store=0, driver_core_image=data) as h5f:
    print h5f


And the result:

$ cat pktdump.h5 | python hd5_test.py
start reading
finish reading
211063152
new_sample.h5 (File) u''
Last modif.: 'Wed Oct  9 12:01:44 2013'
Object Tree:
/ (RootGroup) u''



Thanks




Antonio Valentino

unread,
Oct 9, 2013, 7:22:28 AM10/9/13
to pytable...@googlegroups.com
Hi Bob,
Yes, sorry, of course you should open the file in read mode 'r' not 'w'
in this case.


cheers

--
Antonio Valentino

bob.w...@gmail.com

unread,
Oct 9, 2013, 7:24:47 AM10/9/13
to pytable...@googlegroups.com

Great! Now I understand, and it works of course!!

I just wanted to be clear that I hadn't misunderstood how it was working.
Thanks again

Antonio Valentino

unread,
Oct 9, 2013, 8:18:43 AM10/9/13
to pytable...@googlegroups.com
Hi Bob,

Il 09/10/2013 13:24, bob.w...@gmail.com ha scritto:
>
> Great! Now I understand, and it works of course!!
>
> I just wanted to be clear that I hadn't misunderstood how it was working.
> Thanks again
>

very good!

Your use case is very interesting.
Would you like to update the "In-memory HDF5 files" recipe [1] to
include your example?
It could be useful for other users.

You can get the source code from github [2].
The relevant file is
"PyTables/doc/source/cookbook/inmemory_hdf5_files.rst" [3].


[1] http://pytables.github.io/cookbook/inmemory_hdf5_files.html
[2] https://github.com/PyTables/PyTables
[3]
https://github.com/PyTables/PyTables/blob/develop/doc/source/cookbook/inmemory_hdf5_files.rst


cheers

--
Antonio Valentino
Reply all
Reply to author
Forward
0 new messages