Hi Guys,
I am doing some batch processing of existing HDF5 files. I am doing some processing and then loading in HBase on Hadoop. I have this working with pytables, opening the file, processing and then loading in HBase.
For complicated reasons, I now need to read the HDF5 files from stdin into memory, and then do the processing.
Is this possible using python? Any clues? Files are around 200MB so easily fit in memory.
eg;
cat on_disk_file.hd5 | python process_hd5.pywhere
process_hd5.py will be something like:
#!/usr/bin/env python
import sys
import tables
def read_in_chunks( file_object, chunk_size=2**16 ):
while True:
chunk = file_object.read( chunk_size )
if not chunk:
break
yield chunk
for chunk in read_in_chunks( sys.stdin ):
<now I am stuck>
How can I get the file into memory and accessible with pytables? I would normally process with something like this:
with tables.openFile( h5_filename, 'r' ) as h5f:
query_tables = h5f.walkNodes("/", "Table")My python is so poor that I can't see how to get there.
The documentation for PyTables 3.0 shows the sort of thing I want to do:
h5file = tables.openFile("new_sample.h5", "w", driver="H5FD_CORE", driver_core_bacling_store=0)
But this is opening a file on disk, whereas I want to open a 'file' in memory?
Thanks for helping out a newbie