Storing metadata and files in a HDFStore

Andreas Hilboll

unread,

May 8, 2013, 1:40:52 PM5/8/13

to pyd...@googlegroups.com

Hi,

downloaded some XLS files from the internet and converted the data to a
DataFrame. This DataFrame I want to store in a HDFStore. So far so good.

To keep things neat and tidy, I'd like to store some metadata in the
HDFStore as well (URL, date downloaded). What would be even better: if I
could include whole files in the HDFStore as well (the script I used to
convert the data from XLS to DataFrame, and the XLS file itself).
Because then, this one HDFStore would contain all data to reproduce the
results at any later time.

Any idea if / how this can be achieved? I know that pytables supports a
filenode feature which would at least allow storing files, right?

I think this would be nice, in the light of reproducible science ...

Cheers, Andreas.

Jeff Reback

unread,

May 8, 2013, 1:58:19 PM5/8/13

to pyd...@googlegroups.com

see the last topic in this section

http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore

things are pickles so you can store pretty much anything

I think there is a limit of 64kb in size on a single node though

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Andreas Hilboll

unread,

May 9, 2013, 7:48:03 AM5/9/13

to pyd...@googlegroups.com, Jeff Reback

Am 08.05.2013 19:58, schrieb Jeff Reback:
> see the last topic in this section
>
> http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore
>
> things are pickles so you can store pretty much anything
> I think there is a limit of 64kb in size on a single node though
>
> On May 8, 2013, at 1:40 PM, Andreas Hilboll <li...@hilboll.de

> <mailto:li...@hilboll.de>> wrote:
>
>> Hi,
>>
>> downloaded some XLS files from the internet and converted the data to a
>> DataFrame. This DataFrame I want to store in a HDFStore. So far so good.
>>
>> To keep things neat and tidy, I'd like to store some metadata in the
>> HDFStore as well (URL, date downloaded). What would be even better: if I
>> could include whole files in the HDFStore as well (the script I used to
>> convert the data from XLS to DataFrame, and the XLS file itself).
>> Because then, this one HDFStore would contain all data to reproduce the
>> results at any later time.
>>
>> Any idea if / how this can be achieved? I know that pytables supports a
>> filenode feature which would at least allow storing files, right?
>>
>> I think this would be nice, in the light of reproducible science ...
>>
>> Cheers, Andreas.

Thanks, Jeff, that got me started in the right direction.

There's indeed a 64k limit on single attributes. Since my XLS files are
larger, I settled for storing the binary data directly using the
pytables filenode:

import pandas as pd
from tables.nodes import filenode
H = pd.HDFStore("test.h5", "w", complib='blosc', complevel=9)
with open("data.xls", "rb") as fd:
XLS = fd.read()
H._handle.createGroup('/', 'source_data')
fnode = filenode.newNode(H._handle, where='/source_data',
name="data.xls")
fnode.write(XLS)
fnode.close()
H.close()

Again, thanks for your help, Jeff!

-- Andreas.

Jeff

unread,

May 9, 2013, 10:09:10 AM5/9/13

to pyd...@googlegroups.com, li...@hilboll.de

great...glad it worked out for you

my 2c...I would simply store a pathname to this XLS rather than the actual file itself

for simplicity, but you may have a reason to bind it to the actual file

Jeff

Andreas Hilboll

unread,

May 9, 2013, 2:21:59 PM5/9/13

to Jeff, pyd...@googlegroups.com

Am 09.05.2013 16:09, schrieb Jeff:
> great...glad it worked out for you
>
> my 2c...I would simply store a pathname to this XLS rather than the
> actual file itself
> for simplicity, but you may have a reason to bind it to the actual file

My goal is traceability. I often download data files (often XLS) from
the web, in order to use the data in my research. Different XLS files
have different formats, so the parsing always differs a bit. Also, maybe
I need to reindex stuff according to my own standards. And the original
data producers might not care about reproducibility and change file
contents without changing file names.

By having the original XLS file, the resulting pandas object, and the
Python code which converted the XLS file to the pandas object, all
within one file, I get total traceability. No matter if I reorganize my
harddisk, or if I pass the data file on to a coworker, it's always
totally clear where the data came from.

In order to be able to reproduce one's results (or find a reason why
reproducing isn't possible) I chose to go this way.

Actually, I just hacked together a small module to do exactly this. I
put it on Github:

https://github.com/andreas-h/pyrepsci

Maybe it can be helpful to someone out there. Any comments are welcome.

-- Andreas.

Jeff

unread,

May 9, 2013, 3:00:30 PM5/9/13

to pyd...@googlegroups.com, li...@hilboll.de

very interesting...

I read a thread not too long ago about someone doing this via MongoDB....

mongo apparently keeps nice 'versioning' of data and such, but has

a limitation of data size (they hacked it to get i think 500mb a document),

but HDFStore is still 'better' in that regards

but as you note, it has its pitfalls as well

Your solution is pretty nice idea

a single file versioned 'data'

thanks!

On Wednesday, May 8, 2013 1:40:52 PM UTC-4, Andreas Hilboll wrote:

Jeff

unread,

May 9, 2013, 3:05:15 PM5/9/13

to pyd...@googlegroups.com, li...@hilboll.de

I think you can require pandas only >= 0.10.1, many changes in 0.10,0.10.1 for HDFStore

so before that is pretty different (though you can still read files created before then)

You might find this interesting as well (prob gonig to be introduced in 0.12)
https://github.com/pydata/pandas/pull/3525

On Wednesday, May 8, 2013 1:40:52 PM UTC-4, Andreas Hilboll wrote:

Andrew Giessel

unread,

May 9, 2013, 3:44:17 PM5/9/13

to pyd...@googlegroups.com

Just to interject: I've also done this with mongoDB and the changes needed to use bigger data (up to half a gig) are the first 4 commits here:

https://github.com/dattalab/mongo/commits/big-data

the SONManipulator classes are really nice for shoving docs in and out. We use one for numpy arrays but you could easily write one for dataframes as well

best,

ag

--

You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Andrew Giessel, PhD

Department of Neurobiology, Harvard Medical School
220 Longwood Ave Boston, MA 02115
ph: 617.432.7971 email: andrew_...@hms.harvard.edu

Andreas Hilboll

unread,

May 9, 2013, 5:10:11 PM5/9/13

to Jeff, pyd...@googlegroups.com

Thanks for the hint, Jeff, I'll look into msgpack at some point. And
I'll try out if 0.10.1 works for me.

Cheers, Andreas.

--
-- Andreas.

Tim Michelsen

unread,

May 9, 2013, 6:06:01 PM5/9/13

to pyd...@googlegroups.com

Hello Andreas,

> My goal is traceability. I often download data files (often XLS) from
> the web, in order to use the data in my research. Different XLS files
> have different formats, so the parsing always differs a bit. Also, maybe
> I need to reindex stuff according to my own standards. And the original
> data producers might not care about reproducibility and change file
> contents without changing file names.
>
> By having the original XLS file, the resulting pandas object, and the
> Python code which converted the XLS file to the pandas object, all
> within one file, I get total traceability. No matter if I reorganize my
> harddisk, or if I pass the data file on to a coworker, it's always
> totally clear where the data came from.

> In order to be able to reproduce one's results (or find a reason why
> reproducing isn't possible) I chose to go this way.

I really liked your idea and approach.
I was actually waiting for something like this:
* upload raw data to database
* add meta data on source and properties
* track name of the modifier
* track commands & scripts used to generate this data
* retieve the data for analysis
* do analysis and save back with the respective metadata

If you look at another domains, GRASS GIS uses a metadata descriptor:
http://grass.osgeo.org/grass64/manuals/r.info.html

Data Description: |
| generated by r.slope.aspect

together with a history log it takes you already somewhere...

> Actually, I just hacked together a small module to do exactly this. I
> put it on Github:
>
> https://github.com/andreas-h/pyrepsci
>
> Maybe it can be helpful to someone out there. Any comments are welcome.

I checked it out. It's really cool.
I just miss a simple method to extract all stuff from the HDF-file into
a directory structure.
Maybe even with a IPython log or notebook template: once started or run
in IPython we would have the pandas dataframe ready at our hands...

If we manage to add this into IPython and / or Spyder so that the script
logging gets a automatic process, things would be really easy to proove
your results.

You may also check also Gloo (http://tshauck.github.io/Gloo/) which
gives the general structure that you coudl use to setup your HDF5 file.

BTW, did you compress the XLS files before putting into the HDF?
Would this help to reduce the size?

Regards,
Timmie

Tim Michelsen

unread,

May 9, 2013, 6:08:59 PM5/9/13

to pyd...@googlegroups.com

> Just to interject: I've also done this with mongoDB and the changes
> needed to use bigger data (up to half a gig) are the first 4 commits here:
>
> https://github.com/dattalab/mongo/commits/big-data

So did you go a similar concept as Andreas?

Tim Michelsen

unread,

May 9, 2013, 6:10:09 PM5/9/13

to pyd...@googlegroups.com

> I read a thread not too long ago about someone doing this via MongoDB....
> mongo apparently keeps nice 'versioning' of data and such, but has
> a limitation of data size (they hacked it to get i think 500mb a document),
> but HDFStore is still 'better' in that regards

Are you aware of people working on something similar with MongoDB?

This would be nice because it could be used in webapps, too.

Reply all

Reply to author

Forward