Re: Exchanging tabular data between R and pandas using HDF5

3,359 views
Skip to first unread message

Jeff

unread,
Jan 14, 2013, 1:39:55 PM1/14/13
to pyd...@googlegroups.com
If you would like to make a specification of a dataframe format that is easily readable by R, then its pretty straightforward to have HDFStore read/write this format
there is now plently of support in tHDFStore to read/write foreign formats..

just put out an example file in the format that R can read and I can put together


On Monday, January 14, 2013 1:13:48 PM UTC-5, Moritz wrote:
Hello all,

My query is related to the following question on StackOverflow:
http://stackoverflow.com/questions/12278347/how-can-i-efficiently-save-a-python-pandas-dataframe-in-hdf5-and-open-it-as-a-da

and is partially dealt with by the trtools mentioned in the notebook by Dale Jung:
https://groups.google.com/forum/?fromgroups=#!searchin/pydata/hdf5/pydata/bsgm5V4xP04/hCxi04Iq9z8J

I have data prepared in R using bioconductor which I can save quite nicely in an hdf5 file. The data.frame is stored as a compound dataset. When I read the file using the HDFStore class the dataset is not recognised as a single entity (a DataFrame) but instead each column is available individually to me. I would like to work with the data and then store results again in a way that R can read them as a data.frame. So my questions:

1. If I can get R to write the data.frame as a table inside the hdf5 file, can pandas read that as a DataFrame?
2. Is there a way to have pandas write a DataFrame that can be read by R as a data.frame?
3. Is there a general interest in such round-trip data exchange, i.e., would hacking a more laborious solution maybe based on Dale Jung's trtools be worthwhile to the community?

Thanks in advance for any replies,
Moritz


Moritz

unread,
Jan 15, 2013, 7:01:37 AM1/15/13
to pyd...@googlegroups.com
Great, that's encouraging. So I did some testing and a major show-stopper seems to be that pandas writes-out even individual values as arrays. I have  a screenshot that compares a table created by the pytables tutorial with a stored DataFrame created with pandas: http://s47.beta.photobucket.com/user/ernie999/media/Screenshotat2013-01-15124147_zpsb14baa8e.png.html

The code used to create the DataFrame is:

import pandas
import string
dim_i = 8
indeces = list(string.lowercase[:dim_i])
df = dict()
df["A"] = pandas.Series(numpy.random.randn(dim_i), index=indeces)
df["gender"] = pandas.Factor(numpy.random.random_integers(0, 1, dim_i), ["woman", "man"])
df = pandas.DataFrame(df)
df

On the R side I'm using the rhdf5 library by Bernd Fischer which is part of Bioconductor. It can read the table from the pytables tutorial perfectly but completely chokes on the table generated by pandas.

Another point I wanted to try and coordinate with you and hopefully Bernd Fischer is the handling of factors. rhdf5 currently only writes the integer values of factors so that the levels are lost. pandas writes out the string. I would propose to write the integer values but create an Enum attribute that contains the levels which both ends could use to restore the factor. What do you think?

Jeff

unread,
Jan 15, 2013, 8:03:23 AM1/15/13
to pyd...@googlegroups.com
this explains difference betwwen the 'regular' format (which is what you are doing with the dataframe), and the Table format (which is like the detector example in pytables)
 
 
 
make sure you are on latest master of 0.10.1-dev

 

 
Ok, try this (for exporting data from pandas -> hdf5), this will write in 'table' format
 
store = pd.HDFStore('foo.h5')
store.append('df', df, data_columns = list_of_columns which you want to export) # data_columns could be: df.columns for example
store.close()
 
# creating data columns creates the indvidual fields (like in the detetctor example)
# you can also try w/o data_columns, but you will likely have to split the columns up when you read in from R (they are stored in blocks by type if you don't specify data columns)
# the indicies are automatically stored
# HDFStore creates a bunch of attributes which you can ignore/use when you read it in, these have all of the data to easily reconstruct the table (and correctly set things like dtypes)
#    its a little more involed that the detector example because we want to allow the table to be inspected w/o reading in all of the data (eg.. doing queries) - from pandas
 
reading - its pretty stratightforward to read in the detector example (but i'll have to put a fix in i think to do it - )
 
I was actually suggesting a 3rd option, that we create a format (e.g. a set of attributes that are stored along with the table), so it is easy to identify what format its in a-priori, e.g.
say the pandas_type attribute which allows us to not have to guess about what we are doing
 
this may not be necessary if i build in support for generic pytables
 
i'll let you know
 
Jeff

Moritz

unread,
Jan 15, 2013, 8:12:29 AM1/15/13
to pyd...@googlegroups.com
My apologies, I didn't mention that I stored the DataFrame with:

store.put("df", df, table=True)

I also used the append method and the results are the same as posted in the screenshot before; single values are still written as arrays.

Jeff

unread,
Jan 15, 2013, 8:28:06 AM1/15/13
to pyd...@googlegroups.com
that will write a table as well (e.g. put with table=True is equiv of append)...
 
(need to be on 0.10.1-dev )
use data_columns=df.columns (this will right each column individually)
 
 
e.g.
 
store.put('df',df,table=True,data_columns=df.columns)

Moritz Emanuel Beber

unread,
Jan 15, 2013, 9:18:34 AM1/15/13
to pyd...@googlegroups.com
That actually works! A bunch of (seemingly) non-sensical data is extracted into a variable called _i_table but all the other data is extracted correctly into a data.frame. Now I just need to pester Bernd Fischer about factors...

My thanks!

As you say reading in generic pytables still lacks recognition of a bunch of attributes for the proper HDFStore  representation. Last question, is there a way to read a table into an in memory DataFrame again?
--
 
 

Jeff

unread,
Jan 15, 2013, 9:28:45 AM1/15/13
to pyd...@googlegroups.com
if you are adventurous.....you can update to my branch on github:
 
 
this is pushed as a PR in pandas but not merged into master yet...
 
added the ability to read in generic PyTables tables (like detector)....so this should allow you to export from R and read into pandas
If you cannot get the R reading of the table format to work (with data columns)...
 
Jeff

Jeff

unread,
Jan 15, 2013, 9:31:26 AM1/15/13
to pyd...@googlegroups.com
what do you mean by read a table into an in memory DataFrame again?
 
you mean like
 
store.select('df') ? (which if you store as a table you can use queries) (store['df'] will work as well if you don't have a where criteria)

Moritz Emanuel Beber

unread,
Jan 15, 2013, 10:46:08 AM1/15/13
to pyd...@googlegroups.com
So I tested your experimental branch but still no go. I'm not sure if you used the ipython notebook so I just attached a PDF of my workflow. I used the pytables tutorial file again.

As you can see the keys are not updated properly and it is unclear to me, how to select the data. Updating the keys is probably an easy hack (I don't mind contributing once the goals are clear) but the larger issue is probably the expected data format.

General hdf5 files can behave like file systems by having multiple (sub-)groups. Under each group it is possible to store multiple tables and attributes, etc.

When I generate an hdf5 file with pandas using the table=True command, at the chosen path in the file, another group is created, for example, "df", with the name that I specified and that group contains a table called "table".

That group I can then extract the data from using store.select("path/df") which actually retrieves the "table" object.

I think the general model makes more sense but, of course, I don't know why you chose this storage model. Either way, I'm happy to help with coding.

Moritz
--
 
 

pandas.pdf

Jeff

unread,
Jan 15, 2013, 11:02:12 AM1/15/13
to pyd...@googlegroups.com
this would work if the node is named (and that's how its named in my example file)....not sure why yours is different
 
/detector/table
 
you could try store.get_node('/detector/readout').renameNode('/detector/readout','/detector/table') might work
 
HDFStore has to have a way to try to figure out what kind of object this is (normally pandas_type/table_type do this...but in this case they are missing)....so have to look for something....
will have to think about this....a bit non-trivial
 
(you can of course create your own hierarchy...its just the terminal node that has a fixed name)....
 
e.g. '/detector/readout/table' would work too
 
maybe I can eliminate this check...

On Monday, January 14, 2013 1:13:48 PM UTC-5, Moritz wrote:

Jeff

unread,
Jan 15, 2013, 11:36:25 AM1/15/13
to pyd...@googlegroups.com
ok...update to my repository again....
 
then select('detector/readout') will work (as will select('detector') if the node is named 'table' underneath (which is how HDFStores everything)

Moritz Emanuel Beber

unread,
Jan 15, 2013, 12:21:44 PM1/15/13
to pyd...@googlegroups.com
Sweet, that works. If you want help with code that populates the information contained in HDFStore after opening a generic file, let me know, at the moment it still does not update the groups.

Thank you for your quick code fixes and responses. I appreciate that very much.
--
 
 

Jeff Reback

unread,
Jan 15, 2013, 12:29:51 PM1/15/13
to pyd...@googlegroups.com
what does not update the groups?
--
 
 

Jeff

unread,
Jan 15, 2013, 12:39:59 PM1/15/13
to pyd...@googlegroups.com
so...here's 2 'issues'....
 
1) the generic_table reader can read in the tables that are produced in say R, but I can't determine what is the index (if there is one)....
2) is the exported table format good enough to be read by R? (alternatively I could produce a generic_table writer, e.g. w/o much meta-data).....but then
how would the API look...e.g. store.put('df',df,table=True,export='generic')? (or even store.to_generic('df',df)..
 
these both could be solved by having another format that on the R side embeds some attributes that enable more info to be transmitted (e.g. the index columns),
and a cleaner read on the R side
 
this may or not be worth it depending on what/how you are using these features...

Moritz Emanuel Beber

unread,
Jan 15, 2013, 1:07:38 PM1/15/13
to pyd...@googlegroups.com
I'm quite satisfied with the capabilities right now. So there's only worth in putting more work into this if many other people would like to use generically written hdf5 files with pandas. I'm not sure about the index issue but here is roughly what I meant:

When you create an hdf5 file with pandas HDFStore stores the groups in the files and they can be accessed using the store.keys() method. With a generic file, printing the store object shows it as empty. I assume that is because pandas expects a leaf node called "table" or previously stored some special attributes. I can, however, access the required information with the store.handle.walkGroups() or walkNodes() methods and access the tables directly with your adapted select method.

It also seems like files written with pandas contain information about the tables within the parent group node rather than the table node itself. For example, from the docs:
In [969]: store.root.df._v_attrs.pandas_type
Out[969]: 'frame_table'
where df is a group node. Now, with the generic reader:
r_store.root.detector.readout._v_attrs.pandas_type
'frame_table'
where readout is a table node.

Another issue: I would expect to append to the table...
r_store.append("detector/readout", df)
but that throws an attribute error because the table object lacks the necessary attribute:
AttributeError: 'Table' object has no attribute '_v_filters'
If I instead append to the group:
r_store.append("detector", df)
It instead creates a new table "table" as per usual.

Sorry for the rambling. I guess my point is pandas writes and needs the following attributes to restore data faithfully:
    data_columns := [],
    index_cols := [(0, 'index')],
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['ADCcount', 'TDCcount', 'energy', 'grid_i', 'grid_j', 'idnumber', 'name', 'pressure'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1', 'values_block_2']]
which you attach to the table node when reading a generic file. Why does that information have to be on the group node that is parent to the table when pandas writes the data itself?
--
 
 

Jeff

unread,
Jan 15, 2013, 1:21:19 PM1/15/13
to pyd...@googlegroups.com
hmm....store.keys() should work (even with the detector/readout example).- and it should show the generic_table correctly ...can you send me your example file....?
 
the reason things are in the parent node is kind of historical....that's how it was done.....and didn't want to break backwards compatibliity even more
the idiom was only to store a single object (or perhaps a complex object) in a single group....I don't think this really stops anything....you can just create sub-nodes and such
 
appending is very tricky....you can only append to tables that are in the new format
 
so in your example if you wanted to append, you would do something like:
 
# in effect convert to the table format
df = store.select('detector/readout')
store.append('df',df)
then str(store)
 
in general the tables in HDFStore have much more functionaility that generic PyTables tables..the price is a bit more attribute storage and a 'defined' format
 
the stored table is a different type that the generic table; all of the extra attributes are necessary for easy recreation of the objects (which could be frames,panels,panel4d)...etc, possibily with different orientations...
I happen to use the 4d Panels a lot so that's what I originally designed this.....a frame is conceptually much simpler so was pretty easy to support....of course once you deal with mixed types
and strings...etc...gets a bit hairy
 
ok...so feature set is ok now...i added a bit to the docs....feel free to add more if you'd like
 
thanks
 
Jeff

Moritz Emanuel Beber

unread,
Jan 15, 2013, 1:28:07 PM1/15/13
to pyd...@googlegroups.com
Here is the dataset.
--
 
 

tutorial1.h5

Jeff

unread,
Jan 15, 2013, 2:24:23 PM1/15/13
to pyd...@googlegroups.com
ok...if you would update to my repository again....
 
store will correctly report the 'detector/readout' (as well as 'detector' - if the underlying node has a 'table' in it)....
Reply all
Reply to author
Forward
0 new messages