Problem with concat, category and HDF5

Benito Carmona

unread,

Jun 8, 2016, 2:58:52 AM6/8/16

to PyData

Hi,

I had decided to use the HDF format (with table option) for concat several large dataframes (in principle just one of them fit in memory at the same time). So, the idea was to iterate through them and append them to a HDFStore. However I have found a problem because some of the columns of those dataframes are category type and the following error has shown up:

cannot append a categorical with different categories to the existing

Is there any way to workaround it?

Thanks

Pietro Battiston

unread,

Jun 8, 2016, 4:07:25 AM6/8/16

to pyd...@googlegroups.com

Yes, however the feasibility depends on what exactly you need to do
(e.g. the source of the categorical data).

When you create a Categorical, for instance from an existing series,
you can add to it (currently) unused categories:

In [2]: s = pd.Series(['a', 'b', 'a'])

In [3]: t = s.astype('category', categories=['a', 'b', 'c'])

In [4]: t.values.categories
Out[4]: Index(['a', 'b', 'c'], dtype='object')

So the workaround would be to first create in advance a list of all of
the existing categories, and then pass it as "categories" argument when
you create each chunk of the categorical column.

Pietro

Benito Carmona

unread,

Jun 8, 2016, 8:32:46 AM6/8/16

to PyData

Hi,

Thanks for the suggestion but that is the hard way in my use cases because I dont know the values of the categories up front.

Regards

Benito Carmona

unread,

Jun 11, 2016, 3:46:37 AM6/11/16

to PyData

I have found this. I think it can help.

http://stackoverflow.com/questions/29709918/pandas-and-category-replacement

nir izraeli

unread,

Jun 15, 2016, 6:22:47 PM6/15/16

to PyData

Hi,

Not exactly answers your question, but your attempt at working without having all of the data in memory, and having different portions of pandas dataframes with slightly different columns resembles some of the problems I had up until recently.

Saving tables in hdf format doesn't allow different structure (column names, counts, types, etc), as long as it's saved into the same node. Once the structure is created, it can't be easily modified.
You might already figured out you can work around that by saving dataframes to different nodes or hdf files.
If you just save the dataframes to separate nodes or files and operate on each dataframe indevidualy, you're pretty much OK.

However, if you use dask (which is a library based on pandas that enables out of core computations by working on multiple dataframe partitions) you can have most of the heavy lifting of working with multiple dataframes separately (and since version 0.10.0, latest, seemlessly allows saving every internal partition to a different hdf node (full disclosure: that's an ability I contributed to).

Dask does some things well and other things not so much, so YMMV.

Hope that helps,
Nir

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Benito Carmona

unread,

Jun 18, 2016, 2:07:11 AM6/18/16

to PyData

Thank very much for your input.

Yes, I had already started to work on using several nodes on HDF5 (one per dataframe). The thing is that in my use case, normally I dont need to have all the columns in memory and I have seen that HDF5 does not handle very well this scenario, I mean, you do not save much memory only loading a subset of the columns. Currently, I have decided to test to shard manually a set of daily files by a certain column (some kind user id) so the concatenation can be don in memory using feather format because it is columnar and by using categories it is incredibly fast. For processing I am thinking about to start doing the iterations myself (i.e to process each shard and aggregate the results) and may be I test dask for doing block algorithms.

Reply all

Reply to author

Forward