HDF memory usage question

110 views
Skip to first unread message

Ryan Nelson

unread,
Feb 2, 2015, 11:28:30 AM2/2/15
to pyd...@googlegroups.com
I'm trying to extract a single column from a large HDF table, but the memory usage seems to be the same as if I select the entire HDF table. How can I limit the memory usage? Details below. (Window 7, Python 3.4, Pandas 0.15.2)

All data tables in the HDF file were created using Pandas. The largest table (right now ~100000 rows by 500 columns) was created with two columns set as data_columns: "col_1" and "col_2". I want to be able to filter sections of this large table using certain values from "col_1". I'm doing the following:

import pandas as pd
h5 = pd.HDFStore('all_data.h5')
t_df = h5.select('all_data_table', where='col_2 != 0', columns=['col_1'])

However, the memory usage here is very close to this:

import pandas as pd
h5 = pd.HDFStore('all_data.h5')
t_df = h5.select('all_data_table')

I thought that the first selection works on-disk, and would only return a single column of data, which should be a small set of data. I'm worried that when my large table gets too large for my available memory, it will cause some problems. Am I missing something in the selection process?

Thanks


Ryan Nelson

unread,
Feb 2, 2015, 2:08:50 PM2/2/15
to pyd...@googlegroups.com
Sorry. I missed this one: http://stackoverflow.com/questions/25902114/pandas-retrieving-hdf5-columns-and-memory-usage

It also seems that HDFStore.select_column is much more efficient, but then I can't do the nice 'where=' filters on other columns...

Ultimately, I'd like to do some groupby stuff, so I'll try this:

Maybe I'll try bcolz as well. It will be nice if this gets included in Pandas:

Sorry for the noise.
Reply all
Reply to author
Forward
0 new messages