I'm trying to extract a single column from a large HDF table, but the memory usage seems to be the same as if I select the entire HDF table. How can I limit the memory usage? Details below. (Window 7, Python 3.4, Pandas 0.15.2)
All data tables in the HDF file were created using Pandas. The largest table (right now ~100000 rows by 500 columns) was created with two columns set as data_columns: "col_1" and "col_2". I want to be able to filter sections of this large table using certain values from "col_1". I'm doing the following:
import pandas as pd
h5 = pd.HDFStore('all_data.h5')
t_df = h5.select('all_data_table', where='col_2 != 0', columns=['col_1'])
However, the memory usage here is very close to this:
import pandas as pd
h5 = pd.HDFStore('all_data.h5')
t_df = h5.select('all_data_table')
I thought that the first selection works on-disk, and would only return a single column of data, which should be a small set of data. I'm worried that when my large table gets too large for my available memory, it will cause some problems. Am I missing something in the selection process?
Thanks