Memory Error with large dataframe

83 views
Skip to first unread message

Josh Friedlander

unread,
Aug 7, 2019, 11:09:40 AM8/7/19
to pytables-users
My code reads several large csvs which takes a long time, so I had the idea of reading them once into an hdf 'dict' and then afterwards calling them from there. However, trying to save the largest (129435411, 11) df in hdf5 results in a memory error, specifically from one column which has a mixture of numbers and strings, which I coerced to string to avoid mixed dtypes. However it seems that PyTables tries to pickle it and results in a memory error. Reproduce as follows: 

```
import pandas as pd
import numpy as np
from pandas import HDFStore

x = pd.DataFrame({'session_id': np.random.randint(0, 1000000, 129435411)})
# column contains NaNs, characters, mixtures of numbers and characters
x.loc[0, 'session_id'] = np.NaN
x.loc[1, 'session_id'] = ''
x.loc[2, 'session_id'] = '1212dvfsd .?'
x.loc[3, 'session_id'] = 'as I was sastying?'
x.loc[4, 'session_id'] = '...?'
# add more string columns
for i in range(10):
    x[str(i)] = np.array(['foo'] * len(x))
# enforce session_id type as str, to avoid mixed dtype
x.session_id = x.session_id.fillna('nan').astype(str)

hdf = HDFStore('storage.h5')
hdf.put('x', x, format='fixed', data_columns=True)
```
Stack:

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put

    self._write_to_group(key, value, append=append, **kwargs)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group

    s.write(obj=value, append=append, complib=complib, **kwargs)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2963, in write

    self.write_array('block%d_values' % i, blk.values, items=blk_items)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2730, in write_array

    vlarr.append(value)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/vlarray.py", line 529, in append

    sequence = atom.toarray(sequence)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/atom.py", line 1085, in toarray

    buffer_ = self._tobuffer(object_)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/atom.py", line 1219, in _tobuffer

    return six.moves.cPickle.dumps(object_, six.moves.cPickle.HIGHEST_PROTOCOL)


Any help would be much appreciated!

Sal Abbasi

unread,
Aug 7, 2019, 9:02:54 PM8/7/19
to pytables-users
Josh,

Why not use h5py directly to store each column in your dataset separately?  I do this when I have large datasets.  See code below where I added the last few lines to your code to write and read the data.


import pandas as pd
import numpy as np
from pandas import HDFStore
import h5py

x = pd.DataFrame({'session_id': np.random.randint(0, 1000000, 129435411)})
# column contains NaNs, characters, mixtures of numbers and characters
x.loc[0, 'session_id'] = np.NaN
x.loc[1, 'session_id'] = ''
x.loc[2, 'session_id'] = '1212dvfsd .?'
x.loc[3, 'session_id'] = 'as I was sastying?'
x.loc[4, 'session_id'] = '...?'
# enforce session_id type as str, to avoid mixed dtype
x.session_id = x.session_id.fillna('nan').astype(str)

# Write, add other columns of x as needed here.
with h5py.File('storage.h5', 'w') as f:
    dset = f.create_dataset(name = 'session_id', data = x.session_id.values, dtype = h5py.special_dtype(vlen=str))
    
# Read back
with h5py.File('storage.h5', 'r') as f:
    session_id = f['session_id'][()]
    
x = pd.DataFrame({'session_id' : session_id})
Reply all
Reply to author
Forward
0 new messages