Memory Error with large dataframe

93 views

Skip to first unread message

Josh Friedlander

unread,

Aug 7, 2019, 11:09:40 AM8/7/19

to pytables-users

My code reads several large csvs which takes a long time, so I had the idea of reading them once into an hdf 'dict' and then afterwards calling them from there. However, trying to save the largest (129435411, 11) df in hdf5 results in a memory error, specifically from one column which has a mixture of numbers and strings, which I coerced to string to avoid mixed dtypes. However it seems that PyTables tries to pickle it and results in a memory error. Reproduce as follows:

```

import pandas as pd

import numpy as np

from pandas import HDFStore

x = pd.DataFrame({'session_id': np.random.randint(0, 1000000, 129435411)})

# column contains NaNs, characters, mixtures of numbers and characters

x.loc[0, 'session_id'] = np.NaN

x.loc[1, 'session_id'] = ''

x.loc[2, 'session_id'] = '1212dvfsd .?'

x.loc[3, 'session_id'] = 'as I was sastying?'

x.loc[4, 'session_id'] = '...?'

# add more string columns

for i in range(10):

x[str(i)] = np.array(['foo'] * len(x))

# enforce session_id type as str, to avoid mixed dtype

x.session_id = x.session_id.fillna('nan').astype(str)

hdf = HDFStore('storage.h5')

hdf.put('x', x, format='fixed', data_columns=True)

```

Stack:

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put

self._write_to_group(key, value, append=append, **kwargs)

File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group

s.write(obj=value, append=append, complib=complib, **kwargs)

File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2963, in write

self.write_array('block%d_values' % i, blk.values, items=blk_items)

File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2730, in write_array

vlarr.append(value)

File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/vlarray.py", line 529, in append

sequence = atom.toarray(sequence)

File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/atom.py", line 1085, in toarray

buffer_ = self._tobuffer(object_)

File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/atom.py", line 1219, in _tobuffer

return six.moves.cPickle.dumps(object_, six.moves.cPickle.HIGHEST_PROTOCOL)

Any help would be much appreciated!

Sal Abbasi

unread,

Aug 7, 2019, 9:02:54 PM8/7/19

to pytables-users

Josh,

Why not use h5py directly to store each column in your dataset separately? I do this when I have large datasets. See code below where I added the last few lines to your code to write and read the data.

import pandas as pd

import numpy as np

from pandas import HDFStore

import h5py

x = pd.DataFrame({'session_id': np.random.randint(0, 1000000, 129435411)})

# column contains NaNs, characters, mixtures of numbers and characters

x.loc[0, 'session_id'] = np.NaN

x.loc[1, 'session_id'] = ''

x.loc[2, 'session_id'] = '1212dvfsd .?'

x.loc[3, 'session_id'] = 'as I was sastying?'

x.loc[4, 'session_id'] = '...?'

# enforce session_id type as str, to avoid mixed dtype

x.session_id = x.session_id.fillna('nan').astype(str)

# Write, add other columns of x as needed here.

with h5py.File('storage.h5', 'w') as f:

dset = f.create_dataset(name = 'session_id', data = x.session_id.values, dtype = h5py.special_dtype(vlen=str))

# Read back

with h5py.File('storage.h5', 'r') as f:

session_id = f['session_id'][()]

x = pd.DataFrame({'session_id' : session_id})

Reply all

Reply to author

Forward

0 new messages