Improving I/O Time for reading large CSVs

246 views
Skip to first unread message

Josh Friedlander

unread,
Aug 7, 2019, 3:32:27 PM8/7/19
to PyData
I have a script that opens about 10 CSV files, adds them to a ClientData object, and then passes it on for processing, The files are large, between 2 and 20GB each, and pandas read_csv takes about 5-15 minutes on each one. Since the data doesn't change nearly as often as the script runs, I thought of saving time by reading each into an HDF5 object, saving it, and then loading from that in future runs.

The problem I have encountered is that I have a lot of string data, and when saving the biggest dataframe with pandas.HDFStore.put('df1', df1, format='table'), I run out of memory, even on a fairly powerful cloud machine. It seems like the data on this 20GB file cannot be condensed easily. (I have written more details, with a reproducible example, in this post in the pytables-users group.)

So: is there a simpler solution that I'm missing for data with a lot of string fields/NaNs? Will HDFStore provide significant time saving? Any insights greatly appreciated,

Max Linke

unread,
Aug 7, 2019, 4:19:38 PM8/7/19
to pyd...@googlegroups.com
Hi

Have you tried using parquet files?


Pandas supports a lot more formats then just HDF5. Maybe one of the others works for you. 

Best Max


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pydata/207af5fb-d852-4716-b5f2-3c377800f1f4%40googlegroups.com.

Josh Friedlander

unread,
Aug 7, 2019, 4:25:11 PM8/7/19
to pyd...@googlegroups.com
Thanks for the response Max. I have considered parquet, as well as feather. The reason I was leaning towards HDF5 was that that way I only have one I/O action to load into memory all 10 files. However, I can't confirm that it is actually quicker, since I haven't managed to load the files into the HDFStore.

Sal Abbasi

unread,
Aug 7, 2019, 9:00:22 PM8/7/19
to pyd...@googlegroups.com
Josh,

Why not use h5py directly to store each column in your dataset separately?  I do this when I have large datasets.  See code below where I added the last few lines to your code to write and read the data.


import pandas as pd
import numpy as np
from pandas import HDFStore
import h5py

x = pd.DataFrame({'session_id': np.random.randint(0, 1000000, 129435411)})
# column contains NaNs, characters, mixtures of numbers and characters
x.loc[0, 'session_id'] = np.NaN
x.loc[1, 'session_id'] = ''
x.loc[2, 'session_id'] = '1212dvfsd .?'
x.loc[3, 'session_id'] = 'as I was sastying?'
x.loc[4, 'session_id'] = '...?'
# enforce session_id type as str, to avoid mixed dtype
x.session_id = x.session_id.fillna('nan').astype(str)

# Write, add other columns of x as needed here.
with h5py.File('storage.h5', 'w') as f:
    dset = f.create_dataset(name = 'session_id', data = x.session_id.values, dtype = h5py.special_dtype(vlen=str))
    
# Read back
with h5py.File('storage.h5', 'r') as f:
    session_id = f['session_id'][()]
    
x = pd.DataFrame({'session_id' : session_id})


Pietro Battiston

unread,
Aug 8, 2019, 3:00:20 AM8/8/19
to pyd...@googlegroups.com
Il giorno mer, 07/08/2019 alle 12.32 -0700, Josh Friedlander ha
scritto:
> I have a script that opens about 10 CSV files, adds them to a
> ClientData object, and then passes it on for processing, The files
> are large, between 2 and 20GB each, and pandas read_csv takes about
> 5-15 minutes on each one. Since the data doesn't change nearly as
> often as the script runs, I thought of saving time by reading each
> into an HDF5 object, saving it, and then loading from that in future
> runs.

I think this is a good approach. I always work like this on a data set
of mine. Depending on the server specs, using the HDF compression might
speed up reading further.


> The problem I have encountered is that I have a lot of string data,
> and when saving the biggest dataframe with pandas.HDFStore.put('df1',
> df1, format='table'), I run out of memory, even on a fairly powerful
> cloud machine. It seems like the data on this 20GB file cannot be
> condensed easily. (I have written more details, with a reproducible
> example, in this post in the pytables-users group.)

You can work with chunks. In your case it seems like you have enough
memory to read the CSV, so you just need to write the data bit by bit
(with append=True).

The only problem can come
- if later chunks have longer strings than previous ones: you solve
this with the "min_itemsizes" argument
- if you have categoricals: you have to provide all the categories
since the beginning.

Pietro

Josh Friedlander

unread,
Aug 12, 2019, 12:08:29 PM8/12/19
to PyData
Thanks Sal and Pietro for your answers. However, neither of them work for the example I provided. 

I have spent several days now researching this issue. There are scattered posts on Github, message boards, and Stack Overflow. However many of them are very old.

I know that my data contains varying-length string columns and Unicode, and is quite large (~125m x 11, about 21GB in csv). However, from everything I have read modern PyTables/Pandas should be able to handle it.

If anyone comes across something similar, I would be most grateful to hear from them.

Pietro Battiston

unread,
Aug 12, 2019, 12:16:00 PM8/12/19
to pyd...@googlegroups.com
Il giorno lun, 12/08/2019 alle 09.08 -0700, Josh Friedlander ha
scritto:
> Thanks Sal and Pietro for your answers. However, neither of them work
> for the example I provided.

It might be useful - for people reading the thread, but ultimately
maybe also for you - if you describe the problems you found.

Pietro

Max Linke

unread,
Aug 12, 2019, 1:00:15 PM8/12/19
to pyd...@googlegroups.com
Why do you need to load all the data into memory at once? Is a lazy loading approach not possible for you? Dass has a partial wrapper for pandas dataframes to automatically distribute the workload among many cores and parallelize reading the file. There are other python libraries out there that support lazy loading to work on 2-20GB files with machines having only 8GB RAM. 

I also remembered the following post. tl;dr use a one time pre processing step with well chosen intermediate on disk representation to achieve significant speed ups. 

Best Max

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

Josh Friedlander

unread,
Aug 12, 2019, 3:17:29 PM8/12/19
to PyData
Thanks Pietro - in fact, as you suggested, trying to summarise was helpful for me:

Starting with Sal's answer, I can successfully recreate one column as he suggested. But when I try do it for all columns:

with h5py.File('storage.h5', 'w') as f:
    for col in list(x):
        dset = f.create_dataset(name=col, data=x[col].values, dtype=h5py.special_dtype(vlen=str))

with h5py.File('storage.h5', 'r') as f:
    x_ = pd.DataFrame({'session_id': f['session_id'][()]})
    for col in list(x):
        x_[col] = f[col][()]

It takes a very long time on my machine (MBP, 16GB memory) , and eventually I run out of memory.

****
I run: store = pd.HDFStore('foo.h5')

I would ideally like to save the tables in fixed format, but this seems to not work with weirdly shaped strings/NaNs/unicode, or whatever I have. (I also tried trimming my string to the median length of the series, and removing unicode, but still ended up getting similar errors.) Plus this way you can't use append=True. So I use format='table'.

When I call store.put('x', x, format='t', data_columns=True), this is the warning I get (with my original data):


/shared_directory/projects/env/lib/python3.6/site-packages/tables/leaf.py:414: PerformanceWarning: The Leaf ``/mobile/_i_table/application_data/sorted`` is exceeding the maximum recommended rowsize (104857600 bytes); be ready to see PyTables asking for *lots* of memory and possibly slow I/O.  You may want to reduce the rowsize by trimming the value of dimensions that are orthogonal (and preferably close) to the *main* dimension of this leave.  Alternatively, in case you have specified a very small/large chunksize, you may want to increase/decrease it. PerformanceWarning)

/shared_directory/projects/env/lib/python3.6/site-packages/tables/leaf.py:414: PerformanceWarning: The Leaf ``/sorted`` is exceeding the maximum recommended rowsize (104857600 bytes); be ready to see PyTables asking for *lots* of memory and possibly slow I/O.  You may want to reduce the rowsize by trimming the value of dimensions that are orthogonal (and preferably close) to the *main* dimension of this leave.  Alternatively, in case you have specified a very small/large chunksize, you may want to increase/decrease it. PerformanceWarning)

/shared_directory/projects/env/lib/python3.6/site-packages/tables/leaf.py:414: PerformanceWarning: The Leaf ``/sorted2`` is exceeding the maximum recommended rowsize (104857600 bytes); be ready to see PyTables asking for *lots* of memory and possibly slow I/O.  You may want to reduce the rowsize by trimming the value of dimensions that are orthogonal (and preferably close) to the *main* dimension of this leave.  Alternatively, in case you have specified a very small/large chunksize, you may want to increase/decrease it. PerformanceWarning)


(mobile is the table name, application_data is a column name, the terms sorted/sorted2 are not)


And on the example, I get a long list of warnings that are mostly variations on this:


/shared_directory/projects/env/lib/python3.6/site-packages/tables/path.py:157: NaturalNameWarning: object name is not a valid Python identifier: '9'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though

  check_attribute_name(name)

/shared_directory/projects/env/lib/python3.6/site-packages/tables/attributeset.py:475: NaturalNameWarning: object name is not a valid Python identifier: '0_kind'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though

  check_attribute_name(name)

/shared_directory/projects/env/lib/python3.6/site-packages/tables/attributeset.py:475: NaturalNameWarning: object name is not a valid Python identifier: '0_meta'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though

  check_attribute_name(name)

/shared_directory/projects/env/lib/python3.6/site-packages/tables/attributeset.py:475: NaturalNameWarning: object name is not a valid Python identifier: '0_dtype'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though


But it does eventually seem to finish. And load - very, very slowly.


If I call store.put('engagement', df, format='t', data_columns=True), I get this:


>>> store.put('x', x, format='t', append=True)

Traceback (most recent call last):

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 3612, in create_axes

    info=self.info)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1952, in set_atom

    errors)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1988, in set_atom_string

    data_converted = _convert_string_array(data, encoding, errors)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 4606, in _convert_string_array

    encoding, errors).values.reshape(data.shape)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/strings.py", line 2507, in encode

    result = str_encode(self._data, encoding, errors)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/strings.py", line 1832, in str_encode

    return _na_map(f, arr)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/strings.py", line 150, in _na_map

    return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/strings.py", line 165, in _map

    result = lib.map_infer_mask(arr, f, mask.view(np.uint8), convert)

  File "pandas/_libs/src/inference.pyx", line 1443, in pandas._libs.lib.map_infer_mask

  File "pandas/_libs/src/inference.pyx", line 1233, in pandas._libs.lib.maybe_convert_objects

MemoryError


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put

    self._write_to_group(key, value, append=append, **kwargs)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group

    s.write(obj=value, append=append, complib=complib, **kwargs)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 3946, in write

    **kwargs)

  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 3622, in create_axes

    % (b.dtype.name, b_items, str(detail))

Exception: cannot find the correct atom type -> [dtype->object,items->Index(['session_id', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')] 

Josh Friedlander

unread,
Aug 12, 2019, 3:20:33 PM8/12/19
to PyData
Dask is definitely a good option, I was just tempted by HDF5 as it seems to fit my use case in many other ways. For all-numeric datasets of comparable size, it is blisteringly fast. But it just seems to get dramatically worse with a large df with badly formed strings.

I will read that article, thank you!


On Monday, 12 August 2019 20:00:15 UTC+3, Max Linke wrote:
Why do you need to load all the data into memory at once? Is a lazy loading approach not possible for you? Dass has a partial wrapper for pandas dataframes to automatically distribute the workload among many cores and parallelize reading the file. There are other python libraries out there that support lazy loading to work on 2-20GB files with machines having only 8GB RAM. 

I also remembered the following post. tl;dr use a one time pre processing step with well chosen intermediate on disk representation to achieve significant speed ups. 

Best Max
On Mon 12. Aug 2019 at 18:15, Pietro Battiston <m...@pietrobattiston.it> wrote:
Il giorno lun, 12/08/2019 alle 09.08 -0700, Josh Friedlander ha
scritto:
> Thanks Sal and Pietro for your answers. However, neither of them work
> for the example I provided.

It might be useful - for people reading the thread, but ultimately
maybe also for you - if you describe the problems you found.

Pietro

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyd...@googlegroups.com.

Sal Abbasi

unread,
Aug 12, 2019, 3:39:07 PM8/12/19
to pyd...@googlegroups.com
Josh,

If you run out of memory re-creating the pandas dataframe, then this does not sound like a serialization issue, rather it sounds like your data is too big to fit into a pandas dataframe with the RAM and swap space you have.

Also make sure you use the correct datatype when creating each column, don’t use dtype = h5py.special_dtype(vlen=str) for columns that don’t contain variable length strings.  For example, use dtype = ‘<f4’ or dtype = ‘<i4’ for 4 byte floats and 4 byte ints respectively.

Try  re-loading the data using numpy, and then finally creating a pandas dataframe in one step to see where you run out of memory.  Something like:

def get_dtype(name):
    if name == ’session_id’: return  h5py.special_dtype(vlen=str) 
    if name == ‘my_int_col’ : return ‘<i4’
….

with h5py.File('storage.h5', 'w') as f:
    for col in list(x):
        f.create_dataset(name=col, data=x[col].values, dtype=get_dtype(name))

del x

col_dict = {}
with h5py.File('storage.h5', 'r') as f:
    for col in list(x):
       col_dict[col] = f[col][()]

my_df = pd.DataFrame(col_dict)



-- 
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pydata/28564417-3c4c-4700-8d5c-99bb0c21a3d1%40googlegroups.com.

Josh Friedlander

unread,
Aug 13, 2019, 8:38:12 AM8/13/19
to PyData
Thanks Sal. I followed these instructions and it was a chance to learn a bit about string treatment in df5, as well as the what exactly Pandas means by 'object' - it could be mixed datatypes, or strings of any sort, and how Pandas doesn't let you set a NumPy style string dtype.

In the end I went with the following:

dtypes_dict = df.dtypes.to_dict()


And then overwrote my string col types (which are Pandas object, not recognised in hdf5) with either

dtypes_dict['str_col_foo'] = h5py.special_dtype(vlen=bytes)


(ie ascii-type) strings, or if that failed, I used

dtypes_dict['sdr_u'] = h5py.special_dtype(vlen=str)


(ie Unicode). I also coerced df[col] = df[col].astype(str) to fill NaNs and transform any accidental non-text scalars. I followed up with 

>>> with h5py.File('foo.h5', 'w') as f:

...     for col in list(df):

...         f.create_dataset(name=col, data=df[col].values, dtype=dtypes_dict[col])

>>> del df

>>> col_dict = {}

>>> with h5py.File('foo.h5', 'r') as f:

...     for col in list(dtypes_dict.keys()):

...        col_dict[col] = f[col][()]

>>> df_ = pd.DataFrame(col_dict)

And it worked - not the fastest, and I have to decode my strings afterwards, but quite manageable. The only issue I have is that the size of foo.h5 is 30GB, and this after I reduced my original csv input to just 12GB. (The other method I used yields a h5 file of 7.5GB)

The method I will likely end up using is splitting my df every 10m rows, appending to HDF5 in pieces, and then using pd.concat to put it back together. Here is my code, for those who are interested:

>>> lim = 10000000  # found this size by trial and error
>>> store = pd.HDFStore('foo.h5')
>>> for i in range(len(df) // lim + 1):
...     tmp = df.loc[i * lim: lim + (i * lim) - 1, :].copy()
...     print(i * lim)
...     store['df_' + str(i)] = tmp


To unsubscribe from this group and stop receiving emails from it, send an email to pyd...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages