pandas read_csv options memory_map and low

Michael Aye

unread,

Nov 13, 2013, 10:29:39 PM11/13/13

to pystat...@googlegroups.com

Hi!

I can't find documentation for these options and how/when to use them?

Reason I ask is that I am unable to parse a 2.7 G csv file. I let it run for 15 minutes on a 96 GB machine, I got this warning after 2 or 3 minutes, but it never finished:

In [2]: df = pd.read_csv(fname, parse_dates=[1])

/u/paige/maye/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas-0.12.0_1100_g0c30665-py2.7-linux-x86_64.egg/pandas/io/parsers.py:1033: DtypeWarning: Columns (15,18,19) have mixed types. Specify dtype option on import or set low_memory=False.

data = self._reader.read(nrows)

Playing with low_memory=False and memory_map produces core dump crashes unfortunately, that's why I am wondering how to use them properly.

Cheers,

Michael

Jeff Reback

unread,

Nov 14, 2013, 8:59:33 AM11/14/13

to pystat...@googlegroups.com

trty reading chunk-by-chunk (then concat at the end), much more memory efficient

these options are not documented AFAIK, but are only pseudo-public in any event
http://pandas.pydata.org/pandas-docs/dev/io.html#iterating-through-files-chunk-by-chunk

http://stackoverflow.com/questions/11622652/large-persistent-dataframe-in-pandas/12193309#12193309

see where it hangs....(and what the csv looks like at that point)

Wes McKinney

unread,

Nov 14, 2013, 1:24:50 PM11/14/13

to pystat...@googlegroups.com

Michael-- any chance I could have a look at this file offline?

- Wes

Tommy Guy

unread,

Nov 14, 2013, 7:45:55 PM11/14/13

to pystat...@googlegroups.com

Hi Michael,

That warning triggers when the parser arrives at different dtypes in different chunks of the file. For instance, it's probably thinking the columns are int until it sees a float. You have three options:

1) Ignore the warning (but be aware that your data has different types)

2) Explicitly specify dtype for those columns

3) After the file loads, set the type of those columns explicitly

Tommy

Michael Aye

unread,

Nov 21, 2013, 2:27:57 PM11/21/13

to pystat...@googlegroups.com

Weren't we doing bottom-posting here? I forgot, sorry.

Anywho, I got it running with chunks of 1e6, but it still takes 15-20 minutes, wonder if Wes could improve it even more? Not sure if that's realistic.

Michael

Wes McKinney

unread,

Nov 21, 2013, 3:25:24 PM11/21/13

to pystat...@googlegroups.com

I got the data file, thanks. I'll have a look-see when I can to see
why it's taking to long to parse.

Reply all

Reply to author

Forward

pandas read_csv options memory_map and low_memory

Michael Aye

Jeff Reback

Wes McKinney

Tommy Guy

Michael Aye

Wes McKinney