[pandas] read_csv faster than read

Michael Aye

unread,

Jul 5, 2013, 10:43:51 PM7/5/13

to pystat...@googlegroups.com

I just compared read_fwf with read_csv, with a 10k lines file:

%timeit pd.io.parsers.read_fwf('/u/paige/maye/2013030100_RDR.TAB.head',skiprows=1,colspecs=colspecs,header=None)

1 loops, best of 3: 282 ms per loop

%timeit pd.io.parsers.read_csv('/u/paige/maye/2013030100_RDR.TAB.head',skiprows=1,skipinitialspace=True,header=None)

10 loops, best of 3: 59.8 ms per loop


Shouldn't a fixed width reader be faster than something that needs to scan for the separator?

Wes McKinney

unread,

Jul 5, 2013, 10:57:45 PM7/5/13

to pystat...@googlegroups.com

Going through different code paths. read_csv has a fast C
implementation whereas fwf is pure python

Michael Aye

unread,

Jul 5, 2013, 11:10:50 PM7/5/13

to pystat...@googlegroups.com

Oh, I wasn't expecting that, but I remember your tuning efforts for read_csv.

So would you say, from all parsers available, that read_csv is the fastest way to read text files, correct?

I have to read many large (1.7 G) text files that happen to be both fixed format AND comma-separated. (planetary data system format).

Wes McKinney

unread,

Jul 5, 2013, 11:13:26 PM7/5/13

to pystat...@googlegroups.com

yep, read_csv for fastest speed and lowest memory usage

Michael Aye

unread,

Jul 6, 2013, 3:27:08 AM7/6/13

to pystat...@googlegroups.com

ugh, I just realised the drastic punishment from parse_dates=[[0,1]]:

Parsing 10 k lines WITHOUT parse_dates: 10 loops, best of 3: 68.9 ms per loop

Parsing 10 k lines WITH parse_dates: 1 loops, best of 3: 1.54 s per loop

This pushes the read time for a 5.5 M-line file with parse_dates to 1.17 h !

This is my full command line:

%timeit pd.io.parsers.read_csv('2013030100_RDR.TAB',nrows=1e4, skipinitialspace=True,names=rdrreader.headers,skiprows=1,na_values=['-9999.0'],parse_dates=[[0,1]])

and this 1 line of data:

"01-Mar-2013", "00:00:00.090", 2456352.500001049, 16796, 0.99276, 1.44929, 316.78226, 0383788800.04259, 16.58826, 328.58567, 1857.18354, 119.74794, 180.000, 240.000, 110, -1.03585, 58.16150, 1, 1, -0.806293, 0.534740, -0.252872, 31.3819, 86.512, 16.72140, 328.73477, 3.02910, 19.27805, 8.13338, 12.79667, 000, 012, 000

Is there a faster work-around?

Michael Aye

unread,

Jul 6, 2013, 5:23:00 AM7/6/13

to pystat...@googlegroups.com

So, by using

parse = lambda x: dt.datetime.strptime(x, '%d-%b-%Y %H:%M:%S.%f'))

I was able to bring the read-time for 10 k lines down to:

1 loops, best of 3: 393 ms per loop

when I put the date_parser into read_csv like so:

%timeit pd.io.parsers.read_csv('2013030100_RDR.TAB',nrows=1e4, skipinitialspace=True,names=rdrreader.headers,skiprows=1,na_values=['-9999.0'],parse_dates=[[0,1]],index_col=0,date_parser=parse)

and, interestingly, a little bit faster, when I do the parsing afterwards:

%%timeit
df = pd.io.parsers.read_csv('2013030100_RDR.TAB',nrows=1e4, skipinitialspace=True,names=rdrreader.headers,skiprows=1,na_values=['-9999.0'])
df['date_utc'] = df.date + ' ' + df.utc
df['time'] = df.date_utc.map(parse)

1 loops, best of 3: 370 ms per loop

This gets the read-time for a 5.5 MLines file with date_parse'ing down to 3.4 minutes, not bad, I would like to have it faster, obviously, but if that's the fastest it gets, I bite.

I would like to try it with cython, would be a good learning project, but I have no clue if it actually would improve? What do you guys think?

Michael Aye

unread,

Jul 6, 2013, 5:31:12 AM7/6/13

to pystat...@googlegroups.com

I forgot to add here:

df.set_index(df.time, inplace=True)

df.drop(['date','utc','date_utc','time'],axis=1)

but that costs only 2 ms more.

Jeff Reback

unread,

Jul 6, 2013, 7:17:37 AM7/6/13

to pystat...@googlegroups.com

don't parse the dates at all in read_csv

use pd.to_datetime(..)

Michael Aye

unread,

Jul 6, 2013, 1:37:54 PM7/6/13

to pystat...@googlegroups.com

On Saturday, July 6, 2013 4:17:37 AM UTC-7, Jeff Reback wrote:

don't parse the dates at all in read_csv
use pd.to_datetime(..)

I thought, it's 'post below' policy here? ;)

Anywho, thanks!

Down to 193 ms for 10 k, but only when using the format string argument in pd.to_datetime(..)

That get's me to 1 min 40 sec, great!

Michael Aye

unread,

Jul 7, 2013, 1:51:42 AM7/7/13

to pystat...@googlegroups.com

On Saturday, July 6, 2013 10:37:54 AM UTC-7, Michael Aye wrote:

On Saturday, July 6, 2013 4:17:37 AM UTC-7, Jeff Reback wrote:
don't parse the dates at all in read_csv
use pd.to_datetime(..)

I thought, it's 'post below' policy here? ;)
Anywho, thanks!

Down to 193 ms for 10 k, but only when using the format string argument in pd.to_datetime(..)
That get's me to 1 min 40 sec, great!

Found a bug with to_datetime though. :(

https://github.com/pydata/pandas/issues/4152

Back to slower reads...

Reply all

Reply to author

Forward

[pandas] read_csv faster than read_fwf

Michael Aye

Wes McKinney

Michael Aye

Wes McKinney

Michael Aye

Michael Aye

Michael Aye

Jeff Reback

Michael Aye

Michael Aye