[pandas] read_csv faster than read_fwf

626 views
Skip to first unread message

Michael Aye

unread,
Jul 5, 2013, 10:43:51 PM7/5/13
to pystat...@googlegroups.com
I just compared read_fwf with read_csv, with a 10k lines file:

%timeit pd.io.parsers.read_fwf('/u/paige/maye/2013030100_RDR.TAB.head',skiprows=1,colspecs=colspecs,header=None)

1 loops, best of 3: 282 ms per loop

%timeit pd.io.parsers.read_csv('/u/paige/maye/2013030100_RDR.TAB.head',skiprows=1,skipinitialspace=True,header=None)

10 loops, best of 3: 59.8 ms per loop

Shouldn't a fixed width reader be faster than something that needs to scan for the separator?


Wes McKinney

unread,
Jul 5, 2013, 10:57:45 PM7/5/13
to pystat...@googlegroups.com
Going through different code paths. read_csv has a fast C
implementation whereas fwf is pure python

Michael Aye

unread,
Jul 5, 2013, 11:10:50 PM7/5/13
to pystat...@googlegroups.com
Oh, I wasn't expecting that, but I remember your tuning efforts for read_csv.
So would you say, from all parsers available, that read_csv is the fastest way to read text files, correct?
I have to read many large (1.7 G) text files that happen to be both fixed format AND comma-separated. (planetary data system format).
 

Wes McKinney

unread,
Jul 5, 2013, 11:13:26 PM7/5/13
to pystat...@googlegroups.com
yep, read_csv for fastest speed and lowest memory usage

Michael Aye

unread,
Jul 6, 2013, 3:27:08 AM7/6/13
to pystat...@googlegroups.com
ugh, I just realised the drastic punishment from parse_dates=[[0,1]]:

Parsing 10 k lines WITHOUT parse_dates: 10 loops, best of 3: 68.9 ms per loop
Parsing 10 k lines WITH parse_dates: 1 loops, best of 3: 1.54 s per loop

This pushes the read time for a 5.5 M-line file with parse_dates to 1.17 h !

This is my full command line:
%timeit pd.io.parsers.read_csv('2013030100_RDR.TAB',nrows=1e4, skipinitialspace=True,names=rdrreader.headers,skiprows=1,na_values=['-9999.0'],parse_dates=[[0,1]])

and this 1 line of data:
"01-Mar-2013", "00:00:00.090", 2456352.500001049, 16796, 0.99276,  1.44929, 316.78226, 0383788800.04259,  16.58826, 328.58567,  1857.18354,   119.74794, 180.000, 240.000,  110,  -1.03585,  58.16150, 1,   1, -0.806293,  0.534740, -0.252872,    31.3819,   86.512,  16.72140, 328.73477,   3.02910,  19.27805,   8.13338, 12.79667, 000, 012, 000

Is there a faster work-around?

Michael Aye

unread,
Jul 6, 2013, 5:23:00 AM7/6/13
to pystat...@googlegroups.com
So, by using 
parse = lambda x: dt.datetime.strptime(x, '%d-%b-%Y %H:%M:%S.%f'))
I was able to bring the read-time for 10 k lines down to:

1 loops, best of 3: 393 ms per loop

when I put the date_parser into read_csv like so:

%timeit pd.io.parsers.read_csv('2013030100_RDR.TAB',nrows=1e4, skipinitialspace=True,names=rdrreader.headers,skiprows=1,na_values=['-9999.0'],parse_dates=[[0,1]],index_col=0,date_parser=parse)

and, interestingly, a little bit faster, when I do the parsing afterwards:

%%timeit
df = pd.io.parsers.read_csv('2013030100_RDR.TAB',nrows=1e4, skipinitialspace=True,names=rdrreader.headers,skiprows=1,na_values=['-9999.0'])
df['date_utc'] = df.date + ' ' + df.utc
df['time'] = df.date_utc.map(parse)

1 loops, best of 3: 370 ms per loop

This gets the read-time for a 5.5 MLines file with date_parse'ing down to 3.4 minutes, not bad, I would like to have it faster, obviously, but if that's the fastest it gets, I bite.

I would like to try it with cython, would be a good learning project, but I have no clue if it actually would improve? What do you guys think?

Michael Aye

unread,
Jul 6, 2013, 5:31:12 AM7/6/13
to pystat...@googlegroups.com
I forgot to add here:

df.set_index(df.time, inplace=True)
df.drop(['date','utc','date_utc','time'],axis=1)

but that costs only 2 ms more.

Jeff Reback

unread,
Jul 6, 2013, 7:17:37 AM7/6/13
to pystat...@googlegroups.com
don't parse the dates at all in read_csv
use pd.to_datetime(..)

Michael Aye

unread,
Jul 6, 2013, 1:37:54 PM7/6/13
to pystat...@googlegroups.com


On Saturday, July 6, 2013 4:17:37 AM UTC-7, Jeff Reback wrote:
don't parse the dates at all in read_csv
use pd.to_datetime(..)

I thought, it's 'post below' policy here? ;)
Anywho, thanks!

Down to 193 ms for 10 k, but only when using the format string argument in pd.to_datetime(..)
That get's me to  1 min 40 sec, great!

Michael Aye

unread,
Jul 7, 2013, 1:51:42 AM7/7/13
to pystat...@googlegroups.com


On Saturday, July 6, 2013 10:37:54 AM UTC-7, Michael Aye wrote:


On Saturday, July 6, 2013 4:17:37 AM UTC-7, Jeff Reback wrote:
don't parse the dates at all in read_csv
use pd.to_datetime(..)

I thought, it's 'post below' policy here? ;)
Anywho, thanks!

Down to 193 ms for 10 k, but only when using the format string argument in pd.to_datetime(..)
That get's me to  1 min 40 sec, great!

Found a bug with to_datetime though. :(

Back to slower reads...
Reply all
Reply to author
Forward
0 new messages