New parser branch merged; please help with testing

23 views

Skip to first unread message

Wes McKinney

unread,

Nov 15, 2012, 9:44:02 PM11/15/12

to pyd...@googlegroups.com

hi folks,

I just merged the new-and-improved (faster, low-memory use) file
parser branch (i.e. the guts of read_csv and read_table). If you work
regularly with medium-size, 100MB+ datasets, I dare say this will be
life-altering.

The new parser branch includes in addition:

- Ability to yield NumPy record arrays instead of pandas.DataFrame if
you want (as_recarray=True)
- Explicit dtypes: e.g. dtype={'C': np.float64, 'D': 'S5'}
- usecols option: read a subset of the column in a file with low memory use
- Reading of compressed (gzip, bz2) files (e.g. compression='gzip')
- Easier specification of CSV/delimited file dialect options (e.g.
skipinitialspace=True)
- Lower-level/faster handling of european decimal formats (decimal=',')
- Special-casing whitespace delimited files for high performance
(delim_whitespace=True)
- Ability to disable NA detection logic altogether (na_filter=False)

If you're able, I'd appreciate some help beating any remaining bugs
out of the code-- all you need to do is install the development branch
and use pandas normally. If you run into any problems, please report
them on GitHub.

http://github.com/pydata/pandas

For Windows users, we're working on getting development builds (look
for 0.9.2dev-...) up on the pandas website (they aren't there yet):

http://pandas.pydata.org/pandas-build/dev/

- Wes

Reply all

Reply to author

Forward

0 new messages