[pandas] Keep leading zeros with read_csv

704 views
Skip to first unread message

Carson Farmer

unread,
Jul 13, 2012, 8:21:28 AM7/13/12
to pystat...@googlegroups.com
Just wondering if anyone knows how to avoid having columns with values
like this: 01001 being converted to ints like this: 1001 when reading
via read_csv? I tried using the converter argument, but it seems that
"maybe_convert_objects" is run from "map_infer" *after* the converter
function is applied. There is a potential solution here:
http://permalink.gmane.org/gmane.comp.python.pystatsmodels/6921, but
as Wes points out, it is really quite slow compared with read_csv
(which is in fact, totally awesome).
In my current case, I can simply use something like this once my
dataframe is loaded:

df = read_csv('test_data.csv')
df.head()
oid did mode ox oy dx dy
0 1001 1001 01 272311.659358 176751.822655 272675 176375
1 1001 1001 01 272311.659358 176751.822655 272375 176375
2 1001 1001 01 272311.659358 176751.822655 272125 176675
3 1001 1001 06 272311.659358 176751.822655 272675 177125
4 1001 1001 06 272311.659358 176751.822655 272675 176375

df.oid = df.oid.apply(lambda x: str(x).zfill(5))
df.head()
oid did mode ox oy dx dy
0 01001 1001 01 272311.659358 176751.822655 272675 176375
1 01001 1001 01 272311.659358 176751.822655 272375 176375
2 01001 1001 01 272311.659358 176751.822655 272125 176675
3 01001 1001 06 272311.659358 176751.822655 272675 177125
4 01001 1001 06 272311.659358 176751.822655 272675 176375

because I know how many leading zeros are required, but I can imagine
some cases where one wouldn't know this ahead of time. In fact, the
original file contains double quotes around these values, so I would
have expected read_csv to interpret these are strings (objects). I do
appreciate the fact that read_csv is being a bit clever here, but in
cases where I actually know what I want, it might be nice to be able
to specify the column dtypes explicitly.

Thanks,

Carson


--
Dr. Carson J. Q. Farmer
Centre for GeoInformatics (CGI)
School of Geography and Geosciences
University of St Andrews
[w] www.carsonfarmer.com
[e] cj...@st-andrews.ac.uk
[t] @CarsonFarmer

Wes McKinney

unread,
Sep 7, 2012, 5:54:10 PM9/7/12
to pystat...@googlegroups.com
I'm planning some work on the file parsers in pandas and this will
definitely be a new feature to add (explicit dtype specification for
one or more columns):

https://github.com/pydata/pandas/issues/1858

- Wes
Reply all
Reply to author
Forward
0 new messages