using converters with read_csv

5,672 views
Skip to first unread message

luc.k

unread,
Jan 6, 2012, 2:21:17 PM1/6/12
to pystat...@googlegroups.com
For some "european style" csv's you don't only have a ';' as delimiter but als the decimal sign is a comma!
It would be nice to have an argument like the sep argument, for example decsep=',', especially when working with csv's with many columns.
 
For the moment I'm trying to use a converter for specific columns by using a lambda function: f = lambda x : x.replace(",",".")
While its works with np.genfromtxt, it doesn't with pd.read_csv...?
In my lambda function I also used f = lambda x : float(x.replace(",",".")) without success
 
The inputfile
Id;Number1;Number2;Text1;Text2;Number3
1;1521,1541;187101,9543;ABC;poi;4,738797819
2;121,12;14897,76;DEF;uyt;0,377320872
3;878,158;108013,434;GHI;rez;2,735694704
 
import pandas as pd

df1 = pd.read_csv(r'c:\temp\input1.csv',sep=';')

print df1

print df1.dtypes

print '-----------'

f = lambda x : x.replace(",",".")

print f('12,55')

print '-----------'

converter = {'Number1':f,'Number2':f, 'Number3':f}

df2 = pd.read_csv(r'c:\temp\input1.csv',sep=';',converters=converter)

print df2

print df2.dtypes

 
 
gives the following output
 
   Id  Number1    Number2      Text1  Text2  Number3   
0  1   1521,1541  187101,9543  ABC    poi    4,738797819
1  2   121,12     14897,76     DEF    uyt    0,377320872
2  3   878,158    108013,434   GHI    rez    2,735694704
Id         int64
Number1    object
Number2    object
Text1      object
Text2      object
Number3    object
-----------
12.55
-----------
Traceback (most recent call last):
  File "C:\temp\testpandaswithconverter.py", line 11, in <module>
    df2 = pd.read_csv(r'c:\temp\input1.csv',sep=';',converters=converter)
  File "C:\Python27\lib\site-packages\pandas-0.6.1-py2.7-win32.egg\pandas\io\parsers.py", line 64, in read_csv
    return parser.get_chunk()
  File "C:\Python27\lib\site-packages\pandas-0.6.1-py2.7-win32.egg\pandas\io\parsers.py", line 418, in get_chunk
    data = _convert_to_ndarrays(data, self.na_values)
  File "C:\Python27\lib\site-packages\pandas-0.6.1-py2.7-win32.egg\pandas\io\parsers.py", line 462, in _convert_to_ndarrays
    result[c] = _convert_types(values, na_values)
  File "C:\Python27\lib\site-packages\pandas-0.6.1-py2.7-win32.egg\pandas\io\parsers.py", line 469, in _convert_types
    lib.sanitize_objects(values, na_values)
  File "parsing.pyx", line 222, in pandas._tseries.sanitize_objects (pandas\src\tseries.c:54267)
ValueError: Does not understand character buffer dtype format string ('s')
 

Wes McKinney

unread,
Jan 6, 2012, 3:16:17 PM1/6/12
to pystat...@googlegroups.com

hi Luc,

using f = lambda x : float(x.replace(",", ".")) worked for me with the
development version of pandas but breaks in 0.6.1, so this is
obviously a bug that I've fixed since then:

In [6]: df2
Out[6]:


Id Number1 Number2 Text1 Text2 Number3

0 1 1521 1.871e+05 ABC poi 4.739
1 2 121.1 1.49e+04 DEF uyt 0.3773
2 3 878.2 1.08e+05 GHI rez 2.736

In [9]: df2.dtypes
Out[9]:
Id int64
Number1 float64
Number2 float64
Text1 object
Text2 object
Number3 float64

It's certainly safe to install and use the version from GitHub.
Otherwise you can hang on a few more days for the official 0.7.0
release.

The exception you show when returning a string still happens in the
current git version, though. Creating an issue for it and will fix it
imminently:

https://github.com/wesm/pandas/issues/583

thanks,
Wes

Wes McKinney

unread,
Jan 6, 2012, 3:25:04 PM1/6/12
to pystat...@googlegroups.com

OK I fixed the issue:

https://github.com/wesm/pandas/commit/cd4636bb6b69dde447b2529b8d940f459a9c598e

After the fix using

f = lambda x : x.replace(",", ".")

works and the result DataFrame has floating point columns for the Number* ones.

Thanks for reporting and the test case (often with fixing bugs the
hardest part is writing the test case).

- Wes

luc.k

unread,
Jan 7, 2012, 7:33:26 AM1/7/12
to pystat...@googlegroups.com
perfect.
What do you think of my suggestion of having an extra argument for the decimal sign (default = '.') ?
A lot of csv's I receive, come from other systems that by default use the comma as decimal sign. Since such csv's are a mix of types (string, integer, float,...) it would be interesting to have a quicker way like an additional argument
Or maybe there is already a quick way I'm not aware of?
 
FYI (source: wikipedia):
 
Decimal marks:
Period — Blue
Comma — Green
Non-West-Arabic Numerals — Red
Data unavailable — Grey

Wes McKinney

unread,
Jan 7, 2012, 11:03:37 AM1/7/12
to pystat...@googlegroups.com
I created an issue about it here:


I think it should be straightforward. I'm not sure if I'll get to it by the time of release but sometime in the relatively near future.

- Wes

luc.k

unread,
Feb 15, 2012, 8:40:02 AM2/15/12
to pystat...@googlegroups.com
Hi Wes,
 
the counterpart of  this story is the "to_csv" method.  Is it also possible there?
 
Luc

Adam Klein

unread,
Feb 15, 2012, 10:48:34 AM2/15/12
to pystat...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages