pandas.read_csv and missing integers

610 views
Skip to first unread message

John Hunter

unread,
Feb 5, 2012, 2:21:26 PM2/5/12
to pystat...@googlegroups.com
I have some CSV I am parsing using pandas and am using converter funcs
to parse strings to floats and ints, I am wondering what the right
way is to represent missing data for ints is. For example, in the
following code, I return -1 for ints, but I am not sure how to inform
pandas that this is the "missing" type for this column. I tried using
the na_values arg in read_csv, eg na_values=[-1] but this seemed to
have no effect.

After reading http://pandas.sourceforge.net/missing_data.html I
gathered that maybe missing is not supported for int types, so should
I just return floats in "convert_days" with np.nan as the missing
sentinel, or is there a proper pandas way here for handling the
missing data in the "days" column?

Script illustrating the problem is:


import StringIO
import numpy as np
import pandas
csv = """\
id,score,days
1,2,12
2,2-5,
3,,14+
4,6-12,2
"""

def convert_days(x):
x = x.strip()
if not x: return -1

is_plus = x.endswith('+')
if is_plus:
x = int(x[:-1]) + 1
else:
x = int(x)
return x

def convert_score(x):
x = x.strip()
if not x: return np.nan
if x.find('-')>0:
valmin, valmax = map(int, x.split('-'))
val = 0.5*(valmin + valmax)
else:
val = float(x)

return val

fh = StringIO.StringIO(csv)

p = pandas.read_csv(fh, converters={'score':convert_score,
'days':convert_days}, na_values=[-1,'',None])

print p

Wes McKinney

unread,
Feb 5, 2012, 2:35:50 PM2/5/12
to pystat...@googlegroups.com

What is your pandas version? At least on 0.7.0rc1 and git master I get:

id score days
0 1 2.0 12
1 2 3.5 -1
2 3 NaN 15
3 4 9.0 2

- Wes

Wes McKinney

unread,
Feb 5, 2012, 2:38:27 PM2/5/12
to pystat...@googlegroups.com

Ah, my apologies, I see the issue-- so it looks like the "converted"
value is not used with the na_values set. Maybe it should? Happy to
follow POLS here

- Wes

Wes McKinney

unread,
Feb 5, 2012, 2:44:27 PM2/5/12
to pystat...@googlegroups.com

I found a bug by changing the days converter to return np.nan instead of -1:

https://github.com/wesm/pandas/issues/753

- W

John Hunter

unread,
Feb 5, 2012, 3:56:29 PM2/5/12
to pystat...@googlegroups.com
On Sun, Feb 5, 2012 at 1:44 PM, Wes McKinney <wesm...@gmail.com> wrote:
>> Ah, my apologies, I see the issue-- so it looks like the "converted"
>> value is not used with the na_values set. Maybe it should? Happy to
>> follow POLS here

Mainly I was just looking for a way to specify "missing" with ints
when parsing CSV. I am working around it currently by using np.nan
and explicitly casting all the ints to floats before returning from my
converter func.

Wes McKinney

unread,
Feb 5, 2012, 4:26:45 PM2/5/12
to pystat...@googlegroups.com

Fixed a couple bugs there-- use my more robust alternative to
np.vectorize and handle NA sentinels correctly (so you're original
example returning -1 with -1 among the NA values will work)

Integer NAs is the one thing that I wish NumPy and by association
pandas had better handling for. But all in all it works alright.

John Hunter

unread,
Feb 5, 2012, 4:54:31 PM2/5/12
to pystat...@googlegroups.com
On Sun, Feb 5, 2012 at 3:26 PM, Wes McKinney <wesm...@gmail.com> wrote:
> On Sun, Feb 5, 2012 at 3:56 PM, John Hunter <jdh...@gmail.com> wrote:
>> On Sun, Feb 5, 2012 at 1:44 PM, Wes McKinney <wesm...@gmail.com> wrote:
>>>> Ah, my apologies, I see the issue-- so it looks like the "converted"
>>>> value is not used with the na_values set. Maybe it should? Happy to
>>>> follow POLS here
>>
>> Mainly I was just looking for a way to specify "missing" with ints
>> when parsing CSV.  I am working around it currently by using np.nan
>> and explicitly casting all the ints to floats before returning from my
>> converter func.
>
> Fixed a couple bugs there-- use my more robust alternative to
> np.vectorize and handle NA sentinels correctly (so you're original
> example returning -1 with -1 among the NA values will work)

Just took that for a test drive and it seems to behave well. If I
return np.nan for missing but don't cast the valid returns to floats,
all of my data is cast to float by pandas (makes sense). If I return
-1 and specify -1 as the sentinel, it sets the missing data to NaN and
casts all my data to floats (best we can do w/o generic NA handling in
numpy. Returning None or '' for the "missing" case also casts the
data to floats and sets the missing to NaN, even if '' and None are
not listed in the na_values -- this is debatable but I don't have a
problem with it.

I believe to make this maximally useful, na_values should optionally
be a dictionary. But this is not really a big deal because it can be
handled with a custom converter function. Eg, if you have a CSV file
where -99 is the missing sentinel, you can write a converter
converting that to np.nan. A dict of na_values would be *slightly*
easier, but not necessary.

Thanks for the quick fix!
JDH

Wes McKinney

unread,
Feb 5, 2012, 5:05:31 PM2/5/12
to pystat...@googlegroups.com

I agree with you re: na_values-- you might have different NA sentinels
depending on the column. Will create a ticket to remind me to do this
one of these days.

- W

Reply all
Reply to author
Forward
0 new messages