After reading http://pandas.sourceforge.net/missing_data.html I
gathered that maybe missing is not supported for int types, so should
I just return floats in "convert_days" with np.nan as the missing
sentinel, or is there a proper pandas way here for handling the
missing data in the "days" column?
Script illustrating the problem is:
import StringIO
import numpy as np
import pandas
csv = """\
id,score,days
1,2,12
2,2-5,
3,,14+
4,6-12,2
"""
def convert_days(x):
x = x.strip()
if not x: return -1
is_plus = x.endswith('+')
if is_plus:
x = int(x[:-1]) + 1
else:
x = int(x)
return x
def convert_score(x):
x = x.strip()
if not x: return np.nan
if x.find('-')>0:
valmin, valmax = map(int, x.split('-'))
val = 0.5*(valmin + valmax)
else:
val = float(x)
return val
fh = StringIO.StringIO(csv)
p = pandas.read_csv(fh, converters={'score':convert_score,
'days':convert_days}, na_values=[-1,'',None])
print p
What is your pandas version? At least on 0.7.0rc1 and git master I get:
id score days
0 1 2.0 12
1 2 3.5 -1
2 3 NaN 15
3 4 9.0 2
- Wes
Ah, my apologies, I see the issue-- so it looks like the "converted"
value is not used with the na_values set. Maybe it should? Happy to
follow POLS here
- Wes
I found a bug by changing the days converter to return np.nan instead of -1:
https://github.com/wesm/pandas/issues/753
- W
Mainly I was just looking for a way to specify "missing" with ints
when parsing CSV. I am working around it currently by using np.nan
and explicitly casting all the ints to floats before returning from my
converter func.
Fixed a couple bugs there-- use my more robust alternative to
np.vectorize and handle NA sentinels correctly (so you're original
example returning -1 with -1 among the NA values will work)
Integer NAs is the one thing that I wish NumPy and by association
pandas had better handling for. But all in all it works alright.
Just took that for a test drive and it seems to behave well. If I
return np.nan for missing but don't cast the valid returns to floats,
all of my data is cast to float by pandas (makes sense). If I return
-1 and specify -1 as the sentinel, it sets the missing data to NaN and
casts all my data to floats (best we can do w/o generic NA handling in
numpy. Returning None or '' for the "missing" case also casts the
data to floats and sets the missing to NaN, even if '' and None are
not listed in the na_values -- this is debatable but I don't have a
problem with it.
I believe to make this maximally useful, na_values should optionally
be a dictionary. But this is not really a big deal because it can be
handled with a custom converter function. Eg, if you have a CSV file
where -99 is the missing sentinel, you can write a converter
converting that to np.nan. A dict of na_values would be *slightly*
easier, but not necessary.
Thanks for the quick fix!
JDH
I agree with you re: na_values-- you might have different NA sentinels
depending on the column. Will create a ticket to remind me to do this
one of these days.
- W