Handling NA values in DataFrames

Glen Hertz

unread,

Dec 19, 2012, 12:37:49 AM12/19/12

to julia...@googlegroups.com

Hi,

I have some NA values in a DataFrame and I do this:

u = unique(df["col"])

and NA is considered a unique value. If I do:

sort(u)

Then it errors out.

Should NA values be considered unique? I can't see how something can be unique if it wasn't there. Shouldn't most functions gracefully ignore NA values?

Glen

Harlan Harris

unread,

Dec 19, 2012, 7:47:18 AM12/19/12

to julia...@googlegroups.com

Hi Glen,

NA is considered a unique value in R too. I believe that returning NA is what most people would expect. Note that u is a Vector{Any}, because unique doesn't return a DataVec. (Should it?)

And that's why you get an error in the sort -- because you can't sort heterogeneous arrays.

No, I definitely don't think most functions should gracefully ignore NAs. They should have graceful ways of specifying what to do with NAs, but ignoring by default is rarely good.

And just a note -- we also have the julia-stats mailing list. That might be a better place for any further discussion?

Thanks for trying stuff out! :)

-Harlan

Glen

--

John Myles White

unread,

Dec 19, 2012, 7:49:24 AM12/19/12

to julia...@googlegroups.com

Hi Glen,

First off, thanks for trying out the DataFrames package. Having feedback is really helpful.

But I should warn you that the DataFrames package is both underdocumented and still quite brittle.

This example brings up one general source of brittleness: all of the basic functionality of Julia has to be duplicated to cope intelligently with NA's. We've already done that for unique(), but not for sort(). Everytime you find one of these things and bring it up, we can fix that piece of the system. But you are unfortunately going to hit a lot as you go.

-- John

> --
>
>

Stefan Karpinski

unread,

Dec 19, 2012, 10:39:18 AM12/19/12

to Julia Users

Sorting Any arrays is no problem as long as all the pairs of element types are comparable. For example:

julia> sort({2,1,3,pi,e})

5-element Any Array:
1
2
2.71828
3
3.14159

julia> sort({2,1,3,pi,e,NA})
type error: non-boolean (NAtype) used in boolean context
in insertionsort! at sort.jl:141
in mergesort! at sort.jl:223

in sort at sort.jl:495

The sorting issue for NA is that the isless function should never return NA, unlike the < function, which can do do. Here's the pinpointed problem:

julia> isless(1,NA)
NA

For comparison, consider NaN, which is <-unordered with respect to all numbers:

julia> 1.0 < NaN
false

julia> NaN < 1.0

false

The isless function exists precisely to give a sorting order, however, otherwise all your sorts would fail in the presence of NaNs, since comparison sorts don't work if the ordering function isn't total. NaNs sort to the end, so all other numbers are isless than NaN:

julia> isless(1.0,NaN)
true

julia> isless(NaN,1.0)

false

julia> isless(Inf,NaN)
true

julia> isless(NaN,Inf)
false

Another subtlety here is that the isless and isequal comparisons need to be compatible with hashing. You can't have two things be isequal and hash differently because the isequal comparison is what is used for hashing. This implies, e.g. that -0.0 and 0.0, which are distinct IEEE 754 values such that -0.0 sorts before 0.0 must not hash the same since they cannot be isequal because isless(-0.0,0.0).

Sorry to drop all that insanity on you guys. Sorting, hashing, and floating-point numbers have a very complicated relationship. I suspect that NA should probably mostly behave like NaN in that regard, although, of course, it's distinct from NaN as well (NaN is a *known* value that is not a number, whereas NA is an unknown value).

--

Stefan Karpinski

unread,

Dec 19, 2012, 11:19:14 AM12/19/12

to Julia Users

This actually led me to notice that we were no longer printing -0.0 with a leading minus sign, which I just fixed. Now you can see what I'm talking about with hashing:

julia> d = Dict()
Dict{Any,Any}()

julia> d[0.0] = true

true

julia> d[-0.0] = false
false

julia> d
{0.0=>true,-0.0=>false}

Kevin Squire

unread,

Dec 19, 2012, 12:13:42 PM12/19/12

to julia...@googlegroups.com

Here's a workaround that pushes NAs to the end of the array. Note that it won't be as fast as normal sorting.

julia> sort_by(a ->(isna(a) ? NaN : a), {2,1,3,pi,e,NA})
6-element Any Array:

NA

Also note that this copies the array. You can use sort_by!() to sort in-place.

Kevin

Stefan Karpinski

unread,

Dec 19, 2012, 12:44:53 PM12/19/12

to Julia Users

The solution for sorting in the presence of NaNs that Julia uses is to have an initial phase that just moves all NaN values to the end of the array and then use normal sorting on the non-NaN values. Matlab actually just *counts* the NaNs and then produces that many NaNs at the end of the array, which is a bit faster (when there are tons of NaNs), but doesn't preserve NaN payloads, which some people may want to use for something.

In the case of NAs, something similar should probably be done and I do actually think that unique applied to a DataVec should return a DataVec. Likewise unique applied to a DataFrame should probably return a DataFrame with unique rows.

--

Glen Hertz

unread,

Dec 19, 2012, 11:37:42 PM12/19/12

to julia...@googlegroups.com

On Wednesday, December 19, 2012 7:47:18 AM UTC-5, Harlan Harris wrote:

Hi Glen,

NA is considered a unique value in R too. I believe that returning NA is what most people would expect. Note that u is a Vector{Any}, because unique doesn't return a DataVec. (Should it?)

I was used to Pandas, where NaN is used for non-values, even for non-numbers. It is a bit complex trying to keep track of the differences between how NA and NaN behave.

I'm not sure if unique should return a DataVec. I was just glad that sort() and unique() work and were easy to find...more intuitive than Pandas. I just got hung up on the NA concept.

And that's why you get an error in the sort -- because you can't sort heterogeneous arrays.

This makes sense now...Thanks.

No, I definitely don't think most functions should gracefully ignore NAs. They should have graceful ways of specifying what to do with NAs, but ignoring by default is rarely good.

And just a note -- we also have the julia-stats mailing list. That might be a better place for any further discussion?

I'm doing financial stuff, not stats. Does that belong in stats?

Glen

Glen Hertz

unread,

Dec 19, 2012, 11:41:37 PM12/19/12

to julia...@googlegroups.com

On Wednesday, December 19, 2012 10:39:18 AM UTC-5, Stefan Karpinski wrote:

For comparison, consider NaN, which is <-unordered with respect to all numbers:

julia> 1.0 < NaN
false

julia> NaN < 1.0

false

The isless function exists precisely to give a sorting order, however, otherwise all your sorts would fail in the presence of NaNs, since comparison sorts don't work if the ordering function isn't total. NaNs sort to the end, so all other numbers are isless than NaN:

julia> isless(1.0,NaN)
true

julia> isless(NaN,1.0)

false

julia> isless(Inf,NaN)
true

julia> isless(NaN,Inf)
false

Does this make sense?

julia> min([1,2,NaN])

1.0

julia> max([1,2,NaN])

2.0

julia> mean([1,2,NaN])

NaN

I believe NumPy returns NaN for all these cases.

Glen

John Myles White

unread,

Dec 20, 2012, 11:34:01 AM12/20/12

to julia...@googlegroups.com

I personally think it makes sense, although it's clearly very distinct from the approach we've taken with NA for DataFrames and DataVec's

To me this is a strong argument for not conflating NA with NaN in the way that Python has.

-- John

--

Stefan Karpinski

unread,

Dec 20, 2012, 11:43:17 AM12/20/12

to Julia Users

I'm not sure how intentional this behavior was, but Matlab and NumPy do the same thing, so it may be the best choice. It's certainly the easiest. On the principle that NaN is neither greater than nor less than any number, it is also arguably correct.

--

Stefan Karpinski

unread,

Dec 20, 2012, 12:06:33 PM12/20/12

to Julia Users

On Thu, Dec 20, 2012 at 11:34 AM, John Myles White <johnmyl...@gmail.com> wrote:

To me this is a strong argument for not conflating NA with NaN in the way that Python has.

I think conflating NaN and NA is conceptually quite bad. NaN is a known floating-point value that is incomparable to other floating-point values. NA is a placeholder for an unknown value. The fact that 1.5*NaN = NaN and 1.5*NA = NA is just a coincidence. For example, 0.0*NaN = NaN but 0.0*NA should probably be 0.0 (although the possibility of NA being NaN means that the result could be 0.0 or NaN which maybe means the answer should be NA).

John Myles White

unread,

Dec 20, 2012, 3:57:31 PM12/20/12

to julia...@googlegroups.com

I think the strongest argument is that putting NaN inside either of the following seems pretty silly:

DataVec(["A", "B"])

DataVec({[1, 2], [2, 1]})

Using a BitArray mask means that we can easily wrap any type inside of a DataVec, which R achieves by having NA be part of the core language.