vcat() with NAs

128 views
Skip to first unread message

Milan Bouchet-Valat

unread,
Dec 2, 2013, 1:01:59 PM12/2/13
to julia-stats
Hi!

I've been looking at ways to get things like [1, NA] to create DataArrays. Currently this triggers an error because there exists a promote rule [1] (with a skeptic comment in the code) so that NAtype is promoted to any type it is combined with. Since there is no convert() method for that (and there cannot be), this fails.

I know that currently one can do  @data [1, NA] to get the same behavior, and this works very well. Still, I'm wondering whether we can do better, since [1, NA] not working at all is really disturbing.

So far I have found one possible design that would allow such a behavior in a (I think) Julian fashion. A new type similar to DataArray, but holding a single value or NA could be created, let's call it DataValue:
type DataValue{T}
    value::T
    isna::Bool
end

Any combination of NA and another type would be promoted to DataValue of this type, with a very simple "conversion":
import Base.convert

convert{T}(::Type{DataValue{T}}, x::T) = DataValue{T}(x, false)
convert{T}(::Type{DataValue{T}}, x::NAtype) = DataValue{T}(0, true)

(With this definition DataValue{T}(0, true) only works for Int, I couldn't find a generic way of creating any value, but it surely exists...)

But of course this does not yet allow creating DataArrays:
julia> [1, NA]
2-element Array{DataValue{Int64},1}:
DataValue{Int64}(1,false)
DataValue{Int64}(0,true)

Looking at how [ ] and thus vcat() work, it appears that cat(1, ...) at abstractarray.jl:792 is called, and that the type promotion happens there. So I figured a solution would be to make cat() to call another method just after it has determined the type to which the elements to concatenate will be promoted. This would allow to hook up DataArrays in the process at this point, using a specific method for DataValues, and to convert the DataValues into a proper DataArray. This allows handling NAs in a very integrated way, without hardcoding them in core Julia at all.

Creating DataValues just to combine them one second later adds an overhead, but the [ ] syntax is mostly useful for small arrays, since you need to have all the values written literally (concatenating DataArrays is still handled differently of course). This syntax is useful in particular when presenting examples and teaching, where showing that NAs are handled consistently in the language is very important to convince people.

Does that sound reasonable, or am I totally on crack?


Regards


1 : https://github.com/JuliaStats/DataArrays.jl/blob/master/src/natype.jl#L36

John Myles White

unread,
Dec 2, 2013, 1:20:50 PM12/2/13
to julia...@googlegroups.com
This is an interesting idea. Not sure how this would work without completely replacing all of the existing infrastructure, though.

 -- John
--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Milan Bouchet-Valat

unread,
Dec 15, 2013, 9:56:51 AM12/15/13
to julia...@googlegroups.com
Le lundi 02 décembre 2013 à 10:20 -0800, John Myles White a écrit :
This is an interesting idea. Not sure how this would work without completely replacing all of the existing infrastructure, though.
Late reply, I thought more clued people would comment.

As I see it, none of the existing infrastructure would need to be replaced. cat() would call a method specialized on the type it has determined using promotion rules, and a default method for Any would be provided in base Julia, to give the same result as the one cat() currently returns.  DataArrays would simply provide a specific method for the DataValue type.

I really think a solution should be found to handle NAs in a more natural way. This is essential to be credible as a language for working with data, and people doing so are going to judge Julia according to this criterion in the first place.


Regards

John Myles White

unread,
Dec 15, 2013, 11:46:00 AM12/15/13
to julia...@googlegroups.com
I’m not really convinced that handling NA in a different way is going to make or break Julia. I know a lot of people doing data analysis in Python, Matlab, etc, where NA isn’t cooked into the language in the way it is in R. How do I explain their behavior except by believing that R’s style of NA handling isn’t the sine que non of data analysis tools design?

My sense is that Julia is lacking a thousand things it needs to be a robust replacement for R. What keeps people away isn’t a small thing like array literals, but the absence of virtually the entire ecosystem they’re familiar with. Given the large quantity of things that we are worse than R at right now, it’s not clear to me that it’s even theoretically possible to causally identify our handling of NA as a relevant force on adoption of the language.

Personally, I think the representation of NA is a ridiculous hack in every language I’ve ever used, including R. And yet, despite that people manage to build amazing things around that hack in every language. For building tools in Julia, I think we should worry more about how NA is represented when the current representation inhibits us from building other tools. For now, I think it’s a minor point that is not likely to have a large effect on people deciding whether to use Julia or not.

— John

Milan Bouchet-Valat

unread,
Dec 16, 2013, 7:41:37 AM12/16/13
to julia...@googlegroups.com
Le dimanche 15 décembre 2013 à 08:46 -0800, John Myles White a écrit :
> I’m not really convinced that handling NA in a different way is going
> to make or break Julia. I know a lot of people doing data analysis in
> Python, Matlab, etc, where NA isn’t cooked into the language in the
> way it is in R. How do I explain their behavior except by believing
> that R’s style of NA handling isn’t the sine que non of data analysis
> tools design?
>
> My sense is that Julia is lacking a thousand things it needs to be a
> robust replacement for R. What keeps people away isn’t a small thing
> like array literals, but the absence of virtually the entire ecosystem
> they’re familiar with. Given the large quantity of things that we are
> worse than R at right now, it’s not clear to me that it’s even
> theoretically possible to causally identify our handling of NA as a
> relevant force on adoption of the language.
>
> Personally, I think the representation of NA is a ridiculous hack in
> every language I’ve ever used, including R. And yet, despite that
> people manage to build amazing things around that hack in every
> language. For building tools in Julia, I think we should worry more
> about how NA is represented when the current representation inhibits
> us from building other tools. For now, I think it’s a minor point that
> is not likely to have a large effect on people deciding whether to use
> Julia or not.
You'd be surprised to know that the question of the level of integration
of missing values was risen in discussions I had about Julia with people
working in my field (sociology). Basically no social scientists work
with Python or Matlab: they use Stata, SAS, SPPS and sometimes R. I
wouldn't say it's because of the way they handle NAs, but I do believe
that the degree of integration of missing values is a signal you send to
this kind of users about whether your language is supposed to be easy to
use for them or not. Social scientists are usually not very good
programmers, their needs are often more basic than in many other fields,
so every glitch in basic usage patterns is a problem for them. ;-)

I'm emphasizing the seemingly innocuous issue of NAs in array literals
not because that's a major feature, but because relative to its symbolic
importance it should be relatively easy to fix. With DataArrays I think
Julia's story about NAs is already pretty good, it only needs to
develop. Let's just find a way to fill this small remaining gap.

Of course there are many major areas to improve, but I'm willing to work
on fixing this if you are interesting in it.


Regards

John Myles White

unread,
Dec 16, 2013, 10:23:00 AM12/16/13
to julia...@googlegroups.com
You’re always welcome to work on anything that interests you: we don’t have any kind of top-down control in our community at all. If you can build something that’s noticeably better and provides as good or better performance, I’ll be happy to merge it.

My only concern is that this project is risky: you might end up spending a lot of time on it and still not get much out of it. Since you’ve been so extremely productive in the few weeks you’ve been on the mailing list, having you work on something that might not work out is potentially a non-trivlal loss to the community. But it’s always your call.

— John

Milan Bouchet-Valat

unread,
Dec 16, 2013, 10:48:37 AM12/16/13
to julia...@googlegroups.com
Le lundi 16 décembre 2013 à 07:23 -0800, John Myles White a écrit :
> You’re always welcome to work on anything that interests you: we don’t
> have any kind of top-down control in our community at all. If you can
> build something that’s noticeably better and provides as good or
> better performance, I’ll be happy to merge it.
>
> My only concern is that this project is risky: you might end up
> spending a lot of time on it and still not get much out of it. Since
> you’ve been so extremely productive in the few weeks you’ve been on
> the mailing list, having you work on something that might not work out
> is potentially a non-trivlal loss to the community. But it’s always
> your call.
The solution I came up with would not require much work at all. But I'm
wondering whether it sounds correct to others, and whether you have
better ideas. If it turns out this small feature requires a major
refactoring of the code, I may indeed reconsider my position. ;-)


Regards

John Myles White

unread,
Dec 16, 2013, 10:53:11 AM12/16/13
to julia...@googlegroups.com
Nothing jumps out at me. I feel like we had discussions related to this topic in the past, but can’t track them down now.

— John

Stefan Karpinski

unread,
Dec 16, 2013, 11:03:03 AM12/16/13
to julia-stats
This is something where staged functions, which I mentioned in the discussion of vectorization could help. The vcat operation basically needs to generate code after type information is known but before run-time. There's a chance we could make it more general as well as more efficient that way, but for the moment, this cannot be made to work since [1,2,3] is a literal syntax for Array construction.

Milan Bouchet-Valat

unread,
Dec 16, 2013, 11:39:15 AM12/16/13
to julia...@googlegroups.com
Le lundi 16 décembre 2013 à 11:03 -0500, Stefan Karpinski a écrit :
> This is something where staged functions, which I mentioned in the
> discussion of vectorization could help. The vcat operation basically
> needs to generate code after type information is known but before
> run-time. There's a chance we could make it more general as well as
> more efficient that way, but for the moment, this cannot be made to
> work since [1,2,3] is a literal syntax for Array construction.
OK. I'm perfectly fine to wait until staged functions are implemented.

To be sure I understand: do you think my proposal wouldn't work at the
moment, or just that it wouldn't be clean/efficient?


Regards

Stefan Karpinski

unread,
Dec 16, 2013, 11:42:08 AM12/16/13
to julia-stats
It could work, but it would obviously be much better to just let [1,NA,2] just construct a DataArray directly. That can't be done yet, but I for one would like to allow it.




Regards

Milan Bouchet-Valat

unread,
Dec 16, 2013, 11:45:15 AM12/16/13
to julia...@googlegroups.com
Le lundi 16 décembre 2013 à 11:42 -0500, Stefan Karpinski a écrit :
> It could work, but it would obviously be much better to just let
> [1,NA,2] just construct a DataArray directly. That can't be done yet,
> but I for one would like to allow it.
That would be ideal. Please keep me in touch when the needed
infrastructure is ready.


Regards

Reply all
Reply to author
Forward
0 new messages