Google Groups

Re: RFC: data frame proposal


Douglas Bates Mar 22, 2012 10:56 AM
Posted in group: julia-dev
On Thursday, March 22, 2012 9:51:32 AM UTC-5, Stefan Karpinski wrote:
This proposal is the result of a rather fruitful discussion with Harlan Harris that started over a beer in person (best way to have them!) on Monday and has continued a bit by email since then. I think it's ripe enough to put out here for some feedback from people.

The basic idea is to have a parametric DataVec{T} type something like this:

type DataVec{T}
    data::Vector{T}
    na::AbstractVector{Bool}

    # inner constructors...
end
# outer constructors...

Then a DataFrame type would be an ordered, named bundle of DataVec, which can have different types. Using the trick of indexing into a type to construct "vectors" of that type, we could make this work:

The approach in R has been somewhat different.  The easy case is floating point types where NA is simply one specific pattern of the available NaN representations.  The integer, factor, etc. NA's are also specific patterns and the overhead in the arithmetic operations is in checking for that pattern.  I can understand that it would not be desirable to compromise the performance of arithmetic on vectors in Julia by making this part of the language for all operations on integer types but the proposed scheme adds some overhead in storage and processing even for the floating point case, where NaN values should be handled by the hardware.

Your examples of operations below are not the default in R.  In R an NA always propagates so sum, mean, var, etc. of a vector with an NA is NA except when the optional argument na.rm is true (default is false).  

I can see the sense in the approach you have outlined but I must admit I still don't feel comfortable with it.  It may be best to do some trial implementations and check them out to see if they feel R-like.  To me the ability to handle missing values in computation is important but not the "raison d'etre" of a data frame.  An important aspect of data frames that does not seem to be part of this definition is the requirement that all the vectors have the same length so that the data frame can be regarded as a table and indexed like a matrix.


DataVec[1,2,NA,3]

Here NA would be defined like this:

type NAType; end
const NA = NAType()

This is just like nothing (see src/boot.jl). The implementation of ref(::Type{DataVec}, vals...) would just iterate through all of vals, initializing the data vector and na vector as appropriate and returning the resulting DataVec object. That's a nice pleasant literal syntax for data vectors.

To implement parallel element-wise data operations on data vectors, one would do something like this:

./(v::DataVec, w::DataVec) = DataVec(v.data ./ w.data, v.na | w.na)

That way NA is poisonous: if either value in such an operation is NA, the corresponding result value is NA.

For aggregating operations, like sum, mean, var, etc. you would write the underlying operation to ignore NA values. Something like this:

sum(v::DataVector)  = sum(v.data[!v.na])
mean(v::DataVector) = mean(v.data[!v.na])
var(v::DataVector)  = var(v.data[!v.na])

except the real implementations may want to avoid creating a temporary subarray for this computation. One thing to point out is what happens where there are no non-NA values for something like mean: the answer is NaN. this just falls out of the definition of mean. NA and NaN are not the same thing, NaN is the correct answer here. The same thing would happen for var when there aren't at least two non-NA values. Sum will automatically return zero of the correct underlying data type when there are no non-NA values.

One question is what to return when getting a single value from a data vector. I think this is simple:

ref{T}(v::DataVector{T}, i::Int) = v.na[i] ? NA : v.data[i]

This has a return type signature of Union(NAType,T), which is a bit unfortunate, but probably just fine — you shouldn't be writing big computations on data frames this way; you should use the underlying data and na vectors instead. Of course, you can write naïve code that works, but it will be slower.

As I said at the outset, a DataFrame would then just be an ordered, named bundle of DataVec objects of heterogeneous type.

So, RFC: what do the pro R users here think of this? Does this seem sensible? Does it satisfy what they need? There are a few reasons I like it:
  1. Doesn't complicate underlying value types: Int, Float64, String, etc.
  2. Having NA as a standalone value in general computation doesn't really make much sense; it makes sense in the context of a collection of data, which is how it's used here. The NA value is just a special value that lets one conveniently indicate unavailableness.
  3. Will work with any underlying data type: if someone defines a Date type with appropriate operations, it will immediately be usable in a DataFrame.
  4. As usual in Julia, the entire implementation is transparently exposed to the programmer. They can see what's going on, and therefore understand it. They can also, if need be, mess around with it, although that may not always be advised.


On Thursday, March 22, 2012 9:51:32 AM UTC-5, Stefan Karpinski wrote:
This proposal is the result of a rather fruitful discussion with Harlan Harris that started over a beer in person (best way to have them!) on Monday and has continued a bit by email since then. I think it's ripe enough to put out here for some feedback from people.

The basic idea is to have a parametric DataVec{T} type something like this:

type DataVec{T}
    data::Vector{T}
    na::AbstractVector{Bool}

    # inner constructors...
end
# outer constructors...

Then a DataFrame type would be an ordered, named bundle of DataVec, which can have different types. Using the trick of indexing into a type to construct "vectors" of that type, we could make this work:

DataVec[1,2,NA,3]

Here NA would be defined like this:

type NAType; end
const NA = NAType()

This is just like nothing (see src/boot.jl). The implementation of ref(::Type{DataVec}, vals...) would just iterate through all of vals, initializing the data vector and na vector as appropriate and returning the resulting DataVec object. That's a nice pleasant literal syntax for data vectors.

To implement parallel element-wise data operations on data vectors, one would do something like this:

./(v::DataVec, w::DataVec) = DataVec(v.data ./ w.data, v.na | w.na)

That way NA is poisonous: if either value in such an operation is NA, the corresponding result value is NA.

For aggregating operations, like sum, mean, var, etc. you would write the underlying operation to ignore NA values. Something like this:

sum(v::DataVector)  = sum(v.data[!v.na])
mean(v::DataVector) = mean(v.data[!v.na])
var(v::DataVector)  = var(v.data[!v.na])

except the real implementations may want to avoid creating a temporary subarray for this computation. One thing to point out is what happens where there are no non-NA values for something like mean: the answer is NaN. this just falls out of the definition of mean. NA and NaN are not the same thing, NaN is the correct answer here. The same thing would happen for var when there aren't at least two non-NA values. Sum will automatically return zero of the correct underlying data type when there are no non-NA values.

One question is what to return when getting a single value from a data vector. I think this is simple:

ref{T}(v::DataVector{T}, i::Int) = v.na[i] ? NA : v.data[i]

This has a return type signature of Union(NAType,T), which is a bit unfortunate, but probably just fine — you shouldn't be writing big computations on data frames this way; you should use the underlying data and na vectors instead. Of course, you can write naïve code that works, but it will be slower.

As I said at the outset, a DataFrame would then just be an ordered, named bundle of DataVec objects of heterogeneous type.

So, RFC: what do the pro R users here think of this? Does this seem sensible? Does it satisfy what they need? There are a few reasons I like it:
  1. Doesn't complicate underlying value types: Int, Float64, String, etc.
  2. Having NA as a standalone value in general computation doesn't really make much sense; it makes sense in the context of a collection of data, which is how it's used here. The NA value is just a special value that lets one conveniently indicate unavailableness.
  3. Will work with any underlying data type: if someone defines a Date type with appropriate operations, it will immediately be usable in a DataFrame.
  4. As usual in Julia, the entire implementation is transparently exposed to the programmer. They can see what's going on, and therefore understand it. They can also, if need be, mess around with it, although that may not always be advised.