RFC: data frame proposal

698 views
Skip to first unread message

Stefan Karpinski

unread,
Mar 22, 2012, 10:51:32 AM3/22/12
to Julia Dev
This proposal is the result of a rather fruitful discussion with Harlan Harris that started over a beer in person (best way to have them!) on Monday and has continued a bit by email since then. I think it's ripe enough to put out here for some feedback from people.

The basic idea is to have a parametric DataVec{T} type something like this:

type DataVec{T}
    data::Vector{T}
    na::AbstractVector{Bool}

    # inner constructors...
end
# outer constructors...

Then a DataFrame type would be an ordered, named bundle of DataVec, which can have different types. Using the trick of indexing into a type to construct "vectors" of that type, we could make this work:

DataVec[1,2,NA,3]

Here NA would be defined like this:

type NAType; end
const NA = NAType()

This is just like nothing (see src/boot.jl). The implementation of ref(::Type{DataVec}, vals...) would just iterate through all of vals, initializing the data vector and na vector as appropriate and returning the resulting DataVec object. That's a nice pleasant literal syntax for data vectors.

To implement parallel element-wise data operations on data vectors, one would do something like this:

./(v::DataVec, w::DataVec) = DataVec(v.data ./ w.data, v.na | w.na)

That way NA is poisonous: if either value in such an operation is NA, the corresponding result value is NA.

For aggregating operations, like sum, mean, var, etc. you would write the underlying operation to ignore NA values. Something like this:

sum(v::DataVector)  = sum(v.data[!v.na])
mean(v::DataVector) = mean(v.data[!v.na])
var(v::DataVector)  = var(v.data[!v.na])

except the real implementations may want to avoid creating a temporary subarray for this computation. One thing to point out is what happens where there are no non-NA values for something like mean: the answer is NaN. this just falls out of the definition of mean. NA and NaN are not the same thing, NaN is the correct answer here. The same thing would happen for var when there aren't at least two non-NA values. Sum will automatically return zero of the correct underlying data type when there are no non-NA values.

One question is what to return when getting a single value from a data vector. I think this is simple:

ref{T}(v::DataVector{T}, i::Int) = v.na[i] ? NA : v.data[i]

This has a return type signature of Union(NAType,T), which is a bit unfortunate, but probably just fine — you shouldn't be writing big computations on data frames this way; you should use the underlying data and na vectors instead. Of course, you can write naïve code that works, but it will be slower.

As I said at the outset, a DataFrame would then just be an ordered, named bundle of DataVec objects of heterogeneous type.

So, RFC: what do the pro R users here think of this? Does this seem sensible? Does it satisfy what they need? There are a few reasons I like it:
  1. Doesn't complicate underlying value types: Int, Float64, String, etc.
  2. Having NA as a standalone value in general computation doesn't really make much sense; it makes sense in the context of a collection of data, which is how it's used here. The NA value is just a special value that lets one conveniently indicate unavailableness.
  3. Will work with any underlying data type: if someone defines a Date type with appropriate operations, it will immediately be usable in a DataFrame.
  4. As usual in Julia, the entire implementation is transparently exposed to the programmer. They can see what's going on, and therefore understand it. They can also, if need be, mess around with it, although that may not always be advised.

Tim Holy

unread,
Mar 22, 2012, 1:46:18 PM3/22/12
to juli...@googlegroups.com
Hi Stefan,

This is interesting, and timely because it interacts with what we're thinking
about with the image library. I gather this is R-like, and I've not used R, so
I can't comment as fully as others surely can.

On Thursday, March 22, 2012 09:51:32 am Stefan Karpinski wrote:
> The basic idea is to have a parametric DataVec{T} type something like this:
>
> type DataVec{T}
> data::Vector{T}
> na::AbstractVector{Bool}

I'd add:
fillvalue::T
That controls what gets put in place of an NA when you construct the object. I
suspect one want it to be 0 in most cases, but might as well make it flexible.

The rest seems reasonable to me, with the caveat that I don't know what a
DataFrame is.

Best,
--Tim

Douglas Bates

unread,
Mar 22, 2012, 1:56:57 PM3/22/12
to juli...@googlegroups.com
On Thursday, March 22, 2012 9:51:32 AM UTC-5, Stefan Karpinski wrote:
This proposal is the result of a rather fruitful discussion with Harlan Harris that started over a beer in person (best way to have them!) on Monday and has continued a bit by email since then. I think it's ripe enough to put out here for some feedback from people.

The basic idea is to have a parametric DataVec{T} type something like this:

type DataVec{T}
    data::Vector{T}
    na::AbstractVector{Bool}

    # inner constructors...
end
# outer constructors...

Then a DataFrame type would be an ordered, named bundle of DataVec, which can have different types. Using the trick of indexing into a type to construct "vectors" of that type, we could make this work:

The approach in R has been somewhat different.  The easy case is floating point types where NA is simply one specific pattern of the available NaN representations.  The integer, factor, etc. NA's are also specific patterns and the overhead in the arithmetic operations is in checking for that pattern.  I can understand that it would not be desirable to compromise the performance of arithmetic on vectors in Julia by making this part of the language for all operations on integer types but the proposed scheme adds some overhead in storage and processing even for the floating point case, where NaN values should be handled by the hardware.

Your examples of operations below are not the default in R.  In R an NA always propagates so sum, mean, var, etc. of a vector with an NA is NA except when the optional argument na.rm is true (default is false).  

I can see the sense in the approach you have outlined but I must admit I still don't feel comfortable with it.  It may be best to do some trial implementations and check them out to see if they feel R-like.  To me the ability to handle missing values in computation is important but not the "raison d'etre" of a data frame.  An important aspect of data frames that does not seem to be part of this definition is the requirement that all the vectors have the same length so that the data frame can be regarded as a table and indexed like a matrix.


DataVec[1,2,NA,3]

Here NA would be defined like this:

type NAType; end
const NA = NAType()

This is just like nothing (see src/boot.jl). The implementation of ref(::Type{DataVec}, vals...) would just iterate through all of vals, initializing the data vector and na vector as appropriate and returning the resulting DataVec object. That's a nice pleasant literal syntax for data vectors.

To implement parallel element-wise data operations on data vectors, one would do something like this:

./(v::DataVec, w::DataVec) = DataVec(v.data ./ w.data, v.na | w.na)

That way NA is poisonous: if either value in such an operation is NA, the corresponding result value is NA.

For aggregating operations, like sum, mean, var, etc. you would write the underlying operation to ignore NA values. Something like this:

sum(v::DataVector)  = sum(v.data[!v.na])
mean(v::DataVector) = mean(v.data[!v.na])
var(v::DataVector)  = var(v.data[!v.na])

except the real implementations may want to avoid creating a temporary subarray for this computation. One thing to point out is what happens where there are no non-NA values for something like mean: the answer is NaN. this just falls out of the definition of mean. NA and NaN are not the same thing, NaN is the correct answer here. The same thing would happen for var when there aren't at least two non-NA values. Sum will automatically return zero of the correct underlying data type when there are no non-NA values.

One question is what to return when getting a single value from a data vector. I think this is simple:

ref{T}(v::DataVector{T}, i::Int) = v.na[i] ? NA : v.data[i]

This has a return type signature of Union(NAType,T), which is a bit unfortunate, but probably just fine — you shouldn't be writing big computations on data frames this way; you should use the underlying data and na vectors instead. Of course, you can write naïve code that works, but it will be slower.

As I said at the outset, a DataFrame would then just be an ordered, named bundle of DataVec objects of heterogeneous type.

So, RFC: what do the pro R users here think of this? Does this seem sensible? Does it satisfy what they need? There are a few reasons I like it:
  1. Doesn't complicate underlying value types: Int, Float64, String, etc.
  2. Having NA as a standalone value in general computation doesn't really make much sense; it makes sense in the context of a collection of data, which is how it's used here. The NA value is just a special value that lets one conveniently indicate unavailableness.
  3. Will work with any underlying data type: if someone defines a Date type with appropriate operations, it will immediately be usable in a DataFrame.
  4. As usual in Julia, the entire implementation is transparently exposed to the programmer. They can see what's going on, and therefore understand it. They can also, if need be, mess around with it, although that may not always be advised.


Harlan Harris

unread,
Mar 22, 2012, 2:12:00 PM3/22/12
to juli...@googlegroups.com
On Thu, Mar 22, 2012 at 1:56 PM, Douglas Bates <dmb...@gmail.com> wrote:
On Thursday, March 22, 2012 9:51:32 AM UTC-5, Stefan Karpinski wrote:
This proposal is the result of a rather fruitful discussion with Harlan Harris that started over a beer in person (best way to have them!) on Monday and has continued a bit by email since then. I think it's ripe enough to put out here for some feedback from people.

The basic idea is to have a parametric DataVec{T} type something like this:

type DataVec{T}
    data::Vector{T}
    na::AbstractVector{Bool}

    # inner constructors...
end
# outer constructors...

Then a DataFrame type would be an ordered, named bundle of DataVec, which can have different types. Using the trick of indexing into a type to construct "vectors" of that type, we could make this work:

The approach in R has been somewhat different.  The easy case is floating point types where NA is simply one specific pattern of the available NaN representations.  The integer, factor, etc. NA's are also specific patterns and the overhead in the arithmetic operations is in checking for that pattern.  I can understand that it would not be desirable to compromise the performance of arithmetic on vectors in Julia by making this part of the language for all operations on integer types but the proposed scheme adds some overhead in storage and processing even for the floating point case, where NaN values should be handled by the hardware.

Yes, I think the plan is to eventually optimize the implementation for content with bits to spare (float, bool), but to write the general case first.
 
Your examples of operations below are not the default in R.  In R an NA always propagates so sum, mean, var, etc. of a vector with an NA is NA except when the optional argument na.rm is true (default is false).  

Yes, I'm not totally sure if I agree with Stefan on this one either. It seems to me that operations that are undefined with NAs should either propagate, like R, or maybe throw an error. Note that with properly constructed iterators, an expression like "mean(naFilter(x))" should be very fast. I guess I'd vote for throwing errors and iterators as a replacement for "na.rm=TRUE". "mean(naReplace(x,0))" should work too, and would be generally useful.


I can see the sense in the approach you have outlined but I must admit I still don't feel comfortable with it.  It may be best to do some trial implementations and check them out to see if they feel R-like.  

I agree with trying different options. Although I think the goal is to have R-like functionality (or better) with Julian feel, which may be somewhat different...
 
To me the ability to handle missing values in computation is important but not the "raison d'etre" of a data frame.  An important aspect of data frames that does not seem to be part of this definition is the requirement that all the vectors have the same length so that the data frame can be regarded as a table and indexed like a matrix.

Yes, that's just an omission. There are more details for data frames (or tables, or whatever) that would need to be hashed out... (Oh, someone asked what a data.frame is -- it's basically a columnar database table, and is a critical representation for the types of heterogeneous data that statisticians and social scientists, among others, use constantly.)

 -Harlan

Tim Holy

unread,
Mar 22, 2012, 3:06:30 PM3/22/12
to juli...@googlegroups.com
Hi,

> > Your examples of operations below are not the default in R. In R an NA
> > always propagates so sum, mean, var, etc. of a vector with an NA is NA
> > except when the optional argument na.rm is true (default is false).
>
> Yes, I'm not totally sure if I agree with Stefan on this one either. It
> seems to me that operations that are undefined with NAs should either
> propagate, like R, or maybe throw an error. Note that with properly
> constructed iterators, an expression like "mean(naFilter(x))" should be
> very fast. I guess I'd vote for throwing errors and iterators as a
> replacement for "na.rm=TRUE". "mean(naReplace(x,0))" should work too, and
> would be generally useful.

Likewise, matlab provides separate functions, "sum" and "nansum."

In julia, presumably a cleaner way would be this:

julia> x = DataVec[...] # define this with some NAs in it
julia> sum(x)
NaN
julia> xna = convert(DataVecNA,x)
julia> sum(xna)
172
julia> y = convert(DataVec,xna)
julia> sum(y)
NaN

The DataVec type can _store_ NA information for any type (integers included),
but those NAs are treated as NaN for the purposes of computation. The
DataVecNA triggers the running of algorithms that skip over NAs. Both would
have exactly the same fields & storage (the "convert" is really just syntactic
sugar), and so interconversion involves no overhead. In other words, julia's
type system replaces the need for a na.rm field.

Perhaps this is what you were proposing, Stefan.

Best,
--Tim

> >> 1. Doesn't complicate underlying value types: Int, Float64, String,
> >> etc.
> >> 2. Having NA as a standalone value in general computation doesn't


> >> really make much sense; it makes sense in the context of a collection
> >> of data, which is how it's used here. The NA value is just a special
> >> value that lets one conveniently indicate unavailableness.

> >> 3. Will work with any underlying data type: if someone defines a Date


> >> type with appropriate operations, it will immediately be usable in a
> >> DataFrame.

> >> 4. As usual in Julia, the entire implementation is transparently

> >> 1. Doesn't complicate underlying value types: Int, Float64, String,
> >> etc.
> >> 2. Having NA as a standalone value in general computation doesn't


> >> really make much sense; it makes sense in the context of a collection
> >> of data, which is how it's used here. The NA value is just a special
> >> value that lets one conveniently indicate unavailableness.

> >> 3. Will work with any underlying data type: if someone defines a Date


> >> type with appropriate operations, it will immediately be usable in a
> >> DataFrame.

> >> 4. As usual in Julia, the entire implementation is transparently

Harlan Harris

unread,
Mar 22, 2012, 3:20:02 PM3/22/12
to juli...@googlegroups.com
On Thu, Mar 22, 2012 at 3:06 PM, Tim Holy <tim....@gmail.com> wrote:
Likewise, matlab provides separate functions, "sum" and "nansum."

In julia, presumably a cleaner way would be this:

julia> x = DataVec[...]  # define this with some NAs in it
julia> sum(x)
NaN
julia> xna = convert(DataVecNA,x)
julia> sum(xna)
172
julia> y = convert(DataVec,xna)
julia> sum(y)
NaN

The DataVec type can _store_ NA information for any type (integers included),
but those NAs are treated as NaN for the purposes of computation. The
DataVecNA triggers the running of algorithms that skip over NAs. Both would
have exactly the same fields & storage (the "convert" is really just syntactic
sugar), and so interconversion involves no overhead. In other words, julia's
type system replaces the need for a na.rm field.

I don't like returning NaN because of NAs. That's not what NaN means. I'm also concerned that "running of algorithms that skip over NAs" sounds like implementing everything twice (at least). An advantage of the iterator naFilter wrapper is that it's totally generic, so you only have to write mean() once. And there's no conversion using naFilter either -- it just returns the next non-NA data in the original data type.

Oh, and in case it wasn't clear in my suggestion, if a Data vector has no NAs, then mean(x) works as expected. The version of mean for Data will check for NAs, throw an error if there isn't one, than call mean() on the original data vector.

We'll have to think about how to do slice and index operations on the vectors efficiently: x[3:17], x[x>4], etc...

I do agree with avoiding na.rm options, in any case.

 -Harlan

 

Tim Holy

unread,
Mar 22, 2012, 3:37:07 PM3/22/12
to juli...@googlegroups.com
Hi Harlan,

On Thursday, March 22, 2012 02:20:02 pm Harlan Harris wrote:
> I don't like returning NaN because of NAs. That's not what NaN means.

I agree that's ugly. Going back and re-reading, you're proposing they give an
error when there are no valid items? I like that idea. Errors are good.

> I'm
> also concerned that "running of algorithms that skip over NAs" sounds like
> implementing everything twice (at least). An advantage of the iterator
> naFilter wrapper is that it's totally generic, so you only have to write
> mean() once. And there's no conversion using naFilter either -- it just
> returns the next non-NA data in the original data type.

Won't that kill random-access performance? Suppose I ask for x[100000]? Or are
you proposing this only for operations that don't need that kind of access?

From the standpoint of someone interested in local operations on
multidimensional arrays, it's much more efficient to have one random-access
variable storing real values (with 0 filled in for NaNs), and a second
similarly-addressable variable storing a flag for the NaNs. But different
strategies are presumably warranted for different applications.

Best,
--Tim

Harlan Harris

unread,
Mar 22, 2012, 3:55:19 PM3/22/12
to juli...@googlegroups.com
On Thu, Mar 22, 2012 at 3:37 PM, Tim Holy <tim....@gmail.com> wrote:
On Thursday, March 22, 2012 02:20:02 pm Harlan Harris wrote:
> I don't like returning NaN because of NAs. That's not what NaN means.

I agree that's ugly. Going back and re-reading, you're proposing they give an
error when there are no valid items? I like that idea. Errors are good.

How very Pythonic of you to say that. :) I'm not strongly opposed to NA propagation in vector operations, but I think errors and appropriate wrappers are potentially cleaner, at least for Julia...
 

> I'm
> also concerned that "running of algorithms that skip over NAs" sounds like
> implementing everything twice (at least). An advantage of the iterator
> naFilter wrapper is that it's totally generic, so you only have to write
> mean() once. And there's no conversion using naFilter either -- it just
> returns the next non-NA data in the original data type.

Won't that kill random-access performance? Suppose I ask for x[100000]? Or are
you proposing this only for operations that don't need that kind of access?

The latter. Stefan is proposing that x[100000] would return a singleton of type Union(Float, NAtype) in constant time. My proposal is that you'd have to manually wrap the vector in the iterator-generator if you wanted something like the 100000'th non-NA element: ref(naFilter(x), 100000) or something -- and that would be O(n).

Also note that the convention of functions that generate iterators in Mixed Case suggests that naFilter(x) returns an iterator, nafilter(x) generates a new, possibly smaller vector, naReplace(x, c) returns an iterator of length x, replacing NAs with the parameter, and nareplace(x, c) returns a copied vector with NAs replaced.

From the standpoint of someone interested in local operations on
multidimensional arrays, it's much more efficient to have one random-access
variable storing real values (with 0 filled in for NaNs), and a second
similarly-addressable variable storing a flag for the NaNs. But different
strategies are presumably warranted for different applications.

NaNs or NAs? 0/0 or missing? If the former, I'd avoid this data type entirely and just a matrix of IEEE floats!

Stefan, have I confused everyone?

 -Harlan


Best,
--Tim

Stefan Karpinski

unread,
Mar 22, 2012, 4:16:01 PM3/22/12
to juli...@googlegroups.com
On Thu, Mar 22, 2012 at 1:56 PM, Douglas Bates <dmb...@gmail.com> wrote:

The approach in R has been somewhat different.  The easy case is floating point types where NA is simply one specific pattern of the available NaN representations.  The integer, factor, etc. NA's are also specific patterns and the overhead in the arithmetic operations is in checking for that pattern.  I can understand that it would not be desirable to compromise the performance of arithmetic on vectors in Julia by making this part of the language for all operations on integer types but the proposed scheme adds some overhead in storage and processing even for the floating point case, where NaN values should be handled by the hardware.

We could accomodate that without complicating existing types "simply" by defining new bitstypes that behave similarly to Float64, etc.:

bitstype 64 Float64NA <: Float
bitstype 64 Int64NA <: Signed

Then duplicate the normal float behavior with some tweaks. However, that's an awful lot of duplication and doesn't extend to any new types someone might want to use.

I also think that the approach I'm proposing would have less computational overhead in most cases. Suppose the na vectors are sparse by default and generally are quite sparse. In that case, the v.na | w.na computation is really cheap and fast because it only takes time proportional to the number of actual NAs, while the v.data./w.data computation goes at full vectorized floating point speed without having to stop and mess around with special NaN values. Put another way, the way R handles NAs, the overhead for dealing with them is always proportional to the amount of data, whereas with this approach, using sparse na vectors, the overhead is proportional to the number of NAs, which is typically much, much smaller.

In cases where the number of NAs is much higher, you can use a dense na vector and get better performance improved storage. That's another advantage of this scheme: you can change the structure used for storing the na vector. Sparse is one option. We've also long talked about having a BitVector type that subtypes AbstractVector{Bool} and stores boolean data compactly (8 bits per byte). No one's implemented it yet, but it shouldn't be too hard. Then you get even better storage and performance since you should get at least an 8x speedup in the v.na | w.na operation and an 8x improvement in storage overhead. The whole thing also gives the user options: do you know your data has almost no NAs? Great, use a sparse storage type. Do you know it has lots of NAs? Great, use a dense bit-vector type. The defaults, of course, should be chosen to work well enough for all cases.

Your examples of operations below are not the default in R.  In R an NA always propagates so sum, mean, var, etc. of a vector with an NA is NA except when the optional argument na.rm is true (default is false).

I vaguely remember this from may days of doing more R programming. I.e. I remember passing na.rm=true a lot. I wonder why ignoring NAs isn't the default? I feel like the usual order of business was to run something, see that the result was NA, curse a bit, go back and add na.rm=true, then run it again. Doesn't it make more sense to have ignoring NAs be the default? Of course, we can make the default either way, and once there is keyword argument support, allow changing that behavior the same way that R does.

I can see the sense in the approach you have outlined but I must admit I still don't feel comfortable with it.  It may be best to do some trial implementations and check them out to see if they feel R-like.  To me the ability to handle missing values in computation is important but not the "raison d'etre" of a data frame.  An important aspect of data frames that does not seem to be part of this definition is the requirement that all the vectors have the same length so that the data frame can be regarded as a table and indexed like a matrix.

As Harlan wrote, that was just an omission in my description — the data vectors in a data frame should all be required to be of the same length. They would certainly support indexing and slicing. Although slicing data frames will definitely not work exactly the way it does in R — in particular, this is not going to be possible:

> data = data.frame(foo=c(1,2,3), bar=c("a","b","c"))
> data[1]
  foo
1   1
2   2
3   3
> data[1,]
  foo bar
1   1   a
> data[,1]
[1] 1 2 3

I'm not quite sure how slicing ought to work, but indexing with one thing probably should return a DataVec object, which could easily be accessed either by name or by index (easily distinguished by type and implemented using dispatch). Here are some examples of how I'm picturing interacting with data frames to work:

julia> data = DataFrame[
         "foo"  "bar"
            1     "a"
            2     "b"
            3     "c"
       ]
3x2 DataFrame:
   foo  bar
1    1  "a"
2    2  "b"
3    3  "c"

julia> data["foo"]
DataVec{Int64}:
   foo
1    1
2    2
3    3

julia> data["bar"]
DataVec{String}:
   bar
1  "a"
2  "b"
3  "c"

julia> data["foo",2]
2

julia> data["bar",2]
"b"

julia> data["bar",2:end]
DataVec{String}:
   bar
2  "b"
3  "c"

julia> data[:,2]
1x2 DataFrame:
   foo  bar
1    2  "b"

julia> data[:,2:3]
3x2 DataFrame:
   foo  bar
1    2  "b"
2    3  "c"

Something like that. (Makes me think that data vectors ought to have names, and data frames are just an ordered collection of data vectors.) Does that example make it feel a bit more R-like? The NA stuff is really just an implementation detail. From the user perspective, I think this should all behave in a very R-like manner, with some changes to how data frame slicing is written.

Stefan Karpinski

unread,
Mar 22, 2012, 4:23:27 PM3/22/12
to juli...@googlegroups.com
On Thu, Mar 22, 2012 at 3:37 PM, Tim Holy <tim....@gmail.com> wrote:
 
On Thursday, March 22, 2012 02:20:02 pm Harlan Harris wrote:
> I don't like returning NaN because of NAs. That's not what NaN means.

I agree that's ugly. Going back and re-reading, you're proposing they give an
error when there are no valid items? I like that idea. Errors are good.

There are two different questions here.

The first is what to do when encountering an NA when operating in a mode that doesn't like NAs. In that case, throwing an error seems right: you are doing an operation that doesn't support NAs and you encountered one, the only correct answer is to bail out immediately.

The second is what to do when taking an operation like mean or var without enough non-NA data. That can happen either because your data is just too short or because there are too many NAs. In the case of mean, the correct answer is to return NaN because that's what the mean of a zero-length vector is.

Stefan Karpinski

unread,
Mar 22, 2012, 4:26:08 PM3/22/12
to juli...@googlegroups.com
I don't really get why ignoring NAs isn't the default for things like sum and mean. Seems crazy and useless to me. I wonder if it doesn't simply stem from implementing NA using NaNs, where that's what the IEEE semantics force on you because NaN is poisonous...

Stefan Karpinski

unread,
Mar 22, 2012, 5:02:02 PM3/22/12
to juli...@googlegroups.com
Also, I wanted to say here that I hugely appreciate the feedback in this thread. I'm confident that we can get to a solution here that's both truly Julian and makes users coming from R happy initially and even happier once they get more used to it.

kem

unread,
Mar 22, 2012, 6:59:32 PM3/22/12
to juli...@googlegroups.com
Forgive me if my thinking is fuzzy here--I'm trying to catch up after a conference.

I really like the naFilter() idea -- it's similar to naomit() in R, which I seemed to use a lot more early on with R, and I agree that it's pretty generalizable and intuitive.

One thing to consider, though is the implicit standards that are being created by NA handling as it applies to more complex functions.

The solutions that seem to be being discussed are to either implicitly ignore the NA values, throw an error (unless NA values are explicitly ignored), or propagate the NA values.

With a function like mean(), it might seem reasonable to either implicitly ignore NAs, or throw an error unless NAs are explicitly ignored.

However, when you get to a more complex function like predict() (e.g., used to calculate expected values of outcome variables from predictors in a regression), thinking about NA handling becomes a little different. There, propagating NA values seems like a valuable option, because it clarifies that the predictors contain NA values. If you threw an error, it would be somewhat annoying and difficult because you could calculate expected values for some cases in the dataframe. It's unclear what ignoring the NA values would mean--in some cases it would be undefined, but in other cases it might be defined (e.g., ML prediction using available data but ignoring missing data, which I wish R would incorporate more by default).

In R, it's conventional to be able to specify NA handling explicitly in one of the three ways (which is in my mind implemented unevenly across functions). It might be nice to encourage similar practices in Julia,  by regularly creating explicit methods for the three options--naIgnore(), naProp(), and maybe an error by default or something.

Sorry if I'm being unclear with this--I guess my point is that I'm not sure there can be a single approach to NA handling, that it's probably handled best at the function level rather than the dataframe/structure level, and that it would be nice if some implicit standard or convention were established that covered the bases.

kem

unread,
Mar 22, 2012, 7:16:09 PM3/22/12
to juli...@googlegroups.com
I was going to say that error-throwing should be the default, but then I thought of another reason why propagating NAs is common in R: conversion of dataframes to matrices, and subsequent matrix algebra.

It's very common for me to convert a dataframe where all variables are of the same [numeric] type into a matrix, and then proceed to do linear algebra operations on them. How would you do these conversions in Julia with NAs, and how would you handle them? It's very nice to have the NAs propagate through matrix multiplication, etc. where it makes sense to do so.

Maybe that's too big of an issue, or different, but it seems important to me to streamline the conversion from dataframe to matrix as much as possible, because it's a big strength of R--being able to seamlessly go from the data to the underlying math, without erecting barriers.


Harlan Harris

unread,
Mar 23, 2012, 7:54:32 AM3/23/12
to juli...@googlegroups.com
Kristian,

Great feedback. Thanks.


Yes, it would make sense to have a DataMat type that implements a masked matrix, in addition to the DataVec type that implements a masked vector (and forms the basis for DataTable or whatever). And appropriate functions could do that transformation, retaining NAs. More broadly, we've talked about the need to implement equivalents to formulae, model matrices, etc., to do the DataTable to DataMat (or whatever) transformations.

However, when you get to a more complex function like predict() (e.g., used to calculate expected values of outcome variables from predictors in a regression), thinking about NA handling becomes a little different. There, propagating NA values seems like a valuable option, because it clarifies that the predictors contain NA values. If you threw an error, it would be somewhat annoying and difficult because you could calculate expected values for some cases in the dataframe. It's unclear what ignoring the NA values would mean--in some cases it would be undefined, but in other cases it might be defined (e.g., ML prediction using available data but ignoring missing data, which I wish R would incorporate more by default).

Of course. Any function that's designed to work on DataVec/DataTable objects could handle NAs as expected. So, predict(model::ModelType, data::DataTable) would, if the ModelType allowed, do the right thing with NAs. The various functions/iterators that get rid of NAs would only be needed when you're trying to convert a Data type to a base type.



In R, it's conventional to be able to specify NA handling explicitly in one of the three ways (which is in my mind implemented unevenly across functions). It might be nice to encourage similar practices in Julia,  by regularly creating explicit methods for the three options--naIgnore(), naProp(), and maybe an error by default or something.

Yes, I completely agree, and I think that's what I'm proposing. Functions whose computations are meaningless on NA should throw an error, but the language should include easy transformations to get rid of NAs. Functions whose computations are defined for NAs (such as many statistical/ML algorithms) should accept data of that type and do the right thing. R does this pretty well, except that the na.rm=TRUE convention would be better written functionally, I'd argue.
 

Sorry if I'm being unclear with this--I guess my point is that I'm not sure there can be a single approach to NA handling, that it's probably handled best at the function level rather than the dataframe/structure level, and that it would be nice if some implicit standard or convention were established that covered the bases.

Yep, we agree. (And I agree with almost everything Stefan said in his last few messages, too! Exception being mean() etc. working on Data vectors with NAs by default -- I think they shouldn't.)

  -Harlan
 

kem

unread,
Mar 23, 2012, 12:08:34 PM3/23/12
to juli...@googlegroups.com
On Thursday, March 22, 2012 3:16:01 PM UTC-5, Stefan Karpinski wrote:
Does that example make it feel a bit more R-like?

That made sense to me until I got to here:




julia> data[:,2]
1x2 DataFrame:
   foo  bar
1    2  "b"

julia> data[:,2:3]
3x2 DataFrame:
   foo  bar
1    2  "b"
2    3  "c"



After I thought about it, it made sense to me--I think it would take some getting used to but I think I'd understand it. Just to carry it a bit further. If you had this:

 
julia> data = DataFrame[
         "foo"  "bar"  "var"
            1     "a"    "x"
            2     "b"    "y"
            3     "c"    "z"
       ]
3x3 DataFrame:
   foo  bar  var
1    1  "a"  "x"
2    2  "b"  "y"
3    3  "c"  "z"

You'd get this?

julia> data[2:3,:]
3x2 DataFrame:
   bar  var
1    1  "x"
2    2  "y"
3    3  "z"

Indexing in the way you're suggesting isn't totally unfamiliar, because you can treat data frames as a special type of list in R. E.g., in R:

x = as.data.frame(list(a=rnorm(3), b=runif(3), c=rbeta(3,50,1)))

            a         b         c
1  0.01771559 0.6852642 0.9951766
2 -0.12215320 0.8791794 0.9819956
3  0.91772831 0.3895959 0.9997918

> x[[3]][3]
[1] 0.9997918
> x[[3]][1:3]
[1] 0.9951766 0.9819956 0.9997918
> x[["b"]][2]
[1] 0.8791794

but:

> x[[2:3]]

[1] 0.3895959


This is just to illustrate that the convention you're proposing isn't totally "un-R-like."

For me, though, this then raises the question of indexing in nested lists in Julia, because in R there's some symmetry between indexing in dataframes and lists, so the list notation is sort of intuitive. Would your notation extend to other data structures, or just be for data frames?

E.g.,

> y = list(x, rnorm(4))
> y
[[1]]
            a         b         c
1  0.01771559 0.6852642 0.9951766
2 -0.12215320 0.8791794 0.9819956
3  0.91772831 0.3895959 0.9997918

[[2]]
[1] -1.03428677 -0.01332083  1.63937353 -1.43546416

> y[[1]][[2]][2:3]
[1] 0.8791794 0.3895959



Harlan Harris

unread,
Mar 23, 2012, 6:48:56 PM3/23/12
to juli...@googlegroups.com
All, I've made some progress on DataVec's, here: https://github.com/HarlanH/julia/tree/Data

In base/data.jl are definitions of NA, DataVec, etc.
In test/test_data.jl are test cases for various things, using the test suite (which is currently not in sysimg.jl?)

So:

load("base/data.jl")
load("base/test.jl")
tests("test/test_data.jl")

Shows all tests pass, so far. I've tried Int, Float, and String vectors so far with no problems.

Quite a bit to do yet, just on DataVec:

I've written nafilter() and nareplace(), but not naFilter() or naReplace(), which will require new types.
Ideally, String DataVecs should be implemented such that unique strings are only stored once.
copy_to() needs to be written
similar() needs to be written
logical indexing needs to be written
array indexing (e.g., y=[1,2,4]; x[y]) needs to be written
assign() needs to be written
and probably a lot more

A lot of the basic array functionality shouldn't be rewritten, such as append and reverse, I don't think.

 -Harlan

Douglas Bates

unread,
Mar 24, 2012, 5:38:37 AM3/24/12
to juli...@googlegroups.com
On Thursday, March 22, 2012 3:26:08 PM UTC-5, Stefan Karpinski wrote:
I don't really get why ignoring NAs isn't the default for things like sum and mean. Seems crazy and useless to me. I wonder if it doesn't simply stem from implementing NA using NaNs, where that's what the IEEE semantics force on you because NaN is poisonous...

Because of alignment issues.  For a single vector it may make sense to skip the NAs but not when that vector is part of a data frame.

One operation that it particularly sensitive to NAs is evaluating correlations of several variables, something that social scientist do frequently.  The simple way of handling NAs in such an operation is the na.omit function which, when applied to a data frame, eliminates any row with an NA.  For some data frames, however, this would leave you with only a small, perhaps unrepresentative sample of the rows (I have seen cases where you would get less than 20% of the original rows).  An alternative is pairwise elimination where you each pair of variables uses all the cases where both variables are not missing.  But this can produce a correlation matrix that is not positive semi-definite.

Because this is a tricky area the choice was made to require the user to be specific about the handling of NA's

Douglas Bates

unread,
Mar 24, 2012, 5:51:09 AM3/24/12
to juli...@googlegroups.com
I'm surprised that this was not an error.  The double bracket indexing only makes sense for a single index.  It extracts the element rather than extracting the sub-list.  So if x is a list then x[<whatever>] is also a list but x[[<single name or index>]] can be of any type. This is why a single value is returned here.

Harlan Harris

unread,
Mar 24, 2012, 12:19:33 PM3/24/12
to juli...@googlegroups.com
Quick design question, for the naFilter/naReplace methods on a DataVec.

The goal is to have a DataVec be iterable, returning a Union(T,NAtype), but naFilter(x::DataVec) and naReplace(x::DataVec{T}, r::T) also return iterables. Seems like there are a couple ways to do that.

One option would be for naFilter/naReplace to return an object of class DataVecIterator, or something, that contains/wraps a DataVec:

type DataVecIterator
    datavec::DataVec
end

The start/next/done methods for a DataVecIterator would then have to dive through the extra layer of reference to get the data.

Another option would be for DataVec and FilteredDataVec and ReplacedDataVec to be implementations of an AbstractDataVec type, where almost all methods would refer to AbstractDataVec, except for start/next/done, which would have specific implementations that do the right thing. naFilter(d::DataVec) would then just create a FilteredDataVec, referring to the original data and na vectors. Similar for ReplacedDataVec.

A third option would be to keep just a single type, but add filterNA and replaceNA fields to the type, defaulting to false and nothing, respectively. naFilter and naReplace then just create new objects (referring to the old data) with the appropriate changes to those fields, which start/next/done would examine at run-time.

Which option is most Julian? Or is there a better option I missed?

 -Harlan

Stefan Karpinski

unread,
Mar 24, 2012, 1:21:17 PM3/24/12
to juli...@googlegroups.com
Thanks for the explanation. For correlation/covariance calculations, this makes complete sense to me — it is definitely a very tricky business. For simpler things like mean and sum, I'm still not sure, but it seems that all the heavy R users here agree that ignoring NAs should not be the default, so I'll defer to that.

Stefan Karpinski

unread,
Mar 24, 2012, 1:45:12 PM3/24/12
to juli...@googlegroups.com
I have to confess, I didn't really follow this example very well. What I proposed in my mocked up Julia code for data frames is basically using matrix-like slicing, Matlab style. However, there are a few troubling departures from matrix compatibility:
  1. The indexing is data[col,row] as compared to matrix[row,col].
  2. Writing data[col] picks out a vector rather than an element by linear indexing.
  3. Writing data[col,:] would slice an nx1 DataFrame rather than a n-elt DataVector.
These all kind of make sense to me in the context of data frames, but I worry a lot about introducing conflicting and/or confusing indexing behaviors into a single language that has both matrices and data frames.

Another thing to keep in mind is that we probably do *not* want to do things like write data[a][b] instead of data[a,b]. The former might work if data[a] is an object that can further be sliced by b, but this is not going to be magically optimized away — the intermediate object data[a] will get passed to a second ref operation for further slicing. Writing data[a,b] allows this to be done in a single operation without creating the data[a] object.

Harlan Harris

unread,
Mar 24, 2012, 2:12:28 PM3/24/12
to juli...@googlegroups.com
On Sat, Mar 24, 2012 at 1:45 PM, Stefan Karpinski <ste...@karpinski.org> wrote:
I have to confess, I didn't really follow this example very well. What I proposed in my mocked up Julia code for data frames is basically using matrix-like slicing, Matlab style. However, there are a few troubling departures from matrix compatibility:
  1. The indexing is data[col,row] as compared to matrix[row,col].
Oh, that's weird. I definitely think it should be data[row,col], to maintain compatibility with both R and Julia matrices.
  1. Writing data[col] picks out a vector rather than an element by linear indexing.
Yes, definitely. In R you can write data$col instead of data["col"], which is handy, but not strictly necessary. Both return the column vector. I'm not sure if there's going to be a way to use a bare word for a column index in Julia, without changes to the parser... R allows you (and sometimes requires you) to do data$"col" -- maybe something like that would be possible to save one character?
  1. Writing data[col,:] would slice an nx1 DataFrame rather than a n-elt DataVector.
I assume this should be data[:,col]? So, data[:,"a"] or data[:,3] return a DataFrame/Table. As do data[:,["a","b"]] and data[:,[3,5,2]]. But data["a"] and data[3] return column vectors. To get the third row, which would be a DataTable (prefer the latter to R's data.frame!), you'd say data[3,:]

Does data[3,3] give you a 1-row 1-col DataTable, or a singleton element? In R it gives you the latter. Not sure if that's the best answer.

Keep in mind we'll eventually want both row names and col names...

So, signatures...

ref(d::DataTable, rows::Vector{Int}, cols::Vector{int}) gives a DataTable
ref(d::DataTable, row::Int, cols::Vector{Int}) gives a DataTable
ref(d::DataTable, rows::Vector{Int}, col::Int) gives a DataVec
ref(d::DataTable, row::Int, col::Int) gives a singleton (?)
ref(d::DataTable, col::Int) gives a DataVec
ref(d::DataTable, col::String) gives a DataVec
ref(d::DataTable, cols::Vector{String}) gives a DataTable

plus similar for indexing by booleans, ranges, and row names...

These all kind of make sense to me in the context of data frames, but I worry a lot about introducing conflicting and/or confusing indexing behaviors into a single language that has both matrices and data frames.

Yeah. I think it should be possible to give pretty similar behaviors.

At least until we get to DataMat, which is a single-type matrix with NAs, row names, and col names.
 

Another thing to keep in mind is that we probably do *not* want to do things like write data[a][b] instead of data[a,b]. The former might work if data[a] is an object that can further be sliced by b, but this is not going to be magically optimized away — the intermediate object data[a] will get passed to a second ref operation for further slicing. Writing data[a,b] allows this to be done in a single operation without creating the data[a] object.

The usual idiom in R is data$col[row], equivalent to data[row, "col"]. I'm pretty sure it doesn't create a separate column in this case. data[,"col"][row] definitely would, though.

 -Harlan

 

Stefan Karpinski

unread,
Mar 24, 2012, 3:04:26 PM3/24/12
to juli...@googlegroups.com
I think perhaps the best approach to indexing then is to always just do data[row,col]. If you want to extract a column, you write data[:,col] and if you want to extract a row, you write data[row,:]. That's simple, completely consistent, and supports named rows easily. Using a single index into a data frame would be an error.

I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion. There's a consistency to this: always dropping trailing dimensions that are sliced with scalars. In the data frame, data vector case, it's a little different because there isn't a tower of higher dimensional tensor types over this...

But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.

Or maybe DataFrames are really special. They match up with the relational data model very nicely, and in that model there's no need to go up to tensors: all you need are tables with rows and columns. In that case, however, having named rows seems a bit weird (who names their data points?), and the DataMat type is a completely different kind of beast because it's not relational at all.

kem

unread,
Mar 24, 2012, 11:46:55 PM3/24/12
to juli...@googlegroups.com


Yes, I was surprised by that too. I tried it only because I was interested in how it would handle parallels to Stefan's example.


On Saturday, March 24, 2012 4:51:09 AM UTC-5, Douglas Bates wrote:


but:

> x[[2:3]]

[1] 0.3895959
   foo  bar

kem

unread,
Mar 25, 2012, 12:04:35 AM3/25/12
to juli...@googlegroups.com


On Saturday, March 24, 2012 2:04:26 PM UTC-5, Stefan Karpinski wrote:
I think perhaps the best approach to indexing then is to always just do data[row,col]. If you want to extract a column, you write data[:,col] and if you want to extract a row, you write data[row,:]. That's simple, completely consistent, and supports named rows easily. Using a single index into a data frame would be an error.

I like this proposal. It's intuitive, simple, and clean to me.

 
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion.
I disagree with this, actually, but can see arguments about it either way. My reason for this has to do with confusion that results (it may only be my own personal confusion) when feeding a dimensionless vector into a linear algebra statement that requires dimensionality. E.g., if you slice a data frame with a dimension, lose the dimension, but then require it for subsequent linear algebra, it adds a layer that wouldn't be present if the dimension were never lost. But I don't feel strongly about it.

 

But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.

I like this idea, even though doubt I personally would ever use it. I do sort of think that dataframes are mostly just named arrays with NA handling. It's worth noting that in some other systems similar constructs contain more attributes per variable than just names (e.g., a description), although I don't really see it as being necessary (and isn't present in R).

Harlan Harris

unread,
Mar 25, 2012, 9:35:33 AM3/25/12
to juli...@googlegroups.com
On Sat, Mar 24, 2012 at 3:04 PM, Stefan Karpinski <ste...@karpinski.org> wrote:
I think perhaps the best approach to indexing then is to always just do data[row,col]. If you want to extract a column, you write data[:,col] and if you want to extract a row, you write data[row,:]. That's simple, completely consistent, and supports named rows easily. Using a single index into a data frame would be an error.

I just disagree with this. Especially in an interactive environment, having to say a[:,"b"] is way, way more annoying than a$b (best, but maybe impossible) or a["b"]. 6, 1, and 4 keystrokes, respectively.  Let's implement the 4-keystroke option now, and think about whether R-like 1-keystroke syntax is possible in the future.

I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion. There's a consistency to this: always dropping trailing dimensions that are sliced with scalars. In the data frame, data vector case, it's a little different because there isn't a tower of higher dimensional tensor types over this...

I agree. For a DataTable, dat[:,1] should be a DataVec. Although I do think that for a DataMat, dat[:,1] should do whatever matrixes do, which it seems has not been decided. I don't have an opinion on that. Either a vector or a nx1 matrix seems reasonable...
 
But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.

It may be that DataMat should allow arbitrary matrix dimensionalities, implemented as you suggest. But I do think that, for the sake of eventual use by social scientists, we should have DataVecs be separate and as easy-to-use as possible.
 
Or maybe DataFrames are really special. They match up with the relational data model very nicely, and in that model there's no need to go up to tensors: all you need are tables with rows and columns. In that case, however, having named rows seems a bit weird (who names their data points?), and the DataMat type is a completely different kind of beast because it's not relational at all.

Yes, that's exactly right. DataFrames (Tables!) are special, and DataMats are non-relational.

Named rows isn't technically necessary in DataTables, but it's handy in some cases to have row names separate from the data. Say your row names are patient codes, and you have all of your data with columns "outcome", "predictor1", "predictor2". Then in R syntax you can do:

fit <- lm(outcome ~ ., dat)

Where the "." gets interpreted as "all other columns". If you have "P73" as a patient ID column, you can't do that.

I don't think it's a big deal to have a separate, optional row-name vector in the DataTable implementation...

More in a sec, in response to kem...

 -Harlan

Harlan Harris

unread,
Mar 25, 2012, 9:43:25 AM3/25/12
to juli...@googlegroups.com
On Sun, Mar 25, 2012 at 12:04 AM, kem <kristian...@gmail.com> wrote:
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion.
I disagree with this, actually, but can see arguments about it either way. My reason for this has to do with confusion that results (it may only be my own personal confusion) when feeding a dimensionless vector into a linear algebra statement that requires dimensionality. E.g., if you slice a data frame with a dimension, lose the dimension, but then require it for subsequent linear algebra, it adds a layer that wouldn't be present if the dimension were never lost. But I don't feel strongly about it.

Just to be clear, I think Stefan was talking about DataMats here, not DataTables.
 

But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.

I like this idea, even though doubt I personally would ever use it. I do sort of think that dataframes are mostly just named arrays with NA handling. It's worth noting that in some other systems similar constructs contain more attributes per variable than just names (e.g., a description), although I don't really see it as being necessary (and isn't present in R).

No, data.frames in R are most definitely not just named arrays with NA handling! They allow heterogeneous types and simple column indexing, and they afford a huge number of operations that make little sense with matrixes (even ones with names and NAs). The types of operations that people do on DataTables are not going to be linear-algebra-like. They're going to be map-reduce-like, and SQL-join-like, and reshape-like.

Good point about additional attributes per variable, but I agree with you -- doesn't seem necessary to store those as part of the data structure. As a side note, R implements row names, column names, dimension labels, and a bunch of other things by allowing arbitrary "attribute" lists on every object, no matter how simple. I don't think that that's a great idea for Julia's Data types...

 -Harlan

 

Stefan Karpinski

unread,
Mar 25, 2012, 4:38:40 PM3/25/12
to juli...@googlegroups.com
On Sun, Mar 25, 2012 at 9:35 AM, Harlan Harris <har...@harris.name> wrote:
 
On Sat, Mar 24, 2012 at 3:04 PM, Stefan Karpinski <ste...@karpinski.org> wrote:
 
I think perhaps the best approach to indexing then is to always just do data[row,col]. If you want to extract a column, you write data[:,col] and if you want to extract a row, you write data[row,:]. That's simple, completely consistent, and supports named rows easily. Using a single index into a data frame would be an error.

I just disagree with this. Especially in an interactive environment, having to say a[:,"b"] is way, way more annoying than a$b (best, but maybe impossible) or a["b"]. 6, 1, and 4 keystrokes, respectively.  Let's implement the 4-keystroke option now, and think about whether R-like 1-keystroke syntax is possible in the future.

This is a fair point. I guess that having data[col] select a column of a data frame is reasonable. I don't think any syntax shorter than that is every going to happen though: we're really running out of syntax. More specifically, we're running out of ASCII characters to use for syntax. The obvious analogue of a$b in R would be a.b, but that already means something: field access. One of my favorite things about Julia is the lack of confusing syntax overloading, so I'm not really willing to have a.b sometimes mean something simple and fundamental like field access and other times mean something very different.

If someone comes up with a really clever syntactic solution, that's cool, but it just seems kind of unlikely. My favorite syntactic solution that lets us cram huge amounts of functionality into a single feature is non-standard string literals. Something general like that might just be possible, but I'm kind of skeptical.
 
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion. There's a consistency to this: always dropping trailing dimensions that are sliced with scalars. In the data frame, data vector case, it's a little different because there isn't a tower of higher dimensional tensor types over this...

I agree. For a DataTable, dat[:,1] should be a DataVec. Although I do think that for a DataMat, dat[:,1] should do whatever matrixes do, which it seems has not been decided. I don't have an opinion on that. Either a vector or a nx1 matrix seems reasonable...
 
But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.

It may be that DataMat should allow arbitrary matrix dimensionalities, implemented as you suggest. But I do think that, for the sake of eventual use by social scientists, we should have DataVecs be separate and as easy-to-use as possible.
 
Or maybe DataFrames are really special. They match up with the relational data model very nicely, and in that model there's no need to go up to tensors: all you need are tables with rows and columns. In that case, however, having named rows seems a bit weird (who names their data points?), and the DataMat type is a completely different kind of beast because it's not relational at all.

Yes, that's exactly right. DataFrames (Tables!) are special, and DataMats are non-relational.

Ok, for now, I'm going to suggest that we punt entirely on DataMat or anything like it. I don't think the requirements and applications are clear enough. DataVec (or maybe it should be called DataCol) and DataFrame seem pretty clear and complete: DataFrame provides an essentially relational representation of named, ordered heterogeneously typed, nullable data. DataVec represents a single column of that. Given the connection to relational data representation, I'm wondering if we shouldn't maybe call these types Table and Column instead of DataFrame and DataVec. The names Table and Column are a bit generic, but I'm assuming all of this will have to be imported before use anyway.

Another thing about DataMat is that I'm not entirely convinced that it needs to exist at all. It's awfully close to Matrix. Can't you just provide a replacement value for NA when converting data from a data frame to some sort of matrix representation? If the data happens to be floating-point, as it very often would be, then NaN is even an obvious default replacement for NA. In any case, I'd really like to explore use cases for something like DataMat before we go ahead and implement it.

Named rows isn't technically necessary in DataTables, but it's handy in some cases to have row names separate from the data. Say your row names are patient codes, and you have all of your data with columns "outcome", "predictor1", "predictor2". Then in R syntax you can do:

fit <- lm(outcome ~ ., dat)

Where the "." gets interpreted as "all other columns". If you have "P73" as a patient ID column, you can't do that.

I don't think it's a big deal to have a separate, optional row-name vector in the DataTable implementation...

This makes a lot of sense, especially for plotting. Interpreting data point indices is deeply annoying. And of course sometimes data point labels are something like dates. So it should probably be more flexible than just allowing string labels.

Stefan Karpinski

unread,
Mar 25, 2012, 4:44:53 PM3/25/12
to juli...@googlegroups.com
On Sun, Mar 25, 2012 at 12:04 AM, kem <kristian...@gmail.com> wrote:

I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion.
 
I disagree with this, actually, but can see arguments about it either way. My reason for this has to do with confusion that results (it may only be my own personal confusion) when feeding a dimensionless vector into a linear algebra statement that requires dimensionality. E.g., if you slice a data frame with a dimension, lose the dimension, but then require it for subsequent linear algebra, it adds a layer that wouldn't be present if the dimension were never lost. But I don't feel strongly about it.

Like we've done (finally), with vectors and column matrices, data vectors should automatically upgrade to single-column data matrices. That should take any annoyance out of this behavior. This worries me with regards to the single-index behavior though: for a data vector v[1] means the first scalar value; for a single-column data frame, d[1] would mean the first column (assuming we accept using one value indexing into a data frame for pulling out columns). That difference is potentially very problematic.

But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.
 
I like this idea, even though doubt I personally would ever use it. I do sort of think that dataframes are mostly just named arrays with NA handling. It's worth noting that in some other systems similar constructs contain more attributes per variable than just names (e.g., a description), although I don't really see it as being necessary (and isn't present in R).

The phrase "[I] doubt I personally would ever use it" is a sure danger sign. It's a little too over-engineered for something that isn't an immediately pressing need. Smells a bit of over-thinking it :-\. That's why I think that punting on DataMat for now is the best course of action.

Stefan Karpinski

unread,
Mar 25, 2012, 4:48:01 PM3/25/12
to juli...@googlegroups.com
On Sun, Mar 25, 2012 at 9:43 AM, Harlan Harris <har...@harris.name> wrote:

On Sun, Mar 25, 2012 at 12:04 AM, kem <kristian...@gmail.com> wrote:
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion.
 
I disagree with this, actually, but can see arguments about it either way. My reason for this has to do with confusion that results (it may only be my own personal confusion) when feeding a dimensionless vector into a linear algebra statement that requires dimensionality. E.g., if you slice a data frame with a dimension, lose the dimension, but then require it for subsequent linear algebra, it adds a layer that wouldn't be present if the dimension were never lost. But I don't feel strongly about it.

Just to be clear, I think Stefan was talking about DataMats here, not DataTables.

I was actually talking about data tables (see, now you're calling them tables too!).
 
But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.
 
I like this idea, even though doubt I personally would ever use it. I do sort of think that dataframes are mostly just named arrays with NA handling. It's worth noting that in some other systems similar constructs contain more attributes per variable than just names (e.g., a description), although I don't really see it as being necessary (and isn't present in R).

No, data.frames in R are most definitely not just named arrays with NA handling! They allow heterogeneous types and simple column indexing, and they afford a huge number of operations that make little sense with matrixes (even ones with names and NAs). The types of operations that people do on DataTables are not going to be linear-algebra-like. They're going to be map-reduce-like, and SQL-join-like, and reshape-like.

But matrix multiplication is just a join on the inner indices followed, by a group-by and sum!
 
Good point about additional attributes per variable, but I agree with you -- doesn't seem necessary to store those as part of the data structure. As a side note, R implements row names, column names, dimension labels, and a bunch of other things by allowing arbitrary "attribute" lists on every object, no matter how simple. I don't think that that's a great idea for Julia's Data types...

Yeah, that sounds like way too much.

Harlan Harris

unread,
Mar 25, 2012, 4:53:57 PM3/25/12
to juli...@googlegroups.com
On Sun, Mar 25, 2012 at 4:38 PM, Stefan Karpinski <ste...@karpinski.org> wrote:

This is a fair point. I guess that having data[col] select a column of a data frame is reasonable. I don't think any syntax shorter than that is every going to happen though: we're really running out of syntax. More specifically, we're running out of ASCII characters to use for syntax. The obvious analogue of a$b in R would be a.b, but that already means something: field access. One of my favorite things about Julia is the lack of confusing syntax overloading, so I'm not really willing to have a.b sometimes mean something simple and fundamental like field access and other times mean something very different.

If someone comes up with a really clever syntactic solution, that's cool, but it just seems kind of unlikely. My favorite syntactic solution that lets us cram huge amounts of functionality into a single feature is non-standard string literals. Something general like that might just be possible, but I'm kind of skeptical.

OK. Works for now for me.
 
 
Ok, for now, I'm going to suggest that we punt entirely on DataMat or anything like it.

Concur. It wasn't a high priority for me, anyway...
 
I don't think the requirements and applications are clear enough. DataVec (or maybe it should be called DataCol) and DataFrame seem pretty clear and complete: DataFrame provides an essentially relational representation of named, ordered heterogeneously typed, nullable data. DataVec represents a single column of that. Given the connection to relational data representation, I'm wondering if we shouldn't maybe call these types Table and Column instead of DataFrame and DataVec. The names Table and Column are a bit generic, but I'm assuming all of this will have to be imported before use anyway.

Hm, I'm OK with Table and Column! It's very clear for people with db backgrounds! Will rename those in my code presently...

Another thing about DataMat is that I'm not entirely convinced that it needs to exist at all. It's awfully close to Matrix. Can't you just provide a replacement value for NA when converting data from a data frame to some sort of matrix representation? If the data happens to be floating-point, as it very often would be, then NaN is even an obvious default replacement for NA. In any case, I'd really like to explore use cases for something like DataMat before we go ahead and implement it.

Concur. Most of the time, the conversion from from Table to a matrix will involve a process (deletion, NaNification, imputation) that gets rid of the NAs.
 
This makes a lot of sense, especially for plotting. Interpreting data point indices is deeply annoying. And of course sometimes data point labels are something like dates. So it should probably be more flexible than just allowing string labels.

Hm, interesting. I guess it might be useful to allow non-strings in row labels. We'll have to think about things like how those labels get propogated in various reshape/reorder operations, whether they have to be unique, etc... I don't think the initial version needs row labels, though...

 -Harlan

Harlan Harris

unread,
Mar 26, 2012, 9:41:47 AM3/26/12
to juli...@googlegroups.com
I think this question below got lost in the shuffle -- could I please get a ruling? :)

s/DataVec/Column/g

 -Harlan

James Bullard

unread,
Mar 27, 2012, 2:00:03 AM3/27/12
to juli...@googlegroups.com
On Sun, Mar 25, 2012 at 1:53 PM, Harlan Harris <har...@harris.name> wrote:
On Sun, Mar 25, 2012 at 4:38 PM, Stefan Karpinski <ste...@karpinski.org> wrote:

This is a fair point. I guess that having data[col] select a column of a data frame is reasonable. I don't think any syntax shorter than that is every going to happen though: we're really running out of syntax. More specifically, we're running out of ASCII characters to use for syntax. The obvious analogue of a$b in R would be a.b, but that already means something: field access. One of my favorite things about Julia is the lack of confusing syntax overloading, so I'm not really willing to have a.b sometimes mean something simple and fundamental like field access and other times mean something very different.

If someone comes up with a really clever syntactic solution, that's cool, but it just seems kind of unlikely. My favorite syntactic solution that lets us cram huge amounts of functionality into a single feature is non-standard string literals. Something general like that might just be possible, but I'm kind of skeptical.

OK. Works for now for me.

Also, don't forget that in most R programming environments you have tab-complete on data$the_very_specificcolumnname - which does help a good deal (if you want to be interactive) ...

 
 
Ok, for now, I'm going to suggest that we punt entirely on DataMat or anything like it.

Concur. It wasn't a high priority for me, anyway...
 
I don't think the requirements and applications are clear enough. DataVec (or maybe it should be called DataCol) and DataFrame seem pretty clear and complete: DataFrame provides an essentially relational representation of named, ordered heterogeneously typed, nullable data. DataVec represents a single column of that. Given the connection to relational data representation, I'm wondering if we shouldn't maybe call these types Table and Column instead of DataFrame and DataVec. The names Table and Column are a bit generic, but I'm assuming all of this will have to be imported before use anyway.

Hm, I'm OK with Table and Column! It's very clear for people with db backgrounds! Will rename those in my code presently...

I hate the name data.frame, but just Table can be confused with the other common use of table in a statistical environment which is a cross-tabulation. At the expense of one letter (as compared to DataFrame) why not DataTable and DataCol. 

kem

unread,
Mar 29, 2012, 6:22:46 PM3/29/12
to juli...@googlegroups.com

Hm, I'm OK with Table and Column! It's very clear for people with db backgrounds! Will rename those in my code presently...

I hate the name data.frame, but just Table can be confused with the other common use of table in a statistical environment which is a cross-tabulation. At the expense of one letter (as compared to DataFrame) why not DataTable and DataCol. 

I was going to say the same thing. "Table" has a lot of meanings in statistics, especially contingency tables, and I can see this introducing a lot of confusion.

Stefan Karpinski

unread,
Mar 29, 2012, 6:24:06 PM3/29/12
to juli...@googlegroups.com
Maybe we should just stick with DataFrame and DataCol then.

kem

unread,
Mar 29, 2012, 6:25:24 PM3/29/12
to juli...@googlegroups.com


I just disagree with this. Especially in an interactive environment, having to say a[:,"b"] is way, way more annoying than a$b (best, but maybe impossible) or a["b"]. 6, 1, and 4 keystrokes, respectively.  Let's implement the 4-keystroke option now, and think about whether R-like 1-keystroke syntax is possible in the future.



I understand what you're saying here, but wasn't the issue that began this discussion about how to reference slices? I like the [row, col] syntax because it's clear and consistent. The extra typing doesn't seem to interfere to me.

I'm probably missing something though--I was gone for a few days and I feel like there's some blank spaces in my putting things together.

 

kem

unread,
Mar 29, 2012, 6:27:02 PM3/29/12
to juli...@googlegroups.com





No, data.frames in R are most definitely not just named arrays with NA handling! They allow heterogeneous types and simple column indexing, and they afford a huge number of operations that make little sense with matrixes (even ones with names and NAs). The types of operations that people do on DataTables are not going to be linear-algebra-like. They're going to be map-reduce-like, and SQL-join-like, and reshape-like.

You're right. I forgot about the heterogeneous types, etc., which are extremely important.

kem

unread,
Mar 29, 2012, 6:59:30 PM3/29/12
to juli...@googlegroups.com
Sorry for the delay--I was getting caught up on some other things and wanted to let this settle in my mind a bit anyway.




One option would be for naFilter/naReplace to return an object of class DataVecIterator, or something, that contains/wraps a DataVec:

type DataVecIterator
    datavec::DataVec
end

The start/next/done methods for a DataVecIterator would then have to dive through the extra layer of reference to get the data.


I liked this option the least. I'm not sure I have a good reason for this, other than your last point--it seems like the extra layer could complicate things or slow things down performance-wise. I have a general distrust of these sorts of hierarchies, though.

 
Another option would be for DataVec and FilteredDataVec and ReplacedDataVec to be implementations of an AbstractDataVec type, where almost all methods would refer to AbstractDataVec, except for start/next/done, which would have specific implementations that do the right thing. naFilter(d::DataVec) would then just create a FilteredDataVec, referring to the original data and na vectors. Similar for ReplacedDataVec.
 

A third option would be to keep just a single type, but add filterNA and replaceNA fields to the type, defaulting to false and nothing, respectively. naFilter and naReplace then just create new objects (referring to the old data) with the appropriate changes to those fields, which start/next/done would examine at run-time.

I vacillate between these two options.

I like the third option because it seems straightforward and flexible, and it seems like the fields could be useful for other reasons, or be expanded upon later (e.g., just hypothetically, in the abstract, if later it was decided you wanted to have multiple types of replace). On the other hand, that adds a teeny bit of overhead for every single DataVec, etc. I could also see a possible need to enforce certain constraints between the filterNA and replaceNA fields, in case there is a conflict.

The second option avoids that overhead for DataVec, but then it seems like adds a little complexity in other ways -- e.g., if I were writing a function operating on DataVecs, I'd probably prefer the third option because the fields would always be there, and I wouldn't have to worry about making inferences about the type of DataVec.

It sort of depends on exactly how these things would be implemented--e.g., what you mean by "changes to those fields" and "referring to the original data and na vectors." What would go in those fields and how would the na vectors be represented?

In the back of my mind, I'm sort of trying to imagine writing a function that implements an EM algorithm and trying to figure out what would be most desirable to work with. I think it depends on the details.


Harlan Harris

unread,
Mar 31, 2012, 12:37:43 PM3/31/12
to juli...@googlegroups.com
On Thu, Mar 29, 2012 at 6:59 PM, kem <kristian...@gmail.com> wrote:
Sorry for the delay--I was getting caught up on some other things and wanted to let this settle in my mind a bit anyway.

No worries -- I haven't touched this for a week either!
 
One option would be for naFilter/naReplace to return an object of class DataVecIterator, or something, that contains/wraps a DataVec:

type DataVecIterator
    datavec::DataVec
end

The start/next/done methods for a DataVecIterator would then have to dive through the extra layer of reference to get the data.


I liked this option the least. I'm not sure I have a good reason for this, other than your last point--it seems like the extra layer could complicate things or slow things down performance-wise. I have a general distrust of these sorts of hierarchies, though.

Agree entirely.
 
Another option would be for DataVec and FilteredDataVec and ReplacedDataVec to be implementations of an AbstractDataVec type, where almost all methods would refer to AbstractDataVec, except for start/next/done, which would have specific implementations that do the right thing. naFilter(d::DataVec) would then just create a FilteredDataVec, referring to the original data and na vectors. Similar for ReplacedDataVec.
 

A third option would be to keep just a single type, but add filterNA and replaceNA fields to the type, defaulting to false and nothing, respectively. naFilter and naReplace then just create new objects (referring to the old data) with the appropriate changes to those fields, which start/next/done would examine at run-time.

I vacillate between these two options.

I like the third option because it seems straightforward and flexible, and it seems like the fields could be useful for other reasons, or be expanded upon later (e.g., just hypothetically, in the abstract, if later it was decided you wanted to have multiple types of replace). On the other hand, that adds a teeny bit of overhead for every single DataVec, etc. I could also see a possible need to enforce certain constraints between the filterNA and replaceNA fields, in case there is a conflict.

True.
 
The second option avoids that overhead for DataVec, but then it seems like adds a little complexity in other ways -- e.g., if I were writing a function operating on DataVecs, I'd probably prefer the third option because the fields would always be there, and I wouldn't have to worry about making inferences about the type of DataVec.

I think with the 3rd design you'd never want the user to access the flags at all. They'd only be relevant when you're using the DataVec as an iterator. If you were writing an operation on a DataVec you'd either treat the vector as an iterator, and throw an error if you got an NA, or you'd be doing something that specifically dealt with the NAs somehow, in which case you'd use either ref() access or you'd allow the iterator to provide an NA. So, sum(dat) would throw an error if there were NAs, but sum(naFilter(dat)) or sum(naReplace(dat, 0)) would work as expected.
 

It sort of depends on exactly how these things would be implemented--e.g., what you mean by "changes to those fields" and "referring to the original data and na vectors." What would go in those fields and how would the na vectors be represented?

Oh, this is stuff I've already dealt with. It's a Vector{T} for the data and an AbstractVector{Bool} for the NAs. And for the third option, naFilter would just build a new DataVec with the data and na fields pointing at the same underlying Vectors as the original object.

In the back of my mind, I'm sort of trying to imagine writing a function that implements an EM algorithm and trying to figure out what would be most desirable to work with. I think it depends on the details.

In my mind, I was thinking that, unlike R, most of the heavy math would be done on Matrix objects, not Data* objects. So you'd use various conversion functions to provide data for EM or MCMC or what have you. So the workflow would be to load DataTable objects from files or DBs or whatever, then you'd do all sorts of munging and merging and processing on the DataTable objects, then you'd apply some sort of logic to deal with missing data, be it ignoring cases or imputation of some sort, as you convert to a strictly numerical representation for model fitting. Most or all of the latter would be hidden from the casual user, just as it is in R.

I'm going to work on filling in some gaps in the DataVec functionality, then move on to a very early draft of a DataTable this afternoon, hopefully! I imagine it'll take a few rewrites with reference to data.table and Pandas before the design settles down...

 -Harlan


Harlan Harris

unread,
Apr 16, 2012, 9:27:09 PM4/16/12
to juli...@googlegroups.com
OK! I've now got code working (see extras/data.jl in my fork: https://github.com/HarlanH/julia/tree/Data ) that lets me do the following!

(Iris data with an NA added for fun, attached.)

This is very much a "build one to throw away" implementation, so I'd love comments, suggestions, assistance. After speaking with a variety of people, here and elsewhere, I'm thinking that there is enough complexity to designing this right that a spec/RFC might be worthwhile before proceeding much further.

Also note that this implementation is a bunch of stuff marked TODO, and a lot of slicing isn't implemented. There's absolutely no ideas here about how to do joins, reshapes, fancy selects, etc. Still, hope this is an interesting start:

julia> load("data.jl")
Warning: New definition ref(DataFrame{RT,CT},RT,Int64) is ambiguous with ref(DataFrame{RT,CT},Array{Int64,1},CT).
         Make sure ref(DataFrame{Array{Int64,1},Int64},Array{Int64,1},Int64) is defined first.
Warning: New definition ref(DataFrame{RT,CT},Int64,CT) is ambiguous with ref(DataFrame{RT,CT},RT,Int64).
         Make sure ref(DataFrame{Int64,Int64},Int64,Int64) is defined first.

julia> load("test.jl")

julia> tests("test/test_data.jl")
.....................................................

julia> df = csvDataFrame("/Users/hharris/Desktop/iris.csv")
        Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
[1,]             5.1         3.5          1.4         0.2     setosa
[2,]              NA         3.0          1.4         0.2     setosa
[3,]             4.7         3.2          1.3         0.2     setosa
[4,]             4.6         3.1          1.5         0.2     setosa
[5,]             5.0         3.6          1.4         0.2     setosa
[6,]             5.4         3.9          1.7         0.4     setosa
[7,]             4.6         3.4          1.4         0.3     setosa
[8,]             5.0         3.4          1.5         0.2     setosa
[9,]             4.4         2.9          1.4         0.2     setosa
[10,]            4.9         3.1          1.5         0.1     setosa
(snip)

julia> head(df)
      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
[1,]           5.1         3.5          1.4         0.2  setosa
[2,]            NA         3.0          1.4         0.2  setosa
[3,]           4.7         3.2          1.3         0.2  setosa
[4,]           4.6         3.1          1.5         0.2  setosa
[5,]           5.0         3.6          1.4         0.2  setosa
[6,]           5.4         3.9          1.7         0.4  setosa


julia> summary(df)
Sepal.Length
Min      4.3
1st Qu.  5.1
Median   5.8
Mean     5.849664429530203
3rd Qu.  6.4
Max      7.9
NAs      1

Sepal.Width
Min      2.0
1st Qu.  2.8
Median   3.0
Mean     3.057333333333334
3rd Qu.  3.3
Max      4.4

Petal.Length
Min      1.0
1st Qu.  1.6
Median   4.35
Mean     3.7580000000000027
3rd Qu.  5.1
Max      6.9

Petal.Width
Min      0.1
1st Qu.  0.3
Median   1.3
Mean     1.199333333333334
3rd Qu.  1.8
Max      2.5

Species
Length: 150
Type  : ASCIIString
NAs   : 0


julia> str(df)
150 observations of 5 variables
Sepal.Length: Float64 5.1 NA 4.7 4.6 5.0 5.4 4.6 5.0 4.4 ...
Sepal.Width: Float64 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 ...
Petal.Length: Float64 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 ...
Petal.Width: Float64 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 ...
Species: ASCIIString setosa setosa setosa setosa setosa ...

julia> colnames(df)
5-element ASCIIString Array:
 "Sepal.Length"
 "Sepal.Width"
 "Petal.Length"
 "Petal.Width"
 "Species"    

julia> df[1,1]
5.1

julia> df[2,1]
NA

julia> df["Sepal.Length"]
[5.1,NA,4.7,4.6,5.0,5.4,4.6,5.0,4.4,4.9,5.4,4.8,4.8,4.3,5.8,5.7,5.4,5.1,5.7,5.1,5.4,5.1,4.6,5.1,4.8,5.0,5.0,5.2,5.2,4.7,4.8,5.4,5.2,5.5,4.9,5.0,5.5,4.9,4.4,5.1,5.0,4.5,4.4,5.0,5.1,4.8,5.1,4.6,5.3,5.0,7.0,6.4,6.9,5.5,6.5,5.7,6.3,4.9,6.6,5.2,5.0,5.9,6.0,6.1,5.6,6.7,5.6,5.8,6.2,5.6,5.9,6.1,6.3,6.1,6.4,6.6,6.8,6.7,6.0,5.7,5.5,5.5,5.8,6.0,5.4,6.0,6.7,6.3,5.6,5.5,5.5,6.1,5.8,5.0,5.6,5.7,5.7,6.2,5.1,5.7,6.3,5.8,7.1,6.3,6.5,7.6,4.9,7.3,6.7,7.2,6.5,6.4,6.8,5.7,5.8,6.4,6.5,7.7,7.7,6.0,6.9,5.6,7.7,6.3,6.7,7.2,6.2,6.1,6.4,7.2,7.4,7.9,6.4,6.3,6.1,7.7,6.3,6.4,6.0,6.9,6.7,6.9,5.8,6.8,6.7,6.7,6.3,6.5,6.2,5.9]

julia> head(df[1:2])
      Sepal.Length Sepal.Width
[1,]           5.1         3.5
[2,]            NA         3.0
[3,]           4.7         3.2
[4,]           4.6         3.1
[5,]           5.0         3.6
[6,]           5.4         3.9

julia> df[1, "Petal.Length"]
1.4

julia> df[1] + df[2]
[8.6,NA,7.9,7.699999999999999,8.6,9.3,8.0,8.4,7.300000000000001,8.0,9.100000000000001,8.2,7.8,7.3,9.8,10.100000000000001,9.3,8.6,9.5,8.899999999999999,8.8,8.8,8.2,8.399999999999999,8.2,8.0,8.4,8.7,8.6,7.9,7.9,8.8,9.3,9.7,8.0,8.2,9.0,8.5,7.4,8.5,8.5,6.8,7.6000000000000005,8.5,8.899999999999999,7.8,8.899999999999999,7.8,9.0,8.3,10.2,9.600000000000001,10.0,7.8,9.3,8.5,9.6,7.300000000000001,9.5,7.9,7.0,8.9,8.2,9.0,8.5,9.8,8.6,8.5,8.4,8.1,9.100000000000001,8.899999999999999,8.8,8.899999999999999,9.3,9.6,9.6,9.7,8.9,8.3,7.9,7.9,8.5,8.7,8.4,9.4,9.8,8.6,8.6,8.0,8.1,9.1,8.4,7.3,8.3,8.7,8.6,9.1,7.6,8.5,9.6,8.5,10.1,9.2,9.5,10.6,7.4,10.2,9.2,10.8,9.7,9.100000000000001,9.8,8.2,8.6,9.600000000000001,9.5,11.5,10.3,8.2,10.100000000000001,8.399999999999999,10.5,9.0,10.0,10.4,9.0,9.1,9.2,10.2,10.2,11.7,9.2,9.1,8.7,10.7,9.7,9.5,9.0,10.0,9.8,10.0,8.5,10.0,10.0,9.7,8.8,9.5,9.6,8.9]

julia> dv1 = df[1]
[5.1,NA,4.7,4.6,5.0,5.4,4.6,5.0,4.4,4.9,5.4,4.8,4.8,4.3,5.8,5.7,5.4,5.1,5.7,5.1,5.4,5.1,4.6,5.1,4.8,5.0,5.0,5.2,5.2,4.7,4.8,5.4,5.2,5.5,4.9,5.0,5.5,4.9,4.4,5.1,5.0,4.5,4.4,5.0,5.1,4.8,5.1,4.6,5.3,5.0,7.0,6.4,6.9,5.5,6.5,5.7,6.3,4.9,6.6,5.2,5.0,5.9,6.0,6.1,5.6,6.7,5.6,5.8,6.2,5.6,5.9,6.1,6.3,6.1,6.4,6.6,6.8,6.7,6.0,5.7,5.5,5.5,5.8,6.0,5.4,6.0,6.7,6.3,5.6,5.5,5.5,6.1,5.8,5.0,5.6,5.7,5.7,6.2,5.1,5.7,6.3,5.8,7.1,6.3,6.5,7.6,4.9,7.3,6.7,7.2,6.5,6.4,6.8,5.7,5.8,6.4,6.5,7.7,7.7,6.0,6.9,5.6,7.7,6.3,6.7,7.2,6.2,6.1,6.4,7.2,7.4,7.9,6.4,6.3,6.1,7.7,6.3,6.4,6.0,6.9,6.7,6.9,5.8,6.8,6.7,6.7,6.3,6.5,6.2,5.9]

julia> dv1 = df[1][1:5]
[5.1,NA,4.7,4.6,5.0]

julia> [sqrt(x) | x = naFilter(dv1)]
5-element Any Array:
   2.25832
   2.16795
   2.14476
   2.23607
 #undef  

julia> [sqrt(x) | x = naReplace(dv1, 100.0)]
5-element Any Array:
  2.25832
 10.0   
  2.16795
  2.14476
  2.23607

Like I said, a lot of stuff doesn't work. There are a LOT of valid ways of ref'ing a DataFrame, and I haven't figured out the macro magic to write all of those ways efficiently, so many of them are missing or broken. The only DataVec types that csvDataFrame can deal with are numbers and strings. Nothing is fast.

It'd be great if we could write a quick-and-dirty draft of a model.matrix function to expand out a DF with a string "factor", then plug it into some of the code that Doug Bates and others have written, just as a demo of running a linear model on data loaded from a file!

 -Harlan

iris.csv

Stefan Karpinski

unread,
Apr 17, 2012, 10:48:52 AM4/17/12
to juli...@googlegroups.com
This is awesome! I love it. Makes me feel like we're a lot closer to having something that will be usable by R folks for day-to-day work.
<iris.csv>

Harlan Harris

unread,
Apr 17, 2012, 12:04:51 PM4/17/12
to juli...@googlegroups.com
Great! Definitely a ways from day-to-day work for anybody, but a start and a great learning experience!

I've been thinking about design choices and features that would make DataFrames in Julia be particularly Julian. One obvious need is to have a "darray" equivalent for map-reduce and embarassingly-parallel computation. I also think maybe it should be designed from the start to be extendable to bigger-than-RAM data that lives on disk, perhaps in combination with distributed processing. It's a shame that there aren't really decent existing open-source columnar databases that could be plugged in, as far as I can tell... A simple one-file-per-column-per-machine representation with optional indexes might be an interesting approach...

For my next trick, I'm going to start documenting all of these various design decisions and tradeoffs that can/should be made with regards to representations and operations on those representations. There are a LOT of them...

 -Harlan

Fabio Zottele

unread,
May 9, 2012, 5:03:47 AM5/9/12
to juli...@googlegroups.com
Thankyou Harlan for your work,
it is valuable for me because I am coming from R coding.
I will try your dataframe structure, although still work-in-progress.
I plan to try simply benchmarks with my functions and equivalent R functions.
When I have some numbers, I will let you know...and I hope in future to help you on refine julia's dataframe (I'm still a noobie however....)

Fabio

Harlan Harris

unread,
May 9, 2012, 8:21:35 AM5/9/12
to juli...@googlegroups.com
Thanks, Fabio!

I think all but four of us are noobies with Julia...

Quick status update. Haven't had too much time to code, but started working on a PooledDataVec type for the very common situation where you have a small number of repeated items in a column. Hope to make it eventually act like the String and Factor vector types in R, but a bit more general.

Have also been thinking a lot about indexes, chunked and sharded arrays, memory-mapped files, and so forth. Can of worms.

 -Harlan

Fabio Zottele

unread,
May 9, 2012, 10:14:28 AM5/9/12
to juli...@googlegroups.com
Your work is really valuable. I think that some data structures (as factors in R ) are really didactic when learning statistic (like ANOVA analyses). However learning curve of R becomes steep when dealing with apply,sapply,mapply and vectorized code. Or, better, it is hard for me. So I like julia because it should allows me performance using for loops, writing more readable code (and so learning and discussing on code and analyses is more fun).
And, memory mapped files could be very useful for GIS analyses on raster data. Another data type I want implement in future is spatialGrid for images and volumes etc... so you've gained a new supporter ! :-)

fabio


 


Reply all
Reply to author
Forward
0 new messages