type DataVec{T}data::Vector{T}na::AbstractVector{Bool}# inner constructors...end# outer constructors...
DataVec[1,2,NA,3]
type NAType; endconst NA = NAType()
ref{T}(v::DataVector{T}, i::Int) = v.na[i] ? NA : v.data[i]
This is interesting, and timely because it interacts with what we're thinking
about with the image library. I gather this is R-like, and I've not used R, so
I can't comment as fully as others surely can.
On Thursday, March 22, 2012 09:51:32 am Stefan Karpinski wrote:
> The basic idea is to have a parametric DataVec{T} type something like this:
>
> type DataVec{T}
> data::Vector{T}
> na::AbstractVector{Bool}
I'd add:
fillvalue::T
That controls what gets put in place of an NA when you construct the object. I
suspect one want it to be 0 in most cases, but might as well make it flexible.
The rest seems reasonable to me, with the caveat that I don't know what a
DataFrame is.
Best,
--Tim
This proposal is the result of a rather fruitful discussion with Harlan Harris that started over a beer in person (best way to have them!) on Monday and has continued a bit by email since then. I think it's ripe enough to put out here for some feedback from people.The basic idea is to have a parametric DataVec{T} type something like this:type DataVec{T}data::Vector{T}na::AbstractVector{Bool}# inner constructors...end# outer constructors...Then a DataFrame type would be an ordered, named bundle of DataVec, which can have different types. Using the trick of indexing into a type to construct "vectors" of that type, we could make this work:
DataVec[1,2,NA,3]Here NA would be defined like this:type NAType; endconst NA = NAType()This is just like nothing (see src/boot.jl). The implementation of ref(::Type{DataVec}, vals...) would just iterate through all of vals, initializing the data vector and na vector as appropriate and returning the resulting DataVec object. That's a nice pleasant literal syntax for data vectors.To implement parallel element-wise data operations on data vectors, one would do something like this:That way NA is poisonous: if either value in such an operation is NA, the corresponding result value is NA.For aggregating operations, like sum, mean, var, etc. you would write the underlying operation to ignore NA values. Something like this:except the real implementations may want to avoid creating a temporary subarray for this computation. One thing to point out is what happens where there are no non-NA values for something like mean: the answer is NaN. this just falls out of the definition of mean. NA and NaN are not the same thing, NaN is the correct answer here. The same thing would happen for var when there aren't at least two non-NA values. Sum will automatically return zero of the correct underlying data type when there are no non-NA values.One question is what to return when getting a single value from a data vector. I think this is simple:ref{T}(v::DataVector{T}, i::Int) = v.na[i] ? NA : v.data[i]This has a return type signature of Union(NAType,T), which is a bit unfortunate, but probably just fine — you shouldn't be writing big computations on data frames this way; you should use the underlying data and na vectors instead. Of course, you can write naïve code that works, but it will be slower.As I said at the outset, a DataFrame would then just be an ordered, named bundle of DataVec objects of heterogeneous type.So, RFC: what do the pro R users here think of this? Does this seem sensible? Does it satisfy what they need? There are a few reasons I like it:
- Doesn't complicate underlying value types: Int, Float64, String, etc.
- Having NA as a standalone value in general computation doesn't really make much sense; it makes sense in the context of a collection of data, which is how it's used here. The NA value is just a special value that lets one conveniently indicate unavailableness.
- Will work with any underlying data type: if someone defines a Date type with appropriate operations, it will immediately be usable in a DataFrame.
- As usual in Julia, the entire implementation is transparently exposed to the programmer. They can see what's going on, and therefore understand it. They can also, if need be, mess around with it, although that may not always be advised.
On Thursday, March 22, 2012 9:51:32 AM UTC-5, Stefan Karpinski wrote:This proposal is the result of a rather fruitful discussion with Harlan Harris that started over a beer in person (best way to have them!) on Monday and has continued a bit by email since then. I think it's ripe enough to put out here for some feedback from people.The basic idea is to have a parametric DataVec{T} type something like this:type DataVec{T}data::Vector{T}na::AbstractVector{Bool}# inner constructors...end# outer constructors...Then a DataFrame type would be an ordered, named bundle of DataVec, which can have different types. Using the trick of indexing into a type to construct "vectors" of that type, we could make this work:The approach in R has been somewhat different. The easy case is floating point types where NA is simply one specific pattern of the available NaN representations. The integer, factor, etc. NA's are also specific patterns and the overhead in the arithmetic operations is in checking for that pattern. I can understand that it would not be desirable to compromise the performance of arithmetic on vectors in Julia by making this part of the language for all operations on integer types but the proposed scheme adds some overhead in storage and processing even for the floating point case, where NaN values should be handled by the hardware.
Your examples of operations below are not the default in R. In R an NA always propagates so sum, mean, var, etc. of a vector with an NA is NA except when the optional argument na.rm is true (default is false).
I can see the sense in the approach you have outlined but I must admit I still don't feel comfortable with it. It may be best to do some trial implementations and check them out to see if they feel R-like.
To me the ability to handle missing values in computation is important but not the "raison d'etre" of a data frame. An important aspect of data frames that does not seem to be part of this definition is the requirement that all the vectors have the same length so that the data frame can be regarded as a table and indexed like a matrix.
> > Your examples of operations below are not the default in R. In R an NA
> > always propagates so sum, mean, var, etc. of a vector with an NA is NA
> > except when the optional argument na.rm is true (default is false).
>
> Yes, I'm not totally sure if I agree with Stefan on this one either. It
> seems to me that operations that are undefined with NAs should either
> propagate, like R, or maybe throw an error. Note that with properly
> constructed iterators, an expression like "mean(naFilter(x))" should be
> very fast. I guess I'd vote for throwing errors and iterators as a
> replacement for "na.rm=TRUE". "mean(naReplace(x,0))" should work too, and
> would be generally useful.
Likewise, matlab provides separate functions, "sum" and "nansum."
In julia, presumably a cleaner way would be this:
julia> x = DataVec[...] # define this with some NAs in it
julia> sum(x)
NaN
julia> xna = convert(DataVecNA,x)
julia> sum(xna)
172
julia> y = convert(DataVec,xna)
julia> sum(y)
NaN
The DataVec type can _store_ NA information for any type (integers included),
but those NAs are treated as NaN for the purposes of computation. The
DataVecNA triggers the running of algorithms that skip over NAs. Both would
have exactly the same fields & storage (the "convert" is really just syntactic
sugar), and so interconversion involves no overhead. In other words, julia's
type system replaces the need for a na.rm field.
Perhaps this is what you were proposing, Stefan.
Best,
--Tim
> >> 1. Doesn't complicate underlying value types: Int, Float64, String,
> >> etc.
> >> 2. Having NA as a standalone value in general computation doesn't
> >> really make much sense; it makes sense in the context of a collection
> >> of data, which is how it's used here. The NA value is just a special
> >> value that lets one conveniently indicate unavailableness.
> >> 3. Will work with any underlying data type: if someone defines a Date
> >> type with appropriate operations, it will immediately be usable in a
> >> DataFrame.
> >> 4. As usual in Julia, the entire implementation is transparently
> >> 1. Doesn't complicate underlying value types: Int, Float64, String,
> >> etc.
> >> 2. Having NA as a standalone value in general computation doesn't
> >> really make much sense; it makes sense in the context of a collection
> >> of data, which is how it's used here. The NA value is just a special
> >> value that lets one conveniently indicate unavailableness.
> >> 3. Will work with any underlying data type: if someone defines a Date
> >> type with appropriate operations, it will immediately be usable in a
> >> DataFrame.
> >> 4. As usual in Julia, the entire implementation is transparently
Likewise, matlab provides separate functions, "sum" and "nansum."
In julia, presumably a cleaner way would be this:
julia> x = DataVec[...] # define this with some NAs in it
julia> sum(x)
NaN
julia> xna = convert(DataVecNA,x)
julia> sum(xna)
172
julia> y = convert(DataVec,xna)
julia> sum(y)
NaN
The DataVec type can _store_ NA information for any type (integers included),
but those NAs are treated as NaN for the purposes of computation. The
DataVecNA triggers the running of algorithms that skip over NAs. Both would
have exactly the same fields & storage (the "convert" is really just syntactic
sugar), and so interconversion involves no overhead. In other words, julia's
type system replaces the need for a na.rm field.
On Thursday, March 22, 2012 02:20:02 pm Harlan Harris wrote:
> I don't like returning NaN because of NAs. That's not what NaN means.
I agree that's ugly. Going back and re-reading, you're proposing they give an
error when there are no valid items? I like that idea. Errors are good.
> I'm
> also concerned that "running of algorithms that skip over NAs" sounds like
> implementing everything twice (at least). An advantage of the iterator
> naFilter wrapper is that it's totally generic, so you only have to write
> mean() once. And there's no conversion using naFilter either -- it just
> returns the next non-NA data in the original data type.
Won't that kill random-access performance? Suppose I ask for x[100000]? Or are
you proposing this only for operations that don't need that kind of access?
From the standpoint of someone interested in local operations on
multidimensional arrays, it's much more efficient to have one random-access
variable storing real values (with 0 filled in for NaNs), and a second
similarly-addressable variable storing a flag for the NaNs. But different
strategies are presumably warranted for different applications.
Best,
--Tim
On Thursday, March 22, 2012 02:20:02 pm Harlan Harris wrote:> I don't like returning NaN because of NAs. That's not what NaN means.I agree that's ugly. Going back and re-reading, you're proposing they give an
error when there are no valid items? I like that idea. Errors are good.
Won't that kill random-access performance? Suppose I ask for x[100000]? Or are
> I'm
> also concerned that "running of algorithms that skip over NAs" sounds like
> implementing everything twice (at least). An advantage of the iterator
> naFilter wrapper is that it's totally generic, so you only have to write
> mean() once. And there's no conversion using naFilter either -- it just
> returns the next non-NA data in the original data type.
you proposing this only for operations that don't need that kind of access?
From the standpoint of someone interested in local operations on
multidimensional arrays, it's much more efficient to have one random-access
variable storing real values (with 0 filled in for NaNs), and a second
similarly-addressable variable storing a flag for the NaNs. But different
strategies are presumably warranted for different applications.
Best,
--Tim
The approach in R has been somewhat different. The easy case is floating point types where NA is simply one specific pattern of the available NaN representations. The integer, factor, etc. NA's are also specific patterns and the overhead in the arithmetic operations is in checking for that pattern. I can understand that it would not be desirable to compromise the performance of arithmetic on vectors in Julia by making this part of the language for all operations on integer types but the proposed scheme adds some overhead in storage and processing even for the floating point case, where NaN values should be handled by the hardware.
bitstype 64 Float64NA <: Floatbitstype 64 Int64NA <: Signed
Your examples of operations below are not the default in R. In R an NA always propagates so sum, mean, var, etc. of a vector with an NA is NA except when the optional argument na.rm is true (default is false).
I can see the sense in the approach you have outlined but I must admit I still don't feel comfortable with it. It may be best to do some trial implementations and check them out to see if they feel R-like. To me the ability to handle missing values in computation is important but not the "raison d'etre" of a data frame. An important aspect of data frames that does not seem to be part of this definition is the requirement that all the vectors have the same length so that the data frame can be regarded as a table and indexed like a matrix.
> data = data.frame(foo=c(1,2,3), bar=c("a","b","c"))> data[1]foo1 12 23 3> data[1,]foo bar1 1 a> data[,1][1] 1 2 3
julia> data = DataFrame["foo" "bar"1 "a"2 "b"3 "c"]3x2 DataFrame:foo bar1 1 "a"2 2 "b"3 3 "c"julia> data["foo"]DataVec{Int64}:foo1 12 23 3julia> data["bar"]DataVec{String}:bar1 "a"2 "b"3 "c"julia> data["foo",2]2julia> data["bar",2]"b"julia> data["bar",2:end]DataVec{String}:bar2 "b"3 "c"julia> data[:,2]1x2 DataFrame:foo bar1 2 "b"julia> data[:,2:3]3x2 DataFrame:foo bar1 2 "b"2 3 "c"
On Thursday, March 22, 2012 02:20:02 pm Harlan Harris wrote:I agree that's ugly. Going back and re-reading, you're proposing they give an
> I don't like returning NaN because of NAs. That's not what NaN means.
error when there are no valid items? I like that idea. Errors are good.
However, when you get to a more complex function like predict() (e.g., used to calculate expected values of outcome variables from predictors in a regression), thinking about NA handling becomes a little different. There, propagating NA values seems like a valuable option, because it clarifies that the predictors contain NA values. If you threw an error, it would be somewhat annoying and difficult because you could calculate expected values for some cases in the dataframe. It's unclear what ignoring the NA values would mean--in some cases it would be undefined, but in other cases it might be defined (e.g., ML prediction using available data but ignoring missing data, which I wish R would incorporate more by default).
In R, it's conventional to be able to specify NA handling explicitly in one of the three ways (which is in my mind implemented unevenly across functions). It might be nice to encourage similar practices in Julia, by regularly creating explicit methods for the three options--naIgnore(), naProp(), and maybe an error by default or something.
Sorry if I'm being unclear with this--I guess my point is that I'm not sure there can be a single approach to NA handling, that it's probably handled best at the function level rather than the dataframe/structure level, and that it would be nice if some implicit standard or convention were established that covered the bases.
Does that example make it feel a bit more R-like?
julia> data[:,2]1x2 DataFrame:foo bar1 2 "b"julia> data[:,2:3]3x2 DataFrame:foo bar1 2 "b"2 3 "c"
I don't really get why ignoring NAs isn't the default for things like sum and mean. Seems crazy and useless to me. I wonder if it doesn't simply stem from implementing NA using NaNs, where that's what the IEEE semantics force on you because NaN is poisonous...
I have to confess, I didn't really follow this example very well. What I proposed in my mocked up Julia code for data frames is basically using matrix-like slicing, Matlab style. However, there are a few troubling departures from matrix compatibility:
- The indexing is data[col,row] as compared to matrix[row,col].
- Writing data[col] picks out a vector rather than an element by linear indexing.
- Writing data[col,:] would slice an nx1 DataFrame rather than a n-elt DataVector.
These all kind of make sense to me in the context of data frames, but I worry a lot about introducing conflicting and/or confusing indexing behaviors into a single language that has both matrices and data frames.
Another thing to keep in mind is that we probably do *not* want to do things like write data[a][b] instead of data[a,b]. The former might work if data[a] is an object that can further be sliced by b, but this is not going to be magically optimized away — the intermediate object data[a] will get passed to a second ref operation for further slicing. Writing data[a,b] allows this to be done in a single operation without creating the data[a] object.
but:
> x[[2:3]]
[1] 0.3895959
foo bar
I think perhaps the best approach to indexing then is to always just do data[row,col]. If you want to extract a column, you write data[:,col] and if you want to extract a row, you write data[row,:]. That's simple, completely consistent, and supports named rows easily. Using a single index into a data frame would be an error.
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion.
But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.
I think perhaps the best approach to indexing then is to always just do data[row,col]. If you want to extract a column, you write data[:,col] and if you want to extract a row, you write data[row,:]. That's simple, completely consistent, and supports named rows easily. Using a single index into a data frame would be an error.
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion. There's a consistency to this: always dropping trailing dimensions that are sliced with scalars. In the data frame, data vector case, it's a little different because there isn't a tower of higher dimensional tensor types over this...
But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.
Or maybe DataFrames are really special. They match up with the relational data model very nicely, and in that model there's no need to go up to tensors: all you need are tables with rows and columns. In that case, however, having named rows seems a bit weird (who names their data points?), and the DataMat type is a completely different kind of beast because it's not relational at all.
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion.
I disagree with this, actually, but can see arguments about it either way. My reason for this has to do with confusion that results (it may only be my own personal confusion) when feeding a dimensionless vector into a linear algebra statement that requires dimensionality. E.g., if you slice a data frame with a dimension, lose the dimension, but then require it for subsequent linear algebra, it adds a layer that wouldn't be present if the dimension were never lost. But I don't feel strongly about it.
I like this idea, even though doubt I personally would ever use it. I do sort of think that dataframes are mostly just named arrays with NA handling. It's worth noting that in some other systems similar constructs contain more attributes per variable than just names (e.g., a description), although I don't really see it as being necessary (and isn't present in R).But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.
I think perhaps the best approach to indexing then is to always just do data[row,col]. If you want to extract a column, you write data[:,col] and if you want to extract a row, you write data[row,:]. That's simple, completely consistent, and supports named rows easily. Using a single index into a data frame would be an error.
I just disagree with this. Especially in an interactive environment, having to say a[:,"b"] is way, way more annoying than a$b (best, but maybe impossible) or a["b"]. 6, 1, and 4 keystrokes, respectively. Let's implement the 4-keystroke option now, and think about whether R-like 1-keystroke syntax is possible in the future.
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion. There's a consistency to this: always dropping trailing dimensions that are sliced with scalars. In the data frame, data vector case, it's a little different because there isn't a tower of higher dimensional tensor types over this...
I agree. For a DataTable, dat[:,1] should be a DataVec. Although I do think that for a DataMat, dat[:,1] should do whatever matrixes do, which it seems has not been decided. I don't have an opinion on that. Either a vector or a nx1 matrix seems reasonable...
But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.
It may be that DataMat should allow arbitrary matrix dimensionalities, implemented as you suggest. But I do think that, for the sake of eventual use by social scientists, we should have DataVecs be separate and as easy-to-use as possible.
Or maybe DataFrames are really special. They match up with the relational data model very nicely, and in that model there's no need to go up to tensors: all you need are tables with rows and columns. In that case, however, having named rows seems a bit weird (who names their data points?), and the DataMat type is a completely different kind of beast because it's not relational at all.
Yes, that's exactly right. DataFrames (Tables!) are special, and DataMats are non-relational.
Named rows isn't technically necessary in DataTables, but it's handy in some cases to have row names separate from the data. Say your row names are patient codes, and you have all of your data with columns "outcome", "predictor1", "predictor2". Then in R syntax you can do:
fit <- lm(outcome ~ ., dat)
Where the "." gets interpreted as "all other columns". If you have "P73" as a patient ID column, you can't do that.
I don't think it's a big deal to have a separate, optional row-name vector in the DataTable implementation...
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion.
I disagree with this, actually, but can see arguments about it either way. My reason for this has to do with confusion that results (it may only be my own personal confusion) when feeding a dimensionless vector into a linear algebra statement that requires dimensionality. E.g., if you slice a data frame with a dimension, lose the dimension, but then require it for subsequent linear algebra, it adds a layer that wouldn't be present if the dimension were never lost. But I don't feel strongly about it.
But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.
I like this idea, even though doubt I personally would ever use it. I do sort of think that dataframes are mostly just named arrays with NA handling. It's worth noting that in some other systems similar constructs contain more attributes per variable than just names (e.g., a description), although I don't really see it as being necessary (and isn't present in R).
On Sun, Mar 25, 2012 at 12:04 AM, kem <kristian...@gmail.com> wrote:
I think that data[1,1] should just be a value, not a 1x1 data frame. Also, data[:,1] ought to return a data vector rather than a nx1 data frame. Doing linear indexing into a data vector makes sense, so you could write data[:,col][row] as a verbose, inefficient way of writing data[row,col]. Along that same line of thinking, I actually feel that X[:,1] when slicing a matrix should be a vector rather than a column matrix, but that's a whole different discussion.
I disagree with this, actually, but can see arguments about it either way. My reason for this has to do with confusion that results (it may only be my own personal confusion) when feeding a dimensionless vector into a linear algebra statement that requires dimensionality. E.g., if you slice a data frame with a dimension, lose the dimension, but then require it for subsequent linear algebra, it adds a layer that wouldn't be present if the dimension were never lost. But I don't feel strongly about it.
Just to be clear, I think Stefan was talking about DataMats here, not DataTables.
But maybe there should be? The DataMat thing makes me feel like we're reinventing the wheel one piece at a time, instead of doing the whole thing at once. Are we going to want 3-index data tensors next? Or is DataMat really just a matrix with an NA mask and named rows and columns? If it's the latter, maybe we should just provided NamedArray that just wraps an array implementation and allows access to the rows and columns by name? And another wrapper that provides NA masking. That way, we get a whole systematic menagerie of data types by composition of parametric types instead of poking at the problem piecemeal.
I like this idea, even though doubt I personally would ever use it. I do sort of think that dataframes are mostly just named arrays with NA handling. It's worth noting that in some other systems similar constructs contain more attributes per variable than just names (e.g., a description), although I don't really see it as being necessary (and isn't present in R).
No, data.frames in R are most definitely not just named arrays with NA handling! They allow heterogeneous types and simple column indexing, and they afford a huge number of operations that make little sense with matrixes (even ones with names and NAs). The types of operations that people do on DataTables are not going to be linear-algebra-like. They're going to be map-reduce-like, and SQL-join-like, and reshape-like.
Good point about additional attributes per variable, but I agree with you -- doesn't seem necessary to store those as part of the data structure. As a side note, R implements row names, column names, dimension labels, and a bunch of other things by allowing arbitrary "attribute" lists on every object, no matter how simple. I don't think that that's a great idea for Julia's Data types...
This is a fair point. I guess that having data[col] select a column of a data frame is reasonable. I don't think any syntax shorter than that is every going to happen though: we're really running out of syntax. More specifically, we're running out of ASCII characters to use for syntax. The obvious analogue of a$b in R would be a.b, but that already means something: field access. One of my favorite things about Julia is the lack of confusing syntax overloading, so I'm not really willing to have a.b sometimes mean something simple and fundamental like field access and other times mean something very different.If someone comes up with a really clever syntactic solution, that's cool, but it just seems kind of unlikely. My favorite syntactic solution that lets us cram huge amounts of functionality into a single feature is non-standard string literals. Something general like that might just be possible, but I'm kind of skeptical.
Ok, for now, I'm going to suggest that we punt entirely on DataMat or anything like it.
I don't think the requirements and applications are clear enough. DataVec (or maybe it should be called DataCol) and DataFrame seem pretty clear and complete: DataFrame provides an essentially relational representation of named, ordered heterogeneously typed, nullable data. DataVec represents a single column of that. Given the connection to relational data representation, I'm wondering if we shouldn't maybe call these types Table and Column instead of DataFrame and DataVec. The names Table and Column are a bit generic, but I'm assuming all of this will have to be imported before use anyway.
Another thing about DataMat is that I'm not entirely convinced that it needs to exist at all. It's awfully close to Matrix. Can't you just provide a replacement value for NA when converting data from a data frame to some sort of matrix representation? If the data happens to be floating-point, as it very often would be, then NaN is even an obvious default replacement for NA. In any case, I'd really like to explore use cases for something like DataMat before we go ahead and implement it.
This makes a lot of sense, especially for plotting. Interpreting data point indices is deeply annoying. And of course sometimes data point labels are something like dates. So it should probably be more flexible than just allowing string labels.
On Sun, Mar 25, 2012 at 4:38 PM, Stefan Karpinski <ste...@karpinski.org> wrote:This is a fair point. I guess that having data[col] select a column of a data frame is reasonable. I don't think any syntax shorter than that is every going to happen though: we're really running out of syntax. More specifically, we're running out of ASCII characters to use for syntax. The obvious analogue of a$b in R would be a.b, but that already means something: field access. One of my favorite things about Julia is the lack of confusing syntax overloading, so I'm not really willing to have a.b sometimes mean something simple and fundamental like field access and other times mean something very different.If someone comes up with a really clever syntactic solution, that's cool, but it just seems kind of unlikely. My favorite syntactic solution that lets us cram huge amounts of functionality into a single feature is non-standard string literals. Something general like that might just be possible, but I'm kind of skeptical.
OK. Works for now for me.
Ok, for now, I'm going to suggest that we punt entirely on DataMat or anything like it.
Concur. It wasn't a high priority for me, anyway...
I don't think the requirements and applications are clear enough. DataVec (or maybe it should be called DataCol) and DataFrame seem pretty clear and complete: DataFrame provides an essentially relational representation of named, ordered heterogeneously typed, nullable data. DataVec represents a single column of that. Given the connection to relational data representation, I'm wondering if we shouldn't maybe call these types Table and Column instead of DataFrame and DataVec. The names Table and Column are a bit generic, but I'm assuming all of this will have to be imported before use anyway.
Hm, I'm OK with Table and Column! It's very clear for people with db backgrounds! Will rename those in my code presently...
Hm, I'm OK with Table and Column! It's very clear for people with db backgrounds! Will rename those in my code presently...
I hate the name data.frame, but just Table can be confused with the other common use of table in a statistical environment which is a cross-tabulation. At the expense of one letter (as compared to DataFrame) why not DataTable and DataCol.
I just disagree with this. Especially in an interactive environment, having to say a[:,"b"] is way, way more annoying than a$b (best, but maybe impossible) or a["b"]. 6, 1, and 4 keystrokes, respectively. Let's implement the 4-keystroke option now, and think about whether R-like 1-keystroke syntax is possible in the future.
No, data.frames in R are most definitely not just named arrays with NA handling! They allow heterogeneous types and simple column indexing, and they afford a huge number of operations that make little sense with matrixes (even ones with names and NAs). The types of operations that people do on DataTables are not going to be linear-algebra-like. They're going to be map-reduce-like, and SQL-join-like, and reshape-like.
One option would be for naFilter/naReplace to return an object of class DataVecIterator, or something, that contains/wraps a DataVec:
type DataVecIterator
datavec::DataVec
end
The start/next/done methods for a DataVecIterator would then have to dive through the extra layer of reference to get the data.
Another option would be for DataVec and FilteredDataVec and ReplacedDataVec to be implementations of an AbstractDataVec type, where almost all methods would refer to AbstractDataVec, except for start/next/done, which would have specific implementations that do the right thing. naFilter(d::DataVec) would then just create a FilteredDataVec, referring to the original data and na vectors. Similar for ReplacedDataVec.
A third option would be to keep just a single type, but add filterNA and replaceNA fields to the type, defaulting to false and nothing, respectively. naFilter and naReplace then just create new objects (referring to the old data) with the appropriate changes to those fields, which start/next/done would examine at run-time.
Sorry for the delay--I was getting caught up on some other things and wanted to let this settle in my mind a bit anyway.
One option would be for naFilter/naReplace to return an object of class DataVecIterator, or something, that contains/wraps a DataVec:
type DataVecIterator
datavec::DataVec
end
The start/next/done methods for a DataVecIterator would then have to dive through the extra layer of reference to get the data.
I liked this option the least. I'm not sure I have a good reason for this, other than your last point--it seems like the extra layer could complicate things or slow things down performance-wise. I have a general distrust of these sorts of hierarchies, though.
Another option would be for DataVec and FilteredDataVec and ReplacedDataVec to be implementations of an AbstractDataVec type, where almost all methods would refer to AbstractDataVec, except for start/next/done, which would have specific implementations that do the right thing. naFilter(d::DataVec) would then just create a FilteredDataVec, referring to the original data and na vectors. Similar for ReplacedDataVec.
A third option would be to keep just a single type, but add filterNA and replaceNA fields to the type, defaulting to false and nothing, respectively. naFilter and naReplace then just create new objects (referring to the old data) with the appropriate changes to those fields, which start/next/done would examine at run-time.
I vacillate between these two options.
I like the third option because it seems straightforward and flexible, and it seems like the fields could be useful for other reasons, or be expanded upon later (e.g., just hypothetically, in the abstract, if later it was decided you wanted to have multiple types of replace). On the other hand, that adds a teeny bit of overhead for every single DataVec, etc. I could also see a possible need to enforce certain constraints between the filterNA and replaceNA fields, in case there is a conflict.
The second option avoids that overhead for DataVec, but then it seems like adds a little complexity in other ways -- e.g., if I were writing a function operating on DataVecs, I'd probably prefer the third option because the fields would always be there, and I wouldn't have to worry about making inferences about the type of DataVec.
It sort of depends on exactly how these things would be implemented--e.g., what you mean by "changes to those fields" and "referring to the original data and na vectors." What would go in those fields and how would the na vectors be represented?
In the back of my mind, I'm sort of trying to imagine writing a function that implements an EM algorithm and trying to figure out what would be most desirable to work with. I think it depends on the details.
<iris.csv>