Julia's subsetting is unpredictable. Consider str = "sdfdsfd", then
str[a] returns a char or a string depending on whether a is a range or a
number. Thus in programs one always will have to explicitly check if
str[a] is a char or a string. Same story with arrays; A[a, 1] returns a
number or an array.
With an additions of new, complex data structures this problem will
likely to cause real havoc at some stage. I've already noticed a
discussion of R's data.frames implementation, which proposed that DF[1,
:] is a data.frame, DF[:, 1] is a vector, but DF[1, 1] is a singleton
type.
I understand that currently subsetting is driven by the type of the
argument, if argument is an array return array, if an integer return
integer. But this will not be true for more complex types like data
frames. You cannot return an array by subsetting a data.frame with an
array. What will you return if an argument is a tuple? Every type
designer will invent his own subsetting rules based on the type of an
argument, which will make a hard life for everyone else.
R folks had a brilliant idea, two subsetting operators [] and [[]], one
for non-destructive subsetting, which would preserve the original type,
and another for type demoting subsetting. But they screwed it up in the
implementation (also because they drop the array dimensions). As a
result you never know whether an object is a vector, or an array, or a
number. Consider this in R:
1) Numeric matrix A:
A[1, 1] or A[1, 1:2] -> a numeric vector (NULL dimension) [inconsistent]
2) List L or expression EXPR
L[a] -> always a list [consistent]
EXPR[a] -> always an expression [consistent]
L[[1]], EXPR[[1]] -> elemental type [consistent]
3) Data frame DF:
DF[1] DF[1, ] DF[, 1] -> data.frame [consistent]
DF[1, 1] -> vector [inconsistent]
4) Environment E (hash-table equivalent):
E[] is not implemented
E[["abc"]] -> element [consistent]
E[[1]] not implemented
Note that the biggest trouble is the first one, which drops the
dimensions. R's code is riddled with "if (is.vector(A)){...} else{
... }". Thanks Julia for not drooping array dims!
R doesn't allow elementary types like Int or Char, the coarsest type is
a vector. This makes the stuff pretty easy. In julia the degree of
freedom for arbitrary subsetting rules is much higher, as it allows for
low level types. Every designer of a new type will choose the behavior
of [] as he likes.
I dare to propose to introduce a demoting subsetting [[]] (which
potentially might also drop dimensions). If A is a matrix then A[[1, :]]
is a vector, A[[1, 1]] is an elemental type, STR[[2]] is a char, and so
on. But always keep the parent type in [] subsetting.
Sorry if this has been already discussed somewhere else.
My 2 cents,
Vitalie.
> 3) Data frame DF:
> DF[1] DF[1, ] DF[, 1] -> data.frame [consistent]
> DF[1, 1] -> vector [inconsistent]
Dammit, after so many years in R I still cannot write it from
memory. It's like this:
DF[1, ] -> data.frame
DF[, 1] or DF[, "foo"] -> vector
DF[1] -> DF["foo"] -> data.frame
DF[1, 1] -> vector with length 1
Hate this.
Julia's subsetting is unpredictable. Consider str = "sdfdsfd", then
str[a] returns a char or a string depending on whether a is a range or a
number. Thus in programs one always will have to explicitly check if
str[a] is a char or a string.
With an additions of new, complex data structures this problem will
likely to cause real havoc at some stage. I've already noticed a
discussion of R's data.frames implementation, which proposed that DF[1,
:] is a data.frame, DF[:, 1] is a vector, but DF[1, 1] is a singleton
type.
I understand that currently subsetting is driven by the type of the
argument, if argument is an array return array, if an integer return
integer. But this will not be true for more complex types like data
frames. You cannot return an array by subsetting a data.frame with an
array. What will you return if an argument is a tuple? Every type
designer will invent his own subsetting rules based on the type of an
argument, which will make a hard life for everyone else.
DF[1, ] -> data.frame
DF[, 1] or DF[, "foo"] -> vector
DF[1] -> DF["foo"] -> data.frame
DF[1, 1] -> vector with length 1
> On Tue, Apr 10, 2012 at 5:43 AM, Vitalie Spinu <spin...@gmail.com> wrote:
>> Julia's subsetting is unpredictable. Consider str = "sdfdsfd", then
>> str[a] returns a char or a string depending on whether a is a range or a
>> number. Thus in programs one always will have to explicitly check if
>> str[a] is a char or a string.
> Under what circumstances would you have to check? str[a:b] always returns a
> string, even if a==b, right? Programmatically, you're either going to be
> indexing with a range or you're not.
Often you don't know the type of a in str[a] in advance. It might be an
user input or a result of some other computation. You will always have
to convert it to Range in order to ensure there is no breakage.
> With an additions of new, complex data structures this problem will
>> likely to cause real havoc at some stage. I've already noticed a
>> discussion of R's data.frames implementation, which proposed that DF[1,
>> :] is a data.frame, DF[:, 1] is a vector, but DF[1, 1] is a singleton
>> type.
>>
>> I understand that currently subsetting is driven by the type of the
>> argument, if argument is an array return array, if an integer return
>> integer. But this will not be true for more complex types like data
>> frames. You cannot return an array by subsetting a data.frame with an
>> array. What will you return if an argument is a tuple? Every type
>> designer will invent his own subsetting rules based on the type of an
>> argument, which will make a hard life for everyone else.
>>
> I'll take the DataFrame aspect of this... Look, a DataFrame (or a
> data.frame, or a data.table) is not a matrix, full stop. It's a list of
> heterogeneous vectors of the same length. There's no reason why it needs to
> have the same semantics as a matrix or n-dimensional array. It should have
> the semantics that make it easiest for people dealing with tabular data to
> interactively build and work with the data structure. I fully, fully agree
> that there are giant inconsistencies in R that make programming difficult
> (don't get me started on sample()!), but I think the goal should be
> sensible semantics within each data structure type.
This is a biggest problem. Everyone considers what is sensible in
different ways. There will be 10 different data.xxx structures by 10
different authors, and with 10 different "sensible" semantics.
But one cannot even remember the semantics of a data.frame in R:(
> As you say, for R:
> DF[1, ] -> data.frame
>> DF[, 1] or DF[, "foo"] -> vector
>> DF[1] -> DF["foo"] -> data.frame
>> DF[1, 1] -> vector with length 1
>>
> And also DF$foo -> vector, which sadly is unlikely to happen in Julia.
I don't think it's an option. It should be something out there. How
about DF@foo, @ is not taken as a postfix operator, is it?
> And for Julia (proposed, half implemented in my fork):
> Any combination of two-argument ref's return a DataFrame, except when both
> arguments are simple index types:
> DF[scalar, :] -> 1-row DataFrame
> DF[vector, :] or DF[range, :] -> n-row DataFrame
> DF[:, scalar] -> 1-col DataFrame
> DF[:, vector] or DF[:, range] -> n-row DataFrame
Great.
> DF[scalar, scalar] -> scalar
As I said, I am not very convinced here. But at least it's consistent
with Julia's way of doing it.
> Note that vectors can include boolean vectors, row/column names, or
> (possibly singleton) ranges. So DF[1, 1:1] returns a 1x1 DataFrame. As does
> DF[[1], [1]], because 1-element vectors in Julia are not the same as
> scalars.
Good.
> Single-argument ref always returns a column DataVec:
> DF[i] or DF["cat"]
It's the same as in "scalar" case above, so consistent. I presume you
have DF[vector] as well.
> I don't think there are any inconsistencies here. Requiring users to do
> DF["cat"][3] to get a scalar will be slow to type and slow to execute, so
> that's not an option.
Right, but DF[["cat"]] or DF@cat might possibly be!
> For Julia DataFrames, the real squickiness, it seems like, will be in the
> demotion to simple arrays, and when and where that happens. In R, you can
> have NAs all the way down, but I think in Julia as soon as you want to
> start doing heavy math on a DataFrame, you need to resolve the NAs early
> and convert to a matrix of homogeneous type. I've been playing with
> expressions like mean(naFilter(dv)) to efficiently deal with NAs in core
> math functions, but it's going to take some iteration when we get to
> DataFrames and things like conversions based on formulas to model matrices.
> I tend to think, at the moment, that most mathematical operations won't
> work at all on DataFrames, cf R.
I really hope that we will be able to do heavy stuff with DataFrames in
julia. If you want to have a language for data analysis you must be able
to have it.
> Incidentally, Vitalie, what do you think about Pandas' DataFrames or the R
> data.table types? To me, Pandas DataFrames are too focused on time-series
> data, and have the expectation that non-homogeneous types are mostly just
> for indexing, which isn't generally true. And if you have trouble with
> data.frame semantics, I assume just thinking about data.table makes you
> grind your teeth!
I don't know Panda, but with respect to data.table, by the contrary, it
is a remarkable idea. I would have liked to have DT[v>2, ] semantics by
default in R (here v is a vector in DT), because it's, as you said,
super handy for interactive computation. But then an escape like
DT[.(v)>2, ] is necessary to refer to "v" variable in outer scope. The
opposite is also fine DT[.(v)>2, ] to refer to an element of DT.
I hope the above is posible in Julia, is it?
Bad things about DT:
DT["v"] subsets rows by key, even if you have a column "v"; confusing.
DT[, sum(v), by=x] is awfully ugly, given [] semantics in R. Why don't
have DT[by=x, do=sum(v)] I will never get.
Vitalie.
> Under what circumstances would you have to check? str[a:b] always returns a
> string, even if a==b, right? Programmatically, you're either going to be
> indexing with a range or you're not.
Often you don't know the type of a in str[a] in advance. It might be an
user input or a result of some other computation. You will always have
to convert it to Range in order to ensure there is no breakage.
> I'll take the DataFrame aspect of this... Look, a DataFrame (or aThis is a biggest problem. Everyone considers what is sensible in
> data.frame, or a data.table) is not a matrix, full stop. It's a list of
> heterogeneous vectors of the same length. There's no reason why it needs to
> have the same semantics as a matrix or n-dimensional array. It should have
> the semantics that make it easiest for people dealing with tabular data to
> interactively build and work with the data structure. I fully, fully agree
> that there are giant inconsistencies in R that make programming difficult
> (don't get me started on sample()!), but I think the goal should be
> sensible semantics within each data structure type.
different ways. There will be 10 different data.xxx structures by 10
different authors, and with 10 different "sensible" semantics.
It's the same as in "scalar" case above, so consistent. I presume you
> Single-argument ref always returns a column DataVec:
> DF[i] or DF["cat"]
have DF[vector] as well.
Right, but DF[["cat"]] or DF@cat might possibly be!
> I don't think there are any inconsistencies here. Requiring users to do
> DF["cat"][3] to get a scalar will be slow to type and slow to execute, so
> that's not an option.
I really hope that we will be able to do heavy stuff with DataFrames in
> For Julia DataFrames, the real squickiness, it seems like, will be in the
> demotion to simple arrays, and when and where that happens. In R, you can
> have NAs all the way down, but I think in Julia as soon as you want to
> start doing heavy math on a DataFrame, you need to resolve the NAs early
> and convert to a matrix of homogeneous type. I've been playing with
> expressions like mean(naFilter(dv)) to efficiently deal with NAs in core
> math functions, but it's going to take some iteration when we get to
> DataFrames and things like conversions based on formulas to model matrices.
> I tend to think, at the moment, that most mathematical operations won't
> work at all on DataFrames, cf R.
julia. If you want to have a language for data analysis you must be able
to have it.
I don't know Panda, but with respect to data.table, by the contrary, it
> Incidentally, Vitalie, what do you think about Pandas' DataFrames or the R
> data.table types? To me, Pandas DataFrames are too focused on time-series
> data, and have the expectation that non-homogeneous types are mostly just
> for indexing, which isn't generally true. And if you have trouble with
> data.frame semantics, I assume just thinking about data.table makes you
> grind your teeth!
is a remarkable idea. I would have liked to have DT[v>2, ] semantics by
default in R (here v is a vector in DT), because it's, as you said,
super handy for interactive computation. But then an escape like
DT[.(v)>2, ] is necessary to refer to "v" variable in outer scope. The
opposite is also fine DT[.(v)>2, ] to refer to an element of DT.
I hope the above is posible in Julia, is it?
Bad things about DT:
DT["v"] subsets rows by key, even if you have a column "v"; confusing.
DT[, sum(v), by=x] is awfully ugly, given [] semantics in R. Why don't
have DT[by=x, do=sum(v)] I will never get.
> I don't think there are any inconsistencies here. Requiring users to do
> DF["cat"][3] to get a scalar will be slow to type and slow to execute, so
> that's not an option.
Right, but DF[["cat"]] or DF@cat might possibly be!
(Side note: there will be partial matching on row/column indexes over my dead body.)
Now that I think of it, I want to half-suggest that sqldf might be the better model for Julia, given the impossibility of dealing with bare words. One approach would be something using non-standard string literals to generate a DSL.
df2 = do(Q"select col1, col2 from df1 where col3 = 'dog' and col4 > 7")
Another approach would be some variation on Pandas' split-apply-combine, with a DSL for the grouping:
combine(apply(split(df, G"col1 and col3"), x -> sum(x["col2"])))
On Tuesday, April 10, 2012 10:44:03 AM UTC-5, Harlan Harris wrote:> I don't think there are any inconsistencies here. Requiring users to do
> DF["cat"][3] to get a scalar will be slow to type and slow to execute, so
> that's not an option.
For what it's worth, I always liked having this option (if not a requirement), as it corresponded to a series of nested functions being applied to the data frame and seemed logical to me:
(DF["cat"])[3]
( ( DF[1:10,:] )[:,1] ) [1]
It can get confusing, but is sometimes handy and always seemed consistent to me in a lisp-y sort of way (maybe even more consistent than the DF$var syntax).
(Side note: there will be partial matching on row/column indexes over my dead body.)
For some reason I'm drawing a blank on what you mean by this.
I like the idea in general of a sort of data-structure DSL. It seems flexible to me. Neither of the two above seem particularly appealing to me as direct inspirations, though--I think something closer to what you might see elsewhere would be more appealing to me. E.g., very roughly
df2 = select(df1, [col1, col2], ["col3=='dog'", "col4 > 7"])
Anyway, this is relatively minor, but how are you thinking of handling negative indices?
e.g.,
DF[-1, :] would remove the first row? Not be allowed?
I have far too much programming, writing, and other work to do to
defend myself in this public forum-- but I don't think I have ever
said that data *should* be in one format versus the other. My view is
that you can implement a data structure that performs both column- and
row-wise operations equally well, subject only to cache performance
due to the actual memory layout (C/row-major or Fortran/column-major
order).
I also very strongly disagree with the notion that pandas is
necessarily designed for time series data. Since I started building
pandas in a financial setting many of my examples on the internet come
from that domain because it's a very easy-to-understand unit of data
having a meaningful ordered row index. A small piece of pandas (<< 20%
of the codebase, I could look and give you an exact measurement) could
be completely torn out, thus removing all time series functionality
without harming any other library features. So please, do not spread
misinformation that pandas is not good for other kinds of data
analysis, because that is simply not true (you might have written this
before I clarified in private e-mail) =P
> OK, back to work!
>
> -Halran
>
- Wes
Reshaping in R is painful even with reshape2 package, because data.frame
doesn't have a notion of row variables. But, Harlan, I also think that
your work on data.frames in Julia is extremely useful, and must have. As
people are used to data.frames and more similarity is always better.
Vitalie.
>>>> Harlan Harris <har...@harris.name>
I completely agree with Wes. Data frames in R are old and inflexible
data structure treating columnes and rows asymmetrically. I don't know
much about Panda, but if it treats columns and rows symmetrically it
must be a very smart tool.
Reshaping in R is painful even with reshape2 package, because data.frame
doesn't have a notion of row variables. But, Harlan, I also think that
your work on data.frames in Julia is extremely useful, and must have. As
people are used to data.frames and more similarity is always better.
As to row-names, I am not sure the row-names are really useful. At least
for me it was a continuous source of headache in R. Far more fundamental
concept is the key. You can have many keys in a data.frame and use all
of them to index rows. For example:
DF[("b"), ] would extract a part of data frame whose row key matches
"b". Or DF[("b", 3), ] would extract whose first key matches "b" and
second key matches 3. Of course one would also need some sort of
labeling for this
DF[(:key1 "b", :key2 3), ]
You can have a primary key which by default would be just row-number key
and use DF[2:3, ] to index the rows without the tuple notation.
But named tuples are not yet implemented in Julia, right? And it would
be really nice to have that first, before thinking further about complex
indexing in data structures.
Also by using keys, one from the very beginning can implement non-liner
search for indexing, which is hugely more efficient than linear search
in R data.frames. This is how indexing in R's data.table package works.
>>>> Stefan Karpinski <ste...@karpinski.org>
This like this
http://stackoverflow.com/a/10149202/776560
are very tricky / ad-hoc to do without a consistent and coherent way
to label / index any dimension.
Have any of your thought much about how Julia might provide
implementations of time series data structures like zoo or xts in R
which provide rudimentary auto-alignment functionality (I've taken it
a step further)? In pandas I'm able to do that only by having a
different row index type which is actually very nice for users-- one
completely consistent data structure for doing everything (rather than
R's mixed bag of 8 different data structures that all do slightly
different things).
- Wes
f(str::String, x::Int) = ... str[x] ...f(str::String, x::Range1) = ... str[x] ...
g(str::String, x) = ... str[x] ...
function g(str::String, idxs...)h = HashTable()for idx in idxsh[idx] = str[idx]endreturn hend
julia> g("Hello, world.\n", 2, 1:3, 5){1:3=>"Hel",5=>'o',2=>'e',}
I did have an email exchange with Achim Zeileis a long time back on the performance of zoo, and he had shared some snippets of code with me that was doing some kind of complex dynamic programming. He eventually ended up writing this kind of stuff in C, callable from R. Would be great to get a discussion going with Achim and others.
-viral