Subsetting: Is Julia making the same mistake R made 20 years ago?

Vitalie Spinu

unread,

Apr 10, 2012, 5:43:43 AM4/10/12

to julia-dev

Hi,

Julia's subsetting is unpredictable. Consider str = "sdfdsfd", then
str[a] returns a char or a string depending on whether a is a range or a
number. Thus in programs one always will have to explicitly check if
str[a] is a char or a string. Same story with arrays; A[a, 1] returns a
number or an array.

With an additions of new, complex data structures this problem will
likely to cause real havoc at some stage. I've already noticed a
discussion of R's data.frames implementation, which proposed that DF[1,
:] is a data.frame, DF[:, 1] is a vector, but DF[1, 1] is a singleton
type.

I understand that currently subsetting is driven by the type of the
argument, if argument is an array return array, if an integer return
integer. But this will not be true for more complex types like data
frames. You cannot return an array by subsetting a data.frame with an
array. What will you return if an argument is a tuple? Every type
designer will invent his own subsetting rules based on the type of an
argument, which will make a hard life for everyone else.

R folks had a brilliant idea, two subsetting operators [] and [[]], one
for non-destructive subsetting, which would preserve the original type,
and another for type demoting subsetting. But they screwed it up in the
implementation (also because they drop the array dimensions). As a
result you never know whether an object is a vector, or an array, or a
number. Consider this in R:

1) Numeric matrix A:
A[1, 1] or A[1, 1:2] -> a numeric vector (NULL dimension) [inconsistent]

2) List L or expression EXPR
L[a] -> always a list [consistent]
EXPR[a] -> always an expression [consistent]
L[[1]], EXPR[[1]] -> elemental type [consistent]

3) Data frame DF:
DF[1] DF[1, ] DF[, 1] -> data.frame [consistent]
DF[1, 1] -> vector [inconsistent]

4) Environment E (hash-table equivalent):
E[] is not implemented
E[["abc"]] -> element [consistent]
E[[1]] not implemented

Note that the biggest trouble is the first one, which drops the
dimensions. R's code is riddled with "if (is.vector(A)){...} else{
... }". Thanks Julia for not drooping array dims!

R doesn't allow elementary types like Int or Char, the coarsest type is
a vector. This makes the stuff pretty easy. In julia the degree of
freedom for arbitrary subsetting rules is much higher, as it allows for
low level types. Every designer of a new type will choose the behavior
of [] as he likes.

I dare to propose to introduce a demoting subsetting [[]] (which
potentially might also drop dimensions). If A is a matrix then A[[1, :]]
is a vector, A[[1, 1]] is an elemental type, STR[[2]] is a char, and so
on. But always keep the parent type in [] subsetting.

Sorry if this has been already discussed somewhere else.

My 2 cents,
Vitalie.

Vitalie Spinu

unread,

Apr 10, 2012, 7:35:07 AM4/10/12

to julia-dev

>>>> Vitalie Spinu <spin...@gmail.com>

>>>> on Tue, 10 Apr 2012 11:43:43 +0200 wrote:

> 3) Data frame DF:
> DF[1] DF[1, ] DF[, 1] -> data.frame [consistent]
> DF[1, 1] -> vector [inconsistent]

Dammit, after so many years in R I still cannot write it from
memory. It's like this:

DF[1, ] -> data.frame
DF[, 1] or DF[, "foo"] -> vector
DF[1] -> DF["foo"] -> data.frame
DF[1, 1] -> vector with length 1

Hate this.

Harlan Harris

unread,

Apr 10, 2012, 8:36:37 AM4/10/12

to juli...@googlegroups.com

On Tue, Apr 10, 2012 at 5:43 AM, Vitalie Spinu <spin...@gmail.com> wrote:

Julia's subsetting is unpredictable. Consider str = "sdfdsfd", then
str[a] returns a char or a string depending on whether a is a range or a
number. Thus in programs one always will have to explicitly check if
str[a] is a char or a string.

Under what circumstances would you have to check? str[a:b] always returns a string, even if a==b, right? Programmatically, you're either going to be indexing with a range or you're not.

With an additions of new, complex data structures this problem will
likely to cause real havoc at some stage. I've already noticed a
discussion of R's data.frames implementation, which proposed that DF[1,
:] is a data.frame, DF[:, 1] is a vector, but DF[1, 1] is a singleton
type.

I understand that currently subsetting is driven by the type of the
argument, if argument is an array return array, if an integer return
integer. But this will not be true for more complex types like data
frames. You cannot return an array by subsetting a data.frame with an
array. What will you return if an argument is a tuple? Every type
designer will invent his own subsetting rules based on the type of an
argument, which will make a hard life for everyone else.

I'll take the DataFrame aspect of this... Look, a DataFrame (or a data.frame, or a data.table) is not a matrix, full stop. It's a list of heterogeneous vectors of the same length. There's no reason why it needs to have the same semantics as a matrix or n-dimensional array. It should have the semantics that make it easiest for people dealing with tabular data to interactively build and work with the data structure. I fully, fully agree that there are giant inconsistencies in R that make programming difficult (don't get me started on sample()!), but I think the goal should be sensible semantics within each data structure type.

As you say, for R:

DF[1, ] -> data.frame
DF[, 1] or DF[, "foo"] -> vector
DF[1] -> DF["foo"] -> data.frame

DF[1, 1] -> vector with length 1

And also DF$foo -> vector, which sadly is unlikely to happen in Julia.

And for Julia (proposed, half implemented in my fork):

Any combination of two-argument ref's return a DataFrame, except when both arguments are simple index types:
DF[scalar, :] -> 1-row DataFrame
DF[vector, :] or DF[range, :] -> n-row DataFrame
DF[:, scalar] -> 1-col DataFrame
DF[:, vector] or DF[:, range] -> n-row DataFrame
DF[scalar, scalar] -> scalar
Note that vectors can include boolean vectors, row/column names, or (possibly singleton) ranges. So DF[1, 1:1] returns a 1x1 DataFrame. As does DF[[1], [1]], because 1-element vectors in Julia are not the same as scalars.

Single-argument ref always returns a column DataVec:
DF[i] or DF["cat"]

I don't think there are any inconsistencies here. Requiring users to do DF["cat"][3] to get a scalar will be slow to type and slow to execute, so that's not an option.

For Julia DataFrames, the real squickiness, it seems like, will be in the demotion to simple arrays, and when and where that happens. In R, you can have NAs all the way down, but I think in Julia as soon as you want to start doing heavy math on a DataFrame, you need to resolve the NAs early and convert to a matrix of homogeneous type. I've been playing with expressions like mean(naFilter(dv)) to efficiently deal with NAs in core math functions, but it's going to take some iteration when we get to DataFrames and things like conversions based on formulas to model matrices. I tend to think, at the moment, that most mathematical operations won't work at all on DataFrames, cf R.

Incidentally, Vitalie, what do you think about Pandas' DataFrames or the R data.table types? To me, Pandas DataFrames are too focused on time-series data, and have the expectation that non-homogeneous types are mostly just for indexing, which isn't generally true. And if you have trouble with data.frame semantics, I assume just thinking about data.table makes you grind your teeth!

-Harlan

Vitalie Spinu

unread,

Apr 10, 2012, 9:55:56 AM4/10/12

to juli...@googlegroups.com

>>>> Harlan Harris <har...@harris.name>

>>>> on Tue, 10 Apr 2012 08:36:37 -0400 wrote:

> On Tue, Apr 10, 2012 at 5:43 AM, Vitalie Spinu <spin...@gmail.com> wrote:
>> Julia's subsetting is unpredictable. Consider str = "sdfdsfd", then
>> str[a] returns a char or a string depending on whether a is a range or a
>> number. Thus in programs one always will have to explicitly check if
>> str[a] is a char or a string.

> Under what circumstances would you have to check? str[a:b] always returns a
> string, even if a==b, right? Programmatically, you're either going to be
> indexing with a range or you're not.

Often you don't know the type of a in str[a] in advance. It might be an
user input or a result of some other computation. You will always have
to convert it to Range in order to ensure there is no breakage.

> With an additions of new, complex data structures this problem will
>> likely to cause real havoc at some stage. I've already noticed a
>> discussion of R's data.frames implementation, which proposed that DF[1,
>> :] is a data.frame, DF[:, 1] is a vector, but DF[1, 1] is a singleton
>> type.
>>
>> I understand that currently subsetting is driven by the type of the
>> argument, if argument is an array return array, if an integer return
>> integer. But this will not be true for more complex types like data
>> frames. You cannot return an array by subsetting a data.frame with an
>> array. What will you return if an argument is a tuple? Every type
>> designer will invent his own subsetting rules based on the type of an
>> argument, which will make a hard life for everyone else.
>>

> I'll take the DataFrame aspect of this... Look, a DataFrame (or a
> data.frame, or a data.table) is not a matrix, full stop. It's a list of
> heterogeneous vectors of the same length. There's no reason why it needs to
> have the same semantics as a matrix or n-dimensional array. It should have
> the semantics that make it easiest for people dealing with tabular data to
> interactively build and work with the data structure. I fully, fully agree
> that there are giant inconsistencies in R that make programming difficult
> (don't get me started on sample()!), but I think the goal should be
> sensible semantics within each data structure type.

This is a biggest problem. Everyone considers what is sensible in
different ways. There will be 10 different data.xxx structures by 10
different authors, and with 10 different "sensible" semantics.

But one cannot even remember the semantics of a data.frame in R:(

> As you say, for R:

> DF[1, ] -> data.frame
>> DF[, 1] or DF[, "foo"] -> vector
>> DF[1] -> DF["foo"] -> data.frame
>> DF[1, 1] -> vector with length 1
>>

> And also DF$foo -> vector, which sadly is unlikely to happen in Julia.

I don't think it's an option. It should be something out there. How
about DF@foo, @ is not taken as a postfix operator, is it?

> And for Julia (proposed, half implemented in my fork):

> Any combination of two-argument ref's return a DataFrame, except when both
> arguments are simple index types:
> DF[scalar, :] -> 1-row DataFrame
> DF[vector, :] or DF[range, :] -> n-row DataFrame
> DF[:, scalar] -> 1-col DataFrame
> DF[:, vector] or DF[:, range] -> n-row DataFrame

Great.

> DF[scalar, scalar] -> scalar

As I said, I am not very convinced here. But at least it's consistent
with Julia's way of doing it.

> Note that vectors can include boolean vectors, row/column names, or
> (possibly singleton) ranges. So DF[1, 1:1] returns a 1x1 DataFrame. As does
> DF[[1], [1]], because 1-element vectors in Julia are not the same as
> scalars.

Good.

> Single-argument ref always returns a column DataVec:
> DF[i] or DF["cat"]

It's the same as in "scalar" case above, so consistent. I presume you
have DF[vector] as well.

> I don't think there are any inconsistencies here. Requiring users to do
> DF["cat"][3] to get a scalar will be slow to type and slow to execute, so
> that's not an option.

Right, but DF[["cat"]] or DF@cat might possibly be!

> For Julia DataFrames, the real squickiness, it seems like, will be in the
> demotion to simple arrays, and when and where that happens. In R, you can
> have NAs all the way down, but I think in Julia as soon as you want to
> start doing heavy math on a DataFrame, you need to resolve the NAs early
> and convert to a matrix of homogeneous type. I've been playing with
> expressions like mean(naFilter(dv)) to efficiently deal with NAs in core
> math functions, but it's going to take some iteration when we get to
> DataFrames and things like conversions based on formulas to model matrices.
> I tend to think, at the moment, that most mathematical operations won't
> work at all on DataFrames, cf R.

I really hope that we will be able to do heavy stuff with DataFrames in
julia. If you want to have a language for data analysis you must be able
to have it.

> Incidentally, Vitalie, what do you think about Pandas' DataFrames or the R
> data.table types? To me, Pandas DataFrames are too focused on time-series
> data, and have the expectation that non-homogeneous types are mostly just
> for indexing, which isn't generally true. And if you have trouble with
> data.frame semantics, I assume just thinking about data.table makes you
> grind your teeth!

I don't know Panda, but with respect to data.table, by the contrary, it
is a remarkable idea. I would have liked to have DT[v>2, ] semantics by
default in R (here v is a vector in DT), because it's, as you said,
super handy for interactive computation. But then an escape like
DT[.(v)>2, ] is necessary to refer to "v" variable in outer scope. The
opposite is also fine DT[.(v)>2, ] to refer to an element of DT.

I hope the above is posible in Julia, is it?

Bad things about DT:
DT["v"] subsets rows by key, even if you have a column "v"; confusing.
DT[, sum(v), by=x] is awfully ugly, given [] semantics in R. Why don't
have DT[by=x, do=sum(v)] I will never get.

Vitalie.

Harlan Harris

unread,

Apr 10, 2012, 11:44:03 AM4/10/12

to juli...@googlegroups.com

(Others should feel free to chime in about matrix stuff! I have nothing to say about that!)

On Tue, Apr 10, 2012 at 9:55 AM, Vitalie Spinu <spin...@gmail.com> wrote:

> Under what circumstances would you have to check? str[a:b] always returns a
> string, even if a==b, right? Programmatically, you're either going to be
> indexing with a range or you're not.

Often you don't know the type of a in str[a] in advance. It might be an
user input or a result of some other computation. You will always have
to convert it to Range in order to ensure there is no breakage.

In the case of user input, I think it is entirely reasonable to do a whole range of checks before passing the result to a ref function! In the case of some other computation, I think you can always be sure that the indexing expression is (say) an array, even a singleton array. I just don't see a problem here.

> I'll take the DataFrame aspect of this... Look, a DataFrame (or a
> data.frame, or a data.table) is not a matrix, full stop. It's a list of
> heterogeneous vectors of the same length. There's no reason why it needs to
> have the same semantics as a matrix or n-dimensional array. It should have
> the semantics that make it easiest for people dealing with tabular data to
> interactively build and work with the data structure. I fully, fully agree
> that there are giant inconsistencies in R that make programming difficult
> (don't get me started on sample()!), but I think the goal should be
> sensible semantics within each data structure type.

This is a biggest problem. Everyone considers what is sensible in
different ways. There will be 10 different data.xxx structures by 10
different authors, and with 10 different "sensible" semantics.

Not if the core team has something to say about it, and not if we get it right the first time!

> Single-argument ref always returns a column DataVec:
> DF[i] or DF["cat"]

It's the same as in "scalar" case above, so consistent. I presume you
have DF[vector] as well.

Actually, I don't. I'd prefer requiring DF[:,vector] or DF[vector,:] to get row or column slices.

> I don't think there are any inconsistencies here. Requiring users to do
> DF["cat"][3] to get a scalar will be slow to type and slow to execute, so
> that's not an option.

Right, but DF[["cat"]] or DF@cat might possibly be!

The problem with the former is that it's already a legal expression, calling ref(df, Array{String,1}), which I'd prefer to leave undefined.

In theory I'd like the latter, but there are big parser issues with bare words in Julia, I think. R fakes it with lazy evaluation and with a requirement to quote things like DF$"my col".

(Side note: there will be partial matching on row/column indexes over my dead body.)

> For Julia DataFrames, the real squickiness, it seems like, will be in the
> demotion to simple arrays, and when and where that happens. In R, you can
> have NAs all the way down, but I think in Julia as soon as you want to
> start doing heavy math on a DataFrame, you need to resolve the NAs early
> and convert to a matrix of homogeneous type. I've been playing with
> expressions like mean(naFilter(dv)) to efficiently deal with NAs in core
> math functions, but it's going to take some iteration when we get to
> DataFrames and things like conversions based on formulas to model matrices.
> I tend to think, at the moment, that most mathematical operations won't
> work at all on DataFrames, cf R.

I really hope that we will be able to do heavy stuff with DataFrames in
julia. If you want to have a language for data analysis you must be able
to have it.

By "heavy", I mean numerically heavy. My thinking, and others may have better ideas, is to use DataFrames to do all of the relational/set-based processing, stuff that in R you'd do with the plyr and reshape2 packages (or base functionality, if you like frustration), but that functions like model.matrix that convert data objects to matrices would have to deal with NAs sooner than they do in R, where matrices support missing data.

And we should be sure to learn from other approaches to make those operations fast, efficient, orthoganal, and incredibly easy to read and write in the REPL.

> Incidentally, Vitalie, what do you think about Pandas' DataFrames or the R
> data.table types? To me, Pandas DataFrames are too focused on time-series
> data, and have the expectation that non-homogeneous types are mostly just
> for indexing, which isn't generally true. And if you have trouble with
> data.frame semantics, I assume just thinking about data.table makes you
> grind your teeth!

I don't know Panda, but with respect to data.table, by the contrary, it
is a remarkable idea. I would have liked to have DT[v>2, ] semantics by
default in R (here v is a vector in DT), because it's, as you said,
super handy for interactive computation. But then an escape like
DT[.(v)>2, ] is necessary to refer to "v" variable in outer scope. The
opposite is also fine DT[.(v)>2, ] to refer to an element of DT.

I hope the above is posible in Julia, is it?

Oy, it's bare words issues again. You can't even do DT["v" > 2, ] in Julia, because the expression gets eval'ed before it hits ref().

Bad things about DT:
DT["v"] subsets rows by key, even if you have a column "v"; confusing.
DT[, sum(v), by=x] is awfully ugly, given [] semantics in R. Why don't
have DT[by=x, do=sum(v)] I will never get.

In general, although I find data.table an extremely interesting approach, I agree that its syntax choices are puzzling.

I need to write up my thoughts on data.frame, DataFrame in Pandas, and data.table. I don't feel like any of them are a very good direct inspiration for Julia.

Now that I think of it, I want to half-suggest that sqldf might be the better model for Julia, given the impossibility of dealing with bare words. One approach would be something using non-standard string literals to generate a DSL.

df2 = do(Q"select col1, col2 from df1 where col3 = 'dog' and col4 > 7")

Another approach would be some variation on Pandas' split-apply-combine, with a DSL for the grouping:

combine(apply(split(df, G"col1 and col3"), x -> sum(x["col2"])))

I'm not sure what would be most Julian for joins or reshaping. I do tend to agree with Hadley/plyr/reshape2/ggplot2 (and not with Wes/pandas) that data should be in long format most of the time, and that split-apply-combine should be the default method of performing operations on the data.

I also disagree with Wes on DataFrames being sorta symmetrical in terms of the operations you can do on them. I'd strongly prefer requiring some sort of colwise() functional to do operations on a set of columns.

OK, back to work!

-Halran

Stefan Karpinski

unread,

Apr 10, 2012, 12:02:58 PM4/10/12

to juli...@googlegroups.com

These are legitimate concerns and a good discussion to have. Obviously, we can't stop people writing their own types from implementing any indexing behavior they want, no matter how crazy or inconsistent. That's the downside of such a maleable language that's built in itself. However, we can, as Harlan says, enforce consistency in what gets accepted as standard packages. To that end, we should come up with some high-level indexing "rules": what various slicing behavior should do in spirit, which then needs to be interpreted and applied to new data types like DataFrame, etc. We're still working that out, of course — matrix slicing isn't even entirely settled on.

kem

unread,

Apr 11, 2012, 1:23:54 AM4/11/12

to juli...@googlegroups.com

On Tuesday, April 10, 2012 10:44:03 AM UTC-5, Harlan Harris wrote:

> I don't think there are any inconsistencies here. Requiring users to do
> DF["cat"][3] to get a scalar will be slow to type and slow to execute, so
> that's not an option.

For what it's worth, I always liked having this option (if not a requirement), as it corresponded to a series of nested functions being applied to the data frame and seemed logical to me:

(DF["cat"])[3]

( ( DF[1:10,:] )[:,1] ) [1]

It can get confusing, but is sometimes handy and always seemed consistent to me in a lisp-y sort of way (maybe even more consistent than the DF$var syntax).

Right, but DF[["cat"]] or DF@cat might possibly be!

If this sort of notation were introduced, it would be nice to have it generalize to other sorts of data structures, and not just be a "data frame thing." I'm not sure what that would be, and am not suggesting it wouldn't generalize, but just a reaction.

(Side note: there will be partial matching on row/column indexes over my dead body.)

For some reason I'm drawing a blank on what you mean by this.

Now that I think of it, I want to half-suggest that sqldf might be the better model for Julia, given the impossibility of dealing with bare words. One approach would be something using non-standard string literals to generate a DSL.

df2 = do(Q"select col1, col2 from df1 where col3 = 'dog' and col4 > 7")

Another approach would be some variation on Pandas' split-apply-combine, with a DSL for the grouping:

combine(apply(split(df, G"col1 and col3"), x -> sum(x["col2"])))

I like the idea in general of a sort of data-structure DSL. It seems flexible to me. Neither of the two above seem particularly appealing to me as direct inspirations, though--I think something closer to what you might see elsewhere would be more appealing to me. E.g., very roughly

df2 = select(df1, [col1, col2], ["col3=='dog'", "col4 > 7"])

Anyway, this is relatively minor, but how are you thinking of handling negative indices?

e.g.,

DF[-1, :] would remove the first row? Not be allowed?

Harlan Harris

unread,

Apr 11, 2012, 9:03:18 AM4/11/12

to juli...@googlegroups.com

On Wed, Apr 11, 2012 at 1:23 AM, kem <kristian...@gmail.com> wrote:

On Tuesday, April 10, 2012 10:44:03 AM UTC-5, Harlan Harris wrote:

> I don't think there are any inconsistencies here. Requiring users to do
> DF["cat"][3] to get a scalar will be slow to type and slow to execute, so
> that's not an option.

For what it's worth, I always liked having this option (if not a requirement), as it corresponded to a series of nested functions being applied to the data frame and seemed logical to me:

(DF["cat"])[3]

( ( DF[1:10,:] )[:,1] ) [1]

It can get confusing, but is sometimes handy and always seemed consistent to me in a lisp-y sort of way (maybe even more consistent than the DF$var syntax).

Oh, sorry. I meant to say that that shouldn't be the only option for getting singletons. That will definitely work.

(Side note: there will be partial matching on row/column indexes over my dead body.)

For some reason I'm drawing a blank on what you mean by this.

In R, if you have df <- list(cat=7, dog=12), you can do df$c and it evaluates to 7. Maybe made sense in 1995, but now that's what IDEs are for.

I like the idea in general of a sort of data-structure DSL. It seems flexible to me. Neither of the two above seem particularly appealing to me as direct inspirations, though--I think something closer to what you might see elsewhere would be more appealing to me. E.g., very roughly

df2 = select(df1, [col1, col2], ["col3=='dog'", "col4 > 7"])

Yeah, some sort of combination of select(from, cols, where), and groupby() ala Pandas might make a reasonably readable syntax. I'm slightly partial to split-apply-combine, personally...

Anyway, this is relatively minor, but how are you thinking of handling negative indices?

e.g.,

DF[-1, :] would remove the first row? Not be allowed?

Hadn't thought about it yet! Opinions?

Stefan Karpinski

unread,

Apr 11, 2012, 2:50:39 PM4/11/12

to juli...@googlegroups.com

Negative indices are an error. Something like DF["cat"][3] will work for data frames assuming that DF["cat"] pulls out the DataCol named "cat". I'm still not *entirely* convinced about that behavior, but let's see how it feels.

Tom Short

unread,

Apr 11, 2012, 3:29:00 PM4/11/12

to julia-dev

On Apr 11, 9:03 am, Harlan Harris <har...@harris.name> wrote:
> I like the idea in general of a sort of data-structure DSL. It seems
>
> > flexible to me. Neither of the two above seem particularly appealing to me
> > as direct inspirations, though--I think something closer to what you might
> > see elsewhere would be more appealing to me. E.g., very roughly
>
> > df2 = select(df1, [col1, col2], ["col3=='dog'", "col4 > 7"])

Harlan, you could also write methods using expressions. That might
simplify indexing. I think you could set it up to do the following:

df2 = select(df1, :(col1, col2), :(col3=='dog' & col4 > 7))

or

df2 = df1[:(col3=='dog' & col4 > 7), :(col1, col2)]

This one is close to data.table style of indexing.

- Tom

Wes McKinney

unread,

Apr 12, 2012, 5:25:33 PM4/12/12

to juli...@googlegroups.com

I have far too much programming, writing, and other work to do to
defend myself in this public forum-- but I don't think I have ever
said that data *should* be in one format versus the other. My view is
that you can implement a data structure that performs both column- and
row-wise operations equally well, subject only to cache performance
due to the actual memory layout (C/row-major or Fortran/column-major
order).

I also very strongly disagree with the notion that pandas is
necessarily designed for time series data. Since I started building
pandas in a financial setting many of my examples on the internet come
from that domain because it's a very easy-to-understand unit of data
having a meaningful ordered row index. A small piece of pandas (<< 20%
of the codebase, I could look and give you an exact measurement) could
be completely torn out, thus removing all time series functionality
without harming any other library features. So please, do not spread
misinformation that pandas is not good for other kinds of data
analysis, because that is simply not true (you might have written this
before I clarified in private e-mail) =P

> OK, back to work!
>
> -Halran
>

- Wes

Harlan Harris

unread,

Apr 12, 2012, 5:42:27 PM4/12/12

to juli...@googlegroups.com

Thanks, Wes! Yes, I'm not a Pandas user myself, and you've clarified your design philosophy very well! I apologize for stating things inaccurately.

-Harlan

Vitalie Spinu

unread,

Apr 13, 2012, 4:28:22 AM4/13/12

to juli...@googlegroups.com

I completely agree with Wes. Data frames in R are old and inflexible
data structure treating columnes and rows asymmetrically. I don't know
much about Panda, but if it treats columns and rows symmetrically it
must be a very smart tool.

Reshaping in R is painful even with reshape2 package, because data.frame
doesn't have a notion of row variables. But, Harlan, I also think that
your work on data.frames in Julia is extremely useful, and must have. As
people are used to data.frames and more similarity is always better.

Vitalie.

>>>> Harlan Harris <har...@harris.name>

Harlan Harris

unread,

Apr 13, 2012, 7:16:03 AM4/13/12

to juli...@googlegroups.com

I still partially disagree on the first point, but completely agree on the second.

On Fri, Apr 13, 2012 at 4:28 AM, Vitalie Spinu <spin...@gmail.com> wrote:

I completely agree with Wes. Data frames in R are old and inflexible
data structure treating columnes and rows asymmetrically. I don't know
much about Panda, but if it treats columns and rows symmetrically it
must be a very smart tool.

Data.frames in R are indeed old and inflexible. Nobody wants that. :) It looks to me like Pandas doesn't completely treat rows and columns symmetrically, but many or most operations can be done either way. The problem, to me, is that _data_ is not symmetrical, in the sense that linear algebra matrices are symmetrical. And when you're dealing with heterogeneous types, which is essentially all of the time in many, many use cases, the additional flexibility trades off with simpler syntax. A central goal of having this domain-specific representation for statistical data is to be able to use simple, clear syntax to perform the goals of data cleaning, merging, reshaping, aggregating, etc. If the trade-off is to lose asymmetry, I, personally, feel that it is worth it. I'm not opposed to symmetry, but for what I feel like the goals of the DataFrame structure ought to be, it's not very high on my list. But we'll see what happens...

Reshaping in R is painful even with reshape2 package, because data.frame
doesn't have a notion of row variables. But, Harlan, I also think that
your work on data.frames in Julia is extremely useful, and must have. As
people are used to data.frames and more similarity is always better.

I agree that reshaping in R is torture, and requires programming-by-trial-and-error, even with Hadley's work. The efficiency in Pandas is due to Wes' work in developing indexes and efficient merges by indexes. What I've been wondering is whether an even better solution is to keep rownames and colnames distinguished from indexes, in a database sense. If individual columns could be optionally indexed, then searches and merges on those columns would be fast, and you wouldn't have to redefine the indexes or move columns in and out of the row index to do common operations. There's a lot to be said for row labels being just labels, rather than having to be keys too. (Although I don't think they have to be strings, as in R).

I should stop talking about this and spend the time programming it. :)

Thanks,

-Harlan

Stefan Karpinski

unread,

Apr 13, 2012, 6:28:55 PM4/13/12

to juli...@googlegroups.com

I'm not really sure that's what Wes said at all. This is a pretty slanted interpretation of a statement that just appears (to me at least) to be defending Pandas by saying that Pandas' DataTables are *not* just good for time series.

Stefan Karpinski

unread,

Apr 13, 2012, 6:34:14 PM4/13/12

to juli...@googlegroups.com

I'm pretty unclear on why or how DataFrames would be symmetrical. Vitalie, can you provide a concrete argument for why that's better and how it would work? Seems to me that DataFrames are an essentially relational model with heterogeneous column types, which implies two things: asymmetry and the fact that rows and columns suffice (you don't need to generalize to higher dimensions). For higher dimensions and symmetry between those dimensions, it seems like what you really want is an ArrayNA type that works like regular arrays but also keeps track of an NA mask. Allowing row labels that could be strings, numbers, dates, or whatever, seems like an excellent idea, however.

Vitalie Spinu

unread,

Apr 14, 2012, 4:52:46 AM4/14/12

to juli...@googlegroups.com

It's not about multiple dimensions. It's about the possibility to have
ID variables (i.e. keys) to index columns as well as rows. I know this
is a bit difficult to grasp if you haven't done too much reshaping in R,
but a full explanation would require a careful example and plenty of
space. I promise to find some time to write a detailed wiki page on this
issue, and hopefully some implementation in Julia so we can be more
specific in our discussion. I also have to see what Harlan did in his
DataFrame code.

As to row-names, I am not sure the row-names are really useful. At least
for me it was a continuous source of headache in R. Far more fundamental
concept is the key. You can have many keys in a data.frame and use all
of them to index rows. For example:

DF[("b"), ] would extract a part of data frame whose row key matches
"b". Or DF[("b", 3), ] would extract whose first key matches "b" and
second key matches 3. Of course one would also need some sort of
labeling for this

DF[(:key1 "b", :key2 3), ]

You can have a primary key which by default would be just row-number key
and use DF[2:3, ] to index the rows without the tuple notation.

But named tuples are not yet implemented in Julia, right? And it would
be really nice to have that first, before thinking further about complex
indexing in data structures.

Also by using keys, one from the very beginning can implement non-liner
search for indexing, which is hugely more efficient than linear search
in R data.frames. This is how indexing in R's data.table package works.

>>>> Stefan Karpinski <ste...@karpinski.org>

Wes McKinney

unread,

Apr 14, 2012, 9:39:55 AM4/14/12

to juli...@googlegroups.com

This like this

http://stackoverflow.com/a/10149202/776560

are very tricky / ad-hoc to do without a consistent and coherent way
to label / index any dimension.

Have any of your thought much about how Julia might provide
implementations of time series data structures like zoo or xts in R
which provide rudimentary auto-alignment functionality (I've taken it
a step further)? In pandas I'm able to do that only by having a
different row index type which is actually very nice for users-- one
completely consistent data structure for doing everything (rather than
R's mixed bag of 8 different data structures that all do slightly
different things).

- Wes

Stefan Karpinski

unread,

Apr 14, 2012, 12:10:39 PM4/14/12

to juli...@googlegroups.com

I want to re-address the original concern of this thread since the discussion has gotten pretty far off-topic. Basically, I think the orignal concern is not valid: all systems that have any sort of slicing give differently shaped/typed results when differently shaped/typed objects are used for slicing. That's the whole point of slicing — it's actually inherent to the very concept: that using a single index or a slice give you different things, with shape determined by what you're using for indexing. This is not only true in R, but also in Matlab, Python — and every single language that supports array slicing. Consistency is a concern, and it's arguable that some languages don't get that right, but I don't see why we'll have any problems being consistent since we're actually paying very close attention to this sort of consistency.

The explicitly expressed specific concern is that the expression str[x] can be either a character or a string depending on the type of x. There are two cases for the origin of x:

x is defined locally, most often right in the indexing, as in str[1] or str[1:n]
x comes in as a function parameter

In the case where x is defined locally, it's completely clear from looking at the code what str[x] will be, since we can see what x is. In the case where x is a function parameter, one possibility is that x is dispatched on like so:

f(str::String, x::Int) = ... str[x] ...
f(str::String, x::Range1) = ... str[x] ...

In both definitions, it's completely clear what's going on since you know whether x is an int or a range. Another possibility is that you have a function definition that doesn't dispatch on x:

g(str::String, x) = ... str[x] ...

In that case, you have explicitly polymorphic behavior, where the behavior or g is intentionally dependent on the polymorphic behavior of str[x]. A mildly contrived example of this is a function which takes a string and any number of indexing expression and returns a hash mapping indexing expressions to substrings:

function g(str::String, idxs...)
h = HashTable()

for idx in idxs
h[idx] = str[idx]
end

return h
end

Here it is in action:

julia> g("Hello, world.\n", 2, 1:3, 5)

{1:3=>"Hel",5=>'o',2=>'e',}

If indexing with an integer and a range used two distinct operators, then this polymorphic function would be extremely nasty to write (although not impossible). What you're suggesting by having two different indexing operators would completely cripple the ability to write generic polymorphic indexing code in Julia. Worse still, there would be nothing gained by having separate indexing syntaxes: in none of the above cases is there any lack of clarity about what's going to happen just by looking at nearby code. Imo, the fact that R has two different indexing operators is a confusing misfeature. It means that there are now two different and incompatible indexing systems in one language that I have to remember.

I think the deeper issue causing your concern may be a fundamental discomfort with the level of polymorphism found in Julia. Every time I see a concern that starts with "but then we'd have to write code that does `if is.vector(x) ... elseif is.matrix(x) ... end`" I'm immediately convinced that the person voicing the concern doesn't get Julia: they're missing both the enormous degree of polymorphism that it allows *and* the power afforded by multiple dispatch to control that polymorphism in a fine-grained, expressive manner. You *never* have write code like that in Julia because that's what the dispatch system is for. Moreover, it's not just how indexing works, it's how everything works. What's the type of x+y? I have no idea unless I know what the types of x and y are. The language, like mathematical notation itself, is insanely polymorphic. If you ever find yourself writing something that manually checks types, chances are you're doing it wrong.

Viral Shah

unread,

Apr 14, 2012, 12:55:34 PM4/14/12

to juli...@googlegroups.com

While I personally do not know much about zoo and such things, my R friends swear by this stuff. I can only imagine that we make julia really fast and general-purpose, that Achim Zeileis (author of zoo) and others write zoo for julia.

I did have an email exchange with Achim Zeileis a long time back on the performance of zoo, and he had shared some snippets of code with me that was doing some kind of complex dynamic programming. He eventually ended up writing this kind of stuff in C, callable from R. Would be great to get a discussion going with Achim and others.

-viral

Harlan Harris

unread,

Apr 14, 2012, 1:16:17 PM4/14/12

to juli...@googlegroups.com

Yes, I think that there's a certain amount of tension between data structures and operations that are best for relational data, such as data.frame in R, and data structures and operations that are best for ordered data, such as ts/zoo in R and the aspects of Pandas' DataFrame that do row alignment. Often the two overlap; time-series data can have covariates with relational properties, and relational data can have covariates with temporal properties. That is, you may want to do joins against non-index columns in time-series data, to annotate covariates in various ways, and you may want to sort relational data structures.

Making everyone and their use cases happy seems to be of the same class of impossibility as the matrix slicing issue!

-Harlan

Reply all

Reply to author

Forward