DataArrays / DataFrames Chaos Month

437 views
Skip to first unread message

John Myles White

unread,
Nov 13, 2013, 11:33:16 AM11/13/13
to julia...@googlegroups.com
Over the next month, I’m going to clean up the DataArrays and DataFrames codebases to remove un-Julian idioms.

This includes things like:

* Using nrow/ncol instead of size
* Default typing of DataArrays to Float64
* Using row* and col* functions instead of specifying reduction dimensions as a numeric index
* Using cbind/rbind instead of hcat/vcat.

While this happens, I’ll avoid updating METADATA, which should point to stable releases of both packages that use the documented API.

If you’re interested in helping remove the cruft we’ve accumulated, master will track the cleaned up version of those packages. The more un-Julian idioms we can find and purge, the better. I’d like to end 2013 with a version of DataArrays and DataFrame that has an API we’ll stick with for the long term.

— John

Simon Kornblith

unread,
Nov 22, 2013, 12:02:50 AM11/22/13
to julia...@googlegroups.com
This is great. It might be a good idea to deprecate the old functions at first instead of removing them entirely.

Simon

John Myles White

unread,
Nov 22, 2013, 12:26:03 AM11/22/13
to julia...@googlegroups.com
Yeah, I’ve been a little too eager. I’ll try to add deprecations for what I’ve already taken out.

— John
> --
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

David van Leeuwen

unread,
Dec 4, 2013, 7:51:59 AM12/4/13
to julia...@googlegroups.com
Hello, 

Regarding the remark:


On Wednesday, November 13, 2013 5:33:16 PM UTC+1, John Myles White wrote:
Over the next month, I’m going to clean up the DataArrays and DataFrames codebases to remove un-Julian idioms.

This includes things like:

* Using nrow/ncol instead of size

I am not sure how to read this, replacing nrow/ncol by size() or the other way around.  

Given the explicit mentioning in several places that Julia doesn't support nrow() and ncol(), I suspect the former. 

Is there any reason why Julia doesn't want me to think in terms of rows and columns?  Is it better to abstract this away into a numbered dimension?

Cheers, 

---david
 

John Myles White

unread,
Dec 4, 2013, 10:24:01 AM12/4/13
to julia...@googlegroups.com
Yes, rows and columns are not sufficiently general. Julia’s data structures often have more than two dimensions, so an interface that only makes sense for two dimensions is needlessly restrictive.

— John

David van Leeuwen

unread,
Dec 5, 2013, 6:32:59 PM12/5/13
to julia...@googlegroups.com
Hi, 


On Wednesday, December 4, 2013 4:24:01 PM UTC+1, John Myles White wrote:
Yes, rows and columns are not sufficiently general. Julia’s data structures often have more than two dimensions, so an interface that only makes sense for two dimensions is needlessly restrictive.

I don't see how a shorthand definition

row(x) = size(x,1)

can be seen as restrictive.  We also have a shorthand

Vector{T} = Array{T,1}

there is nothing wrong with having such a special case. 

nrow is just the size of the first dimension.  That's all.  Personally I think linear algebra code is much easier to read nrow() and ncol() than with size(,d).  But that may be a personal flaw...

---david

John Myles White

unread,
Dec 5, 2013, 11:22:53 PM12/5/13
to julia...@googlegroups.com
Here are my objections to having nrow and ncol:

(1) They restrict the imagination. Because you don’t get used to using size(x, 1) and size(x, 2), you will be slower to understand size(x, 3). The size idiom is strictly superior, because it generalizes to more cases than ad hoc things like nrow or ncol.

(2) They encourage people to think in R’s idioms, which keeps them from learning idiomatic Julia.

(3) They violate a principle I’ve come to think is essential to the long-term health of Julia: there should be at most one way to achieve a simple goal in Julia. Every time there are two ways to do one thing, you bifurcate the community into two groups of users: those who use Idiom A and those who use Idiom B. Gradually, they come to speak more and more different dialects, which harms communication in the group. An even worse problem is that the code paths will come to diverge, so that both of them receive less use and fewer bug fixes. This makes everything in the system brittle.

In the case of simple aliases, this isn’t a big problem. But it’s a bad precedent that I’d like to remove from Julia.

— John

Tom Short

unread,
Dec 6, 2013, 7:10:22 AM12/6/13
to julia...@googlegroups.com
Having only one way to do things is a worthy goal, but it'll likely be a never-ending battle because Julia is so expressive.

While I like nrow and ncol because it's much more readable, I agree with John's decision to take them out for the following reason. If users are used to nrow and ncol, they'll expect them to work for Arrays, and they don't.

Harlan Harris

unread,
Dec 6, 2013, 7:41:16 AM12/6/13
to julia...@googlegroups.com
What about something DataFrame-specific, that won't surprise people by being missing from arrays? Something like nvars and nobs? 

Milan Bouchet-Valat

unread,
Dec 6, 2013, 8:53:04 AM12/6/13
to julia...@googlegroups.com
Le vendredi 06 décembre 2013 à 07:10 -0500, Tom Short a écrit :
Having only one way to do things is a worthy goal, but it'll likely be a never-ending battle because Julia is so expressive.


While I like nrow and ncol because it's much more readable, I agree with John's decision to take them out for the following reason. If users are used to nrow and ncol, they'll expect them to work for Arrays, and they don't.
I'd say nrow() and ncol() should be defined for Arrays too. ;-) I have code working with 3D arrays where I use them all the time. Actually, they make more sense for Arrays than for DataFrames, where you think more naturally in terms of observations and variables rather than rows and columns (the row/column choice is somewhat arbitrary).

While I fully share your goals of not having several ways of doing the same thing, I think in that case the correspondence between nrow(x) and size(x, 1) is so direct that it's not a problem. That said, I don't see the lack of such a convenience feature as a showstopper...


Regards

Kevin Squire

unread,
Dec 6, 2013, 9:19:18 AM12/6/13
to julia...@googlegroups.com
I'd say nrow() and ncol() should be defined for Arrays too. ;-) I have code working with 3D arrays where I use them all the time. Actually, they make more sense for Arrays than for DataFrames, where you think more naturally in terms of observations and variables rather than rows and columns (the row/column choice is somewhat arbitrary).

Just for curiosity, for 3D arrays, what function do you use to query the size of the 3rd dimension?

For someone like me who has avoided R like the plague (I use pandas), nrow and ncol are understandable but foreign.  I'm perfectly happy with making the interface more Julian.

Kevin

David van Leeuwen

unread,
Dec 6, 2013, 7:55:55 PM12/6/13
to julia...@googlegroups.com
I am not too much into 3D arrays (I have nothing against them, but it is just that most of the stuff I do fits in a good old matrix), but I could imagine "nlayers()" would help me visualize the data:-)  Although in R I once needed a function for "give me a slice where the nth dimension is k" which was kind of hard at the time. 

So I understand your arguments for not having nrow() and ncol() in base, but is it therefore considered not the Julian way to define my own little helper functions in code that happens to not extend the matrix size (like lots of linear algebra does?).  I would still want to make the case for readablility. 

Cheers, 

---david

John Myles White

unread,
Dec 6, 2013, 8:12:55 PM12/6/13
to julia...@googlegroups.com
You’re always welcome to do what you want in your own code.

I’ve been a big fan of standardized idioms for a while now and, now that I work at a company with thousands of people sharing a single codebase, I’m violently in favor of removing subjective decision-making from coding whenever possible. So I would discourage using nrow and ncol personally, but it’s not really the Julian way to tell people what to do in their private work.

— John

Harlan Harris

unread,
Dec 6, 2013, 8:40:06 PM12/6/13
to julia...@googlegroups.com
I gotta say, for DataFrames, the size(,1/2) thing really feels puzzling to me. Unlike with general arrays (and DataArrays), the dimensions are really drastically qualitatively different, not just a property of how a matrix is laid out in memory. If nrow/ncol seem like bad ideas, I'll repeat my nvars/nobs suggestion. 

John Myles White

unread,
Dec 6, 2013, 8:44:13 PM12/6/13
to julia...@googlegroups.com
I’m pretty opposed to that, but if you feel sufficiently strongly we can put them. That does mean we can’t use those variable names, though.

Even though I agree size(, 1) and size(, 2) are different here, we have to adopt that nomenclature for things like mean(df, 2) to ensure that we track Julia’s idioms properly.

— John

Harlan Harris

unread,
Dec 7, 2013, 12:05:57 PM12/7/13
to julia...@googlegroups.com
Not sure how to best make this decision...! I'm just in the camp of finding 1s and 2s to be good ways of breaking my flow, while either reading or writing code.

What about a symbol-based solution, ala size(df, :obs) or size(df, :vars) that only works for DataFrames?




John Myles White

unread,
Dec 7, 2013, 12:09:19 PM12/7/13
to julia...@googlegroups.com
I like that approach a whole lot more. It’s still a little weird, but it feels Julian at least.

For me the trouble is that we’re going to end up committing to some of these decisions forever, which makes me want to be really conservative. I found size(, 1) confusing at first, but am now totally fluent in it.

Let’s put the symbols in with the agreement that we can revert that decision before Julia hits 1.0.

— John

Milan Bouchet-Valat

unread,
Dec 7, 2013, 12:38:17 PM12/7/13
to julia...@googlegroups.com
Le vendredi 06 décembre 2013 à 06:19 -0800, Kevin Squire a écrit :
>         I'd say nrow() and ncol() should be defined for Arrays
>         too. ;-) I have code working with 3D arrays where I use them
>         all the time. Actually, they make more sense for Arrays than
>         for DataFrames, where you think more naturally in terms of
>         observations and variables rather than rows and columns (the
>         row/column choice is somewhat arbitrary).
> 
> Just for curiosity, for 3D arrays, what function do you use to query
> the size of the 3rd dimension?
The code is in R, so I'm doing dim(x)[3], which is terrible to type. I wouldn't be against a nlayers() function, but I admit this would start being overkill... :-) I think once you get beyond rows/columns, thinking in terms of the dimension number is more natural - but when you work only with matrices numbers feel weird.

> For someone like me who has avoided R like the plague (I use pandas),
> nrow and ncol are understandable but foreign.  I'm perfectly happy
> with making the interface more Julian.
> 
> 
> Kevin

Milan Bouchet-Valat

unread,
Dec 7, 2013, 12:45:30 PM12/7/13
to julia...@googlegroups.com
Le vendredi 06 décembre 2013 à 17:44 -0800, John Myles White a écrit :
I’m pretty opposed to that, but if you feel sufficiently strongly we can put them. That does mean we can’t use those variable names, though.

Even though I agree size(, 1) and size(, 2) are different here, we have to adopt that nomenclature for things like mean(df, 2) to ensure that we track Julia’s idioms properly.
Why would you need mean(df, 2)? One could argue that mean(df) (scalar) and mean(df, 1) (vector) do not have any meaning in DataFrames, except for special cases where all columns are numeric - and in those cases, you should use a DataArray. This would imply that mean(df)should be equivalent to mean(df, 2), and that people who want to compute row-wise means would do it by hand. Or do you think this is a common  enough case?


Regards

John Myles White

unread,
Dec 7, 2013, 12:58:48 PM12/7/13
to julia...@googlegroups.com
I’m happening making mean(df) be equivalent to mean(df, 2). But I would argue that we also need mean(df, 2) for generic programming support.

— John

Andreas Noack Jensen

unread,
Dec 8, 2013, 4:06:12 AM12/8/13
to julia...@googlegroups.com
I would like to support Harlan's view on this. When you use a data frame you want something different than an n-dimensional array and therefore I don't think the terminology related to n-dimensional arrays should be considered more julian here. For a data frames the natural labels are variables and observations. This also applies the the mean function. For a data frame it only makes sense to calculate the mean over each variable and therefore there shouldn't be a dimensional argument.


2013/12/7 John Myles White <johnmyl...@gmail.com>



--
Med venlig hilsen

Andreas Noack Jensen

Sean Garborg

unread,
Dec 8, 2013, 9:33:54 AM12/8/13
to julia...@googlegroups.com

I'm on the fence about the main topic -- just want to offer that 'row-wise' min/max/mean/etc. are common enough, but usually on a subset of columns of a dataset.

Michael Weylandt

unread,
Dec 8, 2013, 10:43:29 AM12/8/13
to julia...@googlegroups.com, julia...@googlegroups.com


On Dec 8, 2013, at 4:06, Andreas Noack Jensen <andreasno...@gmail.com> wrote:

> For a data frame it only makes sense to calculate the mean over each variable and therefore there shouldn't be a dimensional argument.

Not necessarily: suppose I have a data frame of the average weights of men and women broken down by (US) state. Not unlikely that I might want the average weight per state or the average weight per gender. Ignoring the fact that I really should use a weighted mean, that's both column-wise and row-wise application of mean().

A second, perhaps more common, example would be a data frame of stock returns. You might want a cross-portfolio mean return or the mean return of a single asset across the entire set of observations.

Michael

John Myles White

unread,
Dec 8, 2013, 12:00:00 PM12/8/13
to julia...@googlegroups.com
Isn’t that a split-apply-combine operation, not naive row-wise mean?

— John

John Myles White

unread,
Dec 8, 2013, 12:00:55 PM12/8/13
to julia...@googlegroups.com
Let’s add size(df, s::Symbol) to DataFrames then.

— John

Michael Weylandt

unread,
Dec 8, 2013, 1:52:12 PM12/8/13
to julia...@googlegroups.com
The first or the second?

Either could be made to fit split-apply-combine if you allow trivial splits (i.e. each row), apply functions (identity function), and combine functions (hcat/vcat) but it feels heavy.

Agreed it's a somewhat degenerate case and one would probably use weighted means => matrix multiplication for real cases, but for folks working entirely with numerical data, the 'it looks like a matrix, so why can't I take row/column means?' urge is likely to be strong.

Michael

Andreas Noack Jensen

unread,
Dec 8, 2013, 3:39:52 PM12/8/13
to julia...@googlegroups.com
>suppose I have a data frame of the average weights of men and women broken down by (US) state. 

Then I would say that you either have your data in an unfortunate format (wide) for the operation or should consider using an n-dimensional array instead of a data frame.


2013/12/8 Michael Weylandt <michael....@gmail.com>



--

John Myles White

unread,
Dec 8, 2013, 9:09:58 PM12/8/13
to julia...@googlegroups.com
I agree with Andreas: you’re describing a setting in which you should use a DataMatrix, not a DataFrame.

In general, operations like taking row means should either always work for a data structure or always fail. Their success shouldn’t depend upon the types that the columns of a DataFrame just happen to have.

— John

R. Michael Weylandt

unread,
Dec 8, 2013, 10:25:40 PM12/8/13
to julia...@googlegroups.com
Fair enough: I hadn't seen DataMatrix and had mistakenly assumed that
1D DataArrays were the only other game in town.

On Sun, Dec 8, 2013 at 9:09 PM, John Myles White

John Myles White

unread,
Dec 8, 2013, 10:42:32 PM12/8/13
to julia...@googlegroups.com
Nope, the goal is for DataArrays to exactly mirror Array, but with the possibility of NA values.

— John

David van Leeuwen

unread,
Dec 13, 2013, 7:23:43 AM12/13/13
to julia...@googlegroups.com
If we would have the sort of imagination similar to the particle physicists (e.g., Murray Gell-Mann) we'd just invent names for dimensions > 3...

---david
Reply all
Reply to author
Forward
0 new messages