DataFrame order on multiple columns?

856 views
Skip to first unread message

Jacob Quinn

unread,
Feb 1, 2013, 3:51:31 PM2/1/13
to julia...@googlegroups.com
I've continued my exploration of DataFrames and just have a quick question:

is ordering a DataFrame over multiple columns currently supported?

I couldn't find any documentation and my tinkering with order() couldn't find a solution.

-Jacob

John Myles White

unread,
Feb 1, 2013, 3:55:17 PM2/1/13
to julia...@googlegroups.com
No, the only thing we have so far is sortby(), which only supports one column. It would be great to have an extended sortby() that breaks ties using additional columns.

-- John

Tom Short

unread,
Feb 1, 2013, 4:38:00 PM2/1/13
to julia...@googlegroups.com
It looks like the sort routines have undergone a lot of changes.
`order` is no longer a Base function. It's now `sortperm`, and I don't
see a way to call it on multiple vectors.

Tom Short

unread,
Feb 1, 2013, 5:01:47 PM2/1/13
to julia...@googlegroups.com
It looks like it's not too bad to extend `sortperm` to cover multiple
columns. Here is a start that covers two columns. There might be a
more efficient way to do it that would involve creating a new Ordering
type like Perm that handles multiple vectors.

import Base.sortperm

sortperm(v1::AbstractVector, v2::AbstractVector) =
sort(Sort.MergeSort(), Sort.Perm(Sort.Forward(),v1), sortperm(v2))

julia> d = @DataFrame(x => [9,2,2,3], y => [1,2,3,1])
4x2 DataFrame:
x y
[1,] 9 1
[2,] 2 2
[3,] 2 3
[4,] 3 1


julia> d[sortperm(d["x"], d["y"]), :]
4x2 DataFrame:
x y
[1,] 2 2
[2,] 2 3
[3,] 3 1
[4,] 9 1


julia> d[sortperm(d["y"], d["x"]), :]
4x2 DataFrame:
x y
[1,] 3 1
[2,] 9 1
[3,] 2 2
[4,] 2 3

Jacob Quinn

unread,
Feb 1, 2013, 8:19:43 PM2/1/13
to julia...@googlegroups.com
Looks promising. On a related note, what would be the syntax for doing a groupby or by call but with multiple aggregated columns.

I know grouping by multiple columns is easy, but what if I wanted more than one aggregated column in the resultset?

by(iris, "Species", :(MeanPetalLength = mean(Petal_Length)))
#this works great
by(iris, ["Species","Sepal_Length"] , :(MeanPetalLength = mean(Petal_Length)))
#as does this
#but what about something like this?
by(iris, "Species", :(MeanPetalLength = mean(Petal_Length, StdPetalLength = std(Petal_length))))

John Myles White

unread,
Feb 1, 2013, 8:58:04 PM2/1/13
to julia...@googlegroups.com
This is much more Tom's domain that mine, but I would provide a function to by() rather than a simple expression. The example below should get you started:

using DataFrames, RDatasets

function f(df)
  res = DataFrame()
  res["MeanPetalLength"] = mean(df["Petal_Length"])
  res["MedianPetalLength"] = median(df["Petal_Length"])
  return res
end

iris = data("datasets", "iris")
clean_colnames!(iris)
by(iris, "Species", f)

 -- John

Tom Short

unread,
Feb 1, 2013, 9:56:08 PM2/1/13
to julia...@googlegroups.com

Jacob, you can use John's method. Your first try was also pretty close. Just put a semicolon between the two assignments or use quote-end and a multi line expression. I'd show an example, but I'm not at my computer.

Tom Short

unread,
Feb 2, 2013, 11:46:47 AM2/2/13
to julia...@googlegroups.com
Here are several ways to do the same thing (in addition to John's).
The first couple use arrays of symbols. This is easiest when you want
to apply similar functions to multiple columns and are okay with the
names generated. The last two use expressions.

iris = data("datasets", "iris")
clean_colnames!(iris)

method1 = by(iris, "Species", [:mean, :std]) # applies these to all
non-key columns
method2 = by(iris[["Species","Petal_Length"]], "Species", [:mean, :std])
method3 = by(iris, "Species", quote
MeanPetalLength = mean(Petal_Length)
StdPetalLength = std(Petal_Length)
end)
method4 = by(iris, "Species", :(MeanPetalLength = mean(Petal_Length);
StdPetalLength = std(Petal_Length)))

Stefan Karpinski

unread,
Feb 2, 2013, 2:14:34 PM2/2/13
to julia...@googlegroups.com
A little clarification here. When you say "ordering" a data frame, what does that mean? Finding the permutation that would put the data in order or actually sorting the rows? For the latter, I would use sort! Or sortby!. The sortby! function for vectors takes as its first argument some function that maps each element to a proxy value used for sorting. For data frames, the names of some set of columns naturally corresponds to the function mapping each row to just those column values, which one can then sort by.

There is some argument for making sort!(f,v) do what sortby!(f,v) currently does, especially since sort!(v) is equivalent to sortby!(identity,v). Then we need a way to provide a fully custom sort comparison function, but that can always be for with sort!(Sort.Lt((a,b)->xxx),v).

Kevin Squire

unread,
Feb 2, 2013, 4:45:35 PM2/2/13
to julia...@googlegroups.com
I just sent a pull request which lets DataFrames be sorted with the new Sort framework (https://github.com/HarlanH/DataFrames.jl/pull/177).

There was somewhat of an impedance mismatch--in particular, the sort functions expect an AbstractVector as input, and DataFrames are not AbstractVectors, so I needed a wrapper class there, and another for DataFrame rows to allow useful comparisons.  The former wrapper isn't sophisticated enough to allow the modifying forms (sort! or sortby!) to work right now.

There are some advantages in this implementation to Tom's two-column sort above (in particular, it only sorts the data once, instead of sorting different columns at different times), and some potential disadvantages (sorting different columns in different ways (one ascending, one descending, etc., may be more challenging).  Some thought needs to be put into how to specify different sort orders for different columns (how does R do it?).

Q: what should happen with NAs?

Cheers!

  Kevin


On Saturday, February 2, 2013 11:14:34 AM UTC-8, Stefan Karpinski wrote:
A little clarification here. When you say "ordering" a data frame, what does that mean? Finding the permutation that would put the data in order or actually sorting the rows? For the latter, I would use sort! Or sortby!. The sortby! function for vectors takes as its first argument some function that maps each element to a proxy value used for sorting. For data frames, the names of some set of columns naturally corresponds to the function mapping each row to just those column values, which one can then sort by.

There is some argument for making sort!(f,v) do what sortby!(f,v) currently does, especially since sort!(v) is equivalent to sortby!(identity,v). Then we need a way to provide a fully custom sort comparison function, but that can always be for with sort!(Sort.Lt((a,b)->xxx),v).

I would second this change.  I
Reply all
Reply to author
Forward
0 new messages