DataFrames : Apply a function by rows

2,393 views
Skip to first unread message

Fred

unread,
Nov 21, 2015, 7:19:17 AM11/21/15
to julia-users
Hi,

In DataFrames, it is easy to apply a function by columns using the colwise() function. But I find very difficult and inefficient to apply a function by rows.

For example :


 
 julia
> df = DataFrame(a=1:5, b=7:11, c=10:14)
5x3 DataFrames.DataFrame
| Row | a | b  | c  |
|-----|---|----|----|
| 1   | 1 | 7  | 10 |
| 2   | 2 | 8  | 11 |
| 3   | 3 | 9  | 12 |
| 4   | 4 | 10 | 13 |
| 5   | 5 | 11 | 14 |

 
 
 julia
> colwise(mean,df)
3-element Array{Any,1}:
 
[3.0]  
 
[9.0]  
 
[12.0]
 
 
 julia
> colwise(mean,df[1,1:2])
2-element Array{Any,1}:
 
[1.0]
 
[7.0]



To calculate the mean of a row (or a subset), the only way I found is this :

julia> mean(convert(Array,df[1,1:3]))
6.0
 



I think this is inefficient and probably very slow. I there a better way to apply a function by rows ?

Thanks !

Tom Short

unread,
Nov 21, 2015, 8:04:11 AM11/21/15
to julia...@googlegroups.com

You can try `eachrow`. It probably won't be fast, though. Here's an example:

https://github.com/JuliaStats/DataFrames.jl/blob/master/test/iteration.jl#L34

Fred

unread,
Nov 21, 2015, 8:43:53 AM11/21/15
to julia-users
Thanks for the answer. I tried "eachrow" but I have 2 problems :

1- I still have to do an array conversion, I think it is slow


julia> for r in eachrow(df)
              println
(mean(convert(Array,r)))
       
end
6.0
7.0
8.0
9.0
10.0



2- I do not manage to use a subset of the row, for example the 2 first values :
julia> for r in eachrow(df)
              println
(mean(convert(Array,r)))
       
end
6.0
7.0
8.0
9.0
10.0
 
julia
> for r in eachrow(df)
              println
(mean(convert(Array,r[1:2])))
       
end
WARNING
: [a] concatenation is deprecated; use collect(a) instead
 
in depwarn at deprecated.jl:73
 
in oldstyle_vcat_warning at ./abstractarray.jl:29
 
[inlined code] from none:2
 
in anonymous at no file:0
while loading no file, in expression starting on line 0
4.0
 


Tom Short

unread,
Nov 21, 2015, 9:08:34 AM11/21/15
to julia...@googlegroups.com

For the subset, do the indexing after the conversion to an array, or subset the DataFrame first (probably faster).

Fred

unread,
Nov 21, 2015, 1:17:27 PM11/21/15
to julia-users
It is a good idea but how is it possible to iterate two dataframes at the same time ? Something like :

julia> df = DataFrame(a=1:5, b=7:11, c=10:14, d=20:24)
5x4 DataFrames.DataFrame
| Row | a | b  | c  | d  |
|-----|---|----|----|----|
| 1   | 1 | 7  | 10 | 20 |
| 2   | 2 | 8  | 11 | 21 |
| 3   | 3 | 9  | 12 | 22 |
| 4   | 4 | 10 | 13 | 23 |
| 5   | 5 | 11 | 14 | 24 |

julia
> df1 = df[1:2,]
5x2 DataFrames.DataFrame
| Row | a | b  |
|-----|---|----|
| 1   | 1 | 7  |
| 2   | 2 | 8  |
| 3   | 3 | 9  |
| 4   | 4 | 10 |
| 5   | 5 | 11 |

julia
> df1 = df[3:4,]
5x2 DataFrames.DataFrame
| Row | c  | d  |
|-----|----|----|
| 1   | 10 | 20 |
| 2   | 11 | 21 |
| 3   | 12 | 22 |
| 4   | 13 | 23 |
| 5   | 14 | 24 |

julia
> for r1,r2 in eachrow(df1, df2)
              println
(mean(r1,r2))
       
end
ERROR
: syntax: invalid iteration specification

Fred

unread,
Nov 22, 2015, 2:54:27 AM11/22/15
to julia-users
In my last example, the function mean() is not well chosen. In fact, what  I would like to calculate is a statistical test line by lline, like TTest, or Wilcoxon. This is why I need to iterate thought 2 DataFrames at the same time if I subset the DataFrame first to increase speed :)


Something like :

julia> for r1,r2 in eachrow(df1, df2)

              println
(TTest(r1,r2))

       
end
ERROR
: syntax: invalid iteration specification

Tom Short

unread,
Nov 22, 2015, 8:11:21 AM11/22/15
to julia...@googlegroups.com

I'd convert the whole DataFrame to a matrix and use a loop over rows.

Fred

unread,
Nov 22, 2015, 8:23:52 AM11/22/15
to julia-users
Yes, it is a good solution, but it means that DataFrames cannot be used to do some calculations by rows, it is a severe limitation. An equivalent of colwise() whould be very usefull.

Tom Short

unread,
Nov 22, 2015, 9:48:37 AM11/22/15
to julia...@googlegroups.com

Contributions/pull requests from folks that need that are welcome. I don't have that need. For row operations, I can generally get by with loops or `@byrow!` in DataFramesMeta.

Fred

unread,
Nov 22, 2015, 10:03:54 AM11/22/15
to julia-users
Ok, I hope that exchange could contribute to bring news ideas to improve DataFrames although there are other way to do it, like convert a DataFrame or a row into array. Thank you for your help !
Reply all
Reply to author
Forward
0 new messages