DataFrames : Apply a function by rows

Fred

unread,

Nov 21, 2015, 7:19:17 AM11/21/15

to julia-users

Hi,

In DataFrames, it is easy to apply a function by columns using the colwise() function. But I find very difficult and inefficient to apply a function by rows.

For example :


 
 julia> df = DataFrame(a=1:5, b=7:11, c=10:14) 
5x3 DataFrames.DataFrame 
| Row | a | b  | c  | 
|-----|---|----|----| 
| 1   | 1 | 7  | 10 | 
| 2   | 2 | 8  | 11 | 
| 3   | 3 | 9  | 12 | 
| 4   | 4 | 10 | 13 | 
| 5   | 5 | 11 | 14 | 

 
 
 julia> colwise(mean,df) 
3-element Array{Any,1}: 
 [3.0]  
 [9.0]  
 [12.0]
 
 
 julia> colwise(mean,df[1,1:2]) 
2-element Array{Any,1}: 
 [1.0] 
 [7.0]

To calculate the mean of a row (or a subset), the only way I found is this :

julia> mean(convert(Array,df[1,1:3])) 
6.0

I think this is inefficient and probably very slow. I there a better way to apply a function by rows ?

Thanks !

Tom Short

unread,

Nov 21, 2015, 8:04:11 AM11/21/15

to julia...@googlegroups.com

You can try `eachrow`. It probably won't be fast, though. Here's an example:

https://github.com/JuliaStats/DataFrames.jl/blob/master/test/iteration.jl#L34

Fred

unread,

Nov 21, 2015, 8:43:53 AM11/21/15

to julia-users

Thanks for the answer. I tried "eachrow" but I have 2 problems :

1- I still have to do an array conversion, I think it is slow

julia> for r in eachrow(df) 
              println(mean(convert(Array,r))) 
       end 
6.0 
7.0 
8.0 
9.0 
10.0

2- I do not manage to use a subset of the row, for example the 2 first values :

julia> for r in eachrow(df) 
              println(mean(convert(Array,r))) 
       end 
6.0 
7.0 
8.0 
9.0 
10.0 
 
julia> for r in eachrow(df) 
              println(mean(convert(Array,r[1:2]))) 
       end 
WARNING: [a] concatenation is deprecated; use collect(a) instead 
 in depwarn at deprecated.jl:73 
 in oldstyle_vcat_warning at ./abstractarray.jl:29 
 [inlined code] from none:2 
 in anonymous at no file:0 
while loading no file, in expression starting on line 0 
4.0

Tom Short

unread,

Nov 21, 2015, 9:08:34 AM11/21/15

to julia...@googlegroups.com

For the subset, do the indexing after the conversion to an array, or subset the DataFrame first (probably faster).

Fred

unread,

Nov 21, 2015, 1:17:27 PM11/21/15

to julia-users

It is a good idea but how is it possible to iterate two dataframes at the same time ? Something like :

julia> df = DataFrame(a=1:5, b=7:11, c=10:14, d=20:24)
5x4 DataFrames.DataFrame
| Row | a | b  | c  | d  |
|-----|---|----|----|----|
| 1   | 1 | 7  | 10 | 20 |
| 2   | 2 | 8  | 11 | 21 |
| 3   | 3 | 9  | 12 | 22 |
| 4   | 4 | 10 | 13 | 23 |
| 5   | 5 | 11 | 14 | 24 |

julia> df1 = df[1:2,]
5x2 DataFrames.DataFrame
| Row | a | b  |
|-----|---|----|
| 1   | 1 | 7  |
| 2   | 2 | 8  |
| 3   | 3 | 9  |
| 4   | 4 | 10 |
| 5   | 5 | 11 |

julia> df1 = df[3:4,]
5x2 DataFrames.DataFrame
| Row | c  | d  |
|-----|----|----|
| 1   | 10 | 20 |
| 2   | 11 | 21 |
| 3   | 12 | 22 |
| 4   | 13 | 23 |
| 5   | 14 | 24 |

julia> for r1,r2 in eachrow(df1, df2)
              println(mean(r1,r2))
       end
ERROR: syntax: invalid iteration specification

Fred

unread,

Nov 22, 2015, 2:54:27 AM11/22/15

to julia-users

In my last example, the function mean() is not well chosen. In fact, what I would like to calculate is a statistical test line by lline, like TTest, or Wilcoxon. This is why I need to iterate thought 2 DataFrames at the same time if I subset the DataFrame first to increase speed :)

Something like :

julia> for r1,r2 in eachrow(df1, df2)


              println(TTest(r1,r2))


       end
ERROR: syntax: invalid iteration specification

Tom Short

unread,

Nov 22, 2015, 8:11:21 AM11/22/15

to julia...@googlegroups.com

I'd convert the whole DataFrame to a matrix and use a loop over rows.

Fred

unread,

Nov 22, 2015, 8:23:52 AM11/22/15

to julia-users

Yes, it is a good solution, but it means that DataFrames cannot be used to do some calculations by rows, it is a severe limitation. An equivalent of colwise() whould be very usefull.

Tom Short

unread,

Nov 22, 2015, 9:48:37 AM11/22/15

to julia...@googlegroups.com

Contributions/pull requests from folks that need that are welcome. I don't have that need. For row operations, I can generally get by with loops or `@byrow!` in DataFramesMeta.

Fred

unread,

Nov 22, 2015, 10:03:54 AM11/22/15

to julia-users

Ok, I hope that exchange could contribute to bring news ideas to improve DataFrames although there are other way to do it, like convert a DataFrame or a row into array. Thank you for your help !

Reply all

Reply to author

Forward