Adding a row to a DataFrame

2,734 views
Skip to first unread message

Tomas Lycken

unread,
May 26, 2014, 11:59:28 AM5/26/14
to julia...@googlegroups.com
I'm probably just being incredibly daft, but I can't figure out how to add a new row to a DataFrame.

Basically, I have a bunch of data sets for which I want to perform some calculations - lets say the mean and standard deviation of something - each dataset corresponding to some named category of data. So I do the following to construct my new DataFrame

julia> measures = DataFrame()
julia> measures[:Mean] = Float64[]
julia> measures[:StdDev] = Float64[]
julia> measures[:Category] = Symbol[]

Now, I want to add some values that are the results of a calculation on a different data set, and I try this:

julia> push!(psispread, [1.0,0.1,:Fake])
ERROR: no method push!(DataFrame, Array{Any,1})
julia> append!(psispread, [1.0,0.1,:Fake])
ERROR: no method append!(DataFrame, Array{Any,1})
julia> psispread[1,:] = [1.0,0.1,:Fake]
ERROR: BoundsError()
 in setindex! at /home/tlycken/.julia/v0.3/DataArrays/src/dataarray.jl:764
 in insert_single_entry! at /home/tlycken/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:410
 in setindex! at /home/tlycken/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:521

Is there a nice and simple way to add a row to a DataFrame without having to do it one value at a time?

// T

John Myles White

unread,
May 26, 2014, 7:49:41 PM5/26/14
to julia...@googlegroups.com
You can append a one-row DataFrame to your existing DataFrame.

— John

Jason Solack

unread,
May 26, 2014, 9:08:57 PM5/26/14
to julia...@googlegroups.com
this works for me:

dfA = DataFrame(A=[1:10], B=[11:20])
dfB = DataFrame(A=11, B=21)
append!(dfA, dfB)

Kevin Squire

unread,
May 26, 2014, 11:14:24 PM5/26/14
to julia...@googlegroups.com
It shouldn't be that hard to make the array version work.  I might give it a shot, unless that isn't desired. 

Kevin

John Myles White

unread,
May 26, 2014, 11:19:16 PM5/26/14
to julia...@googlegroups.com
I’d not really opposed to it, but I’m also not super excited about it. It’s a redundant and non-obvious interface: I’ve seen people try to use both vectors and 1-row matrices to do this. That suggests to me there’s no clear right answer, so picking one way arbitrarily (appending only DataFrames to DataFrames) is pretty reasonable.

 — John

Kevin Squire

unread,
May 26, 2014, 11:38:34 PM5/26/14
to julia...@googlegroups.com
So, the other argument is that, if the types fit, why not make it easy to append data to a DataFrame via any iterable?  Constructing a DataFrame just to append it to another DataFrame and throw it away seems wasteful, especially since a new array is allocated for each column, and (I think) each array allocates space for 16 elements.  That means we're allocating and throwing away, e.g., 128 bytes per Float64 column, just so we can append one number to the column. 

If we had a separate type for DataFrame rows, on the other hand... 

Cheers,
  Kevin

Tomas Lycken

unread,
May 27, 2014, 2:52:41 AM5/27/14
to julia...@googlegroups.com
Aside from the memory allocation concerns already raised, I also think that constructing a dataframe just to add it to another adds quite a lot of redundancy in the code. For example, I'll have to specify the column names an extra time for each row I append, rather than just once at the beginning. (However, this argument might be moot if the column order is not always well-defined - in that case, I don't really see a way around creating a new dataframe, since the columns need to be named.)

I just find the whole procedure of constructing a full data frame just to append it and throw it away seems very roundabout and complicated.

// T


On Tuesday, May 27, 2014 5:38:34 AM UTC+2, Kevin Squire wrote:
So, the other argument is that, if the types fit, why not make it easy to append data to a DataFrame via any iterable?  Constructing a DataFrame just to append it to another DataFrame and throw it away seems wasteful, especially since a new array is allocated for each column, and (I think) each array allocates space for 16 elements.  That means we're allocating and throwing away, e.g., 128 bytes per Float64 column, just so we can append one number to the column. 

If we had a separate type for DataFrame rows, on the other hand... 

Cheers,
  Kevin

On Monday, May 26, 2014, John Myles White <johnmyl...@gmail.com> wrote:
I’d not really opposed to it, but I’m also not super excited about it. It’s a redundant and non-obvious interface: I’ve seen people try to use both vectors and 1-row matrices to do this. That suggests to me there’s no clear right answer, so picking one way arbitrarily (appending only DataFrames to DataFrames) is pretty reasonable.

 — John

Jacques Rioux

unread,
May 27, 2014, 4:11:15 PM5/27/14
to julia...@googlegroups.com
Let me add a thought here. I also think that adding a row to a dataframe should be easier. However, I do not think that an array would be the best container to represent a row because array members must all be of the same type which brings up Any as the only options in your example.

I think that appending or pushing a tuple with the right types could be made to work. 

So it would be 

julia> push!(psispread, (1.0,0.1,:Fake))

or

julia> append!(psispread, (1.0,0.1,:Fake))

since 

julia> typeof((1.0, 0.1, :fake))
(Float64,Float64,Symbol)

Note, I am not saying that this works now but that it could be made to work by adding the corresponding method to either function. It seems it is the right construct.

Any thoughts?

Tomas Lycken

unread,
May 28, 2014, 2:37:43 AM5/28/14
to julia...@googlegroups.com
I like it - but maybe that wasn't so hard to guess I would ;)

// T

John Myles White

unread,
May 28, 2014, 10:43:24 AM5/28/14
to julia...@googlegroups.com
I’m happy with using tuples since that will make it easier to construct DataFrames from iterators.

 — John

John Myles White

unread,
Jun 6, 2014, 12:12:30 PM6/6/14
to julia...@googlegroups.com
If someone wants to submit a PR to allow adding a tuple as a row to a DataFrame, I’ll merge it.

 — John

Stefan Karpinski

unread,
Jun 6, 2014, 12:20:43 PM6/6/14
to julia...@googlegroups.com
See also https://github.com/JuliaStats/DataFrames.jl/issues/585. Using a tuple may make more sense, but it probably wouldn't hurt to allow an array as well.

John Myles White

unread,
Jun 6, 2014, 12:45:42 PM6/6/14
to julia...@googlegroups.com
The thing that annoys me about arrays is that we arguably need to accept both vectors and 1-row matrices as inputs.

 -- John

Stefan Karpinski

unread,
Jun 6, 2014, 12:55:23 PM6/6/14
to julia...@googlegroups.com
Since all three can be indexed the same way, it seems like that should be a minimal annoyance, no?

John Myles White

unread,
Jun 6, 2014, 12:58:51 PM6/6/14
to julia...@googlegroups.com
Yeah, I just dislike the gratuituous multiplicity of ways to do the same thing.

 -- John

Ivar Nesje

unread,
Jun 6, 2014, 4:07:59 PM6/6/14
to julia...@googlegroups.com
Why can't any iterable (of the correct length) be accepted?

As long as the DataFrame have predefined types on the columns, it is just a matter of asserting or converting the type and copy it inn. Convert would probably be slower because the types would be unknown and it would have to dispatch dynamically to the right convert method.

John Myles White

unread,
Jun 6, 2014, 5:16:11 PM6/6/14
to julia...@googlegroups.com
You're right: any iterable could work.

Personally, I tend to minimize the use of functionality that depends upon the columns of a DataFrame being in a specific order. It's certainly useful in many cases, so we can't get rid of it. But I'm not excited about people writing a lot more code that depends upon order than they do now.

 -- John

Gustavo Lacerda

unread,
Jun 9, 2014, 3:44:28 PM6/9/14
to julia...@googlegroups.com
I've implemented this:

function push!(df::DataFrame, arr::Array)
    K = length(arr)
    assert(size(df,2)==K)
    col_types = map(eltype, eachcol(df))
    converted = map(i -> convert(col_types[i][1], arr[i]), 1:K)
    ## To do: throw error if convert fails
    df2 = DataFrame(reshape(converted, 1, K))
    names!(df2, names(df))
    append!(df,df2)
end

X1 = rand(Normal(0,1), 10); X2 = rand(Normal(0,1), 10); X3 = rand(Normal(0,1), 10); Y = X1 - X2 + rand(Normal(0,1), 10)
df = DataFrame(Y=Y, X1=X1, X2=X2, X3=X3)
push!(df, [1,2,3,4])


I tried to generalize it by replacing Array with Tuple.


function push!(df::DataFrame, tup::Tuple)
    K = length(tup)
    assert(size(df,2)==K)
    col_types = map(eltype, eachcol(df))
    converted = map(i -> convert(col_types[i][1], tup[i]), 1:K)
    ## To do: throw error if convert fails
    df2 = DataFrame(reshape(converted, 1, K))
    names!(df2, names(df))
    append!(df,df2)
end

julia> df[:greeting] = "hello"
"hello"

julia> df
11x5 DataFrame
|-------|-----------|-------------|-----------|------------|----------|
| Row # | Y         | X1          | X2        | X3         | greeting |
| 1     | 0.39624   | 0.163897    | -0.146526 | 0.592489   | "hello"  |
| 2     | -0.236239 | -1.81627    | -0.726978 | 0.638524   | "hello"  |
| 3     | -0.801656 | 0.000801096 | 0.543645  | -0.997613  | "hello"  |
| 4     | -0.30888  | -0.166953   | 0.640827  | 1.53217    | "hello"  |
| 5     | -0.662719 | -1.38129    | -0.194937 | 0.928446   | "hello"  |
| 6     | 4.37102   | 2.22107     | -2.15648  | -0.703392  | "hello"  |
| 7     | 0.0866397 | -0.633333   | -0.745456 | -0.0144429 | "hello"  |
| 8     | 0.581942  | 1.24061     | -0.867256 | 0.283671   | "hello"  |
| 9     | -3.15614  | -1.39045    | 1.34395   | 0.343224   | "hello"  |
| 10    | -1.67029  | 0.634846    | 2.08062   | -0.845479  | "hello"  |
| 11    | 1.0       | 2.0         | 3.0       | 4.0        | "hello"  |


But then this happens:

julia> push!(df, (1,2,3,4, "hi"))
ERROR: no method convert(Type{Float64}, ASCIIString)
 in setindex! at array.jl:305
 in map_range_to! at range.jl:523
 in map at range.jl:534
 in push! at none:5


It apparently tries to convert "hi" to Float64, even though the 5th type is ASCIIString:

julia> col_types
1x5 DataFrame
|-------|---------|---------|---------|---------|-------------|
| Row # | Y       | X1      | X2      | X3      | label       |
| 1     | Float64 | Float64 | Float64 | Float64 | ASCIIString |


Gustavo

Keith Campbell

unread,
Jun 9, 2014, 4:17:28 PM6/9/14
to julia...@googlegroups.com
Thanks for putting this togehter.
Under 0.3 pre from yesterday, I get a deprecation warning in the Array version where df2 is assigned.  The tweak below appears to resolve that warning:

function push!(df::DataFrame, arr::Array)
    K = length(arr)
    assert(size(df,2)==K)
    col_types = map(eltype, eachcol(df))
    converted = map(i -> convert(col_types[i][1], arr[i]), 1:K)
    ## To do: throw error if convert fails
    df2 = convert( DataFrame, reshape(converted, 1, K) )   # <==tweaked
    names!(df2, names(df))
    append!(df,df2)
end

John Myles White

unread,
Jun 9, 2014, 10:41:39 PM6/9/14
to julia...@googlegroups.com
Would be good to clean this up by removing some of the slow parts (map usage, anonymous function usage) and have it submitted as a PR.

 — John

Gustavo Lacerda

unread,
Jun 9, 2014, 11:14:24 PM6/9/14
to julia...@googlegroups.com
OK, but first I want to make it work for heterogenous lists (tuples), which is mysteriously failing.

Gustavo
--
--
Gustavo Lacerda
http://www.optimizelife.com

Keith Campbell

unread,
Jun 10, 2014, 7:35:36 AM6/10/14
to julia...@googlegroups.com, gus...@optimizelife.com
Hey Gustavo,

Below is a crack at a version that handles tuples and deals with some of the issues John raised.  You can see some simple tests at http://nbviewer.ipython.org/gist/catawbasam/003743259cf0a6ec968d.

If you're interested in working it over for a pull request, please feel free.  If you'd like me to do it, I'd be happy to. And if this seems like the wrong approach, that's fine too.
cheers,
Keith

import Base.push!
function push!(df::DataFrame, iterable)
    K = length(iterable)
    assert(size(df,2)==K)
    i=1
    for t in iterable
        try 
            #println(i,t, typeof(t))
            push!(df.columns[i], t)
        catch
            #clean up partial row
            for j in 1:(i-1)
                pop!(df.columns[j])
            end
            msg = "Error adding $t to column $i."
            throw(ArgumentError(msg))
        end    
        i=i+1
    end
end

Gustavo Lacerda

unread,
Jun 10, 2014, 10:01:36 AM6/10/14
to Keith Campbell, julia...@googlegroups.com
hey Keith,

Your solution is elegant because it delegates conversion to the column
push!, i.e. push!{S,T}(dv::DataArray{S,1},v::T)

I have tested it, and it works for me too. This is your code, so I
think you should get all the credit.

Gustavo
--
Gustavo Lacerda
http://www.optimizelife.com


Keith Campbell

unread,
Jun 10, 2014, 11:14:55 AM6/10/14
to julia...@googlegroups.com, keith...@gmail.com, gus...@optimizelife.com
Thanks for the kind words.  I'll put together  a pull request.

Gustavo Lacerda

unread,
Jul 19, 2014, 1:53:58 AM7/19/14
to Keith Campbell, julia...@googlegroups.com
hi Keith,

Are you still planning to do the pull request?

Gustavo

Keith Campbell

unread,
Jul 19, 2014, 11:02:18 AM7/19/14
to julia...@googlegroups.com, keith...@gmail.com, gus...@optimizelife.com
(copied from email reply)
"Sorry! I thought you would be notified, but I guess the discussion was all on the list rather than in an Issue. 

It was pull request #621, merged June 10.
You can see the code changes at:
"
A 2nd pull request relaxed the typing on the Associative argument:

Gustavo Lacerda

unread,
Jul 19, 2014, 11:38:00 AM7/19/14
to julia...@googlegroups.com
oh, so why can't I see your method signature? I just did a fresh
Pkg.update() to be sure...

julia> methods(push!)
# 13 methods for generic function "push!":
push!(a::Array{Any,1},item) at array.jl:464
push!{T}(a::Array{T,1},item) at array.jl:453
push!(B::BitArray{1},item) at bitarray.jl:454
push!(s::IntSet,n::Integer) at intset.jl:32
push!(::EnvHash,k::String,v) at env.jl:114
push!(t::Associative{K,V},key,v) at dict.jl:241
push!(s::Set{T},x) at set.jl:18
push!(a::PyVector{T},item) at
/Users/gustavolacerda/.julia/v0.3/PyCall/src/conversions.jl:276
push!(a::SynchronousStepCollection,args...) at
/Users/gustavolacerda/.julia/v0.3/BinDeps/src/BinDeps.jl:106
push!(c::Choices,args...) at
/Users/gustavolacerda/.julia/v0.3/BinDeps/src/BinDeps.jl:177
push!(A) at abstractarray.jl:1390
push!(A,a,b) at abstractarray.jl:1391
push!(A,a,b,c...) at abstractarray.jl:1392

Keith Campbell

unread,
Jul 19, 2014, 2:10:29 PM7/19/14
to julia...@googlegroups.com, gus...@optimizelife.com
Are you using Julia 0.3, and did you do 'using DataFrames'?

I get 22 methods, including DataFrames methods, after 'using DataFrames':

julia> methods(push!)
# 22 methods for generic function "push!":
push!(a::Array{Any,1},item) at array.jl:464
push!{T}(a::Array{T,1},item) at array.jl:453
push!(B::BitArray{1},item) at bitarray.jl:454
push!(s::IntSet,n::Integer) at intset.jl:32
push!(::EnvHash,k::String,v) at env.jl:114
push!(t::Associative{K,V},key,v) at dict.jl:241
push!(s::Set{T},x) at set.jl:18
push!{T,E}(h::Histogram{T,1,E},x::Real,w::Real) at /home/keithc/.julia/v0.3/StatsBase/src/hist.jl:112
push!{T,E}(h::Histogram{T,1,E},x::Real) at /home/keithc/.julia/v0.3/StatsBase/src/hist.jl:122
push!{T,N}(h::Histogram{T,N,E},xs::NTuple{N,Real},w::Real) at /home/keithc/.julia/v0.3/StatsBase/src/hist.jl:149
push!{T,N}(h::Histogram{T,N,E},xs::NTuple{N,Real}) at /home/keithc/.julia/v0.3/StatsBase/src/hist.jl:161
push!(dv::DataArray{T,1},v::NAtype) at /home/keithc/.julia/v0.3/DataArrays/src/datavector.jl:9
push!{S,T}(dv::DataArray{S,1},v::T) at /home/keithc/.julia/v0.3/DataArrays/src/datavector.jl:15
push!{T,R}(pdv::PooledDataArray{T,R,1},v::NAtype) at /home/keithc/.julia/v0.3/DataArrays/src/datavector.jl:123
push!{S,R,T}(pdv::PooledDataArray{S,R,1},v::T) at /home/keithc/.julia/v0.3/DataArrays/src/datavector.jl:128
push!(x::Index,nm::Symbol) at /home/keithc/.julia/v0.3/DataFrames/src/other/index.jl:66
push!(df::DataFrame,associative::Associative{Symbol,Any}) at /home/keithc/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:1038
push!(df::DataFrame,associative::Associative{K,V}) at /home/keithc/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:1056
push!(df::DataFrame,iterable) at /home/keithc/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:1076
push!(A) at abstractarray.jl:1390
push!(A,a,b) at abstractarray.jl:1391
push!(A,a,b,c...) at abstractarray.jl:1392
```

Gustavo Lacerda

unread,
Jul 19, 2014, 2:12:07 PM7/19/14
to julia...@googlegroups.com
ah, I see it now! Thanks and sorry for the silly question.
Reply all
Reply to author
Forward
0 new messages