Adding a row to a DataFrame

Tomas Lycken

unread,

May 26, 2014, 11:59:28 AM5/26/14

to julia...@googlegroups.com

I'm probably just being incredibly daft, but I can't figure out how to add a new row to a DataFrame.

Basically, I have a bunch of data sets for which I want to perform some calculations - lets say the mean and standard deviation of something - each dataset corresponding to some named category of data. So I do the following to construct my new DataFrame

julia> measures = DataFrame()

julia> measures[:Mean] = Float64[]

julia> measures[:StdDev] = Float64[]

julia> measures[:Category] = Symbol[]

Now, I want to add some values that are the results of a calculation on a different data set, and I try this:

julia> push!(psispread, [1.0,0.1,:Fake])

ERROR: no method push!(DataFrame, Array{Any,1})

julia> append!(psispread, [1.0,0.1,:Fake])

ERROR: no method append!(DataFrame, Array{Any,1})

julia> psispread[1,:] = [1.0,0.1,:Fake]

ERROR: BoundsError()

in setindex! at /home/tlycken/.julia/v0.3/DataArrays/src/dataarray.jl:764

in insert_single_entry! at /home/tlycken/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:410

in setindex! at /home/tlycken/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:521

Is there a nice and simple way to add a row to a DataFrame without having to do it one value at a time?

// T

John Myles White

unread,

May 26, 2014, 7:49:41 PM5/26/14

to julia...@googlegroups.com

You can append a one-row DataFrame to your existing DataFrame.

— John

Jason Solack

unread,

May 26, 2014, 9:08:57 PM5/26/14

to julia...@googlegroups.com

this works for me:

dfA = DataFrame(A=[1:10], B=[11:20])
dfB = DataFrame(A=11, B=21)
append!(dfA, dfB)

Kevin Squire

unread,

May 26, 2014, 11:14:24 PM5/26/14

to julia...@googlegroups.com

It shouldn't be that hard to make the array version work. I might give it a shot, unless that isn't desired.

Kevin

John Myles White

unread,

May 26, 2014, 11:19:16 PM5/26/14

to julia...@googlegroups.com

I’d not really opposed to it, but I’m also not super excited about it. It’s a redundant and non-obvious interface: I’ve seen people try to use both vectors and 1-row matrices to do this. That suggests to me there’s no clear right answer, so picking one way arbitrarily (appending only DataFrames to DataFrames) is pretty reasonable.

— John

Kevin Squire

unread,

May 26, 2014, 11:38:34 PM5/26/14

to julia...@googlegroups.com

So, the other argument is that, if the types fit, why not make it easy to append data to a DataFrame via any iterable? Constructing a DataFrame just to append it to another DataFrame and throw it away seems wasteful, especially since a new array is allocated for each column, and (I think) each array allocates space for 16 elements. That means we're allocating and throwing away, e.g., 128 bytes per Float64 column, just so we can append one number to the column.

If we had a separate type for DataFrame rows, on the other hand...

Cheers,

Kevin

Tomas Lycken

unread,

May 27, 2014, 2:52:41 AM5/27/14

to julia...@googlegroups.com

Aside from the memory allocation concerns already raised, I also think that constructing a dataframe just to add it to another adds quite a lot of redundancy in the code. For example, I'll have to specify the column names an extra time for each row I append, rather than just once at the beginning. (However, this argument might be moot if the column order is not always well-defined - in that case, I don't really see a way around creating a new dataframe, since the columns need to be named.)

I just find the whole procedure of constructing a full data frame just to append it and throw it away seems very roundabout and complicated.

// T

On Tuesday, May 27, 2014 5:38:34 AM UTC+2, Kevin Squire wrote:

So, the other argument is that, if the types fit, why not make it easy to append data to a DataFrame via any iterable? Constructing a DataFrame just to append it to another DataFrame and throw it away seems wasteful, especially since a new array is allocated for each column, and (I think) each array allocates space for 16 elements. That means we're allocating and throwing away, e.g., 128 bytes per Float64 column, just so we can append one number to the column.

If we had a separate type for DataFrame rows, on the other hand...

Cheers,

Kevin

On Monday, May 26, 2014, John Myles White <johnmyl...@gmail.com> wrote:

I’d not really opposed to it, but I’m also not super excited about it. It’s a redundant and non-obvious interface: I’ve seen people try to use both vectors and 1-row matrices to do this. That suggests to me there’s no clear right answer, so picking one way arbitrarily (appending only DataFrames to DataFrames) is pretty reasonable.

— John

Jacques Rioux

unread,

May 27, 2014, 4:11:15 PM5/27/14

to julia...@googlegroups.com

Let me add a thought here. I also think that adding a row to a dataframe should be easier. However, I do not think that an array would be the best container to represent a row because array members must all be of the same type which brings up Any as the only options in your example.

I think that appending or pushing a tuple with the right types could be made to work.

So it would be

julia> push!(psispread, (1.0,0.1,:Fake))

or

julia> append!(psispread, (1.0,0.1,:Fake))

since

julia> typeof((1.0, 0.1, :fake))

(Float64,Float64,Symbol)

Note, I am not saying that this works now but that it could be made to work by adding the corresponding method to either function. It seems it is the right construct.

Any thoughts?

Tomas Lycken

unread,

May 28, 2014, 2:37:43 AM5/28/14

to julia...@googlegroups.com

I like it - but maybe that wasn't so hard to guess I would ;)

// T

John Myles White

unread,

May 28, 2014, 10:43:24 AM5/28/14

to julia...@googlegroups.com

I’m happy with using tuples since that will make it easier to construct DataFrames from iterators.

— John

John Myles White

unread,

Jun 6, 2014, 12:12:30 PM6/6/14

to julia...@googlegroups.com

If someone wants to submit a PR to allow adding a tuple as a row to a DataFrame, I’ll merge it.

— John

Stefan Karpinski

unread,

Jun 6, 2014, 12:20:43 PM6/6/14

to julia...@googlegroups.com

See also https://github.com/JuliaStats/DataFrames.jl/issues/585. Using a tuple may make more sense, but it probably wouldn't hurt to allow an array as well.

John Myles White

unread,

Jun 6, 2014, 12:45:42 PM6/6/14

to julia...@googlegroups.com

The thing that annoys me about arrays is that we arguably need to accept both vectors and 1-row matrices as inputs.

-- John

Stefan Karpinski

unread,

Jun 6, 2014, 12:55:23 PM6/6/14

to julia...@googlegroups.com

Since all three can be indexed the same way, it seems like that should be a minimal annoyance, no?

John Myles White

unread,

Jun 6, 2014, 12:58:51 PM6/6/14

to julia...@googlegroups.com

Yeah, I just dislike the gratuituous multiplicity of ways to do the same thing.

-- John

Ivar Nesje

unread,

Jun 6, 2014, 4:07:59 PM6/6/14

to julia...@googlegroups.com

Why can't any iterable (of the correct length) be accepted?

As long as the DataFrame have predefined types on the columns, it is just a matter of asserting or converting the type and copy it inn. Convert would probably be slower because the types would be unknown and it would have to dispatch dynamically to the right convert method.

John Myles White

unread,

Jun 6, 2014, 5:16:11 PM6/6/14

to julia...@googlegroups.com

You're right: any iterable could work.

Personally, I tend to minimize the use of functionality that depends upon the columns of a DataFrame being in a specific order. It's certainly useful in many cases, so we can't get rid of it. But I'm not excited about people writing a lot more code that depends upon order than they do now.

-- John

Gustavo Lacerda

unread,

Jun 9, 2014, 3:44:28 PM6/9/14

to julia...@googlegroups.com

I've implemented this:

function push!(df::DataFrame, arr::Array)

K = length(arr)

assert(size(df,2)==K)

col_types = map(eltype, eachcol(df))

converted = map(i -> convert(col_types[i][1], arr[i]), 1:K)

## To do: throw error if convert fails

df2 = DataFrame(reshape(converted, 1, K))

names!(df2, names(df))

append!(df,df2)

end

X1 = rand(Normal(0,1), 10); X2 = rand(Normal(0,1), 10); X3 = rand(Normal(0,1), 10); Y = X1 - X2 + rand(Normal(0,1), 10)

df = DataFrame(Y=Y, X1=X1, X2=X2, X3=X3)

push!(df, [1,2,3,4])

I tried to generalize it by replacing Array with Tuple.

function push!(df::DataFrame, tup::Tuple)

K = length(tup)

assert(size(df,2)==K)

col_types = map(eltype, eachcol(df))

converted = map(i -> convert(col_types[i][1], tup[i]), 1:K)

## To do: throw error if convert fails

df2 = DataFrame(reshape(converted, 1, K))

names!(df2, names(df))

append!(df,df2)

end

julia> df[:greeting] = "hello"

"hello"

julia> df

11x5 DataFrame

|-------|-----------|-------------|-----------|------------|----------|

| Row # | Y | X1 | X2 | X3 | greeting |

| 1 | 0.39624 | 0.163897 | -0.146526 | 0.592489 | "hello" |

| 2 | -0.236239 | -1.81627 | -0.726978 | 0.638524 | "hello" |

| 3 | -0.801656 | 0.000801096 | 0.543645 | -0.997613 | "hello" |

| 4 | -0.30888 | -0.166953 | 0.640827 | 1.53217 | "hello" |

| 5 | -0.662719 | -1.38129 | -0.194937 | 0.928446 | "hello" |

| 6 | 4.37102 | 2.22107 | -2.15648 | -0.703392 | "hello" |

| 7 | 0.0866397 | -0.633333 | -0.745456 | -0.0144429 | "hello" |

| 8 | 0.581942 | 1.24061 | -0.867256 | 0.283671 | "hello" |

| 9 | -3.15614 | -1.39045 | 1.34395 | 0.343224 | "hello" |

| 10 | -1.67029 | 0.634846 | 2.08062 | -0.845479 | "hello" |

| 11 | 1.0 | 2.0 | 3.0 | 4.0 | "hello" |

But then this happens:

julia> push!(df, (1,2,3,4, "hi"))

ERROR: no method convert(Type{Float64}, ASCIIString)

in setindex! at array.jl:305

in map_range_to! at range.jl:523

in map at range.jl:534

in push! at none:5

It apparently tries to convert "hi" to Float64, even though the 5th type is ASCIIString:

julia> col_types

1x5 DataFrame

|-------|---------|---------|---------|---------|-------------|

| Row # | Y | X1 | X2 | X3 | label |

Gustavo

P.S. Should the code go here? https://github.com/JuliaStats/DataFrames.jl/blob/master/src/dataframe/dataframe.jl

Keith Campbell

unread,

Jun 9, 2014, 4:17:28 PM6/9/14

to julia...@googlegroups.com

Thanks for putting this togehter.

Under 0.3 pre from yesterday, I get a deprecation warning in the Array version where df2 is assigned. The tweak below appears to resolve that warning:

function push!(df::DataFrame, arr::Array)

K = length(arr)

assert(size(df,2)==K)

col_types = map(eltype, eachcol(df))

converted = map(i -> convert(col_types[i][1], arr[i]), 1:K)

## To do: throw error if convert fails

df2 = convert( DataFrame, reshape(converted, 1, K) ) # <==tweaked

names!(df2, names(df))

append!(df,df2)

end

John Myles White

unread,

Jun 9, 2014, 10:41:39 PM6/9/14

to julia...@googlegroups.com

Would be good to clean this up by removing some of the slow parts (map usage, anonymous function usage) and have it submitted as a PR.

— John

Gustavo Lacerda

unread,

Jun 9, 2014, 11:14:24 PM6/9/14

to julia...@googlegroups.com

OK, but first I want to make it work for heterogenous lists (tuples), which is mysteriously failing.

Gustavo

--
--
Gustavo Lacerda
http://www.optimizelife.com

Keith Campbell

unread,

Jun 10, 2014, 7:35:36 AM6/10/14

to julia...@googlegroups.com, gus...@optimizelife.com

Hey Gustavo,

Below is a crack at a version that handles tuples and deals with some of the issues John raised. You can see some simple tests at http://nbviewer.ipython.org/gist/catawbasam/003743259cf0a6ec968d.

If you're interested in working it over for a pull request, please feel free. If you'd like me to do it, I'd be happy to. And if this seems like the wrong approach, that's fine too.

cheers,

Keith

import Base.push!

function push!(df::DataFrame, iterable)

K = length(iterable)

assert(size(df,2)==K)

i=1

for t in iterable

try

#println(i,t, typeof(t))

push!(df.columns[i], t)

catch

#clean up partial row

for j in 1:(i-1)

pop!(df.columns[j])

end

msg = "Error adding $t to column $i."

throw(ArgumentError(msg))

end

i=i+1

end

Gustavo Lacerda

unread,

Jun 10, 2014, 10:01:36 AM6/10/14

to Keith Campbell, julia...@googlegroups.com

hey Keith,

Your solution is elegant because it delegates conversion to the column
push!, i.e. push!{S,T}(dv::DataArray{S,1},v::T)

I have tested it, and it works for me too. This is your code, so I
think you should get all the credit.

Gustavo

--
Gustavo Lacerda
http://www.optimizelife.com

Keith Campbell

unread,

Jun 10, 2014, 11:14:55 AM6/10/14

to julia...@googlegroups.com, keith...@gmail.com, gus...@optimizelife.com

Thanks for the kind words. I'll put together a pull request.

Gustavo Lacerda

unread,

Jul 19, 2014, 1:53:58 AM7/19/14

to Keith Campbell, julia...@googlegroups.com

hi Keith,

Are you still planning to do the pull request?

Gustavo

Keith Campbell

unread,

Jul 19, 2014, 11:02:18 AM7/19/14

to julia...@googlegroups.com, keith...@gmail.com, gus...@optimizelife.com

(copied from email reply)

"Sorry! I thought you would be notified, but I guess the discussion was all on the list rather than in an Issue.

It was pull request #621, merged June 10.

You can see the code changes at:

https://github.com/JuliaStats/DataFrames.jl/commit/3cf97fffde38517783d41f431da12714766a51c9

"

A 2nd pull request relaxed the typing on the Associative argument:

https://github.com/JuliaStats/DataFrames.jl/commit/d5f657d6be4df63a76e61fbaa4f43c8a54df3132

Gustavo Lacerda

unread,

Jul 19, 2014, 11:38:00 AM7/19/14

to julia...@googlegroups.com

oh, so why can't I see your method signature? I just did a fresh
Pkg.update() to be sure...

julia> methods(push!)
# 13 methods for generic function "push!":
push!(a::Array{Any,1},item) at array.jl:464
push!{T}(a::Array{T,1},item) at array.jl:453
push!(B::BitArray{1},item) at bitarray.jl:454
push!(s::IntSet,n::Integer) at intset.jl:32
push!(::EnvHash,k::String,v) at env.jl:114
push!(t::Associative{K,V},key,v) at dict.jl:241
push!(s::Set{T},x) at set.jl:18
push!(a::PyVector{T},item) at
/Users/gustavolacerda/.julia/v0.3/PyCall/src/conversions.jl:276
push!(a::SynchronousStepCollection,args...) at
/Users/gustavolacerda/.julia/v0.3/BinDeps/src/BinDeps.jl:106
push!(c::Choices,args...) at
/Users/gustavolacerda/.julia/v0.3/BinDeps/src/BinDeps.jl:177
push!(A) at abstractarray.jl:1390
push!(A,a,b) at abstractarray.jl:1391
push!(A,a,b,c...) at abstractarray.jl:1392

Keith Campbell

unread,

Jul 19, 2014, 2:10:29 PM7/19/14

to julia...@googlegroups.com, gus...@optimizelife.com

Are you using Julia 0.3, and did you do 'using DataFrames'?

I get 22 methods, including DataFrames methods, after 'using DataFrames':

julia> methods(push!)

# 22 methods for generic function "push!":

push!(a::Array{Any,1},item) at array.jl:464

push!{T}(a::Array{T,1},item) at array.jl:453

push!(B::BitArray{1},item) at bitarray.jl:454

push!(s::IntSet,n::Integer) at intset.jl:32

push!(::EnvHash,k::String,v) at env.jl:114

push!(t::Associative{K,V},key,v) at dict.jl:241

push!(s::Set{T},x) at set.jl:18

push!{T,E}(h::Histogram{T,1,E},x::Real,w::Real) at /home/keithc/.julia/v0.3/StatsBase/src/hist.jl:112

push!{T,E}(h::Histogram{T,1,E},x::Real) at /home/keithc/.julia/v0.3/StatsBase/src/hist.jl:122

push!{T,N}(h::Histogram{T,N,E},xs::NTuple{N,Real},w::Real) at /home/keithc/.julia/v0.3/StatsBase/src/hist.jl:149

push!{T,N}(h::Histogram{T,N,E},xs::NTuple{N,Real}) at /home/keithc/.julia/v0.3/StatsBase/src/hist.jl:161

push!(dv::DataArray{T,1},v::NAtype) at /home/keithc/.julia/v0.3/DataArrays/src/datavector.jl:9

push!{S,T}(dv::DataArray{S,1},v::T) at /home/keithc/.julia/v0.3/DataArrays/src/datavector.jl:15

push!{T,R}(pdv::PooledDataArray{T,R,1},v::NAtype) at /home/keithc/.julia/v0.3/DataArrays/src/datavector.jl:123

push!{S,R,T}(pdv::PooledDataArray{S,R,1},v::T) at /home/keithc/.julia/v0.3/DataArrays/src/datavector.jl:128

push!(x::Index,nm::Symbol) at /home/keithc/.julia/v0.3/DataFrames/src/other/index.jl:66

push!(df::DataFrame,associative::Associative{Symbol,Any}) at /home/keithc/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:1038

push!(df::DataFrame,associative::Associative{K,V}) at /home/keithc/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:1056

push!(df::DataFrame,iterable) at /home/keithc/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:1076

push!(A) at abstractarray.jl:1390

push!(A,a,b) at abstractarray.jl:1391

push!(A,a,b,c...) at abstractarray.jl:1392

```

Gustavo Lacerda

unread,

Jul 19, 2014, 2:12:07 PM7/19/14

to julia...@googlegroups.com

ah, I see it now! Thanks and sorry for the silly question.

Reply all

Reply to author

Forward