DataFrame vcat stack overflow

795 views
Skip to first unread message

Guillaume Guy

unread,
Dec 22, 2014, 3:09:16 PM12/22/14
to julia...@googlegroups.com
Dear Julia users:

Coming from a R background, I like to work with list of dataframes which i can reduce by doing do.call('rbind',list_of_df) 

In Julia, I attempted to use vcat for this purpose but I ran into trouble:

"
stack overflow
while loading In[29], in expression starting on line 1
"

This operation is basically the vcat of a large vector v consisting of 68K small (11X7) dataframes. The code is attached.

Thanks for your help! 
code_snippet.jl
full.zip

Guillaume Guy

unread,
Dec 24, 2014, 3:08:08 PM12/24/14
to julia...@googlegroups.com
Let me know if you need any additional info.

David van Leeuwen

unread,
Dec 25, 2014, 6:59:57 PM12/25/14
to julia...@googlegroups.com
Hello Guillome, 


On Monday, December 22, 2014 9:09:16 PM UTC+1, Guillaume Guy wrote:
Dear Julia users:

Coming from a R background, I like to work with list of dataframes which i can reduce by doing do.call('rbind',list_of_df) 

After ~10 years of using R, I only recently leaned of the do.call(). 

In Julia, you would say:

vcat(dfs...)

---david

Guillaume Guy

unread,
Dec 25, 2014, 7:06:23 PM12/25/14
to julia...@googlegroups.com
Hi David:

That is where the stack overflow error is thrown.

I attached the code + the data in my first post for your reference.

Sean Garborg

unread,
Dec 31, 2014, 10:42:30 PM12/31/14
to julia...@googlegroups.com
If you Pkg.update() and try again, you should be fine. DataFrames was overdue for a tagged release -- you'll get v0.6.0 which includes some updates to vcat. As a gut check, this works just fine:

using DataFrames
dfs = [DataFrame(Float64, 15, 15) for _=1:200_000]
vcat(dfs)

(If it doesn't for you, definitely file an issue.)

Happy New Year,
Sean

Guillaume Guy

unread,
Jan 2, 2015, 5:05:31 PM1/2/15
to julia...@googlegroups.com
Sean:

I found the problem. Not sure if that is a "bug" per se.

Looking at one element of the Array (which is subsequently vcat-ed):


Note the NA in the equipment column. When running my function (intermediary_point) on each row of my input dataframe, equipment (which is a String column) becomes NA of NAType. Then, the resulting dataframe (see above) has an equipment column type which is now NAtype.

Anyway ... You end up with dfs that has some elements looking like that:

7-element Array{Type{T<:Top},1}:
 UTF8String
 NAtype    
 UTF8String
 UTF8String
 Int64     
 Float64   
 Float64

and some elements with the correct type. The vcat returns a convert error trying to convert the NAtype into String.

Is it a bug? Shouldn't the vcat convert the NAType into String?  

Another question I have is about how to convert a column type within an existing dataframe.... I'm looking for an Julia equivalent of R's as.factor or as.string . Alternative, when running DataFrame(A=1:20,B=1:20), is there a way to specify what A and B should be? 

Thx! 

Sean Garborg

unread,
Jan 2, 2015, 8:17:52 PM1/2/15
to julia...@googlegroups.com
Thanks for reporting -- it is a bug. Having a Array or DataArray with NAtype as its eltype is a little awkward. Here's why it's causing you trouble, and a couple alternatives:

using DataFrames
nrows = 3
a = DataFrame(A = 1:nrows)

# Column :A is all NA for all of these cases
b1 = DataFrame(A = fill(NA, nrows))
b2 = DataFrame(A = DataArray(Int, nrows))
b3 = DataFrame(A = DataArray(None, nrows))

vcat(a, b1) # ERROR: no method matching convert(::Type{Int64}, ::DataArrays.NAtype)
vcat(a, b2) # okay
vcat(a, b3) # okay

It should probably work as is (if not, I guess the promotion rules should change, and the result should be of type Any or there should be a more informative error).

I opened an issue: https://github.com/JuliaStats/DataArrays.jl/issues/134, but given that most interested developers are focused on coming up with an replacement for DataArrays and NAtype, it may not get attention at the moment, so I'd avoid creating that ambiguous array if possible for now.



For your other question, conversion of columns, you'll generally use functions from Base Julia or DataArrays.jl to transform data however you like.

Categorical variables are (for the moment) represented using PooledDataArrays, so:
pdata(abstract_array) or convert(PooledDataArray, abstract_array)

And for strings:
map(string, abstract_array) or convert(some_string_type, abstract_array)

Guillaume Guy

unread,
Jan 3, 2015, 3:47:58 PM1/3/15
to julia...@googlegroups.com
Perfect. Thanks! 
Reply all
Reply to author
Forward
0 new messages