I also note that when I used DataFrames & DataArrays the performance is much much worse compared to when I use Arrays.
I did not take notes, but the performance difference (between indexing and looping over the DataArray) was marginal for DataArrays.
This was the reason why I changed the whole code to use Arrays only. This reduced the runtime by a factor of 10.
Now the bottleneck is the handling of Strings. Can somebody give me some advice how to get the best performance?
I am working with decision trees (thus searching for subsets and splits).
What is the best way to handle Strings? Should I convert my data to integers (via a dictionary) and continue working with integer Arrays?
Or would pooled dataarrays help? If there is a tutorial somewhere on how to tackle character data I would appreciate a link. Also if you could point me towards the implementation which will be closest to what v0.4 will offer, that would be nice.
My issue is essentially that the code below runs terribly slow (compared to a similar code with Float64).
n=40000000
values=rand(n)
feat_char=Array(UTF8String,n,1)
fill!(feat_char,"abc")
feat_char[2]="def"
feat_char[6]="something"
feat_char[end-2]="something"
@time begin
match=findin(feat_char,["abc" "something"])
result=mean(values[match])
end