Applying Mean/Min/Max/Sum (and similar) over conditional subsets of Vectors/Matrices / Is it possible without creating copies?

bernhard

unread,

Mar 30, 2015, 9:05:13 AM3/30/15

to julia...@googlegroups.com

Hello all

I understand that indexing a vector creates a copy thereof. This does not seem to be the optimal way to, say, sum over all elements which satisfy a certain condition.

Is there a way to optimize the performance of the toy examples below?

I just did read up on various functions such as sub, filter, subset, the ArrayViews and NumericExtensions package, but I could not find anything which helps me.

Each of the lines below need about 0.5s on my computer.

Thank you in advance

using NumericExtensions

n=40000000

rnd=rand(n)

@time m=mean(rnd[rnd.<0.3])

@time m=sum(rnd[rnd.<0.3])

@time m=sumsq(rnd[rnd.<0.3])

bernhard

unread,

Mar 30, 2015, 9:32:59 AM3/30/15

to julia...@googlegroups.com

After some thinking I defined a custom function (looping over the array) for the mean.

This is a bit cumbersome but fine given that it is about 3 times faster and has only 288 bytes of memory allocation (compared to 1 GB!) in the example below.

If anyone knows how to do this faster I would appreciate input.

Bernhard

@inbounds function mycustommean(x::Array{Float64,1},thresh::Float64)

sum=0.0

count=0

for i in 1:size(x,1)

if x[i]<thresh

sum+=x[i]

count+=1

end

sum=sum/float64(count)

end

@time a=[mean(rnd[rnd.<0.3]) for i=1:10 ]

#5.7 seconds

m2=mycustommean(rnd,0.3)

@time @inbounds a2=[mycustommean(rnd,0.3) for i=1:10]

#1.9 seconds

Tim Holy

unread,

Mar 30, 2015, 10:47:01 AM3/30/15

to julia...@googlegroups.com

Nope, that's the best way to do it. Yet another illustration that it's trivial
to write code that beats any attempt to cleverly mix together library
functions.

I'd get rid of the @inbounds, though, it doesn't serve any purpose for a
function that runs a long loop. Inlining too much can actually slow things
down.

Best,
--Tim

Milan Bouchet-Valat

unread,

Mar 30, 2015, 3:38:55 PM3/30/15

to julia...@googlegroups.com

Le lundi 30 mars 2015 à 09:46 -0500, Tim Holy a écrit :

Nope, that's the best way to do it. Yet another illustration that it's trivial 
to write code that beats any attempt to cleverly mix together library 
functions.

I'd get rid of the @inbounds, though, it doesn't serve any purpose for a 
function that runs a long loop. Inlining too much can actually slow things 
down.

Writing the function is fine, but I wish we would be able to handle this case via https://github.com/JuliaLang/julia/issues/8450

If rnd.<0.3 returned a view computed on the fly instead of a copy, and rnd[rnd.<0.3] consequently did the same, mean(rnd[rnd.<0.3]) could be made reasonably efficient.

With the last syntax that was proposed, and without changing .<, this might look like this:
mean(rnd[(rnd < 0.3 for rnd)])

One could also imagine changing the behavior of filter in the same vein.

Regards

bernhard

unread,

Mar 31, 2015, 12:40:07 PM3/31/15

to julia...@googlegroups.com

I also note that when I used DataFrames & DataArrays the performance is much much worse compared to when I use Arrays.

I did not take notes, but the performance difference (between indexing and looping over the DataArray) was marginal for DataArrays.

This was the reason why I changed the whole code to use Arrays only. This reduced the runtime by a factor of 10.

Now the bottleneck is the handling of Strings. Can somebody give me some advice how to get the best performance?

I am working with decision trees (thus searching for subsets and splits).

What is the best way to handle Strings? Should I convert my data to integers (via a dictionary) and continue working with integer Arrays?

Or would pooled dataarrays help? If there is a tutorial somewhere on how to tackle character data I would appreciate a link. Also if you could point me towards the implementation which will be closest to what v0.4 will offer, that would be nice.

My issue is essentially that the code below runs terribly slow (compared to a similar code with Float64).

n=40000000

values=rand(n)

feat_char=Array(UTF8String,n,1)

fill!(feat_char,"abc")

feat_char[2]="def"

feat_char[6]="something"

feat_char[end-2]="something"

@time begin

match=findin(feat_char,["abc" "something"])

result=mean(values[match])

end

James Fairbanks

unread,

Apr 1, 2015, 10:31:41 AM4/1/15

to julia...@googlegroups.com

Are the strings coming from a finite collection?

Does https://mlbasejl.readthedocs.org/en/latest/datapre.html#label-processing help?

It probably is an implementation of "convert my data to integers (via a dictionary) and continue working with integer Arrays"

On Tuesday, March 31, 2015 at 12:40:07 PM UTC-4, bernhard wrote:

I also note that when I used DataFrames & DataArrays the performance is much much worse compared to when I use Arrays.
I did not take notes, but the performance difference (between indexing and looping over the DataArray) was marginal for DataArrays.

This was the reason why I changed the whole code to use Arrays only. This reduced the runtime by a factor of 10.

Now the bottleneck is the handling of Strings. Can somebody give me some advice how to get the best performance?
I am working with decision trees (thus searching for subsets and splits).
What is the best way to handle Strings?

bernhard

unread,

Apr 1, 2015, 11:27:48 AM4/1/15

to julia...@googlegroups.com

Yes they are. As I am working with decision trees (iterating over subsets), I will not have more than (say) 30 values for the strings.

Thank you! I will take a look at your link. Milan already suggested PooledDataArrays (here: https://groups.google.com/forum/#!topic/julia-stats/RCSA2179NLo ).

I apologize for double posting. The first question (about indexing) I had in this thread was already answered, so I thought asking in the stats-group would be more appropriate.

Reply all

Reply to author

Forward