Bad performance of for loop

wolfgang

unread,

May 20, 2011, 11:48:53 AM5/20/11

to scala-user

Hello,

I evaluated Scala with a numerical program (basically calculating
inner
products of looong Double vectors) and found its performance
significantly
slower than that of the equivalent Java program (the elapsed time of
the
scala program was 4-5 times the elapsed time of the Java program).

So after some research (decompiling the generated class files back to
Java) I
found that the bad performance was caused by the for loops.
Substituting

for (i <- 0 until 1000000) {
//... s += v(i) * w(i)
}

with

var i = 0
while (i < 1000000) {
//...
i += 1
}

made the Scala program as fast as the Java program!

But using "while" instead of "for" is not really nice (the "i" is in
the
outer scope, it is easy to forget the "i += 1" or put it outside of
the
loop). So I tried something like:

def loop(from: Int, to: Int)(op: Int => Unit): Unit = {
var i = from
while (i < to) {
op(i)
i += 1
}
}

loop(0, 1000000) { i =>
{
//..
}
}

but the performance was bad again (maybe for another reason).

Is there anything like a "fast" for loop in Scala?
Are there any alternatives?

Regards,

Wolfgang Seifert

PS: The numerical program stems from the finite-difference-
approximation of
the heat equation and (as usual in numerical mathematics) computing
speed is crtitical.
The Java program (and Scala program with while) is in fact as fast as
the equivalent C/C++ program.

Jason Zaugg

unread,

May 20, 2011, 12:09:28 PM5/20/11

to wolfgang, scala-user

On Fri, May 20, 2011 at 5:48 PM, wolfgang <wolf.s...@web.de> wrote:
> Is there anything like a "fast" for loop in Scala?
> Are there any alternatives?

There are a bunch of old mailing list threads on the subject;
"Optimizing Simple Fors" [1] features an in depth (don't miss the
second page) discussion, with code, benchmarks, and compiler plugins,
between a range of for-loop luminaries.

If you want to live on the edge a bit, you can try ScalaCL [2], a
compiler plugin with both general purpose and GPU optimizations. I've
working on similar problems, and where necessary I just use while
loops.

-jason

[1] http://scala-programming-language.1934581.n4.nabble.com/optimizing-simple-fors-td2545502.html
[2] http://code.google.com/p/scalacl/

Lex

unread,

May 20, 2011, 12:56:59 PM5/20/11

to wolfgang, scala-user

There seems to be a strong reluctance to officially optimize
for-loops. I was holding my breath for almost 2 years now without any
progress. So if you are going to choose Scala then get ready to use
only while loops.

Ismael Juma

unread,

May 20, 2011, 1:05:59 PM5/20/11

to Lex, wolfgang, scala-user

On Fri, May 20, 2011 at 5:56 PM, Lex <lex...@gmail.com> wrote:
> There seems to be a strong reluctance to officially optimize
> for-loops. I was holding my breath for almost 2 years now without any
> progress. So if you are going to choose Scala then get ready to use
> only while loops.

Or use the compiler plugin that does that for you. I agree that we
should have it as part of the distribution (it's a bit funny that so
much work is being done to take advantage of multiple cores when we do
a bad job of taking advantage of a single core), but until then:

http://code.google.com/p/scalacl/wiki/Usage
http://code.google.com/p/scalacl/wiki/ScalaCLPlugin

Best,
Ismael

√iktor Ҡlang

unread,

May 20, 2011, 2:34:33 PM5/20/11

to Lex, wolfgang, scala-user

On Fri, May 20, 2011 at 6:56 PM, Lex <lex...@gmail.com> wrote:

There seems to be a strong reluctance to officially optimize
for-loops. I was holding my breath for almost 2 years now without any
progress. So if you are going to choose Scala then get ready to use
only while loops.

To be honest, there are no "for loops", there are only for-comprehensions.

--
Viktor Klang

Akka Tech Lead

Typesafe - Enterprise-Grade Scala from the Experts

Twitter: @viktorklang

Naftoli Gugenheim

unread,

May 20, 2011, 3:26:59 PM5/20/11

to Lex, wolfgang, scala-user

I'm not sure if it was a reluctance to optimize them or a reluctance to take resources away from other work.
In other words, if a patch was submitted would they take it?

Ruediger Keller

unread,

May 20, 2011, 3:53:48 PM5/20/11

to Naftoli Gugenheim, scala-user

2011/5/20 Naftoli Gugenheim <nafto...@gmail.com>:

> I'm not sure if it was a reluctance to optimize them or a reluctance to take
> resources away from other work.
> In other words, if a patch was submitted would they take it?

I don't think so. IIRC Paul already produced a patch that compiled for
comprehensions to while loops for the common cases. But it was not
accepted.

Although I don't remember the details, so maybe there is a chance for
acceptance of such a patch.

Regards,
Rüdiger

Daniel Sobral

unread,

May 20, 2011, 6:06:23 PM5/20/11

to Naftoli Gugenheim, Lex, wolfgang, scala-user

On Fri, May 20, 2011 at 16:26, Naftoli Gugenheim <nafto...@gmail.com> wrote:
> I'm not sure if it was a reluctance to optimize them or a reluctance to take
> resources away from other work.
> In other words, if a patch was submitted would they take it?

By optimizing for loops people really mean optimizing an inlining of
range's foreach. Range's foreach is pretty fast already, but it
invokes a Function1, which is what causes the whole trouble. Ideally,
JVM would optimize _that_, but it doesn't.

To make this more clear, consider this:

for { i <- 1 to 10 } println(i)
val range = 1 to 10; for { i <- range } println(i)
val range = 1 to 10; range.foreach(i => println(i))
val range = 1 to 10; val function = (i: Int) => println(i);
range.foreach(function)

These lines are pretty much equivalent, so what needs optimizing is
"range.foreach(function)".

Now, the concern that has been presented before is that Scala
shouldn't make Range special. An optimization should be general, and
not tied to a library class.

I can see at least two reasons for that. First, Scala should be
extensible through libraries, and by special-casing some classes one
decreases this. For instance, no one would be able to write a
Range-equivalent that performed the same. Second, range loops are very
imperative (particularly with foreach) -- they are mostly used to
handle indices into stuff. A functional style should rely instead on
mapping, zipping, etc. An optimization on range's foreach would _not_
bring any benefit to the functional style, and reduce the incentive to
optimize that.

--
Daniel C. Sobral

I travel to the future all the time.

Naftoli Gugenheim

unread,

May 20, 2011, 6:30:00 PM5/20/11

to Daniel Sobral, Lex, wolfgang, scala-user

On Fri, May 20, 2011 at 6:06 PM, Daniel Sobral <dcso...@gmail.com> wrote:

First, Scala should be
extensible through libraries, and by special-casing some classes one
decreases this. For instance, no one would be able to write a
Range-equivalent that performed the same.

I disagree. We're not talking about decreasing the performance of an extension library, so the extensibility of scala remains the same. Optimizing something built in may give that built-in an edge over something not built in, but it's not a competition. The non-built-in retains the same flexibility as without the optimization.

Rex Kerr

unread,

May 20, 2011, 7:51:43 PM5/20/11

to Daniel Sobral, scala-user

I think you've done a pretty good job of presenting the current reasoning, Daniel, but I think there's another argument to make.

On Fri, May 20, 2011 at 6:06 PM, Daniel Sobral <dcso...@gmail.com> wrote:

On Fri, May 20, 2011 at 16:26, Naftoli Gugenheim <nafto...@gmail.com> wrote:
> I'm not sure if it was a reluctance to optimize them or a reluctance to take
> resources away from other work.
> In other words, if a patch was submitted would they take it?

By optimizing for loops people really mean optimizing an inlining of
range's foreach.

Not if "people" includes me. I mean optimizing an inlining of Array foreach also. One can of course use range to index into the array if one must, but that moves from a very inferior solution to Java to a merely somewhat inferior solution for arrays.

Any real solution will do both. And, as far as I know, any real solution _only_ needs to do both since nothing else is fast enough to matter (i.e. you want to specialize the collection, but that's enough) or is already generic. The problem with array and range access is that specialization is not enough to avoid performance penalties that range from 30% to 3x (depending on what the JVM manages to inline).

Range's foreach is pretty fast already, but it
invokes a Function1, which is what causes the whole trouble. Ideally,
JVM would optimize _that_, but it doesn't.

The Sun JVM does somewhat. JRockit is considerably slower (~3x IIRC).

To make this more clear, consider this:

for { i <- 1 to 10 } println(i)
val range = 1 to 10; for { i <- range } println(i)
val range = 1 to 10; range.foreach(i => println(i))
val range = 1 to 10; val function = (i: Int) => println(i);
range.foreach(function)

These lines are pretty much equivalent, so what needs optimizing is
"range.foreach(function)".

Now, the concern that has been presented before is that Scala
shouldn't make Range special. An optimization should be general, and
not tied to a library class.

Agreed, in that an optimization should handle every relevant case. Optimizations need not worry about cases where that particular optimization would be irrelevant anyway (e.g. if we can't cover seek-and-read in a file, it doesn't matter).

I can see at least two reasons for that. First, Scala should be
extensible through libraries, and by special-casing some classes one
decreases this. For instance, no one would be able to write a
Range-equivalent that performed the same.

Granted, but with a sufficiently potent Range, there would be a vanishingly small use case for a thing that was just like Range except wasn't Range.

Second, range loops are very
imperative (particularly with foreach) -- they are mostly used to
handle indices into stuff. A functional style should rely instead on
mapping, zipping, etc. An optimization on range's foreach would _not_
bring any benefit to the functional style, and reduce the incentive to
optimize that.

Which is exactly why it needs to optimize Array also. If you cover Range and Array, everything else you're likely to run into is already an object, and that provides enough overhead that the additional overhead of a specialized Function1 call is minimal.

--Rex

Lex

unread,

May 20, 2011, 8:00:32 PM5/20/11

to Naftoli Gugenheim, Daniel Sobral, wolfgang, scala-user

I think the real reason for the lack of for-loop optimization is that
Scala tries very hard to sell functional style. Optimizing for-loops
would be a slippery slope towards a situation where imperative
approach is significantly faster and just as easy to use. That would
encourage imperative style rather than functional.

It makes absolutely no sense to me that this simple case hasn't been
optimized yet. And if something doesn't make sense, it usually
involves politics.

Kevin Wright

unread,

May 20, 2011, 8:06:19 PM5/20/11

to Lex, Naftoli Gugenheim, Daniel Sobral, wolfgang, scala-user

It's a cost/benefit thing, the extra maintainence that this specific optimisation would add just doesn't justify it.

Especially not when the while loop is a valid workaround and we'll be gaining stuff like virtualised Scala and fixnums in the JVM at some point in the future, which means the optimisation probably wouldn't have too long a lifespan anyway.

There are more profitable fish to be fried here.

--
Kevin Wright

gtalk / msn : kev.lee...@gmail.com

mail: kevin....@scalatechnology.com

vibe / skype: kev.lee.wright

quora: http://www.quora.com/Kevin-Wright
twitter: @thecoda

"My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger" ~ Dijkstra

Rex Kerr

unread,

May 20, 2011, 9:44:58 PM5/20/11

to Kevin Wright, scala-user

On Fri, May 20, 2011 at 8:06 PM, Kevin Wright <kev.lee...@gmail.com> wrote:

It's a cost/benefit thing, the extra maintainence that this specific optimisation would add just doesn't justify it.

Especially not when the while loop is a valid workaround and we'll be gaining stuff like virtualised Scala

The overloadable while loops seem...well...I'd like to see how they're implemented to be as fast as actual while loops (and immune to problems with multicasting and object creation and so on). All the easy implementations seem to me to either require underlying constructs that are not lightweight, or are a poor-man's version of lisp-style macros--but I don't know what Tiark is actually doing, so it could be great in ways that I haven't realized. The inline methods _would_ be great (it's shocking how much slower a max b and c.isNan are than math.max(a,b) and Double.isNaN(c)), but they're not really applicable here.

and fixnums in the JVM at some point in the future

If specialization is already not fast enough, how could fixnums (which have a little bit of overhead above regular primitives) solve the problem? That's a way to spend only a small penalty instead of a large one with non-specialized code.

, which means the optimisation probably wouldn't have too long a lifespan anyway.

There are more profitable fish to be fried here.

For most people, probably so. Since I spend about 30% of my time coding in Scala writing while loops required for adequate performance, it's a pretty valuable fish to me. (I'd probably be twice as fast to code for loops, if you count both the reduced boilerplate and the lower rate of errors. In Java I'm able to code these things significantly faster--but Scala still wins because I'm so much slower in Java with the other 70%.)

Unfortunately, since the speed is mission-critical for me, and the ScalaCL optimization plugin is not officially supported, I have not felt comfortable doing anything but writing the while loops by hand.

Anyway, if it's a matter of maintenance, I suggest not maintaining something else. For example, I think we can pretty safely throw out -optimise--just remove it, or don't maintain any of it--since this conversation almost never happens:

A: "Oh, wow, my code is so slow!"
B: "Turn on -optimise!"
A: "It's so much faster now, thanks!"

and this conversation happens every month or so:

A: "Oh, wow, my for loop is so slow!"
B: "Use a while loop!"
A: "It's so much faster now! Thanks, I guess, but it's so clunky!"

Since even Dalvik now has a passable JIT compiler which is getting better all the time, I think the for loop optimization would save way more time (both coder and computation) than the clever things that both the JIT and -optimise can do.

--Rex

wolfgang

unread,

May 21, 2011, 4:30:02 AM5/21/11

to scala-user

Following your suggestions I tried the ScalaCL compiler.
First it crashed, because I was using Scala 2.9.0.
So I went back to Scala 2.8.1 (had to change back some code) and it
worked.
And it is fast (decompiling shows me a Java for loop now).
But because ScalaCL is not part of the scala distribution ScalaCL is
not an option for me
(I do not want to have the same kind of problems - see crash above -
with every scala upgrade).

Following the discussion I see that there is really a need for a
"fast" for loop (as part of the scala distribution)
and only some "philosophical" arguments against it.
This is not improving the acceptance of Scala, because people either
have to use while loops (slow programming)
or use for loops (and have slow programs).

So I will stick to while loops (or Java?).

Wolfgang

On 21 Mai, 03:44, Rex Kerr <icho...@gmail.com> wrote:

Ruediger Keller

unread,

May 21, 2011, 7:12:45 AM5/21/11

to Rex Kerr, Kevin Wright, scala-user

I recall reading several times on the lists that using -optimize had
no positive effect at all, whereas I don't recall ever reading about
it having any positive effect. So what type of code is actually
optimized by it to a significant degree?

I just tried it a few times on a pet project of mine where I was doing
3D graphics with LWJGL/OpenGL which was *quite* slow at the beginning.
Using -optimize yielded no measurable improvement at all. Rewriting
all the nice for loops as ugly while loops made a HUGE difference. I
don't remember the exact numbers, but a factor of 10 is no
exaggeration, perhaps even more.

And waiting for more optimizations in the JVM isn't really an option.
Nobody knows if there will ever be optimizations that do this for us
and even if, how long will it take until we get them?

Regards,
Rüdiger

2011/5/21 Rex Kerr <ich...@gmail.com>:

Tiark Rompf

unread,

May 21, 2011, 10:00:03 AM5/21/11

to Rex Kerr, Kevin Wright, scala-user

With -optimize, scalac should compile simple for comprehensions to the equivalent of a while loop. If it does not it's a bug (either in the library or in the compiler) and it needs fixing. So if there is indeed a bug, gather solid performance numbers, submit a ticket and continue lobbying to get it fixed.

On May 21, 2011, at 3:44 AM, Rex Kerr wrote:

On Fri, May 20, 2011 at 8:06 PM, Kevin Wright <kev.lee...@gmail.com> wrote:
Especially not when the while loop is a valid workaround and we'll be gaining stuff like virtualised Scala

The overloadable while loops seem...well...I'd like to see how they're implemented to be as fast as actual while loops (and immune to problems with multicasting and object creation and so on). All the easy implementations seem to me to either require underlying constructs that are not lightweight, or are a poor-man's version of lisp-style macros--but I don't know what Tiark is actually doing, so it could be great in ways that I haven't realized.

We like to think so ;-) Totally off-topic but here is a little teaser of the things under development. Consider the following code and note how many times it creates or traverses arrays of size n (fill, map, sum):

def square(x: Rep[Double]) = x*x

def mean(xs: Rep[Array[Double]]) = xs.sum / xs.length

def variance(xs: Rep[Array[Double]]) = xs.map(square) / xs.length - square(mean(xs))

val fun = compile { n: Rep[Int] =>

val array1 = Array.fill(n) { i => 1 }

val array2 = Array.fill(n) { i => 2*i }

val array3 = Array.fill(n) { i => array1(i) + array2(i) }

val m = mean(array3)

val v = variance(array3)

println(m)

println(v)

}

fun(100)

The magic piece here is the 'compile' function. It takes a closure and at runtime generates, compiles and loads, a heavily optimized version of the code, inlining function calls and fusing loops and traversals. In the case above the generated code is this:

class Fun$1 extends ((Int)=>(Unit)) {

def apply(x0: Int): Unit = {

var x47 = 0

var x51 = 0

var x11 = 0

while (x11 < x0) {

val x44 = 2.0*x11

val x45 = 1.0+x44

val x50 = x45*x45

x47 += x45

x51 += x50

x11 += 1

}

val x48 = x47/x0

val x49 = println(x48)

val x52 = x51/x0

val x53 = x48*x48

val x54 = x52-x53

val x55 = println(x54)

x27

}

Note how all the intermediate arrays are optimized away and there is only a single loop to compute the mean and variance at the same time.

The caveat is that you cannot pass *any* code to 'compile' but only things that are 'virtualized' i.e. lifted to the domain of Rep[_] types. Of course the goal is to lift a reasonable share of the standard library but the most interesting use is for DSLs, which can hook into the compilation process and add both custom expression types and optimizations while maintaining the 'just a library' impression.

An alternative to calling 'compile' and then directly executing the result is to generate the code, save it away and use it later. This approach may be more suited for some DSLs, where you have a well delineated 'DSL program'. Plus, if you generate and run the code with Delite, it will also be parallelized and executed across multi-core CPUs and GPUs.

As I said this is all ongoing development work and it is way to early to make any commitments about releases or availability.

cheers,

- Tiark

some pointers:

Delite intro: http://stanford-ppl.github.com/Delite/

Stanford DSL class: http://www.stanford.edu/class/cs442/

GitHub: https://github.com/TiarkRompf/virtualization-lms-core

https://github.com/stanford-ppl/Delite/