Scala 2.9 slower?

401 views
Skip to first unread message

Daniel Sobral

unread,
May 22, 2011, 2:40:04 PM5/22/11
to scala-l...@googlegroups.com
I'm a bit concerned. After seeing that -optimize really does optimize
for comprehensions on 2.9.0, I then benchmarked the code, and
discovered it was about 2.5 times slower than 2.8.1. Then I compiled
without optimization, and got code almost 3 times slower than 2.8.1,
which, at least, shows that the optimization is working.

The code I used to benchmark is below. I tried some variations some
which improved 2.9.0 numbers, but none that made it as fast as 2.8.1.
I know this is pure micro-benchmark, and doesn't mean much in a larger
codebase, but _something_ seems to have gone the wrong way.

object OptTest {
def main(args: Array[String]) {
println("Time: "+bench)
}

def bench: Long = {
val testSet = Array.fill(15000)(Array.fill(1000)(util.Random.nextInt))
val optex = new OptEx

// Warm-up
var result = 0
for (array <- testSet) result += optex.sum(array)

// benchmark
val start = java.lang.System.currentTimeMillis
for (array <- testSet) result += optex.sum(array)
val stop = java.lang.System.currentTimeMillis
println("Ensure nothing gets optimized away: "+result)

stop - start
}

}

class OptEx {
def sum(l: Array[Int]) = {
var acc = 0
for (i <- 0 until l.length) acc += l(i)
acc
}
}

--
Daniel C. Sobral

I travel to the future all the time.

Paul Phillips

unread,
May 22, 2011, 4:41:21 PM5/22/11
to scala-l...@googlegroups.com, Daniel Sobral
On 5/22/11 11:40 AM, Daniel Sobral wrote:
> I'm a bit concerned. After seeing that -optimize really does optimize
> for comprehensions on 2.9.0, I then benchmarked the code, and
> discovered it was about 2.5 times slower than 2.8.1. Then I compiled
> without optimization, and got code almost 3 times slower than 2.8.1,
> which, at least, shows that the optimization is working.

My numbers are nothing like yours. What they said pretty unambiguously is that the optimization is not working. That it's madly pessimizing. This is with 150000 where you have 15000.

Times for unoptimised, 281 / 29 / trunk
Time: 113
Time: 204
Time: 182
Times for unoptimised, 281 / 29 / trunk
Time: 116
Time: 207
Time: 180
Times for unoptimised, 281 / 29 / trunk
Time: 122
Time: 206
Time: 182

I can only think you mixed up your outputs, because if it's twice as slow under -optimise for me I have trouble imagining what confounding variables would make it faster for you.

Times for -optimise, 281 / 29 / trunk
Time: 118
Time: 380
Time: 353
Times for -optimise, 281 / 29 / trunk
Time: 127
Time: 382
Time: 347
Times for -optimise, 281 / 29 / trunk
Time: 116
Time: 379
Time: 353

I'm on osx.

% java -version
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3326)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)

Josh Marcus

unread,
May 22, 2011, 5:11:35 PM5/22/11
to scala-l...@googlegroups.com, Daniel Sobral

For what it's worth,

When I run benchmarks (against recent sun Java on Linux) iterating over an array, my anecdotal observation is that substantial jvm hotspot optimization (I *think* escape analysis) kicks in between 1.5 and 2 million elements.  I just mention that because I wouldn't expect different sized arrays under a certain threshold to have the same performance characteristics -- and also I think sometimes people (e.g. me) expect optimizations like escape analysis to kick in way sooner than the JVM actually applies them. 

--j

Paul Phillips

unread,
May 22, 2011, 9:21:10 PM5/22/11
to scala-l...@googlegroups.com, Josh Marcus, Daniel Sobral
On 5/22/11 2:11 PM, Josh Marcus wrote:
> When I run benchmarks (against recent sun Java on Linux) iterating over
> an array, my anecdotal observation is that substantial jvm hotspot
> optimization (I *think* escape analysis) kicks in between 1.5 and 2
> million elements. I just mention that because I wouldn't expect
> different sized arrays under a certain threshold to have the same
> performance characteristics -- and also I think sometimes people (e.g.
> me) expect optimizations like escape analysis to kick in way sooner than
> the JVM actually applies them.

The numbers I posted used a higher number than he did, but my results
never varied, including running it as given.

My current supposition is that 2.8.1 is faster because it had
Range.ByOne and hotspot liked it better.

// this was in 2.8.1, is no longer with us
trait ByOne extends Range {
override final def foreach[@specialized(Unit) U](f: Int => U) {
if (length > 0) {
val last = this.last
var i = start
while (i != last) {
f(i)
i += 1
}
f(i)
}
}
}

// 2.8.1 RichInt
def until(end: Int): Range with Range.ByOne = Range(start, end)
def until(end: Int, step: Int): Range = Range(start, end, step)

Trond Olsen

unread,
May 22, 2011, 9:37:38 PM5/22/11
to scala-l...@googlegroups.com
Here's what I got using Java 1.6.24 on Windows 7 32-bit (3GB):

2.8.1-standard: 120
2.8.1-optimize: 124
2.9.0-standard: 137
2.9.0-optimize: 53

Paul Phillips

unread,
May 22, 2011, 10:43:24 PM5/22/11
to scala-l...@googlegroups.com, Trond Olsen
On 5/22/11 6:37 PM, Trond Olsen wrote:
> Here's what I got using Java 1.6.24 on Windows 7 32-bit (3GB):
>
> 2.8.1-standard: 120
> 2.8.1-optimize: 124
> 2.9.0-standard: 137
> 2.9.0-optimize: 53

And it is all kinds of possible dcsobral was on windows too. Windows,
turning things on their head since 1882. Not that your results match
his OR mine. Perhaps we're running some kind of jvm implementation
rorschach test... "what does this image make you want to inline?"

Josh Marcus

unread,
May 23, 2011, 12:03:01 AM5/23/11
to scala-l...@googlegroups.com, Daniel Sobral
To add another data point, I get performance on Linux that are similar to Daniel's.  The code runs more quickly when compiled with "-optimise" under 2.9.0, but significantly more slowly than 2.8.1.

Running:
  -- 150000 instead of 15000 (like Paul)
  -- Linux (ubuntu 11.04)
  --  Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
      Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Time for unoptimised 281: 212
Time for optimised 281: 211

Time for unoptimised 290: 406
Time for optimized 290: 351

--j

Adam Jorgensen

unread,
May 23, 2011, 12:19:36 AM5/23/11
to scala-l...@googlegroups.com
Okay, so has anyone tried a long-running test then?

Something which will take an hour to complete?

Trond Olsen

unread,
May 23, 2011, 1:30:06 AM5/23/11
to scala-l...@googlegroups.com
I couldn't bother to run this thing for ages, but some 100 iterations turned out these stats (while and for-loop compared):

// val testSet = Array.fill(50000)(Array.fill(1000)(util.Random.nextInt))

// 2.9.0 stats
scala benchmark.scala
        avgtime/no          mintime     avgmem/no         maxmem id
ok         73ms/100            72ms      194mb/300         195mb while-test
ok        416ms/100           411ms      202mb/999         210mb for-test

scala -optimise benchmark.scala
        avgtime/no          mintime     avgmem/no         maxmem id
ok         74ms/100            72ms      194mb/299         195mb while-test
ok        170ms/100           167ms      202mb/499         209mb for-test

// 2.8.1 stats
scala benchmark.scala
        avgtime/no          mintime     avgmem/no         maxmem id
ok         74ms/100            72ms      202mb/300         202mb while-test
ok        383ms/100           378ms      202mb/900         202mb for-test

scala -optimise benchmark.scala
        avgtime/no          mintime     avgmem/no         maxmem id
ok         74ms/100            72ms      198mb/300         199mb while-test
ok        387ms/100           379ms      199mb/901         199mb for-test


The benchmark is available at http://pastebin.com/Z6JM5EzE if anyone want to try it out. Note: I just recently wrote it and may contain a bug or two. Can't make any guarantees that the measurement process doesn't affect the result either.

Trond Olsen

unread,
May 23, 2011, 1:36:23 AM5/23/11
to scala-l...@googlegroups.com
Bork! Forgot to use for in outer loop in for-test. Didn't affect results much anyway. Just som 20ms extra. Updated benchmark at http://pastebin.com/wEfNtBkq.

Daniel Sobral

unread,
May 23, 2011, 9:05:39 AM5/23/11
to scala-l...@googlegroups.com
Sorry for not having mentioned it, but I was running on Windows Vista
64, 64 bits Java, no flags. I'm at work right now, so I can't get the
exact version of the JVM (I pasted it on IRC at one time, in case
anyone keeps a window logging everything there). I was definitely
seeing -optimize yield better results. I suspected ByOne as well, and
so also tried using "by 2" inside OptEx. It helped, but did not
overcome the difference.

Paul Phillips

unread,
May 23, 2011, 2:46:25 PM5/23/11
to scala-l...@googlegroups.com, Josh Marcus, Daniel Sobral
These numbers may only serve to confuse things further, but they do nicely illustrate that numbers which aren't accompanied by at least all the factors covered here, plus the platform, jvm version, and who knows what else, are useless. Also, I took additional measures to stabilize the numbers and they still vary by nearly a factor of two in consecutive calls to the same method. A spread like this:

Times: 24.401 / 41.13 / 21.543

doesn't fill me with confidence I have a clue what is really happening. Sometimes I think all we ever measure are hotspot quirks. Maybe there's some major flaw in the code (see below.)

The difference with 32-bit is so big I'm thinking something is being emulated in the osx jvm.

-client -d32 gets nearly 3x faster under -optimise.
-client -d64 gets nearly 2x slower under -optimise.
-server gets about 1.5x slower under -optimise.

The numbers are basically stable across multiple runs, including the weird example above: here are a couple more of those from different runs.

Times: 26.103 / 41.483 / 21.169
Times: 23.68 / 42.071 / 21.27

All of them:

JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/281/bin/scala -cp ./target/281/client Test (compiled with )
Times: 147.247 / 146.642 / 150.934
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/281/bin/scala -cp ./target/281/client-opt Test (compiled with -optimise)
Times: 146.652 / 146.314 / 150.41
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/281/bin/scala -cp ./target/281/client64 Test (compiled with )
Times: 23.413 / 21.758 / 20.948
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/281/bin/scala -cp ./target/281/client64-opt Test (compiled with -optimise)
Times: 24.401 / 41.13 / 21.543
JAVA_OPTS="-Xmx1024m -server" ./scalas/281/bin/scala -cp ./target/281/server Test (compiled with )
Times: 24.435 / 25.972 / 22.151
JAVA_OPTS="-Xmx1024m -server" ./scalas/281/bin/scala -cp ./target/281/server-opt Test (compiled with -optimise)
Times: 24.427 / 21.995 / 22.372
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/29/bin/scala -cp ./target/29/client Test (compiled with )
Times: 169.916 / 186.567 / 187.333
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/29/bin/scala -cp ./target/29/client-opt Test (compiled with -optimise)
Times: 72.994 / 90.613 / 78.926
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/29/bin/scala -cp ./target/29/client64 Test (compiled with )
Times: 39.039 / 34.789 / 34.964
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/29/bin/scala -cp ./target/29/client64-opt Test (compiled with -optimise)
Times: 67.416 / 52.48 / 52.525
JAVA_OPTS="-Xmx1024m -server" ./scalas/29/bin/scala -cp ./target/29/server Test (compiled with )
Times: 42.018 / 41.98 / 33.762
JAVA_OPTS="-Xmx1024m -server" ./scalas/29/bin/scala -cp ./target/29/server-opt Test (compiled with -optimise)
Times: 70.827 / 76.681 / 69.961
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/trunk/bin/scala -cp ./target/trunk/client Test (compiled with )
Times: 171.41 / 179.602 / 171.946
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/trunk/bin/scala -cp ./target/trunk/client-opt Test (compiled with -optimise)
Times: 69.005 / 68.011 / 67.853
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/trunk/bin/scala -cp ./target/trunk/client64 Test (compiled with )
Times: 34.801 / 35.017 / 51.299
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/trunk/bin/scala -cp ./target/trunk/client64-opt Test (compiled with -optimise)
Times: 62.01 / 67.581 / 50.296
JAVA_OPTS="-Xmx1024m -server" ./scalas/trunk/bin/scala -cp ./target/trunk/server Test (compiled with )
Times: 44.47 / 32.772 / 32.562
JAVA_OPTS="-Xmx1024m -server" ./scalas/trunk/bin/scala -cp ./target/trunk/server-opt Test (compiled with -optimise)
Times: 67.066 / 67.597 / 50.858


Numbers collected with the following code: spirit is the same as the original.


object Test {


def main(args: Array[String]) {

val s1 = if (args.length > 0) args(0).toInt else 15000
val s2 = if (args.length > 1) args(1).toInt else 2000

// warmup
bench(100, 1000)
bench(100, 1000)
bench(100, 1000)
bench(100, 1000)
bench(100, 1000)

val t1 = bench(s1, s2)
val t2 = bench(s1, s2)
val t3 = bench(s1, s2)

println("Times: " + t1 + " / " + t2 + " / " + t3)
}

def bench(s1: Int, s2: Int): Double = {
val testSet = new Array[Array[Int]](s1)
var i = 0;
while (i < s1) {
val arr = new Array[Int](s2)
testSet(i) = arr
var j = 0
while (j < s2) {
arr(j) = util.Random.nextInt()
j += 1
}
i += 1


}
val optex = new OptEx

// Warm-up
var result = 0
for (array <- testSet) result += optex.sum(array)

// benchmark
val start = System.nanoTime


for (array <- testSet) result += optex.sum(array)

println("Ensure nothing gets optimized away: "+result)

val stop = System.nanoTime

(stop - start).toDouble / 1000000
}
}

class OptEx {
def sum(l: Array[Int]): Int = {

Rex Kerr

unread,
May 23, 2011, 3:16:30 PM5/23/11
to scala-l...@googlegroups.com
Your benchmark is too small.  20 ms is pretty much nothing--almost certainly a garbage collection or a compilation or something.  You can catch those with -XX:+PrintCompilation -verbose:gc but the solution is really to make the test harder.  You should loop over your arrays several times (e.g. 10).

I usually run tests with this level of inner-loop complexity at least a billion times (unless it proves arduously slow).  The inner loop here is only run 30M times by default.

Also, util.Random.nextInt() is an awfully expensive thing to do for a test like this; it's ~30x slower than a simple addition.  This means that you're spending almost all your time creating the test and almost none of it testing.

  --Rex

Hilco Wijbenga

unread,
May 23, 2011, 5:29:55 PM5/23/11
to scala-l...@googlegroups.com, Josh Marcus, Daniel Sobral
On 23 May 2011 11:46, Paul Phillips <pa...@improving.org> wrote:
:
</snip>
:

> object Test {
>  def main(args: Array[String]) {
>    val s1 = if (args.length > 0) args(0).toInt else 15000
>    val s2 = if (args.length > 1) args(1).toInt else 2000
>
>    // warmup
>    bench(100, 1000)
>    bench(100, 1000)
>    bench(100, 1000)
>    bench(100, 1000)
>    bench(100, 1000)

Are you sure this is enough to warm up the JVM? I would just run it a
few hundred times to be sure.

>    val t1 = bench(s1, s2)
>    val t2 = bench(s1, s2)
>    val t3 = bench(s1, s2)

This has no statistical significance. Why don't you run it a few
hundred times and determine the average and standard deviation? Then
you can create a 95% (or 99%) confidence interval and make a
statistically meaningful statement about your scenario.

>    println("Times: " + t1 + " / " + t2 + " / " + t3)
>  }
>
>  def bench(s1: Int, s2: Int): Double = {

:
</snip>
:


>    // benchmark
>    val start = System.nanoTime
>    for (array <- testSet) result += optex.sum(array)
>    println("Ensure nothing gets optimized away: "+result)
>    val stop = System.nanoTime

Why do you include the println statement in your test? That introduces
all kinds of IO, kernel and thread interactions to your test. I would
swap the last two statements so you only measure the for loop.

>    (stop - start).toDouble / 1000000
>  }
> }

:
</snip>

Paul Phillips

unread,
May 23, 2011, 5:42:57 PM5/23/11
to scala-l...@googlegroups.com, Hilco Wijbenga, Josh Marcus, Daniel Sobral
I was kicking someone else's can here, it's really not my scenario. As I
have often noted, measuring performance is not my thing: all signs point
toward this being a fine opportunity for someone for whom that is their
thing to make a contribution by offering concrete recommendations. (By
which I mean: recommendations about the finer points of benchmark
writing are less desirable than recommendations about why 2.9 is
whatever amount slower than 2.8 and what might be done about it.)

Daniel Sobral

unread,
May 23, 2011, 8:42:51 PM5/23/11
to Hilco Wijbenga, scala-l...@googlegroups.com, Josh Marcus
On Mon, May 23, 2011 at 18:29, Hilco Wijbenga <hilco.w...@gmail.com> wrote:
> On 23 May 2011 11:46, Paul Phillips <pa...@improving.org> wrote:
> :
> </snip>
> :
>> object Test {
>>  def main(args: Array[String]) {
>>    val s1 = if (args.length > 0) args(0).toInt else 15000
>>    val s2 = if (args.length > 1) args(1).toInt else 2000
>>
>>    // warmup
>>    bench(100, 1000)
>>    bench(100, 1000)
>>    bench(100, 1000)
>>    bench(100, 1000)
>>    bench(100, 1000)
>
> Are you sure this is enough to warm up the JVM? I would just run it a
> few hundred times to be sure.

Note that OptEx is run 15000 times each time bench is called.

However, unless you can show the observed slowdown from 2.8.1 to 2.9.0
is solely due to inadequacies of the benchmark code -- by showing
benchmark code that measures the same thing and shows no such slowdown
-- then that is beside the point.

Ismael Juma

unread,
May 24, 2011, 6:12:29 AM5/24/11
to scala-l...@googlegroups.com
On Mon, May 23, 2011 at 8:16 PM, Rex Kerr <ich...@gmail.com> wrote:
> Also, util.Random.nextInt() is an awfully expensive thing to do for a test
> like this; it's ~30x slower than a simple addition.  This means that you're
> spending almost all your time creating the test and almost none of it
> testing.

It's worth making it clear that Random.nextInt is not timed, so it's a
minor issue. A major issue with that benchmark is that println was
called before stop was measured (as was mentioned in another message).
Those lines should be swapped. Adding System.gc before timings is also
a good idea to reduce noise.

Best,
Ismael

Johannes Rudolph

unread,
May 24, 2011, 6:32:24 AM5/24/11
to scala-l...@googlegroups.com
There's a whole Java microbenchmarking framework available from Google:

http://code.google.com/p/caliper/

Perhaps it would make sense to setup an instance of that tool for
Scala (however, it seems to use AppEngine, so it's probably not
particularly light-weight).

--
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Johannes Rudolph

unread,
May 24, 2011, 7:19:26 AM5/24/11
to scala-l...@googlegroups.com
On Tue, May 24, 2011 at 12:32 PM, Johannes Rudolph
<johannes...@googlemail.com> wrote:
> There's a whole Java microbenchmarking framework available from Google:
>
> http://code.google.com/p/caliper/
>
> Perhaps it would make sense to setup an instance of that tool for
> Scala (however, it seems to use AppEngine, so it's probably not
> particularly light-weight).

Stupid me: Mathias already got that ready to use for Scala, so it
works without AppEngine as well, of course:

https://github.com/sirthias/scala-benchmarking-template

Daniel Kristensen

unread,
May 24, 2011, 5:04:18 PM5/24/11
to scala-l...@googlegroups.com
On Tue, May 24, 2011 at 12:12 PM, Ismael Juma <ism...@juma.me.uk> wrote:
minor issue. A major issue with that benchmark is that println was
called before stop was measured (as was mentioned in another message).
Those lines should be swapped.

And stop should probably be included in the println too. Otherwise some optimizing entity between the scala code and the metal is allowed to reorder the operations unless I'm mistaken.

Cheers,
Daniel

Isaac Gouy

unread,
Jun 7, 2011, 1:23:58 PM6/7/11
to scala-language


On May 24, 3:32 am, Johannes Rudolph <johannes.rudo...@googlemail.com>
wrote:
> There's a whole Java microbenchmarking framework available from Google:
>
> http://code.google.com/p/caliper/


Also - JavaStats 'Statistically Rigorous Java Performance Evaluation'
http://www.elis.ugent.be/JavaStats

"JavaStats is a tool, written in Python that allows to run Java
benchmarks automatically until a sufficiently narrow confidence
interval is obtained for a given performance metric."
Reply all
Reply to author
Forward
0 new messages