The code I used to benchmark is below. I tried some variations some
which improved 2.9.0 numbers, but none that made it as fast as 2.8.1.
I know this is pure micro-benchmark, and doesn't mean much in a larger
codebase, but _something_ seems to have gone the wrong way.
object OptTest {
def main(args: Array[String]) {
println("Time: "+bench)
}
def bench: Long = {
val testSet = Array.fill(15000)(Array.fill(1000)(util.Random.nextInt))
val optex = new OptEx
// Warm-up
var result = 0
for (array <- testSet) result += optex.sum(array)
// benchmark
val start = java.lang.System.currentTimeMillis
for (array <- testSet) result += optex.sum(array)
val stop = java.lang.System.currentTimeMillis
println("Ensure nothing gets optimized away: "+result)
stop - start
}
}
class OptEx {
def sum(l: Array[Int]) = {
var acc = 0
for (i <- 0 until l.length) acc += l(i)
acc
}
}
--
Daniel C. Sobral
I travel to the future all the time.
My numbers are nothing like yours. What they said pretty unambiguously is that the optimization is not working. That it's madly pessimizing. This is with 150000 where you have 15000.
Times for unoptimised, 281 / 29 / trunk
Time: 113
Time: 204
Time: 182
Times for unoptimised, 281 / 29 / trunk
Time: 116
Time: 207
Time: 180
Times for unoptimised, 281 / 29 / trunk
Time: 122
Time: 206
Time: 182
I can only think you mixed up your outputs, because if it's twice as slow under -optimise for me I have trouble imagining what confounding variables would make it faster for you.
Times for -optimise, 281 / 29 / trunk
Time: 118
Time: 380
Time: 353
Times for -optimise, 281 / 29 / trunk
Time: 127
Time: 382
Time: 347
Times for -optimise, 281 / 29 / trunk
Time: 116
Time: 379
Time: 353
I'm on osx.
% java -version
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3326)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
For what it's worth,
When I run benchmarks (against recent sun Java on Linux) iterating over an array, my anecdotal observation is that substantial jvm hotspot optimization (I *think* escape analysis) kicks in between 1.5 and 2 million elements. I just mention that because I wouldn't expect different sized arrays under a certain threshold to have the same performance characteristics -- and also I think sometimes people (e.g. me) expect optimizations like escape analysis to kick in way sooner than the JVM actually applies them.
--j
The numbers I posted used a higher number than he did, but my results
never varied, including running it as given.
My current supposition is that 2.8.1 is faster because it had
Range.ByOne and hotspot liked it better.
// this was in 2.8.1, is no longer with us
trait ByOne extends Range {
override final def foreach[@specialized(Unit) U](f: Int => U) {
if (length > 0) {
val last = this.last
var i = start
while (i != last) {
f(i)
i += 1
}
f(i)
}
}
}
// 2.8.1 RichInt
def until(end: Int): Range with Range.ByOne = Range(start, end)
def until(end: Int, step: Int): Range = Range(start, end, step)
And it is all kinds of possible dcsobral was on windows too. Windows,
turning things on their head since 1882. Not that your results match
his OR mine. Perhaps we're running some kind of jvm implementation
rorschach test... "what does this image make you want to inline?"
Something which will take an hour to complete?
Times: 24.401 / 41.13 / 21.543
doesn't fill me with confidence I have a clue what is really happening. Sometimes I think all we ever measure are hotspot quirks. Maybe there's some major flaw in the code (see below.)
The difference with 32-bit is so big I'm thinking something is being emulated in the osx jvm.
-client -d32 gets nearly 3x faster under -optimise.
-client -d64 gets nearly 2x slower under -optimise.
-server gets about 1.5x slower under -optimise.
The numbers are basically stable across multiple runs, including the weird example above: here are a couple more of those from different runs.
Times: 26.103 / 41.483 / 21.169
Times: 23.68 / 42.071 / 21.27
All of them:
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/281/bin/scala -cp ./target/281/client Test (compiled with )
Times: 147.247 / 146.642 / 150.934
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/281/bin/scala -cp ./target/281/client-opt Test (compiled with -optimise)
Times: 146.652 / 146.314 / 150.41
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/281/bin/scala -cp ./target/281/client64 Test (compiled with )
Times: 23.413 / 21.758 / 20.948
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/281/bin/scala -cp ./target/281/client64-opt Test (compiled with -optimise)
Times: 24.401 / 41.13 / 21.543
JAVA_OPTS="-Xmx1024m -server" ./scalas/281/bin/scala -cp ./target/281/server Test (compiled with )
Times: 24.435 / 25.972 / 22.151
JAVA_OPTS="-Xmx1024m -server" ./scalas/281/bin/scala -cp ./target/281/server-opt Test (compiled with -optimise)
Times: 24.427 / 21.995 / 22.372
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/29/bin/scala -cp ./target/29/client Test (compiled with )
Times: 169.916 / 186.567 / 187.333
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/29/bin/scala -cp ./target/29/client-opt Test (compiled with -optimise)
Times: 72.994 / 90.613 / 78.926
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/29/bin/scala -cp ./target/29/client64 Test (compiled with )
Times: 39.039 / 34.789 / 34.964
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/29/bin/scala -cp ./target/29/client64-opt Test (compiled with -optimise)
Times: 67.416 / 52.48 / 52.525
JAVA_OPTS="-Xmx1024m -server" ./scalas/29/bin/scala -cp ./target/29/server Test (compiled with )
Times: 42.018 / 41.98 / 33.762
JAVA_OPTS="-Xmx1024m -server" ./scalas/29/bin/scala -cp ./target/29/server-opt Test (compiled with -optimise)
Times: 70.827 / 76.681 / 69.961
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/trunk/bin/scala -cp ./target/trunk/client Test (compiled with )
Times: 171.41 / 179.602 / 171.946
JAVA_OPTS="-Xmx1024m -client -d32" ./scalas/trunk/bin/scala -cp ./target/trunk/client-opt Test (compiled with -optimise)
Times: 69.005 / 68.011 / 67.853
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/trunk/bin/scala -cp ./target/trunk/client64 Test (compiled with )
Times: 34.801 / 35.017 / 51.299
JAVA_OPTS="-Xmx1024m -client -d64" ./scalas/trunk/bin/scala -cp ./target/trunk/client64-opt Test (compiled with -optimise)
Times: 62.01 / 67.581 / 50.296
JAVA_OPTS="-Xmx1024m -server" ./scalas/trunk/bin/scala -cp ./target/trunk/server Test (compiled with )
Times: 44.47 / 32.772 / 32.562
JAVA_OPTS="-Xmx1024m -server" ./scalas/trunk/bin/scala -cp ./target/trunk/server-opt Test (compiled with -optimise)
Times: 67.066 / 67.597 / 50.858
Numbers collected with the following code: spirit is the same as the original.
object Test {
def main(args: Array[String]) {
val s1 = if (args.length > 0) args(0).toInt else 15000
val s2 = if (args.length > 1) args(1).toInt else 2000
// warmup
bench(100, 1000)
bench(100, 1000)
bench(100, 1000)
bench(100, 1000)
bench(100, 1000)
val t1 = bench(s1, s2)
val t2 = bench(s1, s2)
val t3 = bench(s1, s2)
println("Times: " + t1 + " / " + t2 + " / " + t3)
}
def bench(s1: Int, s2: Int): Double = {
val testSet = new Array[Array[Int]](s1)
var i = 0;
while (i < s1) {
val arr = new Array[Int](s2)
testSet(i) = arr
var j = 0
while (j < s2) {
arr(j) = util.Random.nextInt()
j += 1
}
i += 1
}
val optex = new OptEx
// Warm-up
var result = 0
for (array <- testSet) result += optex.sum(array)
// benchmark
val start = System.nanoTime
for (array <- testSet) result += optex.sum(array)
println("Ensure nothing gets optimized away: "+result)
val stop = System.nanoTime
(stop - start).toDouble / 1000000
}
}
class OptEx {
def sum(l: Array[Int]): Int = {
Are you sure this is enough to warm up the JVM? I would just run it a
few hundred times to be sure.
> val t1 = bench(s1, s2)
> val t2 = bench(s1, s2)
> val t3 = bench(s1, s2)
This has no statistical significance. Why don't you run it a few
hundred times and determine the average and standard deviation? Then
you can create a 95% (or 99%) confidence interval and make a
statistically meaningful statement about your scenario.
> println("Times: " + t1 + " / " + t2 + " / " + t3)
> }
>
> def bench(s1: Int, s2: Int): Double = {
:
</snip>
:
> // benchmark
> val start = System.nanoTime
> for (array <- testSet) result += optex.sum(array)
> println("Ensure nothing gets optimized away: "+result)
> val stop = System.nanoTime
Why do you include the println statement in your test? That introduces
all kinds of IO, kernel and thread interactions to your test. I would
swap the last two statements so you only measure the for loop.
> (stop - start).toDouble / 1000000
> }
> }
:
</snip>
Note that OptEx is run 15000 times each time bench is called.
However, unless you can show the observed slowdown from 2.8.1 to 2.9.0
is solely due to inadequacies of the benchmark code -- by showing
benchmark code that measures the same thing and shows no such slowdown
-- then that is beside the point.
It's worth making it clear that Random.nextInt is not timed, so it's a
minor issue. A major issue with that benchmark is that println was
called before stop was measured (as was mentioned in another message).
Those lines should be swapped. Adding System.gc before timings is also
a good idea to reduce noise.
Best,
Ismael
http://code.google.com/p/caliper/
Perhaps it would make sense to setup an instance of that tool for
Scala (however, it seems to use AppEngine, so it's probably not
particularly light-weight).
--
Johannes
-----------------------------------------------
Johannes Rudolph
http://virtual-void.net
Stupid me: Mathias already got that ready to use for Scala, so it
works without AppEngine as well, of course:
https://github.com/sirthias/scala-benchmarking-template
minor issue. A major issue with that benchmark is that println was
called before stop was measured (as was mentioned in another message).
Those lines should be swapped.