lessons learned optimizing Range.foreach, and yes, another (promising) attempt

Miguel Garcia

oläst,

16 dec. 2011 09:53:122011-12-16

till scala-i...@googlegroups.com

For those who haven't been following, here's a summary of factors heavily influencing Range.foreach performance (as far as I can remember):

(0) in all cases, we use -optimise.

(1) accessing a lazy val (e.g., Range.length) is a performance killer (may be prevents JIT compilation?)

(2) invoking the closure arg can be done with or without casting, the cast neither hurts nor improves performance significantly:

    (f.asInstanceOf[Int => Unit]).apply(i)
      vs
    f(i)

(3) it's desirable to perform that invocation only once, otherwise each occurrence can lead to inlining the loop body.

(4) keeping Range.foreach small helps, which means in practice pre-processing beforehand, not forgetting "corner cases" such as:

    Int.MinValue to Int.MaxValue by (Int.MinValue / 2).abs
    1 to 1 by Int.MaxValue
    0 to Int.MinValue by Int.MinValue

(5) Daniel Sobral has made available a very complete benchmarking harness.

And now an attempt based on the above (binary compatible at that). The number of iterations is precomputed at Range construction time, using the same algorithm as in NumericRange.count() (thus I believe all corner cases are handled properly). The value -1 signals an exception should be thrown at Range.foreach() time. Range.foreach() itself just performs that check, and then moves on to a while loop. Performance seems competitive, more benchmarking is welcome. Having said this, the main goal of this attempt was gathering evidence as to whether the factors mentioned above fully explain Range.foreach() performance. Regarding "few iterations" (ie < 500), that hypothesis seems corroborated. In case you have evidence to the contrary, a summary of the design criteria at play would also be welcome.

Miguel
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

class Range(val start: Int, val end: Int, val step: Int)
extends collection.AbstractSeq[Int]
   with IndexedSeq[Int]
   with collection.CustomParallelizable[Int, ParRange]
   with Serializable
{
// has to be public, due to access in foreach (if private then foreach can't be inlined)
val numSteps = preproc(start, end, step, isInclusive)

private[this] def preproc(start: Long, end: Long, step: Long, isInclusive: Boolean): Int = {
    val upward = (start < end)
    if (step == 0) -1
    else if (start == end) if (isInclusive) 1 else 0
    else if (upward != (step > 0)) 0
    else {
      val diff      = end - start
      val jumps     = diff / step
      val remainder = diff % step
      val longCount = jumps + (
        if (!isInclusive && 0 == remainder) 0 else 1
      )

      /** The edge cases keep coming. Since e.g.
       *    Long.MaxValue + 1 == Long.MinValue
       * we do some more improbable seeming checks lest
       * overflow turn up as an empty range.
       */
      // The second condition contradicts an empty result.
      val isOverflow = (longCount == 0) && (((start + step) < end) == upward)

      if (longCount > scala.Int.MaxValue || longCount < 0L || isOverflow) -1
      else longCount.toInt
    }
}

/** @note Making foreach run as fast as a while loop is a challenge.
   */
@inline final override def foreach[U](f: Int => U) {
    var numIters = numSteps
    if(numIters == -1) {
      if (step == 0) throw new IllegalArgumentException("step cannot be 0.")
      else {
        val word = if (isInclusive) "to" else "until"
        val descr = List(start, word, end, "by", step) mkString " "

        throw new IllegalArgumentException(descr + ": seqs cannot contain more than Int.MaxValue elements.")
      }
    }
    var i = start
    while ( numIters > 0 ) {
      (f.asInstanceOf[Int => Unit]).apply(i)
      i += step
      numIters -= 1
    }
}

. . .

Ismael Juma

oläst,

16 dec. 2011 10:02:122011-12-16

till scala-i...@googlegroups.com

Hi Miguel,

Thanks for writing a summary of the findings so far. A comment below.

On Fri, Dec 16, 2011 at 2:53 PM, Miguel Garcia <mgarc...@yahoo.com> wrote:

@inline final override def foreach[U](f: Int => U) {
    var numIters = numSteps
    if(numIters == -1) {
      if (step == 0) throw new IllegalArgumentException("step cannot be 0.")
      else {
        val word = if (isInclusive) "to" else "until"
        val descr = List(start, word, end, "by", step) mkString " "

        throw new IllegalArgumentException(descr + ": seqs cannot contain more than Int.MaxValue elements.")
      }
    }

You want to move all of this to a separate method as we don't want to inline the cold path and we don't want it to inflate the size of foreach.

Best,

Ismael

Miguel Garcia

oläst,

16 dec. 2011 10:31:442011-12-16

till scala-i...@googlegroups.com

Ismael,

Right, and an existing method can be used (as shown below). There's a slight performance degradation however (accessing lazy vals, even in a cold path, apparently makes the JITer less aggressive, for some reason).

@inline final override def foreach[U](f: Int => U) {
var numIters = numSteps

if(numIters == -1) this.length // invoked for side-effect (throw IllegalArgumentException)

    var i = start
    while ( numIters > 0 ) {
      (f.asInstanceOf[Int => Unit]).apply(i)
      i += step
      numIters -= 1
    }
}

Miguel
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

Ismael Juma

oläst,

16 dec. 2011 10:37:092011-12-16

till scala-i...@googlegroups.com

On Fri, Dec 16, 2011 at 3:31 PM, Miguel Garcia <mgarc...@yahoo.com> wrote:

Right, and an existing method can be used (as shown below). There's a slight performance degradation however (accessing lazy vals, even in a cold path, apparently makes the JITer less aggressive, for some reason).

Are you accessing the lazy val in the benchmark? If not, that sounds suspicious.

@inline final override def foreach[U](f: Int => U) {
var numIters = numSteps

if(numIters == -1) this.length // invoked for side-effect (throw IllegalArgumentException)

Is `this.length` a method that is not inlined by scalac?

Best,

Ismael

Geoff Reedy

oläst,

16 dec. 2011 10:39:352011-12-16

till scala-i...@googlegroups.com

On Fri, Dec 16, 2011 at 06:53:12AM -0800, Miguel Garcia said

> // has to be public, due to access in foreach (if private then foreach
> can't be inlined)
> val numSteps = preproc(start, end, step, isInclusive)

Should we automatically synthesize a public accessor for private members
referenced from an @inline method? I think it'd have to have a mangled
name though since subclasses can introduce a different private field
with the same name.

-- Geoff

Miguel Garcia

oläst,

16 dec. 2011 10:51:192011-12-16

till scala-i...@googlegroups.com

> Are you accessing the lazy val in the benchmark? If not, that sounds suspicious.

Range.length accesses the lazy val Range.numRangeElements . I'm using the range-bench micro-benchmark.

The `this.length` callsite in my version of Range.foreach isn't inlined, ICode after dce:

9:
    16    LOAD_LOCAL(variable $inlThis1)
    16    CALL_METHOD scala.collection.immutable.Range.length (dynamic)
    16    DROP INT
    16    JUMP 4

And, for those interested, full ICode after dce under -optimise for range-bench's foreachSum() (shown below) can be found at the end of this message.

def foreachSum(max: Int): Int = {
    var sum = 0
    1 to max foreach (sum += _)
    sum
}

Miguel
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

def foreachSum(max: Int (INT)): Int {
locals: value max, variable $inlThis4, variable par13, variable par14, variable par21, value sum$1, variable v11, variable $inlThis1, variable loc31, variable loc21
startBlock: 1
blocks: [1,7,9,4,8,5,6]

1:
    15    NEW REF(class IntRef)
    15    DUP(REF(class IntRef))
    15    CONSTANT(0)
    15    CALL_METHOD scala.runtime.IntRef.<init> (static-instance)
    15    STORE_LOCAL(value sum$1)
    15    SCOPE_ENTER value sum$1
    16    NEW REF(class RichInt)
    16    DUP(REF(class Object))
    16    CONSTANT(1)
    16    CALL_METHOD scala.runtime.RichInt.<init> (static-instance)
    16    LOAD_LOCAL(value max)
    16    STORE_LOCAL(variable par13)
    16    CALL_METHOD scala.runtime.RichInt.self (dynamic)
    16    LOAD_LOCAL(variable par13)
    16    STORE_LOCAL(variable par21)
    16    STORE_LOCAL(variable par14)
    16    NEW REF(class Range$Inclusive)
    16    DUP(REF(class Object))
    16    LOAD_LOCAL(variable par14)
    16    LOAD_LOCAL(variable par21)
    16    CONSTANT(1)
    16    CALL_METHOD scala.collection.immutable.Range$Inclusive.<init> (static-instance)
    ?    DUP(REF(class Object))
    ?    STORE_LOCAL(variable $inlThis1)
    16    CALL_METHOD scala.collection.immutable.Range.numSteps (dynamic)
    ?    DUP(INT)
    ?    STORE_LOCAL(variable loc21)
    16    CONSTANT(-1)
    16    CJUMP (INT)NE ? 7 : 9

7:
    16    JUMP 4

9:
    16    LOAD_LOCAL(variable $inlThis1)
    16    CALL_METHOD scala.collection.immutable.Range.length (dynamic)
    16    DROP INT
    16    JUMP 4

4:
    16    LOAD_LOCAL(variable $inlThis1)
    16    CALL_METHOD scala.collection.immutable.Range.start (dynamic)
    16    STORE_LOCAL(variable loc31)
    16    JUMP 8

8:
    16    LOAD_LOCAL(variable loc21)
    16    CONSTANT(0)
    16    CJUMP (INT)LE ? 5 : 6

5:
    17    LOAD_LOCAL(value sum$1)
    17    LOAD_FIELD variable elem
    17    SCOPE_EXIT value sum$1
    17    RETURN(INT)

6:
    16    LOAD_LOCAL(variable loc31)
    16    STORE_LOCAL(variable v11)
    ?    LOAD_LOCAL(value sum$1)
    ?    LOAD_LOCAL(value sum$1)
    16    LOAD_FIELD variable elem
    16    LOAD_LOCAL(variable v11)
    16    CALL_PRIMITIVE(Arithmetic(ADD,INT))
    16    STORE_FIELD variable elem (dynamic)
    16    LOAD_LOCAL(variable loc31)
    16    LOAD_LOCAL(variable $inlThis1)
    16    CALL_METHOD scala.collection.immutable.Range.step (dynamic)
    16    CALL_PRIMITIVE(Arithmetic(ADD,INT))
    16    STORE_LOCAL(variable loc31)
    16    LOAD_LOCAL(variable loc21)
    16    CONSTANT(1)
    16    CALL_PRIMITIVE(Arithmetic(SUB,INT))
    16    STORE_LOCAL(variable loc21)
    16    JUMP 8

}

Paul Phillips

oläst,

16 dec. 2011 10:51:462011-12-16

till scala-i...@googlegroups.com

On Fri, Dec 16, 2011 at 7:37 AM, Ismael Juma <ism...@juma.me.uk> wrote:
> Are you accessing the lazy val in the benchmark? If not, that sounds
> suspicious.

I have observed the same thing. If there is a lazy val anywhere, good
luck not seeing that minimum 2x difference between while and foreach.
I don't know about everyone else, but I find there are only two
execution profiles among all plausible implementations: foreach is
about the same speed as while, or foreach is half the speed of while.
(OK, there are a bunch that are slower yet.) I would like to
understand that, because the consistency of the 2x factor tells me
it's some very specific, happens-or-doesn't-happen aspect. And the
lazy val is in my experimentation a lock to move the needle to
"happens".

Paul Phillips

oläst,

16 dec. 2011 10:55:172011-12-16

till scala-i...@googlegroups.com

On Fri, Dec 16, 2011 at 7:51 AM, Paul Phillips <pa...@improving.org> wrote:
> I have observed the same thing. If there is a lazy val anywhere, good
> luck not seeing that minimum 2x difference between while and foreach.

Intuitively this is related to the fact that a lazy val involves going
through a monitor - that optimizations are precluded if something
downstream from the method requires coordination - but that's pure
speculation on my part.

Miguel Garcia

oläst,

16 dec. 2011 11:02:062011-12-16

till scala-i...@googlegroups.com

Geoff,

Inliner tries to make public in some cases, but not when the callee is a library method (`inc` below is the callee)

        def makePublic(f: Symbol): Boolean =
          inc.hasSourceFile && (f.isSynthetic || f.isParamAccessor) && {
            debuglog("Making not-private symbol out of synthetic: " + f)

            f setNotFlag Flags.PRIVATE
            true
          }

Help to improve the inliner is always welcome:
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/2011Q4/Inliner.pdf

Miguel

Ismael Juma

oläst,

16 dec. 2011 11:07:452011-12-16

till scala-i...@googlegroups.com

On Fri, Dec 16, 2011 at 3:51 PM, Paul Phillips <pa...@improving.org> wrote:

I have observed the same thing. If there is a lazy val anywhere, good
luck not seeing that minimum 2x difference between while and foreach.

Interesting. Seems like we have to look at the assembler generated to understand the reason. Meanwhile, Miguel can just move the same cold path code that he had inlined into a separate method and there should be no performance difference in the micro-benchmark with less fragility in real-life code.

Best,

Ismael

Daniel Sobral

oläst,

16 dec. 2011 12:14:212011-12-16

till scala-i...@googlegroups.com

On Fri, Dec 16, 2011 at 12:53, Miguel Garcia <mgarc...@yahoo.com> wrote:
>
> For those who haven't been following, here's a summary of factors heavily
> influencing Range.foreach performance (as far as I can remember):
>
> (0) in all cases, we use -optimise.

Actually, I got while-like performance without -optimise in the ByOne
case. Better than -optimise as often as not, in fact.

> (1) accessing a lazy val (e.g., Range.length) is a performance killer (may
> be prevents JIT compilation?)
>
> (2) invoking the closure arg can be done with or without casting, the cast
> neither hurts nor improves performance significantly:
>
>     (f.asInstanceOf[Int => Unit]).apply(i)
>       vs
>     f(i)
>
> (3) it's desirable to perform that invocation only once, otherwise each
> occurrence can lead to inlining the loop body.
>
> (4) keeping Range.foreach small helps, which means in practice
> pre-processing beforehand, not forgetting "corner cases" such as:
>
>     Int.MinValue to Int.MaxValue by (Int.MinValue / 2).abs
>     1 to 1 by Int.MaxValue
>     0 to Int.MinValue by Int.MinValue

Paul has also raised the hypothesis that using a literal increment
instead of adding step may also contribute to speed diferences. I can
see where this might make a difference to JIT: constant-increment
loops are so common it might special-case it.

I have not tested this hypothesis. I'll look into it.

On the other hand, it seems that subclassing Range gives benefits even
in the absence of changes to RichInt's type signature.

> (5) Daniel Sobral has made available a very complete benchmarking harness.

I'm taking pull requests to improve it further too. I have renamed the
repository, though. The new name is
https://github.com/dcsobral/scala-foreach-benchmark. Paul sent a pull
request to remove the hardcoded path I had in it, which made it really
easy to use.

--
Daniel C. Sobral

I travel to the future all the time.

Paul Phillips

oläst,

16 dec. 2011 12:41:242011-12-16

till scala-i...@googlegroups.com

On Fri, Dec 16, 2011 at 9:14 AM, Daniel Sobral <dcso...@gmail.com> wrote:
> Actually, I got while-like performance without -optimise in the ByOne
> case. Better than -optimise as often as not, in fact.

Me too.

Paul Phillips

oläst,

16 dec. 2011 12:43:402011-12-16

till scala-i...@googlegroups.com

On Fri, Dec 16, 2011 at 9:14 AM, Daniel Sobral <dcso...@gmail.com> wrote:

> Paul has also raised the hypothesis that using a literal increment
> instead of adding step may also contribute to speed diferences. I can
> see where this might make a difference to JIT: constant-increment
> loops are so common it might special-case it.

It does.

http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques

Constants
Use constants when you can.
It's OK to store them in local variables; the SSA representation tends
to keep them visible.
It's OK to store them in static final fields, too.
Static final fields can hold non-primitive, non-string constants.
The class of a static final field value is also constant, as is the
array length (if it's an array).

[Ed: that's what pushed me toward implementing public static final fields.]

Loops
The server compiler likes a loop with an int counter (int i = 0), a
constant stride (i++), and loop-invariant limit (i <= n).
^-----

Loops over arrays work especially well when the compiler can relate
the counter limit to the length of the array(s).
For long loops over arrays, the majority of iterations are free of
individual range checks.
Loops are typically peeled by one iteration, to "shake out" tests
which are loop invariant but execute only on a non-zero tripcount.
Null checks are the key example.
If a loop contains a call, it is best if that call is inlined, so that
loop can be optimized as a whole.
A loop can have multiple exits. Any deoptimization point counts as a loop exit.
If your loop has a rare exceptional condition, consider exiting to
another (slower) loop when it happens.

Daniel Sobral

oläst,

16 dec. 2011 13:21:542011-12-16

till scala-i...@googlegroups.com

By the way, here's for foreach compares between 2.8.1 and 2.9.1:
http://microbenchmarks.appspot.com/run/dcso...@gmail.com/org.example.Benchmark/1195018.

Sébastien Bocq

oläst,

18 dec. 2011 12:15:152011-12-18

till scala-i...@googlegroups.com

2011/12/16 Miguel Garcia <mgarc...@yahoo.com>

Hi,

I played with these benchmarks and I get some strange results. Here is a slightly modified version of Miguel's Range:

class RangeOpt(val start: Int, val end: Int, val step: Int) extends IndexedSeq[Int]

with collection.CustomParallelizable[Int, ParRange]
with Serializable{

...

final def numSteps:Int = {

val numSteps = preproc(start, end, step, isInclusive)

if(numSteps == -1) {

          if (step == 0) throw new IllegalArgumentException("step cannot be 0.")
          else {
            val word = if (isInclusive) "to" else "until"
            val descr = List(start, word, end, "by", step) mkString " "

            throw new IllegalArgumentException(descr + ": seqs cannot contain more than Int.MaxValue elements.")
          }
        }

numSteps

}

@inline
final override def foreach[U](f: Int => U) {

    RangeOpt.applyEach(start, step, numSteps, f.asInstanceOf[Int => Unit])
}
}

object RangeOpt {

   final def applyEach(start:Int, step:Int, numSteps:Int, f:Int => Unit) {
     if (numSteps > 0) {
       f.apply(start)
       applyEach(start + step, step, numSteps -1, f)
     }
   }

   ...
}

Note that this version does not perform as well if "applyEach" is renamed to "foreach", which brings up factor number 6:
(6) don't overload methods, even in companion object.

So, I modified Daniel's benchmarking harness to experiment with both implementations (the source code is in attachment). What is intriguing is that the improvements with -optimize are not consistent. See the results below for instance (the benchmark names prefixed with "Scala" correspond to Miguel's version.)

----------------------------
with -optimize:
[info] 10000         ScalaForeachUnit      4473,26 =
[info] 10000           OptForeachUnit      4387,00 =
[info] 10000                WhileUnit      4757,30 =

without:
[info] 10000         ScalaForeachUnit      4563,64 =
[info] 10000           OptForeachUnit      6634,02 =
[info] 10000                WhileUnit      4760,56 =
----------------------------
with -optimize:
[info] 10000        ScalaForeachArray     12547,58 =
[info] 10000          OptForeachArray      6300,88 =
[info] 10000               WhileArray      4872,60 =

without:
[info] 10000        ScalaForeachArray      4906,10 =
[info] 10000          OptForeachArray     13547,56 =
[info] 10000               WhileArray      4889,45 =
----------------------------
with -optimize:
[info] 10000    ScalaForeachDumbPrime 192690649,50 =============================
[info] 10000      OptForeachDumbPrime 193040203,86 ==============================
[info] 10000           WhileDumbPrime 152969618,67 =======================

without:
[info] 10000    ScalaForeachDumbPrime 185219927,00 ==============================
[info] 10000      OptForeachDumbPrime 134526519,29 =====================
[info] 10000           WhileDumbPrime 153043985,67 ========================

Could someone double check my benchmarks please? Maybe I did something wrong. Right now, it looks like its rather complex to reason about optimizations and I'm not sure what to conclude about -optimize.

Cheers,
Sébastien

$ java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.4) (Gentoo build 1.6.0_22-b22)
OpenJDK Server VM (build 20.0-b11, mixed mode)

$ cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Core(TM)2 Duo CPU     T9400 @ 2.53GHz

sbt> run
...

with -optimize:

[info] length                benchmark           ns linear runtime
[info]     10         ScalaForeachUnit        16,37 =
[info]     10           OptForeachUnit        17,03 =
[info]     10                WhileUnit         5,54 =
[info]     10          ScalaForeachInt        64,66 =
[info]     10            OptForeachInt        66,45 =
[info]     10                 WhileInt         5,54 =
[info]     10          ScalaForeachDec        13,13 =
[info]     10            OptForeachDec        23,88 =
[info]     10      ScalaForeachReverse       180,43 =
[info]     10        OptForeachReverse       197,42 =
[info]     10                 WhileDec         9,63 =
[info]     10         ScalaForeachOpen        13,63 =
[info]     10           OptForeachOpen        12,65 =
[info]     10                WhileOpen         5,22 =
[info]     10        ScalaForeachStep2        15,61 =
[info]     10          OptForeachStep2        16,28 =
[info]     10               WhileStep2         4,35 =
[info]     10       ScalaForeachStepM2        21,00 =
[info]     10         OptForeachStepM2        19,97 =
[info]     10              WhileStepM2         8,24 =
[info]     10        ScalaForeachArray        13,51 =
[info]     10          OptForeachArray        13,48 =
[info]     10               WhileArray         6,13 =
[info]     10        ScalaForeachHeron       309,12 =
[info]     10          OptForeachHeron       311,35 =
[info]     10               WhileHeron       304,49 =
[info]     10    ScalaForeachDumbPrime       249,94 =
[info]     10      OptForeachDumbPrime       566,38 =
[info]     10           WhileDumbPrime       167,32 =
[info]     10 ScalaForeachEratosthenes       213,62 =
[info]     10   OptForeachEratosthenes       572,48 =
[info]     10        WhileEratosthenes        82,38 =
[info]    100         ScalaForeachUnit        55,75 =
[info]    100           OptForeachUnit        58,41 =
[info]    100                WhileUnit        52,36 =
[info]    100          ScalaForeachInt       525,09 =
[info]    100            OptForeachInt       517,02 =
[info]    100                 WhileInt        52,36 =
[info]    100          ScalaForeachDec        56,08 =
[info]    100            OptForeachDec        50,44 =
[info]    100      ScalaForeachReverse       655,78 =
[info]    100        OptForeachReverse       672,30 =
[info]    100                 WhileDec        52,83 =
[info]    100         ScalaForeachOpen        89,59 =
[info]    100           OptForeachOpen        54,16 =
[info]    100                WhileOpen        51,69 =
[info]    100        ScalaForeachStep2        37,99 =
[info]    100          OptForeachStep2        37,59 =
[info]    100               WhileStep2        23,27 =
[info]    100       ScalaForeachStepM2        41,80 =
[info]    100         OptForeachStepM2        40,78 =
[info]    100              WhileStepM2        27,89 =
[info]    100        ScalaForeachArray        52,92 =
[info]    100          OptForeachArray        52,76 =
[info]    100               WhileArray        41,95 =
[info]    100        ScalaForeachHeron      7042,35 =
[info]    100          OptForeachHeron      7011,71 =
[info]    100               WhileHeron      6962,55 =
[info]    100    ScalaForeachDumbPrime     22754,37 =
[info]    100      OptForeachDumbPrime     23438,61 =
[info]    100           WhileDumbPrime     17499,66 =
[info]    100 ScalaForeachEratosthenes      9095,57 =
[info]    100   OptForeachEratosthenes     10405,51 =
[info]    100        WhileEratosthenes      4555,72 =
[info]   1000         ScalaForeachUnit       455,48 =
[info]   1000           OptForeachUnit       460,48 =
[info]   1000                WhileUnit       468,61 =
[info]   1000          ScalaForeachInt      7969,47 =
[info]   1000            OptForeachInt      7967,97 =
[info]   1000                 WhileInt       469,89 =
[info]   1000          ScalaForeachDec       746,35 =
[info]   1000            OptForeachDec       478,50 =
[info]   1000      ScalaForeachReverse      8621,86 =
[info]   1000        OptForeachReverse      8734,51 =
[info]   1000                 WhileDec       467,75 =
[info]   1000         ScalaForeachOpen       696,64 =
[info]   1000           OptForeachOpen       445,77 =
[info]   1000                WhileOpen       467,34 =
[info]   1000        ScalaForeachStep2       401,41 =
[info]   1000          OptForeachStep2       242,81 =
[info]   1000               WhileStep2       260,99 =
[info]   1000       ScalaForeachStepM2       412,05 =
[info]   1000         OptForeachStepM2       244,01 =
[info]   1000              WhileStepM2       236,66 =
[info]   1000        ScalaForeachArray      1230,64 =
[info]   1000          OptForeachArray       548,91 =
[info]   1000               WhileArray       418,03 =
[info]   1000        ScalaForeachHeron     96283,41 =
[info]   1000          OptForeachHeron     96275,78 =
[info]   1000               WhileHeron     95684,05 =
[info]   1000    ScalaForeachDumbPrime   2160548,03 =
[info]   1000      OptForeachDumbPrime   1632831,97 =
[info]   1000           WhileDumbPrime   1793613,34 =
[info]   1000 ScalaForeachEratosthenes    655508,99 =
[info]   1000   OptForeachEratosthenes    498866,13 =
[info]   1000        WhileEratosthenes    347508,13 =
[info] 10000         ScalaForeachUnit      4473,26 =
[info] 10000           OptForeachUnit      4387,00 =
[info] 10000                WhileUnit      4757,30 =
[info] 10000          ScalaForeachInt     89291,52 =
[info] 10000            OptForeachInt     88926,04 =
[info] 10000                 WhileInt      4754,33 =
[info] 10000          ScalaForeachDec      6664,22 =
[info] 10000            OptForeachDec      4695,55 =
[info] 10000      ScalaForeachReverse     92604,12 =
[info] 10000        OptForeachReverse     95456,05 =
[info] 10000                 WhileDec      4498,38 =
[info] 10000         ScalaForeachOpen      6618,01 =
[info] 10000           OptForeachOpen      4374,62 =
[info] 10000                WhileOpen      4656,96 =
[info] 10000        ScalaForeachStep2      3356,29 =
[info] 10000          OptForeachStep2      2284,11 =
[info] 10000               WhileStep2      2633,47 =
[info] 10000       ScalaForeachStepM2      3376,44 =
[info] 10000         OptForeachStepM2      2278,21 =
[info] 10000              WhileStepM2      2364,73 =
[info] 10000        ScalaForeachArray     12547,58 =
[info] 10000          OptForeachArray      6300,88 =
[info] 10000               WhileArray      4872,60 =
[info] 10000        ScalaForeachHeron   1237844,10 =
[info] 10000          OptForeachHeron   1243683,07 =
[info] 10000               WhileHeron   1232865,41 =
[info] 10000    ScalaForeachDumbPrime 192690649,50 =============================
[info] 10000      OptForeachDumbPrime 193040203,86 ==============================
[info] 10000           WhileDumbPrime 152969618,67 =======================
[info] 10000 ScalaForeachEratosthenes 48321987,98 =======
[info] 10000   OptForeachEratosthenes 57257370,50 ========
[info] 10000        WhileEratosthenes 23461142,43 ===

without:
[info]
[info] length                benchmark           ns linear runtime
[info]     10         ScalaForeachUnit        14,72 =
[info]     10           OptForeachUnit        17,03 =
[info]     10                WhileUnit         5,54 =
[info]     10          ScalaForeachInt        65,75 =
[info]     10            OptForeachInt        66,18 =
[info]     10                 WhileInt         5,54 =
[info]     10          ScalaForeachDec        13,19 =
[info]     10            OptForeachDec        24,68 =
[info]     10      ScalaForeachReverse       195,60 =
[info]     10        OptForeachReverse       254,37 =
[info]     10                 WhileDec         9,63 =
[info]     10         ScalaForeachOpen        12,10 =
[info]     10           OptForeachOpen        12,65 =
[info]     10                WhileOpen         4,75 =
[info]     10        ScalaForeachStep2        15,61 =
[info]     10          OptForeachStep2        16,28 =
[info]     10               WhileStep2         4,36 =
[info]     10       ScalaForeachStepM2        18,09 =
[info]     10         OptForeachStepM2        19,97 =
[info]     10              WhileStepM2         8,24 =
[info]     10        ScalaForeachArray        13,49 =
[info]     10          OptForeachArray        81,70 =
[info]     10               WhileArray         6,12 =
[info]     10        ScalaForeachHeron       312,21 =
[info]     10          OptForeachHeron       308,31 =
[info]     10               WhileHeron       302,71 =
[info]     10    ScalaForeachDumbPrime       388,85 =
[info]     10      OptForeachDumbPrime       565,15 =
[info]     10           WhileDumbPrime       168,61 =
[info]     10 ScalaForeachEratosthenes       597,38 =
[info]     10   OptForeachEratosthenes       581,73 =
[info]     10        WhileEratosthenes        82,73 =
[info]    100         ScalaForeachUnit        55,28 =
[info]    100           OptForeachUnit        58,40 =
[info]    100                WhileUnit        52,36 =
[info]    100          ScalaForeachInt       491,82 =
[info]    100            OptForeachInt      1302,71 =
[info]    100                 WhileInt        52,38 =
[info]    100          ScalaForeachDec        56,37 =
[info]    100            OptForeachDec        50,47 =
[info]    100      ScalaForeachReverse       670,21 =
[info]    100        OptForeachReverse       963,09 =
[info]    100                 WhileDec        52,83 =
[info]    100         ScalaForeachOpen        53,07 =
[info]    100           OptForeachOpen       155,84 =
[info]    100                WhileOpen        51,67 =
[info]    100        ScalaForeachStep2        37,51 =
[info]    100          OptForeachStep2        37,61 =
[info]    100               WhileStep2        23,27 =
[info]    100       ScalaForeachStepM2        38,72 =
[info]    100         OptForeachStepM2        40,81 =
[info]    100              WhileStepM2        27,89 =
[info]    100        ScalaForeachArray        52,93 =
[info]    100          OptForeachArray       195,54 =
[info]    100               WhileArray        41,94 =
[info]    100        ScalaForeachHeron      7062,11 =
[info]    100          OptForeachHeron      7100,62 =
[info]    100               WhileHeron      6942,28 =
[info]    100    ScalaForeachDumbPrime     23718,27 =
[info]    100      OptForeachDumbPrime     17714,59 =
[info]    100           WhileDumbPrime     17425,67 =
[info]    100 ScalaForeachEratosthenes      9250,65 =
[info]    100   OptForeachEratosthenes      9221,41 =
[info]    100        WhileEratosthenes      4559,93 =
[info]   1000         ScalaForeachUnit       451,84 =
[info]   1000           OptForeachUnit       742,61 =
[info]   1000                WhileUnit       470,36 =
[info]   1000          ScalaForeachInt     10139,48 =
[info]   1000            OptForeachInt     21375,80 =
[info]   1000                 WhileInt       469,45 =
[info]   1000          ScalaForeachDec       479,39 =
[info]   1000            OptForeachDec       750,48 =
[info]   1000      ScalaForeachReverse     11174,16 =
[info]   1000        OptForeachReverse     10500,74 =
[info]   1000                 WhileDec       467,81 =
[info]   1000         ScalaForeachOpen       451,03 =
[info]   1000           OptForeachOpen       746,24 =
[info]   1000                WhileOpen       466,38 =
[info]   1000        ScalaForeachStep2       242,44 =
[info]   1000          OptForeachStep2       404,69 =
[info]   1000               WhileStep2       260,99 =
[info]   1000       ScalaForeachStepM2       241,94 =
[info]   1000         OptForeachStepM2       409,62 =
[info]   1000              WhileStepM2       236,79 =
[info]   1000        ScalaForeachArray       430,59 =
[info]   1000          OptForeachArray      1369,34 =
[info]   1000               WhileArray       417,23 =
[info]   1000        ScalaForeachHeron     96250,19 =
[info]   1000          OptForeachHeron     96819,11 =
[info]   1000               WhileHeron     95683,17 =
[info]   1000    ScalaForeachDumbPrime   2079398,86 =
[info]   1000      OptForeachDumbPrime   1600272,21 =
[info]   1000           WhileDumbPrime   1798492,10 =
[info]   1000 ScalaForeachEratosthenes    602405,93 =
[info]   1000   OptForeachEratosthenes    556142,71 =
[info]   1000        WhileEratosthenes    349392,10 =
[info] 10000         ScalaForeachUnit      4563,64 =
[info] 10000           OptForeachUnit      6634,02 =
[info] 10000                WhileUnit      4760,56 =
[info] 10000          ScalaForeachInt    107232,19 =
[info] 10000            OptForeachInt    210767,17 =
[info] 10000                 WhileInt      4757,47 =
[info] 10000          ScalaForeachDec      4707,63 =
[info] 10000            OptForeachDec      6663,45 =
[info] 10000      ScalaForeachReverse    117261,60 =
[info] 10000        OptForeachReverse    108942,68 =
[info] 10000                 WhileDec      4499,97 =
[info] 10000         ScalaForeachOpen      4484,43 =
[info] 10000           OptForeachOpen      6667,60 =
[info] 10000                WhileOpen      4659,20 =
[info] 10000        ScalaForeachStep2      2288,47 =
[info] 10000          OptForeachStep2      3367,85 =
[info] 10000               WhileStep2      2632,56 =
[info] 10000       ScalaForeachStepM2      2287,49 =
[info] 10000         OptForeachStepM2      2267,90 =
[info] 10000              WhileStepM2      2365,25 =
[info] 10000        ScalaForeachArray      4906,10 =
[info] 10000          OptForeachArray     13547,56 =
[info] 10000               WhileArray      4889,45 =
[info] 10000        ScalaForeachHeron   1236872,18 =
[info] 10000          OptForeachHeron   1244323,60 =
[info] 10000               WhileHeron   1233106,32 =
[info] 10000    ScalaForeachDumbPrime 185219927,00 ==============================
[info] 10000      OptForeachDumbPrime 134526519,29 =====================
[info] 10000           WhileDumbPrime 153043985,67 ========================
[info] 10000 ScalaForeachEratosthenes 35652739,80 =====
[info] 10000   OptForeachEratosthenes 52890626,64 ========
[info] 10000        WhileEratosthenes 23476727,32 ===

foreach-benchmarks.tgz

Miguel Garcia

oläst,

19 dec. 2011 06:02:592011-12-19

till scala-i...@googlegroups.com

Sébastien,

Your benchmarks point at something interesting, at this point I can only guess that (unsurprisingly) the time when jit-ing kicks in explains performance differences (as shown by -XX:+PrintCompilation, [1] ). That however does not explain why jit-ing kicks in earlier for different (functionally equivalent) versions of the same method.

Guessing some more, the inliner is right when inlining "inside closures" (i.e. the "loop body" for Range.foreach) but inlining in the prolog to such loop does not seem profitable (on the other hand, for the few experiments where I looked at this, jit-ing kicks in at about the same time for 'whileSum' and 'foreachSum' in the range-bench micro-benchmark).

What's clear is that when one goes for aggressive jit-ing of loops (via -XX:+AdvancedOpts ) then 'whileSum' wins whatever else one does.

Miguel

[1] http://q-redux.blogspot.com/2011/01/inspecting-hotspot-jvm-options.html

Sébastien Bocq

oläst,

19 dec. 2011 18:09:442011-12-19

till scala-i...@googlegroups.com

2011/12/19 Miguel Garcia <mgarc...@yahoo.com>

Hi Miguel,

Thanks for the insight. It is possible to approach more consistently the performance of "while" but not with "foreach". The original benchmark is contrived in some sense because it is misusing "foreach" for folding over a range by mutating a variable outside the scope of the "closure body". This might be a reason why optimizations get messed up. All it takes is to specialize "foldLeft" and I think it is worth it (complete benchmark is in attachment):

class RangeOpt extends ... {
...
final def foldLeftS[@specialized(Int) B](start:Int, step:Int, numSteps:Int)(z: B)(op: (B, Int) ⇒ B): B =
   // Inner method does not get specialized (bug SI-3457?) => moved it to companion
   RangeOpt.foldEachLeftS(start, step, numSteps)(z)(op)

}

object RangeOpt {
...
// Method does not get specialized if marked private (new bug?)
final def foldEachLeftS[@specialized(Int) B](start:Int, step:Int, numSteps:Int)(z: B)(op: (B, Int) ⇒ B): B =
     if (numSteps > 0)
       foldEachLeftS(start + step, step, numSteps - 1)(op(z, start))(op)
     else
       z
}

The timings below show that:
- Conversely to previous optimizations, "foldLeft" performs consistently with or without -optimise.
- Sometimes the JVM cannot make up its mind whether to optimise or not.
Note that -XX:+AggressivedOpts did not make a difference in these benchmarks.

- with -optimise (10000):
[info]         benchmark trial     us linear runtime
[info]   ScalaForeachInt     0 100,55 ==============
[info]   ScalaForeachInt     1 100,76 ==============
[info]   ScalaForeachInt     2 100,53 ==============
[info]     OptForeachInt     0 100,76 ==============
[info]     OptForeachInt     1 100,94 ==============
[info]     OptForeachInt     2 100,66 ==============
[info]       FoldLeftInt     0 198,38 ============================ // Not specialized
[info]       FoldLeftInt     1 198,01 ============================
[info]       FoldLeftInt     2 198,22 ============================
[info]      FoldLeftSInt     0   6,61 =                            // Specialized
[info]      FoldLeftSInt     1   6,62 =
[info]      FoldLeftSInt     2   6,61 =
[info]          WhileInt     0   4,75 =
[info]          WhileInt     1   4,75 =
[info]          WhileInt     2   4,76 =

[info] ScalaForeachStep2     0   3,36 =
[info] ScalaForeachStep2     1   3,36 =
[info] ScalaForeachStep2     2   3,36 =
[info]   OptForeachStep2     0   2,28 =
[info]   OptForeachStep2     1   2,29 =
[info]   OptForeachStep2     2   2,28 =
[info]     FoldLeftStep2     0 99,82 ==============
[info]     FoldLeftStep2     1 99,65 ==============
[info]     FoldLeftStep2     2 99,69 ==============
[info]    FoldLeftSStep2     0   3,36 =
[info]    FoldLeftSStep2     1   3,36 =
[info]    FoldLeftSStep2     2   3,36 =
[info]        WhileStep2     0   2,63 =
[info]        WhileStep2     1   2,63 =
[info]        WhileStep2     2   2,63 =

[info] ScalaForeachArray     0 13,54 =
[info] ScalaForeachArray     1 13,50 =
[info] ScalaForeachArray     2 13,52 =
[info]   OptForeachArray     0   6,04 =
[info]   OptForeachArray     1   6,02 =
[info]   OptForeachArray     2   6,06 =
[info]     FoldLeftArray     0 202,92 =============================
[info]     FoldLeftArray     1 207,59 ==============================
[info]     FoldLeftArray     2 206,65 =============================
[info]    FoldLeftSArray     0   4,89 =
[info]    FoldLeftSArray     1 23,88 ===
[info]    FoldLeftSArray     2 23,89 ===
[info]        WhileArray     0   4,87 =
[info]        WhileArray     1   4,87 =
[info]        WhileArray     2   4,86 =
[info]

- without:
[info]         benchmark trial     us linear runtime
[info]   ScalaForeachInt     0 101,05 ==============
[info]   ScalaForeachInt     1 100,41 =============
[info]   ScalaForeachInt     2 100,08 =============
[info]     OptForeachInt     0 206,24 ============================
[info]     OptForeachInt     1 208,59 =============================
[info]     OptForeachInt     2 213,98 =============================
[info]       FoldLeftInt     0 200,83 ===========================
[info]       FoldLeftInt     1 197,82 ===========================
[info]       FoldLeftInt     2 200,25 ===========================
[info]      FoldLeftSInt     0   6,62 =
[info]      FoldLeftSInt     1   4,57 =
[info]      FoldLeftSInt     2   4,56 =
[info]          WhileInt     0   4,75 =
[info]          WhileInt     1   4,75 =
[info]          WhileInt     2   4,76 =

[info] ScalaForeachStep2     0   2,30 =
[info] ScalaForeachStep2     1   2,30 =
[info] ScalaForeachStep2     2   2,30 =
[info]   OptForeachStep2     0   3,36 =
[info]   OptForeachStep2     1   3,37 =
[info]   OptForeachStep2     2   3,37 =
[info]     FoldLeftStep2     0 101,62 ==============
[info]     FoldLeftStep2     1 100,37 =============
[info]     FoldLeftStep2     2 99,08 =============
[info]    FoldLeftSStep2     0   3,36 =
[info]    FoldLeftSStep2     1   3,36 =
[info]    FoldLeftSStep2     2   3,36 =
[info]        WhileStep2     0   2,63 =
[info]        WhileStep2     1   2,63 =
[info]        WhileStep2     2   2,63 =

[info] ScalaForeachArray     0   4,92 =
[info] ScalaForeachArray     1   4,90 =
[info] ScalaForeachArray     2   4,90 =
[info]   OptForeachArray     0 13,54 =
[info]   OptForeachArray     1 13,54 =
[info]   OptForeachArray     2 13,54 =
[info]     FoldLeftArray     0 207,35 ============================
[info]     FoldLeftArray     1 215,38 ==============================
[info]     FoldLeftArray     2 206,75 ============================
[info]    FoldLeftSArray     0 23,88 ===
[info]    FoldLeftSArray     1   4,89 =
[info]    FoldLeftSArray     2 23,88 ===
[info]        WhileArray     0   4,87 =
[info]        WhileArray     1   4,86 =
[info]        WhileArray     2   4,86 =

I don't think it is worth investing more effort in optimizing "foreach" because it must perform I/O if it is used correctly, which will consume most of the time anyway. In some ways, it sounds natural that one can't beat imperative style on its own ground by abusing functional style. I haven't tried to convert the other tests but, for the greater good, optimizing idiomatic code paths against "while" loops encourages best coding practices.

Cheers,
Sébastien

foreach-benchmarks.tgz

Matthew Pocock

oläst,

20 dec. 2011 11:25:262011-12-20

till scala-i...@googlegroups.com

On 19 December 2011 23:09, Sébastien Bocq <sebasti...@gmail.com> wrote:

I don't think it is worth investing more effort in optimizing "foreach" because it must perform I/O if it is used correctly, which will consume most of the time anyway. In some ways, it sounds natural that one can't beat imperative style on its own ground by abusing functional style. I haven't tried to convert the other tests but, for the greater good, optimizing idiomatic code paths against "while" loops encourages best coding practices.

I may have misunderstood you, but if you are saying that there's no point optimizing foreach in the cases where people should be coding in a functionally-pure way but chose to use mutable state that the loop updates, and that having this case run slowly is how we should encourage them towards the FP-pure coding style, then I could not disagree more. If in the formative time with scala someone finds that this case is slow, their first inclination will be to switch to a language where it is fast, not find the 'right' way to do it in scala. If this becomes policy, then at the very least the compiler needs to spit out a warning pushing them towards the 'right' way. I do hope that I got the wrong end of the stick.

Matthew

Cheers,
Sébastien

--

Dr Matthew Pocock

Integrative Bioinformatics Group, School of Computing Science, Newcastle University

mailto: turingate...@gmail.com

gchat: turingate...@gmail.com

msn: matthew...@yahoo.co.uk

irc.freenode.net: drdozer

skype: matthew.pocock

tel: (0191) 2566550

mob: +447535664143

Miguel Garcia

oläst,

20 dec. 2011 11:33:372011-12-20

till scala-i...@googlegroups.com

Matthew,

There's value in making the compiler ever smarter about optimizing functional-style code,
and
there's value in writing C-style code for performance-critical sections.

Miguel
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

Ismael Juma

oläst,

20 dec. 2011 11:35:172011-12-20

till scala-i...@googlegroups.com

On Mon, Dec 19, 2011 at 11:09 PM, Sébastien Bocq <sebasti...@gmail.com> wrote:

I don't think it is worth investing more effort in optimizing "foreach" because it must perform I/O if it is used correctly

This is false.

Best,

Ismael

Nils Kilden-Pedersen

oläst,

20 dec. 2011 11:55:102011-12-20

till scala-i...@googlegroups.com

Tony?

Ismael Juma

oläst,

20 dec. 2011 12:14:242011-12-20

till scala-i...@googlegroups.com

To elaborate, there are many scenarios in performance critical code where using foreach without performing I/O is desirable. For example, the inner loop of a machine learning algorithm where feature vectors are updated.

Best,

Ismael

Paul Phillips

oläst,

20 dec. 2011 14:22:112011-12-20

till scala-i...@googlegroups.com

On Tue, Dec 20, 2011 at 9:14 AM, Ismael Juma <ism...@juma.me.uk> wrote:
> To elaborate, there are many scenarios in performance critical code where
> using foreach without performing I/O is desirable. For example, the inner
> loop of a machine learning algorithm where feature vectors are updated.

Or, look at every method implementation in TraversableLike.

Sébastien Bocq

oläst,

20 dec. 2011 20:42:082011-12-20

till scala-i...@googlegroups.com

2011/12/20 Paul Phillips <pa...@improving.org>

Here is my grouped answer:

@Matthew To rephrase it another way, I think there has been enough
time sinked in optimizing foreach (at least I'll stop trying it) and,
as shown in my latest benchmarks, idiomatic code deserves some love
too. I was not trying to win a language popularity contest here, I
just tried to help and then share my conclusions in all honesty. See
also paragraph below.

@Ismael Unless the syntactic-sugar for for-comprehensions changes in
the future, I'd use "while" in Scala to program imperatively at memory
level and mutate in place arrays or vectors in performance critical
code. It has a less nicer syntax than the "for" loop in Java but
hopefully the scope can be limited. The reason is that I don't
entertain the thought that "foreach" is equivalent to "for" in Java.
For example, using "for" in Java to compute the sum of the elements in
an array is fast and good style. In Scala, "foreach" is a higher-order
function tailored for I/O. You can do the same as in Java, by mutating
a data structure outside the scope of the effectful function closure,
but it is less efficient than its imperative counterpart at that job.
Fair enough. "foldLeft" is another higher-function like "foreach", but
this one is fit for the job and it can be easily optimized to perform
comparably to the imperative loop. Thinking this way, one can use the
right tool for the right job and make the most out of Scala.

@Paul Ah ah, it doesn't leave much choices, does it? Well, I'll answer
that implementing a collection in terms of "foreach" is a design
choice that is questionable at best. See this one for example:
def isEmpty: Boolean = {
        var result = true
        breakable {
          for (x <- this) {
            result = false
            break
          }
        }
        result
    }
It is much too complex for what it is and most of these methods are
overriden in subclasses anyway. Look at every overriden method
implementation in IterableLike for example. So, why bother?

Cheers!
Sébastien

Ismael Juma

oläst,

21 dec. 2011 03:38:522011-12-21

till scala-i...@googlegroups.com

Hi Sebastien,

On Wed, Dec 21, 2011 at 1:42 AM, Sébastien Bocq <sebasti...@gmail.com> wrote:

@Ismael Unless the syntactic-sugar for for-comprehensions changes in
the future, I'd use "while" in Scala to program imperatively at memory
level and mutate in place arrays or vectors in performance critical
code. It has a less nicer syntax than the "for" loop in Java but
hopefully the scope can be limited.

As someone who has to do this a lot, it's less than satisfactory.

In Scala, "foreach" is a higher-order function tailored for I/O.

You keep saying this. What is your source for such claim? I have been following the Scala community closely for years now and I don't remember Martin ever saying that.

Best,

Ismael

Erik Osheim

oläst,

21 dec. 2011 10:17:392011-12-21

till scala-i...@googlegroups.com

On Wed, Dec 21, 2011 at 08:38:52AM +0000, Ismael Juma wrote:

> On Wed, Dec 21, 2011 at 1:42 AM, S�bastien Bocq <sebasti...@gmail.com>wrote:
> > @Ismael Unless the syntactic-sugar for for-comprehensions changes in
> > the future, I'd use "while" in Scala to program imperatively at memory
> > level and mutate in place arrays or vectors in performance critical
> > code. It has a less nicer syntax than the "for" loop in Java but
> > hopefully the scope can be limited.
>
> As someone who has to do this a lot, it's less than satisfactory.

+100

-- Erik

Svara alla

Svara författaren

Vidarebefordra