the experimental optimizer does closure-inlining, and with that all major optimizations are in place

Miguel Garcia

unread,

Dec 18, 2012, 8:01:07 AM12/18/12

to scala-i...@googlegroups.com

In the experimental optimizer [1] closure-inlining handles methods receiving any number of anonymous closures. There's room for improvement, but the following is already working:

(a) each closure may be applied multiple times yet its inlining won't result in code duplication, because closure application is translated as invocation of a final method. (A follow-up analysis could determine whether inlining pays off, but the VM is already quite good at this). Want to know more details? This optimizer is documented!

(b) Specialization results in closure bodies expecting unboxed arguments, while FunctionX.apply() methods don't have primitives in their method descriptors. Provided the invoker has primitives to start with, the intermediares in charge of unboxing/boxing are collapsed into oblivion by closure-inlining.

(c) The experimental optimizer also elides all private class members that see no use (unless -YkeepUnusedPrivateClassMembers is given). Combined with closure-inlining, that results in 1.5 MiB worth of closures disappearing from scala-compiler.jar

(d) Closure inlining is also possible for non-trivial closure bodies that declare local methods, local classes, etc. For example, the following results in just Test.class and Test$.class being emitted:

object Test {

def main(args: Array[String]) {
    (1 to 10) foreach { a => def ma(x: Int) = { x }
      (11 to 20) foreach { b => def mb(y: Int) = { y }
        println(ma(a) + mb(b))
      }
    }
}

}

To explore the resulting bytecode, javap -private is necessary. If using the -Ygen-javap option of scalac, the following change in JavapBytecodeWriter achieves the same:
javap(Seq("-c", "-private", classFile.name)) foreach (_.show())

Regarding "room for improvement", currently both method-inlining and closure-inlining focus on the targets of callsites that are statically known to be @inline, moreover without running a type-flow analysis (which is available, and quite fast too, but initially only the low-hanging cases are considered). The reasoning behind this is allowing "optimization levels", for example: (0) no optimization; (1) only intra-method; (2) inter-procedural without type-flow analysis; (3) the whole thing.

Comments and suggestions are welcome. The "prototype" bootstraps and thus should be able to cope with your benchmarks of choice. As to suggestions, one of the strengths of ASM is the ease of adding intra-method and intra-class optimizations , so please feel free to point out code snippets that could be improved.

A comment about Range.foreach. A previous design got the job done invoking the closure twice (once in a loop and one more time outside it). That design would have worked wonderfully with the experimental optimizer. Currently, the Range.foreach() hands over the closure to Range.validateRangeBoundaries(). The experimental optimizer still stack-allocates the closure, at the cost of inlining both of those methods. Just an example where code hand-tuned for some optimizer quirks might benefit from a more natural formulation.

BTW, the "closureConversionMethodHandle()" utility in UnCurry is used by the experimental optimizer for closure-inlining, but it could also serve as basis for MethodHandle experiments.

Miguel
http://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

References
----------

[1] branch GenBCodeOpt at https://github.com/magarciaEPFL/scala.git

√iktor Ҡlang

unread,

Dec 18, 2012, 8:36:11 AM12/18/12

to scala-i...@googlegroups.com

Do @tailrec methods declared inside a closure work with this optimization?

Cheers,

√

--
Viktor Klang

Director of Engineering

Typesafe - The software stack for applications that scale

Twitter: @viktorklang

Miguel Garcia

unread,

Dec 18, 2012, 9:09:25 AM12/18/12

to scala-i...@googlegroups.com

@tailrec methods work as expected, also inside closures and moreover capturing locals. By the time the experimental optimizer gets to closure-inlining, closures already have apply() methods that invoke a delegate. That's an easy bytecode shape for the optimizer to handle.

I'm sure more juicy examples must be available (examples welcome!) but the following shows both the tailrec and closure-inlining optimizations get applied:

object Test {

def main(args: Array[String]) {
(1 to 10) foreach { a =>

      @tailrec
      def ma(x: Int): Int = {
        x match {
          case 0 => a
          case _ => ma(x-1)

        }
      }

      (11 to 20) foreach { b =>

        @tailrec
        def mb(y: Int): Int = {
          y match {
            case 0 => b
            case _ => mb(y-1)
          }
        }

        println(ma(a) + mb(b))
      }
    }
}

}

In the listing below, the lengthy "shio$1$foreach" method is a consequence of the division of labor between "Range.foreach" and "Range.validateRangeBoundaries". There's nothing the optimizer can do about that as things stand.

public final class Test$ extends java.lang.Object{
public static final Test$ MODULE$;

public static {};
Code:
   0:    new    #2; //class Test$
   3:    invokespecial    #20; //Method "<init>":()V
   6:    return

public void main(java.lang.String[]);
Code:
   0:    getstatic    #27; //Field scala/runtime/RichInt$.MODULE$:Lscala/runtime/RichInt$;
   3:    getstatic    #32; //Field scala/Predef$.MODULE$:Lscala/Predef$;
   6:    iconst_1
   7:    invokevirtual    #38; //Method scala/LowPriorityImplicits.intWrapper:(I)I
   10:    bipush    10
   12:    invokevirtual    #42; //Method scala/runtime/RichInt$.to$extension0:(II)Lscala/collection/immutable/Range$Inclusive;
   15:    invokestatic    #46; //Method shio$1$foreach$mVc$sp:(Lscala/collection/immutable/Range;)V
   18:    return

private final int ma$1(int, int);
Code:
   0:    iload_1
   1:    tableswitch{ //0 to 0
        0: 20;
        default: 22 }
   20:    iload_2
   21:    ireturn
   22:    iload_1
   23:    iconst_1
   24:    isub
   25:    istore_1
   26:    goto    0

private final int mb$1(int, int);
Code:
   0:    iload_1
   1:    tableswitch{ //0 to 0
        0: 20;
        default: 22 }
   20:    iload_2
   21:    ireturn
   22:    iload_1
   23:    iconst_1
   24:    isub
   25:    istore_1
   26:    goto    0

public final void Test$$dlgt2$1(int, int);
Code:
   0:    getstatic    #32; //Field scala/Predef$.MODULE$:Lscala/Predef$;
   3:    aload_0
   4:    iload_2
   5:    iload_2
   6:    invokespecial    #53; //Method ma$1:(II)I
   9:    aload_0
   10:    iload_1
   11:    iload_1
   12:    invokespecial    #55; //Method mb$1:(II)I
   15:    iadd
   16:    invokestatic    #61; //Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
   19:    invokevirtual    #65; //Method scala/Predef$.println:(Ljava/lang/Object;)V
   22:    return

public final void Test$$dlgt1$1(int);
Code:
   0:    getstatic    #27; //Field scala/runtime/RichInt$.MODULE$:Lscala/runtime/RichInt$;
   3:    getstatic    #32; //Field scala/Predef$.MODULE$:Lscala/Predef$;
   6:    bipush    11
   8:    invokevirtual    #38; //Method scala/LowPriorityImplicits.intWrapper:(I)I
   11:    bipush    20
   13:    invokevirtual    #42; //Method scala/runtime/RichInt$.to$extension0:(II)Lscala/collection/immutable/Range$Inclusive;
   16:    iload_1
   17:    invokestatic    #71; //Method shio$2$foreach$mVc$sp:(Lscala/collection/immutable/Range;I)V
   20:    return

private Test$();
Code:
   0:    aload_0
   1:    invokespecial    #72; //Method java/lang/Object."<init>":()V
   4:    aload_0
   5:    putstatic    #74; //Field MODULE$:LTest$;
   8:    return

private static void shio$1$foreach$mVc$sp(scala.collection.immutable.Range);
Code:
   0:    aload_0
   1:    invokevirtual    #77; //Method scala/collection/immutable/Range.scala$collection$immutable$Range$$validateMaxLength:()V
   4:    aload_0
   5:    invokevirtual    #81; //Method scala/collection/immutable/Range.start:()I
   8:    istore    6
   10:    iload    6
   12:    ldc    #82; //int -2147483648
   14:    if_icmpne    72
   17:    aload_0
   18:    invokevirtual    #85; //Method scala/collection/immutable/Range.end:()I
   21:    ldc    #82; //int -2147483648
   23:    if_icmpne    72
   26:    iconst_0
   27:    istore    4
   29:    iload    6
   31:    istore    5
   33:    iload    4
   35:    aload_0
   36:    invokevirtual    #88; //Method scala/collection/immutable/Range.numRangeElements:()I
   39:    if_icmpge    76
   42:    iload    5
   44:    istore    7
   46:    getstatic    #74; //Field MODULE$:LTest$;
   49:    iload    7
   51:    invokevirtual    #90; //Method Test$$dlgt1$1:(I)V
   54:    iload    4
   56:    iconst_1
   57:    iadd
   58:    istore    4
   60:    iload    5
   62:    aload_0
   63:    invokevirtual    #93; //Method scala/collection/immutable/Range.step:()I
   66:    iadd
   67:    istore    5
   69:    goto    33
   72:    iconst_1
   73:    goto    77
   76:    iconst_0
   77:    istore    4
   79:    iload    4
   81:    ifne    85
   84:    return
   85:    iload    6
   87:    istore_1
   88:    aload_0
   89:    invokevirtual    #96; //Method scala/collection/immutable/Range.terminalElement:()I
   92:    istore_2
   93:    aload_0
   94:    invokevirtual    #93; //Method scala/collection/immutable/Range.step:()I
   97:    istore_3
   98:    iload_1
   99:    iload_2
   100:    if_icmpne    104
   103:    return
   104:    iload_1
   105:    istore    7
   107:    getstatic    #74; //Field MODULE$:LTest$;
   110:    iload    7
   112:    invokevirtual    #90; //Method Test$$dlgt1$1:(I)V
   115:    iload_1
   116:    iload_3
   117:    iadd
   118:    istore_1
   119:    goto    98

private static void shio$2$foreach$mVc$sp(scala.collection.immutable.Range, int);
Code:
   0:    aload_0
   1:    invokevirtual    #77; //Method scala/collection/immutable/Range.scala$collection$immutable$Range$$validateMaxLength:()V
   4:    aload_0
   5:    invokevirtual    #81; //Method scala/collection/immutable/Range.start:()I
   8:    istore    7
   10:    iload    7
   12:    ldc    #82; //int -2147483648
   14:    if_icmpne    73
   17:    aload_0
   18:    invokevirtual    #85; //Method scala/collection/immutable/Range.end:()I
   21:    ldc    #82; //int -2147483648
   23:    if_icmpne    73
   26:    iconst_0
   27:    istore    5
   29:    iload    7
   31:    istore    6
   33:    iload    5
   35:    aload_0
   36:    invokevirtual    #88; //Method scala/collection/immutable/Range.numRangeElements:()I
   39:    if_icmpge    77
   42:    iload    6
   44:    istore    8
   46:    getstatic    #74; //Field MODULE$:LTest$;
   49:    iload    8
   51:    iload_1
   52:    invokevirtual    #106; //Method Test$$dlgt2$1:(II)V
   55:    iload    5
   57:    iconst_1
   58:    iadd
   59:    istore    5
   61:    iload    6
   63:    aload_0
   64:    invokevirtual    #93; //Method scala/collection/immutable/Range.step:()I
   67:    iadd
   68:    istore    6
   70:    goto    33
   73:    iconst_1
   74:    goto    78
   77:    iconst_0
   78:    istore    5
   80:    iload    5
   82:    ifne    86
   85:    return
   86:    iload    7
   88:    istore_2
   89:    aload_0
   90:    invokevirtual    #96; //Method scala/collection/immutable/Range.terminalElement:()I
   93:    istore_3
   94:    aload_0
   95:    invokevirtual    #93; //Method scala/collection/immutable/Range.step:()I
   98:    istore    4
   100:    iload_2
   101:    iload_3
   102:    if_icmpne    106
   105:    return
   106:    iload_2
   107:    istore    8
   109:    getstatic    #74; //Field MODULE$:LTest$;
   112:    iload    8
   114:    iload_1
   115:    invokevirtual    #106; //Method Test$$dlgt2$1:(II)V
   118:    iload_2
   119:    iload    4
   121:    iadd
   122:    istore_2
   123:    goto    100

}

Jason Zaugg

unread,

Dec 18, 2012, 9:21:34 AM12/18/12

to scala-i...@googlegroups.com

On Tue, Dec 18, 2012 at 2:01 PM, Miguel Garcia <miguel...@tuhh.de> wrote:

In the experimental optimizer [1] closure-inlining handles methods receiving any number of anonymous closures. There's room for improvement, but the following is already working:

Great work!

I just built it. What command line options should I use to unlock all of this?

-jason

Miguel Garcia

unread,

Dec 18, 2012, 9:28:30 AM12/18/12

to scala-i...@googlegroups.com

There's no command-line option to activate the optimizer, it runs by default. That means "ant test.suite" will take a bit longer because optimized code is being emitted and run.

The only command-line option associated to the experimental optimizer is -YkeepUnusedPrivateClassMembers , it's there basically so that a few tests reflecting on private members don't fail when finding optimizer-introduced ones (containing "shio$" or "dlgt" in their name) or not finding removed ones (those private members nobody uses).

Miguel
http://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

Simon Ochsenreither

unread,

Dec 18, 2012, 9:41:22 AM12/18/12

to scala-i...@googlegroups.com

so that a few tests reflecting on private members don't fail when finding optimizer-introduced ones (containing "shio$" or "dlgt" in their name)

Couldn't this be fixed by adding the SYNTHETIC flag to these methods?

Erik Osheim

unread,

Dec 18, 2012, 10:16:18 AM12/18/12

to scala-i...@googlegroups.com

Wow! This is really exciting, especially the support for closures
generated in specialized context!

I will definitely give this is a shot and let you know how it performs!

Thanks!

-- Erik

Grzegorz Kossakowski

unread,

Dec 18, 2012, 1:40:59 PM12/18/12

to scala-i...@googlegroups.com

On 18 December 2012 05:01, Miguel Garcia <miguel...@tuhh.de> wrote:

In the experimental optimizer [1] closure-inlining handles methods receiving any number of anonymous closures. There's room for improvement, but the following is already working:

(a) each closure may be applied multiple times yet its inlining won't result in code duplication, because closure application is translated as invocation of a final method. (A follow-up analysis could determine whether inlining pays off, but the VM is already quite good at this). Want to know more details? This optimizer is documented!

(b) Specialization results in closure bodies expecting unboxed arguments, while FunctionX.apply() methods don't have primitives in their method descriptors. Provided the invoker has primitives to start with, the intermediares in charge of unboxing/boxing are collapsed into oblivion by closure-inlining.

(c) The experimental optimizer also elides all private class members that see no use (unless -YkeepUnusedPrivateClassMembers is given). Combined with closure-inlining, that results in 1.5 MiB worth of closures disappearing from scala-compiler.jar

(d) Closure inlining is also possible for non-trivial closure bodies that declare local methods, local classes, etc. For example, the following results in just Test.class and Test$.class being emitted:

Does inlining of methods defined in traits work?

Also, do you have test-cases covering those impressive results?

--
Grzegorz Kossakowski

Miguel Garcia

unread,

Dec 18, 2012, 3:12:48 PM12/18/12

to scala-i...@googlegroups.com

Inlining of trait methods is not yet implemented (closure-inlining made more sense first). However two solution approaches for SI-4767 have been identified:

(1) GenBCode already detects which invokeinterface callsites target final methods in traits,
https://github.com/magarciaEPFL/scala/blob/GenBCodeOpt/src/compiler/scala/tools/nsc/backend/jvm/GenBCode.scala#L2397

(2) a cleaner approach involved collecting that information during Mixin. Vlad was working on this approach.

Eventually one of the approaches above will be adopted.

Measuring size reduction in scala-compiler.jar amounts to:

(a) compiling scalac from trunk, using GenICode, the current optimizer, and GenASM.
(b) compiling scalac from trunk, using GenBCode and the experimental optimizer.

In aggregate, many small reductions explain the reduction in jar size, starting with smaller InnerClasses attributes (SI-6759) to elided closures (the lion's share) to a more compact representation of closures themselves (code lives in delegates, which allows constant pool reuse rather than duplicating it wholesale on closure classes) and so on.

ant test.suite and ant nightly are happy with the experimental optimizer, therefore it's on a good path of gaining feedback, adding more optimizations, improving warnings and error messages. That's being worked on too.

Miguel
http://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

Sean Parsons

unread,

Jan 1, 2013, 10:38:02 AM1/1/13

to scala-internals

This is definitely of interest to me as when optimising the argonaut
parser I found a lot of AbstractFunction2.<init> invocations in place
of lambdas that were eliminated by making a method a val as you can
see from this commit: https://github.com/markhibberd/argonaut/commit/ddcafc795c1de698f053e09a2d8949df470a737f
That alone bump some benchmarks by at least 10%.

Sean.

On 18 Dec 2012, 15:12, Miguel Garcia <miguel.gar...@tuhh.de> wrote:
> Inlining of trait methods is not yet implemented (closure-inlining made
> more sense first). However two solution approaches for SI-4767 have been
> identified:
>
> (1) GenBCode already detects which invokeinterface callsites target final
> methods in traits,
>

> https://github.com/magarciaEPFL/scala/blob/GenBCodeOpt/src/compiler/s...

>
> (2) a cleaner approach involved collecting that information during Mixin.
> Vlad was working on this approach.
>
> Eventually one of the approaches above will be adopted.
>
> Measuring size reduction in scala-compiler.jar amounts to:
>
> (a) compiling scalac from trunk, using GenICode, the current optimizer,
> and GenASM.
> (b) compiling scalac from trunk, using GenBCode and the experimental
> optimizer.
>
> In aggregate, many small reductions explain the reduction in jar size,
> starting with smaller InnerClasses attributes (SI-6759) to elided closures
> (the lion's share) to a more compact representation of closures themselves
> (code lives in delegates, which allows constant pool reuse rather than
> duplicating it wholesale on closure classes) and so on.
>
> ant test.suite and ant nightly are happy with the experimental optimizer,
> therefore it's on a good path of gaining feedback, adding more
> optimizations, improving warnings and error messages. That's being worked
> on too.
>

> Miguelhttp://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

>
>
>
>
>
>
>
> On Tuesday, December 18, 2012 7:40:59 PM UTC+1, Grzegorz Kossakowski wrote:
>

Reply all

Reply to author

Forward