In the experimental optimizer [1] closure-inlining handles methods receiving any number of anonymous closures. There's room for improvement, but the following is already working:
(a) each closure may be applied multiple times yet its inlining won't result in code duplication, because closure application is translated as invocation of a final method. (A follow-up analysis could determine whether inlining pays off, but the VM is already quite good at this). Want to know more details? This optimizer is documented!
(b) Specialization results in closure bodies expecting unboxed arguments, while FunctionX.apply() methods don't have primitives in their method descriptors. Provided the invoker has primitives to start with, the intermediares in charge of unboxing/boxing are collapsed into oblivion by closure-inlining.
(c) The experimental optimizer also elides all private class members that see no use (unless -YkeepUnusedPrivateClassMembers is given). Combined with closure-inlining, that results in 1.5 MiB worth of closures disappearing from scala-compiler.jar
(d) Closure inlining is also possible for non-trivial closure bodies that declare local methods, local classes, etc. For example, the following results in just Test.class and Test$.class being emitted:
object Test {
def main(args: Array[String]) {
(1 to 10) foreach { a => def ma(x: Int) = { x }
(11 to 20) foreach { b => def mb(y: Int) = { y }
println(ma(a) + mb(b))
}
}
}
}To explore the resulting bytecode, javap -private is necessary. If using the -Ygen-javap option of scalac, the following change in JavapBytecodeWriter achieves the same:
javap(Seq("-c", "-private", classFile.name)) foreach (_.show())
Regarding "room for improvement", currently both method-inlining and closure-inlining focus on the targets of callsites that are statically known to be @inline, moreover without running a type-flow analysis (which is available, and quite fast too, but initially only the low-hanging cases are considered). The reasoning behind this is allowing "optimization levels", for example: (0) no optimization; (1) only intra-method; (2) inter-procedural without type-flow analysis; (3) the whole thing.
Comments and suggestions are welcome. The "prototype" bootstraps and thus should be able to cope with your benchmarks of choice. As to suggestions, one of the strengths of ASM is the ease of adding intra-method and intra-class optimizations , so please feel free to point out code snippets that could be improved.
A comment about Range.foreach. A previous design got the job done invoking the closure twice (once in a loop and one more time outside it). That design would have worked wonderfully with the experimental optimizer. Currently, the Range.foreach() hands over the closure to Range.validateRangeBoundaries(). The experimental optimizer still stack-allocates the closure, at the cost of inlining both of those methods. Just an example where code hand-tuned for some optimizer quirks might benefit from a more natural formulation.
BTW, the "closureConversionMethodHandle()" utility in UnCurry is used by the experimental optimizer for closure-inlining, but it could also serve as basis for MethodHandle experiments.
Miguel
http://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/References
----------
[1] branch GenBCodeOpt at
https://github.com/magarciaEPFL/scala.git