starting to optimize the virtualizing pattern matcher

333 views
Skip to first unread message

Adriaan Moors

unread,
Nov 28, 2011, 11:18:51 AM11/28/11
to scala-i...@googlegroups.com
Dear All,

The time has come to start optimizing the virtualizing pattern matcher.

["what's that," you say? 
In a nutshell, it compiles pattern matches using a scheme similar to how we compile for-comprehensions: all you need is flatMap and orElse
it's also available in nightly builds of scalac -- unleash it using -Yvirtpatmat
I urgently need to write something up about this, but for now, might I suggest you browse its implementation -- it's actually fairly simple:


As it stands, a compiler compiled using -Yvirtpatmat, and compiling for -Yvirtpatmat,
takes between 37-55% longer to compile the library and the compiler (... than a compiler that's still compiling for -Yvirtpatmat, but whose execution uses the code generated by the regular pattern matcher)

So, how do we get this down to an acceptable percentage?

My intuition was to see how can we can avoid redundant condition-testing in matches like

// it's instructive to see what this is compiled to by the current pattern matcher -- use scalac -Xprint:explicitouter
def fastMatch() = {
  List(1, 3, 4, 7) match {
    case 1 :: 3 :: 4 :: 5 :: x => println("nope")
    case 1 :: 3 :: 4 :: 6 :: x => println("nope") // only need to test whether the previous case failed due to the third element not being 5, and whether third element is 6
    case 1 :: 3 :: 4 :: 7 :: x => 1
  }
}

So, I raced this method against the code currently generated by virtpatmat for the same match:

import improving.benchmark.Benchmark.Race // https://github.com/adriaanm/scala-improving/blob/master/src/main/scala/Benchmark.scala (code by Paul)

// race fastMatch, which is compiled using the current pattern matcher,
// against virtMatch, the same match compiled using -Yvirtpatmat,
// by running each method 100K times
Race(fastMatch, virtMatch)(100000).converge()

// --> First body 9% faster.

Ok, a 10% slowdown does not explain the 50% we get in "real life", but it seems substantial enough to spend some time chiselling at it.

Hence, the following experiments using hand-optimized versions of virtMatch:

// does it help to optimize orElse by storing the result in a mutable variable?
Race(fastMatch, virtMatch_var_orElse)(100000).converge() // --> First body 8% faster.

// does inlining runOrElse make a difference?
Race(fastMatch, virtMatch_inline_runOrElse)(100000).converge() // --> First body 8% faster.

okay, not much gain here...


// how about sharing? (i.e., to avoid re-computing already tested conditions)
Race(fastMatch, virtMatch_sharing)(100000).converge() // --> First body 3% faster.

That looks more promising! So, I think I'm going to implement this optimization first. (hand-coded version in the gist: https://gist.github.com/1400910)

The idea is to eliminate common subexpressions found in the conditions of the if-statements that make up the match.
I'd like to avoid duplicating code/generating jumps (which is what the current pattern matcher does).
Hence the idea of sharing the results of the tests rather than the code doing the testing.

The reason I'm posting is of course to hear your thoughts!
Suggestions for concrete benchmarks, a better benchmarking strategy, or optimizations most welcome!


cheers
adriaan

Matthew Pocock

unread,
Nov 28, 2011, 11:43:43 AM11/28/11
to scala-i...@googlegroups.com
Hi,

I may be being naive, but do your optimizations require that the unapply methods are side-effect-less, pure functions? If so, it sounds like your optimizations are those generally applicable to these cases - avoid executing code multiple times where you can remember the value from executing it once, and avoid executing code where the result isn't used. However, you don't have to look back far in the -users list to find examples of people doing stateful things in case statements. Without an effects system I'm not sure how far you can go with this while guaranteeing that you're not borking the semantics up in some exciting way. As usual with rewrite-based optimization, an effects system would be useful, even if it is somewhat conservative in what it declares effect-free.

Matthew
--
Dr Matthew Pocock
Integrative Bioinformatics Group, School of Computing Science, Newcastle University
skype: matthew.pocock
tel: (0191) 2566550

Adriaan Moors

unread,
Nov 28, 2011, 11:51:04 AM11/28/11
to scala-i...@googlegroups.com
Hi,

Indeed, unapplies can't be optimized this way -- for the reasons you state, and besides, even if they are pure, they might not respect the other laws these optimizations rely on.
I should have clarified this is only going to optimize matching on case classes (where we know the unapplies do the right thing), and possibly type tests/constant patterns.
Luckily, this is still a pretty common scenario. 

cheers
adriaan

ps: We're also devising another unapply scheme that might bring some of the benefit (mostly speed) of case class matching to user-defined unapplies, but that's a 2.0 feature ;-)

Josh Suereth

unread,
Nov 28, 2011, 12:17:34 PM11/28/11
to scala-i...@googlegroups.com
Now I see why you have the @ThisUnapplyIsPrettyOk annotation in your branch....

Tiark Rompf

unread,
Nov 28, 2011, 12:19:59 PM11/28/11
to scala-i...@googlegroups.com
I'd suggest trying to get rid of the Option objects altogether (either encode using null or use a separate bool flag).
- Tiark

Adriaan Moors

unread,
Nov 28, 2011, 1:29:38 PM11/28/11
to scala-i...@googlegroups.com
On Mon, Nov 28, 2011 at 6:19 PM, Tiark Rompf <tiark...@epfl.ch> wrote:
I'd suggest trying to get rid of the Option objects altogether (either encode using null or use a separate bool flag).
good call! I figured "how bad could that last option be," but you were right -- simply eliminating all traces of option brings the overhead down to 4%... 
I updated the gist with a version without option -- if you see a more efficient way, please let me know!

cheers
adriaan

Paul Phillips

unread,
Nov 28, 2011, 2:02:16 PM11/28/11
to scala-i...@googlegroups.com
On Mon, Nov 28, 2011 at 8:18 AM, Adriaan Moors <adriaa...@epfl.ch> wrote:
> // how about sharing? (i.e., to avoid re-computing already tested
> conditions)
> Race(fastMatch, virtMatch_sharing)(100000).converge() // --> First body 3%
> faster.
> That looks more promising!

Indeed, and type tests aren't that cheap, especially when compared to
a stack boolean.

if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
val o47: Option[AnyVal] = if
(x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
if (x1.isInstanceOf[collection.immutable.::[Int]]) {
val o47: Option[AnyVal] = if
(x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]])
if (x1.isInstanceOf[collection.immutable.::[Int]])

At least record the result of every type test which must be executed.
It'll get a little fuzzier when you get into non-mandatory code paths;
but you could allocate almost-free "lazy val booleans" in the form of
tri-state ints.

Adriaan Moors

unread,
Nov 28, 2011, 2:11:26 PM11/28/11
to scala-i...@googlegroups.com


On Mon, Nov 28, 2011 at 8:02 PM, Paul Phillips <pa...@improving.org> wrote:
you could allocate almost-free "lazy val booleans" in the form of tri-state ints.
ah yes, that sounds pretty easy to do -- I'll implement this next and report on the ensuing race

Tiark Rompf

unread,
Nov 28, 2011, 3:10:45 PM11/28/11
to scala-i...@googlegroups.com
also, to make sure you are measuring just the pattern matching, it seems like a good idea to move the List(1, 3, 4, 7) object construction out of the benchmark loop and store it as a field of the enclosing object.

It'll be interesting to see how the tri-state boolean version turns out.

- Tiark

Jason Zaugg

unread,
Nov 28, 2011, 3:50:54 PM11/28/11
to scala-i...@googlegroups.com
That would be a nice general optimization for any lazy val that doesn't escape the scope of a method.

-jason

Adriaan Moors

unread,
Nov 30, 2011, 8:18:40 AM11/30/11
to scala-i...@googlegroups.com
the results for bootstrapping the compiler with the no-option virtual pattern matcher are in:

[strap.lib.timer: 1:55.194 sec]   115s ==> + 16%      (compiler compiled using -Yvirtpatmat compiling the library for -Yvirtpatmat)
[quick.lib.timer: 1:39.240 sec]   99s                         (compiler with normal pattern matching compiling the library for -Yvirtpatmat)
[strap.comp.timer: 1:39.152 sec]  99s  ==> + 26%   (compiler compiled using -Yvirtpatmat compiling the compiler for -Yvirtpatmat) 
[quick.comp.timer: 1:18.715 sec]  78s                     (compiler with normal pattern matching compiling the compiler for -Yvirtpatmat)

you can try this yourself using `ant strap-vpm` on https://github.com/adriaanm/scala-dev/tree/topic/virtpatmat-bootstrap

I'd be happy to have these results confirmed/refuted independently. If these are accurate, I think we're getting there faster than expected.
No-option for the win!

next step: tri-state ints for lazy val boolean of isInstanceOf's

On Mon, Nov 28, 2011 at 6:19 PM, Tiark Rompf <tiark...@epfl.ch> wrote:

Simon Ochsenreither

unread,
Nov 30, 2011, 9:36:47 AM11/30/11
to scala-i...@googlegroups.com, adriaa...@epfl.ch
Are these numbers also representative for “usual code”? A performance regression by 15-25% looks scary ...

Thanks!

martin odersky

unread,
Nov 30, 2011, 9:38:57 AM11/30/11
to scala-i...@googlegroups.com, adriaa...@epfl.ch
On Wed, Nov 30, 2011 at 3:36 PM, Simon Ochsenreither <simon.och...@googlemail.com> wrote:
Are these numbers also representative for “usual code”? A performance regression by 15-25% looks scary ...

I believe so. It's work in progress. And more optimizations are to come. In any case, we will not release a compiler with a comparable performance degradation.

Cheers

 -- Martin

Adriaan Moors

unread,
Nov 30, 2011, 9:48:37 AM11/30/11
to scala-i...@googlegroups.com
On Wed, Nov 30, 2011 at 3:36 PM, Simon Ochsenreither <simon.och...@googlemail.com> wrote:
Are these numbers also representative for “usual code”? A performance regression by 15-25% looks scary ...
as Martin said, this is work in progress (I started working on optimizing it about a week ago)

that said, I'm happy to provide support to anyone trying it out on "usual code", just add -Yvirtpatmat to the scalac args


cheers
adriaan

Daniel Sobral

unread,
Nov 30, 2011, 12:28:54 PM11/30/11
to scala-i...@googlegroups.com

Say, do these numbers include the suggested optimization of converting
Option tests into null tests?


--
Daniel C. Sobral

I travel to the future all the time.

Adriaan Moors

unread,
Nov 30, 2011, 5:10:19 PM11/30/11
to scala-i...@googlegroups.com
Say, do these numbers include the suggested optimization of converting Option tests into null tests?
yep, except that they're not null tests, but boolean tests (to distinguish failure and success) + unwrapped option (a successful null must not be confused with failure)

(the gist of the idea is in: https://github.com/adriaanm/scala-dev/commit/06906e245ad523da6af8185a8e0b582ccffc5747, but has been refined subsequently)

I've also implemented tri-state ints for lazy bools for caching repeated isInstanceOf tests:

[strap.lib.timer: 1:59.914 sec]     120s ==> +19%
[quick.lib.timer: 1:41.313 sec]     101s
[strap.comp.timer: 1:49.832 sec]    110s ==> +35%
[quick.comp.timer: 1:21.208 sec]    81s

so that's a bit of a bummer... the overheads went up since last time i measured (before this optimization)

I may have implemented the optimization poorly (see my first stab for yourself at https://github.com/adriaanm/scala-dev/commit/d5c8b82b6ec7897ac343ac6351eb0dd708fac65b), 
maybe I got the timing results wrong, or something else is going on -- will look into it...
on the other hand, it could be that the JVM is just good at instanceof tests :-)

adriaan

martin odersky

unread,
Nov 30, 2011, 5:16:26 PM11/30/11
to scala-i...@googlegroups.com
In my experience, the JVM is indeed surprisingly good at instance oftests. It seems like they feed right into the branch prediction logic of the processor. We tried at some point to "optimize" pattern matching by switching on a tag. It was a loss. A sequence of instanceof tests was faster.

Cheers

 -- Martin
 

Paul Phillips

unread,
Nov 30, 2011, 6:26:18 PM11/30/11
to scala-i...@googlegroups.com
On Wed, Nov 30, 2011 at 2:16 PM, martin odersky <martin....@epfl.ch> wrote:
> In my experience, the JVM is indeed surprisingly good at instance oftests.
> It seems like they feed right into the branch prediction logic of the
> processor. We tried at some point to "optimize" pattern matching by
> switching on a tag. It was a loss. A sequence of instanceof tests was
> faster.

They're fast, but no matter how fast they are they can't be faster
than reading a boolean on the stack without introducing some
additional factor. The use of $tag was trying to avoid the instanceof
tests entirely, and it brought overhead. Here the first instanceof
test is being performed, only successive identical ones are being
avoided, and the additional overhead is in the neighborhood of zero.

My guess is that the jvm is already caching and reusing the result of
the first instanceof test, and their cache is closer to the metal than
ours is.

martin odersky

unread,
Nov 30, 2011, 6:35:55 PM11/30/11
to scala-i...@googlegroups.com
Yes, I think that's plausible. -- Martin


Vlad Ureche

unread,
Nov 30, 2011, 6:47:55 PM11/30/11
to scala-i...@googlegroups.com
On Thu, Dec 1, 2011 at 12:26 AM, Paul Phillips <pa...@improving.org> wrote:


Could you test it with a list of lists? Maybe the JVM inlines the list into the pattern matcher, thus completely eliminating the tests.


Vlad

Ismael Juma

unread,
Dec 1, 2011, 3:43:48 AM12/1/11
to scala-i...@googlegroups.com
On Wed, Nov 30, 2011 at 10:10 PM, Adriaan Moors <adriaa...@epfl.ch> wrote:
on the other hand, it could be that the JVM is just good at instanceof tests :-)

It is pretty good:

"Profiling is performed at the bytecode level in the interpreter and tier one compiler. The compiler leans heavily on profile data to motivate optimistic optimizations."
"There is also a type profile for every checkcast, instanceof, and aastore. (Helps with generics.) "


Best,
Ismael

Tiark Rompf

unread,
Dec 1, 2011, 4:30:11 AM12/1/11
to scala-i...@googlegroups.com
I've also implemented tri-state ints for lazy bools for caching repeated isInstanceOf tests:

[strap.lib.timer: 1:59.914 sec]     120s ==> +19%
[quick.lib.timer: 1:41.313 sec]     101s
[strap.comp.timer: 1:49.832 sec]    110s ==> +35%
[quick.comp.timer: 1:21.208 sec]    81s

so that's a bit of a bummer... the overheads went up since last time i measured (before this optimization)

Is the result of asInstanceOf also cached? if not the jvm will still need to do instance checks as part of asInstanceOf.

In the end i don't think we'll reach the performance baseline without generating jumps, so maybe now is a good opportunity to think about how we can get there in a clean, high-level way.

- Tiark

Adriaan Moors

unread,
Dec 1, 2011, 4:41:00 AM12/1/11
to scala-i...@googlegroups.com
interesting! thanks for the pointer!

adriaan

Adriaan Moors

unread,
Dec 1, 2011, 4:44:57 AM12/1/11
to scala-i...@googlegroups.com
Could you test it with a list of lists? Maybe the JVM inlines the list into the pattern matcher, thus completely eliminating the tests.
I assume you're referring to my test case? The timing results I quoted come from compiling the compiler and the library with a compiler that was compiled to use this implementation of pattern matching, so I think they represent real-life performance reasonably well.

cheers
adriaan

Adriaan Moors

unread,
Dec 1, 2011, 5:27:23 AM12/1/11
to scala-i...@googlegroups.com
Is the result of asInstanceOf also cached? if not the jvm will still need to do instance checks as part of asInstanceOf.
no, I only cached the isInstanceOf test, so yes, this could very well explain why the "optimization" didn't help
I'm not sure caching both would help, though, since other evidence suggests the JVM is simply good at this kind of caching/profiling

we could try caching the asInstanceOf anyway, but that brings me to your next point


In the end i don't think we'll reach the performance baseline without generating jumps,
so maybe now is a good opportunity to think about how we can get there in a clean, high-level way.
The isInstanceOf caching experiment was just that, and, after sleeping on it for a night, I think I'll punt on it for now.

I agree we need a more sophisticated analysis that generates clean code -- preferably without jumps, though.

My current plan is to implement the sharing of the outcomes of common tests, which I described earlier. 
It's different from the isInstanceOf caching in that it is not lazy and that it bundles checks.

Consider a match with cases that share the same outer pattern. It compiles to something that looks like this:

// <vals> abbreviates the local variables that bind the information that follows from the tested condition (e.g., if(<typeTest>) val x1 = <casted>)

if(C1) { <vals_1> if(C2) { <vals_2> if(C3A) {matchRes = ....; keepGoing = false} // case_1

if(keepGoing) if(C1) { <vals_1'> if(C2) { <vals_2'> if(C3B) {matchRes = ....; keepGoing = false} // case_2


I propose to optimize this as follows:

// case_1
var C2satisfied = false
<hoisted vals_1 + vals_2, mutable -- only those referenced in later cases>
if(C1) { <vals_1_case_1> if(C2) { <vals_2_case_1> 
  C2satisfied = true
  <assign values to shared variables: since the tests if(C1) {<vals> ... if(C2) { <vals> ... will be dropped below, need to preserve any variables bound by them>
  if(C3A) {matchRes = ....; keepGoing = false}
}

// case_2
if(keepGoing && C2satisfied) if(C3B) {matchRes = ....; keepGoing = false} // replace old variable references by references to the shared variables


The advantage of this strategy, I think, is that it maintains the code-structure of the encoding (the size of the generated tree is linear in the size of the original matcher's tree)

if it turns out that is is inefficient to encode an automaton (that does not loop or backtrack [*]) by if-testing mutable flags, I think it would be an evolutionary step to use jumps


cheers
adriaan


[*] I still need to read up a bit on automata theory, please correct me if I'm wrong on this

Jason Zaugg

unread,
Dec 1, 2011, 5:44:43 AM12/1/11
to scala-i...@googlegroups.com
On Thu, Dec 1, 2011 at 11:27 AM, Adriaan Moors <adriaa...@epfl.ch> wrote:
Is the result of asInstanceOf also cached? if not the jvm will still need to do instance checks as part of asInstanceOf.
no, I only cached the isInstanceOf test, so yes, this could very well explain why the "optimization" didn't help
I'm not sure caching both would help, though, since other evidence suggests the JVM is simply good at this kind of caching/profiling

we could try caching the asInstanceOf anyway, but that brings me to your next point


In the end i don't think we'll reach the performance baseline without generating jumps,
so maybe now is a good opportunity to think about how we can get there in a clean, high-level way.
The isInstanceOf caching experiment was just that, and, after sleeping on it for a night, I think I'll punt on it for now.

I agree we need a more sophisticated analysis that generates clean code -- preferably without jumps, though.

Comment from the bleachers: it might be interesting to consider ways to split the generated code into smaller methods (as is done for guards at the moment.) AFAIK, large methods are less likely to receive attention from hotspot.

Earlier this year, Alexander converted [1] the pattern matches that implements type conformance in the IDEA plugin to polymorphic dispatch and saw big performance improvement. My hunch was that breaking it into smaller methods might have been the telling factor, rather than the difference between dispatching based on instanceOf vs invokeVirtual.

-jason

Adriaan Moors

unread,
Dec 1, 2011, 5:49:42 AM12/1/11
to scala-i...@googlegroups.com
aha! excellent point! now to get the size heuristics right... :-)

Adriaan Moors

unread,
Dec 14, 2011, 12:10:24 PM12/14/11
to scala-i...@googlegroups.com
following a suggestion by Tiark, I stumbled upon a significant source of overhead in -Yvirtpatmat... alternatives!

who would've thought, right? it turns out the compiler uses quite a few of those

a better implementation brought virtpatmat's overhead (as defined earlier in this thread) down to about 5%, when also using -Ydead-code, which seems essential (otherwise the overhead is closer to 11% when compiling the compiler)


so, my plan: 
1) optimize simple matches that consist of only constants, type tests and alternatives, probably even emitting a switch when there are only constants and alternatives, as is done by the current pattern matcher...

2) add some simple dead-code elimination to the virtpatmat code gen, so we don't have to run with -Ydead-code, which seems a bit expensive (have a look at the jump from 2min -> 3min when I switched to the current implementation of alternatives)


cheers
adriaan


martin odersky

unread,
Dec 14, 2011, 1:23:53 PM12/14/11
to scala-i...@googlegroups.com
On Wed, Dec 14, 2011 at 6:10 PM, Adriaan Moors <adriaa...@epfl.ch> wrote:
> following a suggestion by Tiark, I stumbled upon a significant source of
> overhead in -Yvirtpatmat... alternatives!
>
> who would've thought, right? it turns out the compiler uses quite a few of
> those
>
> a better implementation brought virtpatmat's overhead (as defined earlier in
> this thread) down to about 5%, when also using -Ydead-code, which seems
> essential (otherwise the overhead is closer to 11% when compiling the
> compiler)
>
> details are
> here: https://github.com/adriaanm/scala/blob/aa1ede90eaa234fb1099ed589a7ccc9344a66a6e/bootstrapping
>
> so, my plan:
> 1) optimize simple matches that consist of only constants, type tests and
> alternatives, probably even emitting a switch when there are only constants
> and alternatives, as is done by the current pattern matcher...
>
So, it was the Scanner that crawled, right? Yes, emitting switches is
essential. You can simply leave the match expression in the trees for
them; the backend will take care of the rest.

Cheers

-- Martin

Adriaan Moors

unread,
Dec 15, 2011, 5:57:27 AM12/15/11
to scala-i...@googlegroups.com
So, it was the Scanner that crawled, right?
indeed: parser, tailcalls, liftcode and cleanup jump out as having above average per-phase overhead


 
Yes, emitting switches is essential. You can simply leave the match expression in the trees for them; the backend will take care of the rest.
yep, doing that today, should be easy enough

cheers
adriaan 

Adriaan Moors

unread,
Dec 16, 2011, 1:59:56 PM12/16/11
to scala-i...@googlegroups.com
just a quick status update: -Ydead-code -Ycloselim now has the same running times as -Ydead-code -Ycloselim, or at least within margin of error:

[strap.lib.timer: 1:53.893 sec]
[quick.lib.timer: 1:53.159 sec]

[strap.comp.timer: 1:45.424 sec]
[quick.comp.timer: 1:43.849 sec]

below are the strap.comp per-phase timings in ms, with the percentages comparing to a compiler (quick.comp) running with the same switches, so performing the same work, but using normal pattern matching:

         icode 5979 115.49%
     tailcalls 781 113.02%
       uncurry 3231 108.61%
       flatten 548 108.09%
     refchecks 2858 108.01%
    lambdalift 1709 106.95%
  constructors 1112 106.21%
       erasure 9373 103.40%
         mixin 2267 103.37%
        parser 1906 102.03%
         typer 35624 100.69%
 explicitouter 2435 99.75%
superaccessors 1035 99.52%
       pickler 808 99.51%
           jvm 10733 98.46%
      closelim 5200 97.58%
           dce 13183 97.28%
    specialize 2754 95.03%

note the rows with < 100% :-)


next week I'll investigate the patterns in tailcalls and icode

cheers
adriaan


ps: I realise these statistics need to be taken with a grain of salt, i'll be performing more benchmarking on other projects, as well as micro-benchmarks, where statistics is easier to apply -- help with this would be greatly appreciated!
Reply all
Reply to author
Forward
0 new messages