parallelizing an optimization phase

Skip to first unread message

Miguel Garcia

Jan 22, 2012, 6:27:05 AM1/22/12

After some performance improvements to the optimizer [1] [2], I wanted to try for real parallelizing one of the optimization phases (say, one of the "easier" ones). The hypothesis was that typer was going to be the single source of races. Thus I picked an optimization phase (dead code elimination) that does most of its work without constantly invoking Tree.tpe or (relatively speaking). After synchronizing on global, the concurrent version takes on average 1/3 of the sequential one (no matter how many more threads are used, due to contention on global).

In principle, that approach could also do the trick for Inliner (to recap, Inliner is the largest contributor to compilation time under -optimize). With the caveats that Inliner is definitely more typer-reliant, and perhaps the best speedup achievable would be 1/2. All that seems worthy exploring in more depth. What you can do is:

  (a) background material at The Scala Compiler Corner [3]
  (b) sources for parallel DeadCodeElimination at

The idioms used in the parallel version really boil down to two:

  (c) making mutable-shared-state not shared across threads (e.g., Linearizer and Peephole are now instance-level and thus not shared across tasks submitted to Executor)

  (d) now comes the real synchronization. Say, Instruction.produced invokes typer (at least for some overrides). A working solution is:

    private def getProduced(i: Instruction): Int = {
      if(i.isInstanceOf[opcodes.CALL_METHOD]) {
        global synchronized i.produced // locked because CALL_METHOD.produced() invokes producedType
      } else i.produced

That's all there's to it. It can be tricky because (thinking aloud) -Ydebug or -Ylog might result in typer being invoked (I haven't tried this). However, the same principles apply.

Perhaps this prototype can evolve into a parallel version of dce in trunk (contributions are welcome!).





Paul Phillips

Jan 25, 2012, 4:45:56 PM1/25/12
On Sun, Jan 22, 2012 at 3:27 AM, Miguel Garcia <> wrote:
> That's all there's to it. It can be tricky because (thinking aloud) -Ydebug
> or -Ylog might result in typer being invoked (I haven't tried this).
> However, the same principles apply.

Well, it's not supposed to. There was a time when you couldn't print
much of anything without hosing yourself. Now, AFAIK, you can quite
safely print symbols; printing trees may or may not be safe, not sure;
printing types is not safe, haven't looked into why.

Why can you print symbols?

scala> intp("scala.collection.parallel.mutable.ParMap").info.members
map (_.defString) >

def withDefaultValue: <?>
def withDefault: <?>

Because it is discriminating about it. Give the infos a little nudge,
see some more.

scala> intp("scala.collection.parallel.mutable.ParMap").info.members
foreach (

scala> intp("scala.collection.parallel.mutable.ParMap").info.members
map (_.defString) >
def withDefaultValue(d: <?>): scala.collection.parallel.mutable.ParMap[K,V]
def withDefault(d: <?>): scala.collection.parallel.mutable.ParMap[K,V]

All manifestations of the typer being invoked (or really, much of
anything being invoked) when calling toString should in my view be
considered bugs. If necessary there could be a flag somewhere which
is flipped into forcing mode when printing errors or wherever else it
is that one actually wants to influence the outcome by printing
something. But the default should be to affect nothing.

Miguel Garcia

Jan 26, 2012, 4:39:54 AM1/26/12

Got some (small) news on this. My initial reservations about less-than-ideal speedup in 'parallel dce' were unfounded. As the chart shows, a single outlier makes the speedup sublinear. In detail, the smallest unit of work is an IMethod, and it so happens a certain 'CommentFactory.parse0$1' takes up 7 sec to process. By that time all other threads are done with their thousands and thousands of IMethod s and are sitting idle.

I want to gather evidence (this time about Inliner) before claiming for sure the technique is a winner. We're almost there, however :)



Daniel Sobral

Jan 26, 2012, 8:03:15 AM1/26/12
How can I benchmark this particular method? There are some changes I'd
like to try out.

Oops, I gotta go. Let me quote SI-5045:

"When using a regular expression a user can state explicitly that they
want anchoring and the Regex library should not mandate that decision.
As an aside this often leads to people surrounding their patterns with
".*PTRN.*" which can turn a nice fast pattern into one with heavy
backtracking. See Mastering Regular Expressions by Friedl for

And now let me quote the code:

/** The start of a scaladoc code block */
protected val CodeBlockStart =
new Regex("""(.*)((?:\{\{\{)|(?:\u000E<pre(?:

/** The end of a scaladoc code block */
protected val CodeBlockEnd =
new Regex("""(.*)((?:\}\}\})|(?:\u000E</pre>\u000E))(.*)""")

Short of adopting the suggestion made in the ticket, replacing .* with
.*? at the start might result in big gains.

Daniel C. Sobral

I travel to the future all the time.

Miguel Garcia

Jan 26, 2012, 8:52:34 AM1/26/12

I guess there's an interesting performance story to be told for `CommentFactory$class.parse0$1()` (now I got its name right). However, given that I'm not familiar with that code, all I've measured is the time it takes for DeadCodeElimination (rather, my parallel version of it) to analyze its ICode at compile time. I haven't look into what that ICode can be, nor its runtime impact. As far as parallel-dce is concerned, it's already doing its job: speeding up the thing as much as possible.


Reply all
Reply to author
0 new messages