Miniboxing vs. scratchpads for avoiding combinatorial explosion of specialization

164 views
Skip to first unread message

Rex Kerr

unread,
Apr 12, 2013, 11:16:55 PM4/12/13
to scala-debate
Miniboxing (being worked on by Vlad Ureche) is a form of specialization where you pack primitives into a Long so that if you have N specialized arguments you only have 2^N instead of 9^N different versions to worry about; the compiler would then handle the boxing and unboxing from long.

Scratchpads (proposed by me under another name to Martin in a year or two ago (I think?) and re-introduced in the recent primitive-array-performance thread) are a form of specialization where you separate each argument of a function into a separate load function, which is stored, and only when the last item is loaded is the computation run (which is then stored in another variable for readout).  If you have N specialized arguments you only have 9N to worry about instead of 9^N.

Note that miniboxing is more compact up to N=6; from N=7 on, scratchpads win in that regard.

It occurred to me that the completely general form of scratchpads are unnecessary for a lot of the most common Function2 signatures; instead, you can get away with an Out (return type) whose value is (secretly) mangled by an In (argument type), as long as the Out can be initialized.  So I decided to how this would compare with miniboxing.

The test is to either sum an array of doubles or to count the NaNs in the array.  The baseline is using a primitive array by hand; the contenders are either the stock library sum/count methods, miniboxing which I've created by hand, or in-out scratchpads which I've created by hand.

The bottom line is that in this test, scratchpads _are as fast as doing it by hand_.  Miniboxing puts in a decent showing at 1.6x to 3.9x slower.  The library routines take a depressing 10-18x longer.  It's pretty much independent of array size (tiny arrays are hard to measure because of measurement overhead), but here's an example for arrays of 1000:

Array size 1000
Benchmark comparison (in 725.6 ms): sum prim vs lib
Significantly different (p ~= 0)
  Time ratio:    16.61871   95% CI 16.25970 - 16.97772   (n=20)
    First     919.8 ns   95% CI 905.5 ns - 934.2 ns
    Second    15.29 us   95% CI 15.06 us - 15.52 us

Benchmark comparison (in 659.4 ms): sum prim vs mini
Significantly different (p ~= 0)
  Time ratio:    3.91488   95% CI 3.88437 - 3.94538   (n=20)
    First     900.9 ns   95% CI 896.0 ns - 905.8 ns
    Second    3.527 us   95% CI 3.507 us - 3.547 us

Benchmark comparison (in 670.0 ms): sum prim vs io
Not significantly different (p ~= 0.9366)
  Time ratio:    0.99969   95% CI 0.99198 - 1.00740   (n=20)
    First     901.0 ns   95% CI 896.1 ns - 905.9 ns
    Second    900.7 ns   95% CI 895.8 ns - 905.6 ns

Benchmark comparison (in 602.6 ms): count prim vs lib
Significantly different (p ~= 0)
  Time ratio:    11.32286   95% CI 11.07644 - 11.56927   (n=20)
    First     918.9 ns   95% CI 907.4 ns - 930.5 ns
    Second    10.41 us   95% CI 10.22 us - 10.59 us

Benchmark comparison (in 622.2 ms): count prim vs mini
Significantly different (p ~= 0)
  Time ratio:    1.65292   95% CI 1.63907 - 1.66677   (n=20)
    First     919.9 ns   95% CI 915.0 ns - 924.8 ns
    Second    1.521 us   95% CI 1.511 us - 1.530 us

Benchmark comparison (in 681.4 ms): count prim vs io
Not significantly different (p ~= 0.8906)
  Time ratio:    1.00052   95% CI 0.99296 - 1.00808   (n=20)
    First     920.0 ns   95% CI 915.0 ns - 924.9 ns
    Second    920.4 ns   95% CI 915.5 ns - 925.3 ns

Rex Kerr

unread,
Apr 12, 2013, 11:21:23 PM4/12/13
to scala-debate
...er, hit send by mistake.  Anyway, I think miniboxing is quite hopeful, but I am not sure that we should discount the potential benefits of scratchpads either.

The benchmark code is appended below.  Note that I've used my benchmarking library for it.  Someone could convert this to Caliper etc. easily enough, I imagine.

  --Rex

object BoxTest {
  import java.lang.Double.{doubleToRawLongBits => d2b, longBitsToDouble => b2d}
  import java.lang.Double.{isNaN => nan}
  abstract class MiniFun2 {
    def apply(a: Long, b: Long): Long
  }
  def minifoldD(a: Array[Double], zero: Long, mf: MiniFun2) = {
    var i = 0
    var s = zero
    while (i < a.length) {
      s = mf(s, d2b(a(i)))
      i += 1
    }
    s
  }
  def minisum(a: Array[Double]): Double = {
    val mf = new MiniFun2 { def apply(a: Long, b: Long) = d2b(b2d(a)+b2d(b)) }
    b2d(minifoldD(a, d2b(0.0), mf))
  }
  def minicountnan(a: Array[Double]): Int = {
    val mf = new MiniFun2 {
      def apply(a: Long, b: Long) = a + (if (nan(b2d(b))) 1 else 0)
    }
    minifoldD(a, d2b(0.0), mf).toInt
  }

  trait In[@specialized A] { def apply(a: A): Unit }
  trait Out[@specialized B] { def init(zero: B): Unit; def get: B }
  def ifoldD[@specialized A](a: Array[A], fi: In[A]): fi.type = {
    var i = 0
    while (i < a.length) {
      fi(a(i))
      i += 1
    }
    fi
  }
  def iosum(a: Array[Double]): Double = {
    class anon extends In[Double] with Out[Double] {
      private[this] var acc: Double = _
      def apply(a: Double) { acc += a }
      def init(b: Double) { acc = b }
      def get = acc
    }
    val af = new anon
    af.init(0.0)
    ifoldD(a, af).get
  }
  def iocountnan(a: Array[Double]): Int = {
    class anon extends In[Double] with Out[Int] {
      private[this] var acc: Int = _
      def apply(a: Double) { acc += (if (nan(a)) 1 else 0) }
      def init(b: Int) { acc = b }
      def get = acc
    }
    val af = new anon
    af.init(0)
    ifoldD(a, af).get
  }

  def libsum(a: Array[Double]): Double = a.sum
  def libcountnan(a: Array[Double]): Int = a.count(x => nan(x))
  def primsum(a: Array[Double]): Double = {
    var i = 0
    var s = 0.0
    while (i < a.length) { s += a(i); i += 1 }
    s
  }
  def primcountnan(a: Array[Double]): Int = {
    var i,s = 0
    while (i < a.length) { s += (if (nan(a(i))) 1 else 0); i += 1 }
    s
  }

  import ichi.bench.Thyme
  def run(th: Thyme, n: Int) {
    val a = Array.tabulate(n){ i => if (i%3 == 0) Double.NaN else 1.1 }
    println("\n\nArray size "+n)
    th.pbenchOff("sum prim vs lib"){primsum(a)}{libsum(a)}
    println
    th.pbenchOff("sum prim vs mini"){primsum(a)}{minisum(a)}
    println
    th.pbenchOff("sum prim vs io"){primsum(a)}{iosum(a)}
    println
    th.pbenchOff("count prim vs lib"){primcountnan(a)}{libcountnan(a)}
    println
    th.pbenchOff("count prim vs mini"){primcountnan(a)}{minicountnan(a)}
    println
    th.pbenchOff("count prim vs io"){primcountnan(a)}{iocountnan(a)}
  }

  def main(args: Array[String]) {
    val th = new Thyme
    th.pbenchOff(){}{}  // Warmup
    for (n <- Seq(3, 10, 100, 1000, 10000)) run(th, n)
  }
}



Erik Osheim

unread,
Apr 13, 2013, 12:11:17 PM4/13/13
to Rex Kerr, scala-debate
On Fri, Apr 12, 2013 at 11:21:23PM -0400, Rex Kerr wrote:
> ...er, hit send by mistake. Anyway, I think miniboxing is quite hopeful,
> but I am not sure that we should discount the potential benefits of
> scratchpads either.
>
> The benchmark code is appended below. Note that I've used my benchmarking
> library for it. Someone could convert this to Caliper etc. easily enough,
> I imagine.

Also remember that with macros it's possible to generate specialized
extension methods without necessarily having a combinatorial explosion
in the standard library.

I do like the scratch pad pattern. We used it in GeoTrellis to get
"manual specialization" working during raster construction. I'm on the
fence about whether I think it's the future direction the standard
library should go (versus miniboxing, improved templating, or something
macro-based).

-- Erik

Rex Kerr

unread,
Apr 13, 2013, 12:39:52 PM4/13/13
to Erik Osheim, scala-debate
The advantage of scratch pads are not in simple cases like array traversal but in complex cases such as traversing a trie, where you have a huge amount of code for which the specialization is irrelevant.  Then you would like to separate the traversal code from the function evaluation code as much as possible; this is about as much as is possible.

Macros would work also, but if the function is heavily used you can still end up with more explosion than you'd like (especially the way they work now which would be like the bad old days of C++ templates before linkers got smart enough to merge duplicates).  Memoized quasiquoting would probably deliver the best of all worlds.

  --Rex

Vlad Ureche

unread,
Apr 13, 2013, 2:23:07 PM4/13/13
to Rex Kerr, Grzegorz Kossakowski, Erik Osheim, scala-debate
Hi Rex,

Thanks for explaining the scratch pads idea, it's very interesting. The more I look at it, the more I think we're moving towards separating data and operations:
 - scratchpads are the boldest of all: they extract the accumulator in a separate class where it's used with primitive types, so they even separate data from methods, not only from objects
 - extension methods, if I understand correctly, leave data in the objects and move code outside
 - Greg's proposal of using method handles assumes there is no specialized data in the class (Greg, please correct me if I'm wrong, this is what I understood last time we discussed it at EPFL, you probably developed it more in the meantime)
 - miniboxing is the shyest one of all, it still encodes data in the class, but does so in a more uniform way, thus getting the slower 2^N explosion

A question we should ask ourselves is whether separating code and data is good or bad:
 - if we make a closed world assumption, where the program becomes stand-alone, LMS proved time and time again that it's best to separate data and operations
 - if we're in an open world, where classes may be extended from outside... we don't really know. We'd be giving up our nice modularity, but we'd be gaining performance. To me it seems all our efforts push in this direction, we only need to find a good tradeoff. Another question to ask ourselves is whether we should do it at compilation time, or let the virtual machine do it, since it can treat the world as closed and revise its decisions later if proven wrong?

As for the example you posted, I'm wondering if we could automate it, such that the user doesn't have to do the transformation by hand? Unfortunately I've got my hands full at the moment with miniboxing, but if someone would be willing to give it a shot, I'd love to see it developed further. (who knows, I may even try hacking something after I'm done with miniboxing)

Cheers,
Vlad

PS: Rex, can you also try miniboxing with integers or bytes? I think you'll be surprised by the results. :)
I've also been puzzled by the slowdown when using doubleToRawLongBits/floatToRawIntBits, I think it's related to the fact that they're treated as intrinsics in Hotspot, so they don't get the nice optimizations other methods get.

2013/4/13 Rex Kerr <ich...@gmail.com>
--
You received this message because you are subscribed to the Google Groups "scala-debate" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-debate...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Som Snytt

unread,
Apr 13, 2013, 2:46:48 PM4/13/13
to Rex Kerr, scala-debate
On Fri, Apr 12, 2013 at 8:16 PM, Rex Kerr <ich...@gmail.com> wrote:
proposed by me under another name

Which name were you using back then?

I kind of thought Rex Kerr must be an alias, because it sounds like a Hollywood leading man.

Just realized that in filing order, it becomes "Kerr, Rex" or "car wrecks."  Now I'm sure it's an alias.

I recently learned Charlton Heston was born John Carter.  Maybe that movie John Carter would have done better if it changed its name to Charlton Heston.

(Wide-ranging OT is allowed on scala-debate, right?)

After a recent quip about "name mangling," I got hooked on anagrams of personal names.  Mine is "Karma rewind", which explains everything.

Are there any Martin Odersky commits fighting bit rot with the commit message: "Amend Risky Rot"?

Most a propos: Dark Rye Monist

How does he dress?  Dorky Raiments

If he suffers to make Scala great, it's because he is one of the Martyred Ikons.

But I'd be curious whether this guy has committed to any foss projects: Dirk Monastery

Well, all of these aliases could be deemed: Tardy Monikers

(And that's only in the English of http://www.wordsmith.org/anagram/)

Rex Kerr

unread,
Apr 13, 2013, 2:58:26 PM4/13/13
to Som Snytt, scala-debate
On Sat, Apr 13, 2013 at 2:46 PM, Som Snytt <som....@gmail.com> wrote:
On Fri, Apr 12, 2013 at 8:16 PM, Rex Kerr <ich...@gmail.com> wrote:
proposed by me under another name

Which name were you using back then?

Hahaha.  That's what happens when I accidentally hit post--no proofreading for unclear references.

I am sure I called it something other than scratchpads back then.  Maybe I didn't even have a name.  I forget.

Maybe I forget my own name, also.

   --???

Rex Kerr

unread,
Apr 13, 2013, 4:04:59 PM4/13/13
to Vlad Ureche, Grzegorz Kossakowski, Erik Osheim, scala-debate
On Sat, Apr 13, 2013 at 2:23 PM, Vlad Ureche <vlad....@gmail.com> wrote:
Hi Rex,

Thanks for explaining the scratch pads idea, it's very interesting. The more I look at it, the more I think we're moving towards separating data and operations:
 - scratchpads are the boldest of all: they extract the accumulator in a separate class where it's used with primitive types, so they even separate data from methods, not only from objects

Well, I don't know if it's bold in that regard because it keeps data and operation together inside itself.  It wraps capability in a different way than your usual function, though--it pulls out the relevant data into a private copy so it can do the efficient thing.
 
 - extension methods, if I understand correctly, leave data in the objects and move code outside
 - Greg's proposal of using method handles assumes there is no specialized data in the class (Greg, please correct me if I'm wrong, this is what I understood last time we discussed it at EPFL, you probably developed it more in the meantime)
 - miniboxing is the shyest one of all, it still encodes data in the class, but does so in a more uniform way, thus getting the slower 2^N explosion

A question we should ask ourselves is whether separating code and data is good or bad:
 - if we make a closed world assumption, where the program becomes stand-alone, LMS proved time and time again that it's best to separate data and operations
 - if we're in an open world, where classes may be extended from outside... we don't really know. We'd be giving up our nice modularity, but we'd be gaining performance. To me it seems all our efforts push in this direction, we only need to find a good tradeoff. Another question to ask ourselves is whether we should do it at compilation time, or let the virtual machine do it, since it can treat the world as closed and revise its decisions later if proven wrong?

I don't see how this is any different than having functions.  It's incredibly powerful to be able to insert custom computations into some process.  This capacity becomes part of your interface, and you can subclass as much as you like as long as you respect the signature (and, hopefully, enough of the behavior to not seem too weird).
 

As for the example you posted, I'm wondering if we could automate it, such that the user doesn't have to do the transformation by hand?

Yeah, it's a straightforward transformation.  You probably couldn't safely do it on bare functions since you can't actually tell what the zero really is, but you could with slightly modified syntax.  I don't have time to work on this, but I can illustrate a possible way to do it:

  (a: A)(f: (A,B) => A): A

must be (by hand) rewritten as:
  (@Out[A] f: In[B]): f.type
which will give the same syntax as the above by using the (a: A) argument to initialize the anonymous In[B] with Out[A], and then calling f.get on the return value.
 
Unfortunately I've got my hands full at the moment with miniboxing, but if someone would be willing to give it a shot, I'd love to see it developed further. (who knows, I may even try hacking something after I'm done with miniboxing)

Cheers,
Vlad

PS: Rex, can you also try miniboxing with integers or bytes? I think you'll be surprised by the results. :)

No, no.  I'm well aware of how happy small integer types are to go into big integer types.  In fact, you can in special cases end up faster with bytes since you don't (necessarily) have to cast back after the JVM leaves your result as an int.

I actually have been meaning to propose miniboxing for under 32 bit values for a while now as a way to keep Function2 small enough and still enable fast operations on strings and bytes.
 
I've also been puzzled by the slowdown when using doubleToRawLongBits/floatToRawIntBits, I think it's related to the fact that they're treated as intrinsics in Hotspot, so they don't get the nice optimizations other methods get.

Maybe so.  I'm not sure.  It might just be register hopping.  If you do this:

      d += java.lang.Double.longBitsToDouble(i);
      s += java.lang.Double.doubleToRawLongBits(d);

you end up with this bytecode (in Hotspot, as of 6u25, which is the last easily available debug build I can find for +PrintOptoAssembly)

245       movq    [rsp + #0], R10    # MoveL2D_reg_stack
249       movsd   XMM0, [rsp + #0]    # double stk
24e       addsd   XMM2, XMM0
252       movd    R10,XMM2    # MoveD2L
257       addq    RAX, R10    # long

This is not exactly optimal--there's the stack operation which hardly helps (on the L2D side, not the D2L side), and then it's using two different registers.  I hope it's right that the transfer is worth the faster operation in appropriate registers.  gcc doesn't think so--it basically just skips 249 and uses the [rsp+#0] value directly in the addsd--but maybe it just doesn't do sufficiently aggressive register optimization.  If I had more time I'd try miniboxing in C++ and see if it can beat Java.

  --Rex

Vlad Ureche

unread,
Apr 13, 2013, 5:27:36 PM4/13/13
to Rex Kerr, Grzegorz Kossakowski, Erik Osheim, scala-debate

2013/4/13 Rex Kerr <ich...@gmail.com>

 
Yeah, it's a straightforward transformation.  You probably couldn't safely do it on bare functions since you can't actually tell what the zero really is, but you could with slightly modified syntax.  I don't have time to work on this, but I can illustrate a possible way to do it:

  (a: A)(f: (A,B) => A): A

must be (by hand) rewritten as:
  (@Out[A] f: In[B]): f.type
which will give the same syntax as the above by using the (a: A) argument to initialize the anonymous In[B] with Out[A], and then calling f.get on the return value.

The In[A] with Out[B] seems a bit like what the Adobe guys did for GIL[1]. They pushed the type parameters as deep as possible in the methods, and what you're proposing is to push them even more, splitting a class Function2[T1, T2, R] into In1[T1] with In2[T2] with Out[R]. The question is whether you can do this generally, in an automated fashion, with just an annotation at most? For example, how would the compiler infer it needs to move the accumulator from the map code to the function?


I actually have been meaning to propose miniboxing for under 32 bit values for a while now as a way to keep Function2 small enough and still enable fast operations on strings and bytes.

(shameless advertising) That's more or less what I plan to present at ScalaDays, if my talk is accepted: an example with folding and Function2 that is optimized using the @minispec annotation. :D

 
 
I've also been puzzled by the slowdown when using doubleToRawLongBits/floatToRawIntBits, I think it's related to the fact that they're treated as intrinsics in Hotspot, so they don't get the nice optimizations other methods get.

Maybe so.  I'm not sure.  It might just be register hopping.  If you do this:

      d += java.lang.Double.longBitsToDouble(i);
      s += java.lang.Double.doubleToRawLongBits(d);

you end up with this bytecode (in Hotspot, as of 6u25, which is the last easily available debug build I can find for +PrintOptoAssembly)

245       movq    [rsp + #0], R10    # MoveL2D_reg_stack
249       movsd   XMM0, [rsp + #0]    # double stk
24e       addsd   XMM2, XMM0
252       movd    R10,XMM2    # MoveD2L
257       addq    RAX, R10    # long

This is not exactly optimal--there's the stack operation which hardly helps (on the L2D side, not the D2L side), and then it's using two different registers.  I hope it's right that the transfer is worth the faster operation in appropriate registers.  gcc doesn't think so--it basically just skips 249 and uses the [rsp+#0] value directly in the addsd--but maybe it just doesn't do sufficiently aggressive register optimization. 

That's just the tip of the iceberg. As I learned the hard way, intrinsics don't play well with the JVM the array optimizations. So getting a runtime support library that has good array performance is not easy. The paper where I described this is under submission, but I'll make it public as soon as I can.

 
If I had more time I'd try miniboxing in C++ and see if it can beat Java.

It does. (well, it's not exactly the same comparison, but the array overhead adds up in favor of C/C++)


[1] Lubomir Bourdev, Jaakko Järvi - Efficient run-time dispatching in generic programming with minimal code bloat

Rex Kerr

unread,
Apr 13, 2013, 6:06:55 PM4/13/13
to Vlad Ureche, Grzegorz Kossakowski, Erik Osheim, scala-debate
On Sat, Apr 13, 2013 at 5:27 PM, Vlad Ureche <vlad....@gmail.com> wrote:

2013/4/13 Rex Kerr <ich...@gmail.com>
 
Yeah, it's a straightforward transformation.  You probably couldn't safely do it on bare functions since you can't actually tell what the zero really is, but you could with slightly modified syntax.  I don't have time to work on this, but I can illustrate a possible way to do it:

  (a: A)(f: (A,B) => A): A

must be (by hand) rewritten as:
  (@Out[A] f: In[B]): f.type
which will give the same syntax as the above by using the (a: A) argument to initialize the anonymous In[B] with Out[A], and then calling f.get on the return value.

The In[A] with Out[B] seems a bit like what the Adobe guys did for GIL[1]. They pushed the type parameters as deep as possible in the methods, and what you're proposing is to push them even more, splitting a class Function2[T1, T2, R] into In1[T1] with In2[T2] with Out[R]. The question is whether you can do this generally, in an automated fashion, with just an annotation at most?
 

Yes, but the insight I had that prompted me to type this benchmark up now was that we _almost never use the full power of a Function2_.  It's almost always (A,B) => A or (A,B) => B, or some widened version which we could plug in instead.

These sort of accumulator/transformers have the form A => (B => A), where the B => A does all the interesting stuff and the first call is just used to set state.

So the general transform would be
  (a,b) => f(a,b)
to
  loadA(a) { _a = a }
  loadB(b) { _c = f(_a,b) }
  get = _c
but this wouldn't necessarily help if one method was required to use all three.  And normally you _are_ required to use all three because the mental model for functions is that they're stateless.

 
For example, how would the compiler infer it needs to move the accumulator from the map code to the function?

It couldn't.  That's why the signature has to change--as I said, it must be rewritten manually.  Well, of course you could write a macro/signal with an annotation to do it automatically:

  (@zero a: A)(@accum f:(A,B) => A): A

or something.  The specific transform isn't that hard:

  foo(a: A)(f: (A,B)=> A): A

  now:
  t = new Function2[A,B]{ apply(a: A, b: B) = f(a,b) }
  foo(a)(t)

  then:
  s = new Scratch(a) { load(b: B) { _a = f(_a,b) } }
  foo(s).get

but you'd have to allow a to be a var and make sure that all code involving a was separable in the right way.  I really think it'd be too fiddly to have the library code be identical--better off just defining the separate interface and let people do it that way.  Yes, you have to redo some work, but as a plus, you're better expressing what the requirements actually are of the computation.  That is, having a well-defined interface to side-effects can be powerful when it comes to splitting concerns.



I actually have been meaning to propose miniboxing for under 32 bit values for a while now as a way to keep Function2 small enough and still enable fast operations on strings and bytes.

(shameless advertising) That's more or less what I plan to present at ScalaDays, if my talk is accepted: an example with folding and Function2 that is optimized using the @minispec annotation. :D

Well, if I go (I submitted an abstract to give a more general performance-tuning talk), I will look forward to hearing that!

Incidentally, array-handling is another strong point of scratchpads.  The type is always correct, so it doesn't have to worry about oh-I-called-it-a-Long-but-it-was-a-Float-and-oh-where-do-I-put-the-spare-Byte-ugh.
 
  --Rex

Vlad Ureche

unread,
Apr 13, 2013, 6:47:30 PM4/13/13
to Rex Kerr, Grzegorz Kossakowski, Erik Osheim, scala-debate
@Rex: Sorry for the duplicate email, I hit reply instead of reply to all.

2013/4/14 Rex Kerr <ich...@gmail.com>


Yes, but the insight I had that prompted me to type this benchmark up now was that we _almost never use the full power of a Function2_.  It's almost always (A,B) => A or (A,B) => B, or some widened version which we could plug in instead.

These sort of accumulator/transformers have the form A => (B => A), where the B => A does all the interesting stuff and the first call is just used to set state.

So the general transform would be
  (a,b) => f(a,b)
to
  loadA(a) { _a = a }
  loadB(b) { _c = f(_a,b) }
  get = _c
but this wouldn't necessarily help if one method was required to use all three.  And normally you _are_ required to use all three because the mental model for functions is that they're stateless.

I see. There's a very cool place we could put this pattern to good use: returning value classes with multiple arguments. I'm not sure to what extent this would improve the performance, as escape analysis probably eliminates the object creation when the code is inlined. But it may be worth the effort for the interpreter and if the method returning the value class is not inlined.

 
For example, how would the compiler infer it needs to move the accumulator from the map code to the function?

It couldn't.  That's why the signature has to change--as I said, it must be rewritten manually.  Well, of course you could write a macro/signal with an annotation to do it automatically:

  (@zero a: A)(@accum f:(A,B) => A): A

or something.  The specific transform isn't that hard:

  foo(a: A)(f: (A,B)=> A): A

  now:
  t = new Function2[A,B]{ apply(a: A, b: B) = f(a,b) }
  foo(a)(t)

  then:
  s = new Scratch(a) { load(b: B) { _a = f(_a,b) } }
  foo(s).get

but you'd have to allow a to be a var and make sure that all code involving a was separable in the right way.

Separating the code is what I fear the most, but it would be interesting to explore to what extent it can be done automatically.


I really think it'd be too fiddly to have the library code be identical--better off just defining the separate interface and let people do it that way.  Yes, you have to redo some work, but as a plus, you're better expressing what the requirements actually are of the computation.  That is, having a well-defined interface to side-effects can be powerful when it comes to splitting concerns.
Well, if I go (I submitted an abstract to give a more general performance-tuning talk), I will look forward to hearing that!

I'd love to brainstorm more on these topics, assuming my talk gets in.

 
Incidentally, array-handling is another strong point of scratchpads.  The type is always correct, so it doesn't have to worry about oh-I-called-it-a-Long-but-it-was-a-Float-and-oh-where-do-I-put-the-spare-Byte-ugh.

I agree a manual transformation is more powerful. It will always be, if you're willing to do it.
But getting 10x-22x performance boost with an annotation and without significant duplication is still a pretty good tradeoff, don't you think?

Cheers,
Vlad

Rex Kerr

unread,
Apr 13, 2013, 7:02:23 PM4/13/13
to Vlad Ureche, Grzegorz Kossakowski, Erik Osheim, scala-debate
On Sat, Apr 13, 2013 at 6:47 PM, Vlad Ureche <vlad....@gmail.com> wrote:
@Rex: Sorry for the duplicate email, I hit reply instead of reply to all.

2013/4/14 Rex Kerr <ich...@gmail.com>

Yes, but the insight I had that prompted me to type this benchmark up now was that we _almost never use the full power of a Function2_.  It's almost always (A,B) => A or (A,B) => B, or some widened version which we could plug in instead.

These sort of accumulator/transformers have the form A => (B => A), where the B => A does all the interesting stuff and the first call is just used to set state.

So the general transform would be
  (a,b) => f(a,b)
to
  loadA(a) { _a = a }
  loadB(b) { _c = f(_a,b) }
  get = _c
but this wouldn't necessarily help if one method was required to use all three.  And normally you _are_ required to use all three because the mental model for functions is that they're stateless.

I see. There's a very cool place we could put this pattern to good use: returning value classes with multiple arguments. I'm not sure to what extent this would improve the performance, as escape analysis probably eliminates the object creation when the code is inlined. But it may be worth the effort for the interpreter and if the method returning the value class is not inlined.

Yes.  I also have something I call "Muples" and "Fuples"--muples are mutable tuples, and fuples are basically the scratchpads I've been talking about except more general; they are Muples which can do an arbitrary transform on their data when you call the appropriate method.

I have found them less useful for multiple return values and more useful for carrying state, but if you've got them anyway of course they can deliver multiple return values without problems.

But you're right; the JVM is pretty good about object-elimination escape analysis and also very good at collecting short-lived objects, so without making a compiler plugin so I can use these without a lot of syntactic overhead, I found that it was so rarely worth it that it was better to not use them at all.
 

 
Incidentally, array-handling is another strong point of scratchpads.  The type is always correct, so it doesn't have to worry about oh-I-called-it-a-Long-but-it-was-a-Float-and-oh-where-do-I-put-the-spare-Byte-ugh.

I agree a manual transformation is more powerful. It will always be, if you're willing to do it.
But getting 10x-22x performance boost with an annotation and without significant duplication is still a pretty good tradeoff, don't you think?

Absolutely.  But when you get a 22x performance boost in some applications _and a 4x loss in others_ then one is a little less excited.  So I think you're going to have to leave Float/Double alone (or pack Float into Double).  They're really important for computations; I can't afford to have the default Function2 even 2x slower before I drop Scala and go back to Java for computationally heavy work.

Since almost everything I do is computationally intensive, this does not sound appealing to me.

  --Rex
 

Vlad Ureche

unread,
Apr 13, 2013, 8:13:35 PM4/13/13
to Rex Kerr, Grzegorz Kossakowski, Erik Osheim, scala-debate

2013/4/14 Rex Kerr <ich...@gmail.com>


Absolutely.  But when you get a 22x performance boost in some applications _and a 4x loss in others_ then one is a little less excited.  So I think you're going to have to leave Float/Double alone (or pack Float into Double).  They're really important for computations; I can't afford to have the default Function2 even 2x slower before I drop Scala and go back to Java for computationally heavy work.

I think most programmers would be quite happy with _just 6x-22x_ speedup for the collection library. I might be wrong though, please chime in if you prefer it slow...
Jokes aside, I will try separating the float/double and see what performance numbers come out.

Cheers,
Vlad

Rex Kerr

unread,
Apr 14, 2013, 2:56:40 PM4/14/13
to Vlad Ureche, Grzegorz Kossakowski, Erik Osheim, scala-debate
Hi Vlad,

I guess I should be more clear: it's fine for collections to be miniboxed, as they are already really slow and any improvement would be welcome (and it would all be improvement, at least on 64 bit JVMs).  But it's not fine for that to be the general mechanism because FunctionN can be slowed down an unacceptable amount.  Of course you don't need to have miniboxed functions (outside of the context of miniboxed collections), but there's going to be a mismatch at the interface where your collections have the full capacity to have fast operations with Byte and Char but you can't afford to have your Function2 fully specialized three ways.

There are a variety of ways to solve this, but it does need to be solved before it's a regular feature.

  --Rex

Matthew Pocock

unread,
Apr 15, 2013, 5:55:31 AM4/15/13
to Rex Kerr, Vlad Ureche, Grzegorz Kossakowski, Erik Osheim, scala-debate
So, here's my take. Specialisation is clearly necessary to get acceptable performance but the combinatorix kills you. However, in my experience you very rarely want anything very complex. (A, A) => A, (A, B) => A, (A, B) => B covers virtually every use I've needed. In fact, (A, A) => A covers probably 90% of my use (percentages brought to you from the department of off-the-top-of-my-head statistics). Is there some low-hanging fruit way to capture these type correlations?

def f[@specialized(Double, Float, Int) A, @specialized(Double, Float, Int) B, @specialized(Double, Float, Int) @OneOf(A, B) C]
   (a: A, b: B): C

It feels like there's a thinko with specialization, rather than there being some technical issues that just need a little bit of working out. Things like the field duplication and the combinatorial explosions are symptoms of this. Something is deeply gnarley where specialised and non-specialized implementations can substitute for each-other's interfaces. Specialisation is much more like C++ templates than generics. You're declaring a whole family of types or methods, each with their own code and incompatible type signatures.

Anyway, it's Monday morning and I've not even had my first bottle of full-fat coke yet, and greater minds than mine have pondered these questions. In the absence of JVM support for specialization, and while the scalac compiler lacks sufficient unboxing voodoo, I wonder if we should take a moment to be clear about what we want from specialized code, in terms of performance, how you write specialized code, and how you invoke specialized code.

Cheers,

Matthew



--
You received this message because you are subscribed to the Google Groups "scala-debate" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-debate...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Dr Matthew Pocock
Turing ate my hamster LTD

Integrative Bioinformatics Group, School of Computing Science, Newcastle University

skype: matthew.pocock
tel: (0191) 2566550

Vlad Ureche

unread,
Apr 22, 2013, 7:33:48 AM4/22/13
to Rex Kerr, Grzegorz Kossakowski, Erik Osheim, scala-debate

2013/4/14 Rex Kerr <ich...@gmail.com>

Hi Vlad,

I guess I should be more clear: it's fine for collections to be miniboxed, as they are already really slow and any improvement would be welcome (and it would all be improvement, at least on 64 bit JVMs).  But it's not fine for that to be the general mechanism because FunctionN can be slowed down an unacceptable amount.  Of course you don't need to have miniboxed functions (outside of the context of miniboxed collections), but there's going to be a mismatch at the interface where your collections have the full capacity to have fast operations with Byte and Char but you can't afford to have your Function2 fully specialized three ways.

There are a variety of ways to solve this, but it does need to be solved before it's a regular feature.

The more I think of this the more I agree with you. It'll probably need a double variant as well, or, as technically challenging as it is, to integrate with specialization.
I'm looking forward to Miguel's findings on runtime code generation, if that's not prohibitively expensive (which is what I found in my exploration) then we may go down that path.

Cheers,
Vlad

PS: Rex, sorry for the duplicate message, I hit reply instead of reply to all again.

Vlad Ureche

unread,
Apr 22, 2013, 7:40:31 AM4/22/13
to Matthew Pocock, Rex Kerr, Grzegorz Kossakowski, Erik Osheim, scala-debate

2013/4/15 Matthew Pocock <turingate...@gmail.com>

So, here's my take. Specialisation is clearly necessary to get acceptable performance but the combinatorix kills you. However, in my experience you very rarely want anything very complex. (A, A) => A, (A, B) => A, (A, B) => B covers virtually every use I've needed. In fact, (A, A) => A covers probably 90% of my use (percentages brought to you from the department of off-the-top-of-my-head statistics). Is there some low-hanging fruit way to capture these type correlations?

def f[@specialized(Double, Float, Int) A, @specialized(Double, Float, Int) B, @specialized(Double, Float, Int) @OneOf(A, B) C]
   (a: A, b: B): C

It feels like there's a thinko with specialization, rather than there being some technical issues that just need a little bit of working out. Things like the field duplication and the combinatorial explosions are symptoms of this. Something is deeply gnarley where specialised and non-specialized implementations can substitute for each-other's interfaces. Specialisation is much more like C++ templates than generics. You're declaring a whole family of types or methods, each with their own code and incompatible type signatures.

I remember proposing this some time ago, but gave up on it after I got the idea of miniboxing:
https://groups.google.com/d/msg/scala-internals/57ObhiTjP50/HlNeI83T0qMJ
I guess we can always pick it up again.

I wonder if we should take a moment to be clear about what we want from specialized code, in terms of performance, how you write specialized code, and how you invoke specialized code.

Yes, please, let's do so. It seems to me there's no one-size-fits-all solution, so let's see what we're willing to trade off.

Cheers,
Vlad

Vlad Ureche

unread,
Apr 22, 2013, 7:41:50 AM4/22/13
to Matthew Pocock, Rex Kerr, Grzegorz Kossakowski, Erik Osheim, scala-debate
Reply all
Reply to author
Forward
0 new messages