--
You received this message because you are subscribed to the Google Groups "scala-debate" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-debate...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
proposed by me under another name
On Fri, Apr 12, 2013 at 8:16 PM, Rex Kerr <ich...@gmail.com> wrote:
proposed by me under another nameWhich name were you using back then?
Hi Rex,Thanks for explaining the scratch pads idea, it's very interesting. The more I look at it, the more I think we're moving towards separating data and operations:
- scratchpads are the boldest of all: they extract the accumulator in a separate class where it's used with primitive types, so they even separate data from methods, not only from objects
- extension methods, if I understand correctly, leave data in the objects and move code outside
- Greg's proposal of using method handles assumes there is no specialized data in the class (Greg, please correct me if I'm wrong, this is what I understood last time we discussed it at EPFL, you probably developed it more in the meantime)
- miniboxing is the shyest one of all, it still encodes data in the class, but does so in a more uniform way, thus getting the slower 2^N explosion
A question we should ask ourselves is whether separating code and data is good or bad:
- if we make a closed world assumption, where the program becomes stand-alone, LMS proved time and time again that it's best to separate data and operations
- if we're in an open world, where classes may be extended from outside... we don't really know. We'd be giving up our nice modularity, but we'd be gaining performance. To me it seems all our efforts push in this direction, we only need to find a good tradeoff. Another question to ask ourselves is whether we should do it at compilation time, or let the virtual machine do it, since it can treat the world as closed and revise its decisions later if proven wrong?
As for the example you posted, I'm wondering if we could automate it, such that the user doesn't have to do the transformation by hand?
Unfortunately I've got my hands full at the moment with miniboxing, but if someone would be willing to give it a shot, I'd love to see it developed further. (who knows, I may even try hacking something after I'm done with miniboxing)
Cheers,VladPS: Rex, can you also try miniboxing with integers or bytes? I think you'll be surprised by the results. :)
I've also been puzzled by the slowdown when using doubleToRawLongBits/floatToRawIntBits, I think it's related to the fact that they're treated as intrinsics in Hotspot, so they don't get the nice optimizations other methods get.
Yeah, it's a straightforward transformation. You probably couldn't safely do it on bare functions since you can't actually tell what the zero really is, but you could with slightly modified syntax. I don't have time to work on this, but I can illustrate a possible way to do it:
(a: A)(f: (A,B) => A): Amust be (by hand) rewritten as:(@Out[A] f: In[B]): f.typewhich will give the same syntax as the above by using the (a: A) argument to initialize the anonymous In[B] with Out[A], and then calling f.get on the return value.
I actually have been meaning to propose miniboxing for under 32 bit values for a while now as a way to keep Function2 small enough and still enable fast operations on strings and bytes.
Maybe so. I'm not sure. It might just be register hopping. If you do this:I've also been puzzled by the slowdown when using doubleToRawLongBits/floatToRawIntBits, I think it's related to the fact that they're treated as intrinsics in Hotspot, so they don't get the nice optimizations other methods get.
d += java.lang.Double.longBitsToDouble(i);
s += java.lang.Double.doubleToRawLongBits(d);
you end up with this bytecode (in Hotspot, as of 6u25, which is the last easily available debug build I can find for +PrintOptoAssembly)
245 movq [rsp + #0], R10 # MoveL2D_reg_stack
249 movsd XMM0, [rsp + #0] # double stk
24e addsd XMM2, XMM0
252 movd R10,XMM2 # MoveD2L
257 addq RAX, R10 # longThis is not exactly optimal--there's the stack operation which hardly helps (on the L2D side, not the D2L side), and then it's using two different registers. I hope it's right that the transfer is worth the faster operation in appropriate registers. gcc doesn't think so--it basically just skips 249 and uses the [rsp+#0] value directly in the addsd--but maybe it just doesn't do sufficiently aggressive register optimization.
If I had more time I'd try miniboxing in C++ and see if it can beat Java.
2013/4/13 Rex Kerr <ich...@gmail.com>
Yeah, it's a straightforward transformation. You probably couldn't safely do it on bare functions since you can't actually tell what the zero really is, but you could with slightly modified syntax. I don't have time to work on this, but I can illustrate a possible way to do it:
(a: A)(f: (A,B) => A): Amust be (by hand) rewritten as:(@Out[A] f: In[B]): f.typewhich will give the same syntax as the above by using the (a: A) argument to initialize the anonymous In[B] with Out[A], and then calling f.get on the return value.
The In[A] with Out[B] seems a bit like what the Adobe guys did for GIL[1]. They pushed the type parameters as deep as possible in the methods, and what you're proposing is to push them even more, splitting a class Function2[T1, T2, R] into In1[T1] with In2[T2] with Out[R]. The question is whether you can do this generally, in an automated fashion, with just an annotation at most?
For example, how would the compiler infer it needs to move the accumulator from the map code to the function?
I actually have been meaning to propose miniboxing for under 32 bit values for a while now as a way to keep Function2 small enough and still enable fast operations on strings and bytes.(shameless advertising) That's more or less what I plan to present at ScalaDays, if my talk is accepted: an example with folding and Function2 that is optimized using the @minispec annotation. :D
Yes, but the insight I had that prompted me to type this benchmark up now was that we _almost never use the full power of a Function2_. It's almost always (A,B) => A or (A,B) => B, or some widened version which we could plug in instead.
These sort of accumulator/transformers have the form A => (B => A), where the B => A does all the interesting stuff and the first call is just used to set state.So the general transform would be
(a,b) => f(a,b)toloadA(a) { _a = a }loadB(b) { _c = f(_a,b) }get = _cbut this wouldn't necessarily help if one method was required to use all three. And normally you _are_ required to use all three because the mental model for functions is that they're stateless.
For example, how would the compiler infer it needs to move the accumulator from the map code to the function?It couldn't. That's why the signature has to change--as I said, it must be rewritten manually. Well, of course you could write a macro/signal with an annotation to do it automatically:
(@zero a: A)(@accum f:(A,B) => A): Aor something. The specific transform isn't that hard:foo(a: A)(f: (A,B)=> A): Anow:t = new Function2[A,B]{ apply(a: A, b: B) = f(a,b) }foo(a)(t)then:s = new Scratch(a) { load(b: B) { _a = f(_a,b) } }foo(s).getbut you'd have to allow a to be a var and make sure that all code involving a was separable in the right way.
I really think it'd be too fiddly to have the library code be identical--better off just defining the separate interface and let people do it that way. Yes, you have to redo some work, but as a plus, you're better expressing what the requirements actually are of the computation. That is, having a well-defined interface to side-effects can be powerful when it comes to splitting concerns.
Well, if I go (I submitted an abstract to give a more general performance-tuning talk), I will look forward to hearing that!
Incidentally, array-handling is another strong point of scratchpads. The type is always correct, so it doesn't have to worry about oh-I-called-it-a-Long-but-it-was-a-Float-and-oh-where-do-I-put-the-spare-Byte-ugh.
@Rex: Sorry for the duplicate email, I hit reply instead of reply to all.
2013/4/14 Rex Kerr <ich...@gmail.com>
Yes, but the insight I had that prompted me to type this benchmark up now was that we _almost never use the full power of a Function2_. It's almost always (A,B) => A or (A,B) => B, or some widened version which we could plug in instead.
These sort of accumulator/transformers have the form A => (B => A), where the B => A does all the interesting stuff and the first call is just used to set state.So the general transform would be
(a,b) => f(a,b)toloadA(a) { _a = a }loadB(b) { _c = f(_a,b) }get = _cbut this wouldn't necessarily help if one method was required to use all three. And normally you _are_ required to use all three because the mental model for functions is that they're stateless.
I see. There's a very cool place we could put this pattern to good use: returning value classes with multiple arguments. I'm not sure to what extent this would improve the performance, as escape analysis probably eliminates the object creation when the code is inlined. But it may be worth the effort for the interpreter and if the method returning the value class is not inlined.
Incidentally, array-handling is another strong point of scratchpads. The type is always correct, so it doesn't have to worry about oh-I-called-it-a-Long-but-it-was-a-Float-and-oh-where-do-I-put-the-spare-Byte-ugh.
I agree a manual transformation is more powerful. It will always be, if you're willing to do it.
But getting 10x-22x performance boost with an annotation and without significant duplication is still a pretty good tradeoff, don't you think?
Absolutely. But when you get a 22x performance boost in some applications _and a 4x loss in others_ then one is a little less excited. So I think you're going to have to leave Float/Double alone (or pack Float into Double). They're really important for computations; I can't afford to have the default Function2 even 2x slower before I drop Scala and go back to Java for computationally heavy work.
--
You received this message because you are subscribed to the Google Groups "scala-debate" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-debate...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
There are a variety of ways to solve this, but it does need to be solved before it's a regular feature.Hi Vlad,I guess I should be more clear: it's fine for collections to be miniboxed, as they are already really slow and any improvement would be welcome (and it would all be improvement, at least on 64 bit JVMs). But it's not fine for that to be the general mechanism because FunctionN can be slowed down an unacceptable amount. Of course you don't need to have miniboxed functions (outside of the context of miniboxed collections), but there's going to be a mismatch at the interface where your collections have the full capacity to have fast operations with Byte and Char but you can't afford to have your Function2 fully specialized three ways.
So, here's my take. Specialisation is clearly necessary to get acceptable performance but the combinatorix kills you. However, in my experience you very rarely want anything very complex. (A, A) => A, (A, B) => A, (A, B) => B covers virtually every use I've needed. In fact, (A, A) => A covers probably 90% of my use (percentages brought to you from the department of off-the-top-of-my-head statistics). Is there some low-hanging fruit way to capture these type correlations?def f[@specialized(Double, Float, Int) A, @specialized(Double, Float, Int) B, @specialized(Double, Float, Int) @OneOf(A, B) C](a: A, b: B): CIt feels like there's a thinko with specialization, rather than there being some technical issues that just need a little bit of working out. Things like the field duplication and the combinatorial explosions are symptoms of this. Something is deeply gnarley where specialised and non-specialized implementations can substitute for each-other's interfaces. Specialisation is much more like C++ templates than generics. You're declaring a whole family of types or methods, each with their own code and incompatible type signatures.
I wonder if we should take a moment to be clear about what we want from specialized code, in terms of performance, how you write specialized code, and how you invoke specialized code.