I've seen the presentation at Scala Days, and I first wanted to say congratulations for your great work !
I've been looking at the documentation of the value plugin, and I've seen the limitation of returning only boxed values if your value class have more than one value. That is expected, but in practice, breaks most of the benefits of multiple fields, as shown in this benchmark:
https://github.com/miniboxing/value-plugin/wiki/Benchmarks
But at the end, it's mentioned that in the future this could be solved storing the fields in an object.
Could you explain a little bit how this would work? Is it really feasible? Any other disadvantages with this alternative approach? It would amazing if this works.
Thanks,
Pablo
That, with the right compiler sugar, would be pretty close to returning unboxed value classes. And the overhead should be minimal.
Cheers,
Pablo
To view this discussion on the web visit https://groups.google.com/d/msgid/scala-miniboxing/7b97a4bb-b834-432e-9e30-d3d96e6bc430%40googlegroups.com.--
You received this message because you are subscribed to the Google Groups "Scala Miniboxing Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-miniboxi...@googlegroups.com.
To post to this group, send email to scala-mi...@googlegroups.com.
Visit this group at http://groups.google.com/group/scala-miniboxing.
For more options, visit https://groups.google.com/d/optout.
I've been benchmarking this idea using ScalaMeter. I'm not an expert on microbenchmarks but I've done my best to have something realistic.
First, I've tried a function providing a case class and returning a case class VS setting the values on a global object, calling the function that retrieves the values from the object and set new ones for the output, and retrieving the values after the call.
This case is promising, because even ignoring the memory overhead, it's 2 orders of magnitud faster for primitive values.
If we also use references instead of only primitive values, it's still better but only by 1 order of magnitude. I'm really surprised by this, as I thought that copying a reference should be as fast as copying a primitive of the same size. Also, it's much slower reading this reference than writing it. Any explanation to this?
Then, the real problem. If we have to read a ThreadLocal to get the object, the results are much worst, and are even slower than creating new objects if we ignore the memory overhead. So, I think we need a solution to avoid ThreadLocals.
My first idea (not benchmarked) is to use a lockless object pool to get the global object. Maybe this can be faster than ThreadLocal. For example, a linked list of global objects that is updated using and atomic reference on the root and CompareAndSet.
Other improvement would be to provide the global object as a parameter and return it. This way, if we have a long list of calls that all need the global object, we only need to get it once.
I've also been thinking about this and I have many really useful use cases if we can make it work fast enough. I'll explain this in another mail if we can solve the speed problem.
Cheers,
Pablo
I've tried with a simple object pool based on AtomicReference, and it's as slow as using ThreadLocal.
I mean, it's still really really fast, but object creation it's incredibly fast on the JVM. The only reason for going with any of these implementations could be to save memory and GC.
I've also tried the case class version with only 20MB of heap, and it expends 20% of the time in GC, and that's the best case because everything is garbage. In a more realistic scenario, all this garbage will be mixed with live object and the GC overhead can be larger. Of course, with plenty of memory this is negligible.
Any other ideas to access the global objects in a thread safe way?
Cheers,
Pablo
For example, the fastest benchmark with values and global object, takes only 0.36 cpu cycles per iteration, and that performs multiple additions and multiplications.
Also, the case clases one takes around 36 cycles/iteration, and does the same operations and creates 2 objects.
So, I think I have to repeat them using megamorphic calls to benchmark it properly.
Cheers,
Pablo
Cheers,
Pablo
--
You received this message because you are subscribed to the Google Groups "Scala Miniboxing Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-miniboxi...@googlegroups.com.
To post to this group, send email to scala-mi...@googlegroups.com.
Visit this group at http://groups.google.com/group/scala-miniboxing.
To view this discussion on the web visit https://groups.google.com/d/msgid/scala-miniboxing/a484a859-5335-44c6-bfe6-ce2508e44d61%40googlegroups.com.
I've updated the jvm (1.7.0_60-b19) and I don't see any change.
I've been thinking about increasing the global object reuse. If you have this situation, f1 and f3, functions that use the global object, f2 a function that doesn't use it. Then you call f2 inside f1 and f3 inside f2.
In this case, we have to grab the global object twice, which is bad. If we provide a new annotation to f2 that add an extra parameter with the global object we can fully reuse it. This can be useful when creating libraries.
If f2 is used in other context, we would need to grab the global object even if we don't need it. That shouldn't be a problem as it's an annotation provided by the user, that knows the context.
Cheers,
Pablo
I was working on something similar to what motivated my first tests, more than one year ago, and I found an amazing way to speed things up that I think it can make it a viable solution.
The problems before where that accessing a ThreadLocal every single time was adding some overhead over the simple case class creation. On the other hand, reusing the Storage object was really fast, but passing it around for reuse was too complex to be useful.
The new idea is to profit from the fact that Thread.currentThread() is incredibly fast (just one mov instruction from a register, at least on OpenJDK) to create a cache that avoids the use of TL most of the time.
The idea is to add a new field to the storage object to point to it's thread. Then, we create an array of Storage objects as a cache.
When trying to get a Storage, we first look in the position (Thread.currentThread().getId() % arraySize) of the array, and we verify it's the right thread comparing the currentThread with the new field in Storage. If we got the right one, we are done and that's incredibly fast, otherwise, we get it from the TL and put it on the array.
We can make other improvements, like having more cache levels for collisions, replacing % with a mask with arrays that have a size of a power of 2, ...
I've pushed a new test to https://github.com/miniboxing/value-benchmarks and the results on my machine (increasing iterations x10 for more stability) are:
::Benchmark Case class::
Parameters( -> ()): 36.077858
--
::Benchmark MultivalueReturn::
Parameters( -> ()): 48.205917
--
::Benchmark Multivalue::
Parameters( -> ()): 49.684233
--
::Benchmark Multivalue Reuse::
Parameters( -> ()): 26.452018
--
::Benchmark Smart Cache::
Parameters( -> ()): 26.227872
There are minimal variations on different runs but they look really promising to me. Could you try it out on your machine to see if the results are consistent?
By the way, if anyone is interested, my real use case where I plan to implement this idea is to have a stack based generic object pool that avoids allocating new objects in many situations, not just return values. It tries to solve a similar problem as scala-offheap but in a different way.
Cheers,
Pablo
--
You received this message because you are subscribed to the Google Groups "Scala Miniboxing Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-miniboxi...@googlegroups.com.
To post to this group, send email to scala-mi...@googlegroups.com.
Visit this group at http://groups.google.com/group/scala-miniboxing.
To view this discussion on the web visit https://groups.google.com/d/msgid/scala-miniboxing/16aced8d-df06-4fb4-824c-e247216bf8eb%40googlegroups.com.
Good to see you can reproduce the results.
I also think that the JVM is avoiding case class allocation. It's really simple to inline the function call, and then, using escape analysis I guess it can stack allocate the objects.
But being realist, that's going to be the case in many situations, so it's important that we can beat that case too if we want a generic solution.
Anyway, we should try to find what's really going on, and write an additional test that really allocates objects to see the difference (use a megamorphic call to avoid inlining?).
But the most important aspect I think is avoiding GC, and lower memory consumption, that are really important in mobile, Scala.JS and many other situations.
About integrating this into the plugin, I think is promising and that it's worth it to test this ideas a little bit more to see where they lead us.
Cheers,
Pablo
Using SBT_OPTS="-XX:-DoEscapeAnalysis -XX:-EliminateAllocations" sbt run
::Benchmark Case class::
Parameters( -> ()): 88.859212
--
::Benchmark MultivalueReturn::
Parameters( -> ()): 52.769595
--
::Benchmark Multivalue::
Parameters( -> ()): 53.696932
--
::Benchmark Multivalue Reuse::
Parameters( -> ()): 29.47849
--
::Benchmark Smart Cache::
Parameters( -> ()): 28.498999
For comparison, here are the results with without this options:
::Benchmark Case class::
Parameters( -> ()): 39.631732
--
::Benchmark MultivalueReturn::
Parameters( -> ()): 54.205114
--
::Benchmark Multivalue::
Parameters( -> ()): 54.940818
--
::Benchmark Multivalue Reuse::
Parameters( -> ()): 28.379325
--
::Benchmark Smart Cache::
Parameters( -> ()): 28.124978
My results vary a little if I run them multiple times because my computer is not 100% idle right now, so don't take absolute values. But the proportions among tests are consistent and Case class test gets a x2 speedup because of stack allocation.
Still, trying to write a test that can reproduce this without the command line options would be interesting. Probably the best option we have without introducing other overheads is to unroll the loop and call the function multiple times using different instances of a trait.