Returning unboxed value classes

sir...@gmail.com

unread,

Jul 4, 2014, 11:38:51 AM7/4/14

to scala-mi...@googlegroups.com

Hi,

I've seen the presentation at Scala Days, and I first wanted to say congratulations for your great work !

I've been looking at the documentation of the value plugin, and I've seen the limitation of returning only boxed values if your value class have more than one value. That is expected, but in practice, breaks most of the benefits of multiple fields, as shown in this benchmark:

https://github.com/miniboxing/value-plugin/wiki/Benchmarks

But at the end, it's mentioned that in the future this could be solved storing the fields in an object.

Could you explain a little bit how this would work? Is it really feasible? Any other disadvantages with this alternative approach? It would amazing if this works.

Thanks,
Pablo

sir...@gmail.com

unread,

Jul 4, 2014, 11:59:56 AM7/4/14

to scala-mi...@googlegroups.com, sir...@gmail.com

Just guessing, but, could this be implemented having a global thread local object for each value class, where you set it's value before returning, and get it's value at the caller site?

That, with the right compiler sugar, would be pretty close to returning unboxed value classes. And the overhead should be minimal.

Cheers,
Pablo

Vlad Ureche

unread,

Jul 4, 2014, 7:06:01 PM7/4/14

to sir...@gmail.com, scala-mi...@googlegroups.com

Hi Pablo,

Your explanation is perfect! I'm amazed how well you described the mechanism! :)

Indeed, returning unboxed multi-param value classes would be done with a thread-local object where the callee can store the fields so the caller can read them back. With enough compiler discipline (reading and nulling the fields) and an invariant that nothing is done after writing the fields / before reading them back, it should be possible to get the multi-slot return without affecting the language semantics.

Now, for the actual implementation, I expect it won't be as easy as it sounds, but it shouldn't be too hard either... Still, I would recommend benchmarking the transformation done by hand (as if you were the plugin) and then, if the numbers are good, we can encode the transformation in the plugin. The other way is usually time-consuming, especially when it comes to experimenting with different code patterns.

Cheers,
Vlad

--
You received this message because you are subscribed to the Google Groups "Scala Miniboxing Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-miniboxi...@googlegroups.com.
To post to this group, send email to scala-mi...@googlegroups.com.
Visit this group at http://groups.google.com/group/scala-miniboxing.

To view this discussion on the web visit https://groups.google.com/d/msgid/scala-miniboxing/7b97a4bb-b834-432e-9e30-d3d96e6bc430%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

sir...@gmail.com

unread,

Jul 5, 2014, 3:19:58 PM7/5/14

to scala-mi...@googlegroups.com, sir...@gmail.com

Hi Vlad,

I've been benchmarking this idea using ScalaMeter. I'm not an expert on microbenchmarks but I've done my best to have something realistic.

First, I've tried a function providing a case class and returning a case class VS setting the values on a global object, calling the function that retrieves the values from the object and set new ones for the output, and retrieving the values after the call.

This case is promising, because even ignoring the memory overhead, it's 2 orders of magnitud faster for primitive values.

If we also use references instead of only primitive values, it's still better but only by 1 order of magnitude. I'm really surprised by this, as I thought that copying a reference should be as fast as copying a primitive of the same size. Also, it's much slower reading this reference than writing it. Any explanation to this?

Then, the real problem. If we have to read a ThreadLocal to get the object, the results are much worst, and are even slower than creating new objects if we ignore the memory overhead. So, I think we need a solution to avoid ThreadLocals.

My first idea (not benchmarked) is to use a lockless object pool to get the global object. Maybe this can be faster than ThreadLocal. For example, a linked list of global objects that is updated using and atomic reference on the root and CompareAndSet.

Other improvement would be to provide the global object as a parameter and return it. This way, if we have a long list of calls that all need the global object, we only need to get it once.

I've also been thinking about this and I have many really useful use cases if we can make it work fast enough. I'll explain this in another mail if we can solve the speed problem.

Cheers,
Pablo

sir...@gmail.com

unread,

Jul 5, 2014, 4:15:25 PM7/5/14

to scala-mi...@googlegroups.com, sir...@gmail.com

Hi again,

I've tried with a simple object pool based on AtomicReference, and it's as slow as using ThreadLocal.

I mean, it's still really really fast, but object creation it's incredibly fast on the JVM. The only reason for going with any of these implementations could be to save memory and GC.

I've also tried the case class version with only 20MB of heap, and it expends 20% of the time in GC, and that's the best case because everything is garbage. In a more realistic scenario, all this garbage will be mixed with live object and the GC overhead can be larger. Of course, with plenty of memory this is negligible.

Any other ideas to access the global objects in a thread safe way?

Cheers,
Pablo

sir...@gmail.com

unread,

Jul 5, 2014, 6:15:14 PM7/5/14

to scala-mi...@googlegroups.com, sir...@gmail.com

I've looked at the results again, and probably the JVM it's optimizing more than I think, because things are too fast.

For example, the fastest benchmark with values and global object, takes only 0.36 cpu cycles per iteration, and that performs multiple additions and multiplications.

Also, the case clases one takes around 36 cycles/iteration, and does the same operations and creates 2 objects.

So, I think I have to repeat them using megamorphic calls to benchmark it properly.

Cheers,
Pablo

sir...@gmail.com

unread,

Jul 6, 2014, 9:44:34 AM7/6/14

to scala-mi...@googlegroups.com, sir...@gmail.com

Sorry for the long post, but I wanted to explain in detail everything I’ve been testing. I think it’s worth reading it, because the conclusions are promising.

I've fixed the benchmark. The problem was that the iterations were independent form each other and that allowed the JVM to optimize the primitive case in a way that was not realistic. I’ve introduced dependencies and now the results are not that surprising.

The difference between primitives and references dissapear and all the tests take around 30-100 cycles per iteration, which is reasonable. Also, the best performance, I get it using ThreadLocals and using the trick of proving the global object as parameter and returning it, so all tests are done with this.

Also, for the reasons explained later (generics), I use a single global object for any value class. In the example, it has a maximum of 5 primitives + 5 references. Any multivalue class would have to be mapped into that. For the same reasons, the input parameters are provided using the global object, not only the return value. The global object could have multiple sets of primitives and references to support multiple input parameters.

The test is simple, an input multivalue that provides 3 numbers, and an output multivalue that provides the sum and the product of the 3 numbers. To make each iteration dependent on the previous, the 3 input numbers are taken from the previous iteration (sum, prod, sum+prod). This operation is repeated 1M times.

Scalameter is taking care of warming up, repeating the operation multiple times and taking the median value. The results are really stable, and scale as expected for 10M and 100M, so I’m more or less confident.

This is run on a 2.6 Core2Duo with plenty of RAM, so that the Case Class benchmark is not affected by GC, as measured with jvisualvm almost 0% GC for long tests (100M) using 1GB of RAM.

The code: https://gist.github.com/siriux/f851fb292c14aaeadd07

Cleaned up results + speedup respect Case Class:

::Benchmark Case class::
Parameters( -> ()): 14.263

::Benchmark Multivalue::
Parameters( -> ()): 11.032 Speedup: 1.292x

::Benchmark Multivalue Reuse::
Parameters( -> ()): 6.348 Speedup: 2.246x

The multivalue case, is only reusing the global object in a single iteration. That’s the worst case where this is used on a single function call, with the rest of the code not using multivalues, in that case we need to grab the global object from the ThreadLocal every single iteration.

If we have a chain of calls where all of them use multivalues, the global object can be reused for all the chain. That’s really good for a functional style. It can be really useful for things like akka-streams, where the global object could be passed along the chain.

The last benchmark called Multivalue Reuse grabs the global object from TL only once for the 1M iterations. This provides a great speedup. This tries to reflect the case where your “loop” is aware of multivalues, and therefore, this global can be reused. An example of this would be something similar to a collection view, where you can apply multiple operations and then traverse the collection a single time. If this view is aware of multivalues, you can take advantage to reuse the global object without the TL tax.

To be able to implement this “collections” that are aware multivalues (views, akka-streams, …) we need a way to make this work with generics. The idea is to provide an annotation similar to miniboxing on the type (or integrate it with miniboxing) to generate a third implementation in addition to primitive and object for the multivalues. What this implementation will do is to provide a global object on the call, and get it from the return. Then it assumes the worst case, and copies everything from global to local variables on function returns, and back to global on function calls. That’s the reason why we need to pass the input parameters using global, because with generics we don’t have any other information. Assuming a maximum of 5 primitives + 5 references, I’ve written benchmarks for this.

Results for Generic:

::Benchmark Multivalue Generic Overhead 5::
Parameters( -> ()): 19.039 Speedup: 0.749x

::Benchmark Multivalue Generic Overhead 5 Reuse::
Parameters( -> ()): 10.672 Speedup: 1.336x

The generic single use case is a little bit slower, but I don’t think this will be very common. Usually, if you have generics, you will have something similar to a collection, and in this case you can reuse the global. In this case, we get a little speedup.

This can be further improved specializing for multiple levels of primitives + references. You can have a base generic of 2 + 2 that will be pretty fast for usual multivalues, then 5 + 5 as shown in the benchmark, and for example, 10 + 10. This would be applied only to a few classes, not all the collections, so I think it won’t be a problem.

Other improvement can be to provide an annotation that says that it’s safe to reuse the global object values (without copying back and forth to locals) because we are sure that none else uses it in the middle. If we use this to implement a map operation on a generic view, it would be incredibly performant and generic. Also, in this case we won’t need multiple levels of specialization.

In summary, I think a thread local global object, shared for all multivalue classes, that it’s used for input and output parameters can be used to speed up the execution, and take advantage of less memory consumption. Also, to really make this useful, we need support for unboxed multivalued generics. With the same technique, and a few optimizations, we can support them with better performance in most cases.

What do you think?

Cheers,
Pablo

Vlad Ureche

unread,

Jul 7, 2014, 6:00:54 PM7/7/14

to Pablo Guerrero, scala-mi...@googlegroups.com

Pablo,

That's a really neat analysis! I created the value-benchmarks repository based on your gist and you can commit anything you like to the miniboxing/* projects.

Here are some comments I came up with while trying to replicate the benchmarks:

1. I tried to rerun the benchmarks on my laptop. This is what i got on the i7-4702HQ processor running Oracle's Java 7:

::Benchmark Case class::
Parameters( -> ()): 4.893229
::Benchmark Multivalue::
Parameters( -> ()): 9.196792
::Benchmark Multivalue Reuse::
Parameters( -> ()): 3.881853

::Benchmark Multivalue Generic Overhead 5::

Parameters( -> ()): 17.412198

::Benchmark Multivalue Generic Overhead 5 Reuse::

Parameters( -> ()): 7.403543

Tbh, I'm a bit surprised at these numbers, since I was expecting to see a significant speedup when reusing the TL storage object.
I also tested the multivalue benchmark with passing value classes unboxed and then using the TL storage object for benchmarking:

::Benchmark MultivalueReturn::
Parameters( -> ()): 11.088242

I guess the take home message is that using a TL object for passing value classes is worth it, as it avoids the garbage collection overhead (a 20% improvement), but we need to be very careful to avoid having to call getStorage() which is expensive. Furthermore, I'm not sure how a single TL object would scale to passing multiple value classes as arguments -- we could have several fields and serialize the objects, but this doesn't seem like an elegant solution...

2. I tried using a SeparateJvmsExector, with completely crazy results:

::Benchmark Case class::
Parameters( -> ()): 0.00146
::Benchmark MultivalueReturn::
Parameters( -> ()): 0.0012
::Benchmark Multivalue::
Parameters( -> ()): 0.001719
::Benchmark Multivalue Reuse::
Parameters( -> ()): 0.003434

::Benchmark Multivalue Generic Overhead 5::

Parameters( -> ()): 0.001682

::Benchmark Multivalue Generic Overhead 5 Reuse::

Parameters( -> ()): 0.003471

Which was the megamorphic call you mentioned? Maybe that's causing such numbers, as the JVM is basically throwing away the loop after the first iteration:

0. warmup run running time: 4.140961 (covNoGC: NaN, covGC: NaN)
1. warmup run running time: 0.005453 (covNoGC: 1.4105, covGC: 1.4105)
2. warmup run running time: 0.003815 (covNoGC: 1.7262, covGC: 1.7262)
3. warmup run running time: 0.0031 (covNoGC: 1.9921, covGC: 1.9921)
4. warmup run running time: 0.002743 (covNoGC: 2.2259, covGC: 2.2259)
5. warmup run running time: 0.002684 (covNoGC: 2.4369, covGC: 2.4369)
6. warmup run running time: 0.002558 (covNoGC: 2.6307, covGC: 2.6307)
7. warmup run running time: 0.002721 (covNoGC: 2.8105, covGC: 2.8105)
8. warmup run running time: 0.002831 (covNoGC: 2.9790, covGC: 2.9790)
9. warmup run running time: 0.002771 (covNoGC: 3.1381, covGC: 3.1381)
10. warmup run running time: 0.002765 (covNoGC: 0.2817, covGC: 0.2817)
11. warmup run running time: 0.002727 (covNoGC: 0.1250, covGC: 0.1250)
12. warmup run running time: 0.002692 (covNoGC: 0.0505, covGC: 0.0505)

3. Regarding the specialization for value classes, I think this is a very nice idea: instead of boxing individually, we could have a "global box" that we use each time we operate with value classes.
There are 3 areas where specialization/miniboxing reap benefits:

- from avoiding object allocation/GC

- from direct access to values (vs indirect access through pointers in case of boxing)

- from aligned (cache-local) data structures

If I understood the "global box" approach correctly, it reaps the benefits from the 1st point, but not the 2nd and 3rd. It would be great for heap-constrained applications and would avoid GC cycles, but wouldn't produce the big speedups associated to the 2nd and 3rd points. I'm not aware of any transformation that deeply replaces value classes into the generated bytecode (please correct me if I'm wrong, but I think even .net boxes value classes when they are used in generic contexts) -- so any step in this direction would be awesome!

I think to validate the "global box" approach, we would need some more solid benchmarks and a recipe for handling multiple arguments of value class types. Do you have any ideas how we could do this?

Cheers,

Vlad

Cheers,
Pablo

--
You received this message because you are subscribed to the Google Groups "Scala Miniboxing Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-miniboxi...@googlegroups.com.
To post to this group, send email to scala-mi...@googlegroups.com.
Visit this group at http://groups.google.com/group/scala-miniboxing.

To view this discussion on the web visit https://groups.google.com/d/msgid/scala-miniboxing/a484a859-5335-44c6-bfe6-ce2508e44d61%40googlegroups.com.

sir...@gmail.com

unread,

Jul 8, 2014, 3:45:54 AM7/8/14

to scala-mi...@googlegroups.com, sir...@gmail.com

Hi Vlad,

On Tuesday, July 8, 2014 12:00:54 AM UTC+2, Vlad Ureche wrote:
> ::Benchmark Case class::
> Parameters( -> ()): 4.893229
> ::Benchmark Multivalue::
> Parameters( -> ()): 9.196792
> ::Benchmark Multivalue Reuse::
> Parameters( -> ()): 3.881853
> ::Benchmark Multivalue Generic Overhead 5::
>
>
> Parameters( -> ()): 17.412198
> ::Benchmark Multivalue Generic Overhead 5 Reuse::
> Parameters( -> ()): 7.403543
>
>
> Tbh, I'm a bit surprised at these numbers, since I was expecting to see a significant speedup when reusing the TL storage object.

I was using ScalaMeter 0.5-SNAPSHOT, a different version of sbt, running inside intellij idea. So I've rerun everything with the same parameters as you used, but I didn't see a big difference.

Then, I realize that today my results were not stable as yesterday, so I closed many things, and I've changed the aggregate function to min instead of median. Yesterday, when I was getting consistent results anyway it seemed a good idea, but I think using min is much more consistent.

Now I get consisten results almost al the time, but I cannot reproduce the difference with Case Class test. With min, Case Class is a little bit faster than Multivalue, but not double. Can you rerun it multiple times using min aggregator, to see if you can reproduce it?

Also, make sure you are creating garbage with the Case Class test. I'm seeing some GC detected. lines and I can see the garbage with jvisualvm. It's true that mi processor is old, but I don't see how could this benefit the Case Class test.

I'm using Macosx on a Core 2 Due 2.66GHz with java:
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

My results are:

::Benchmark Case class::
Parameters( -> ()): 13.079

::Benchmark MultivalueReturn::
Parameters( -> ()): 14.903

::Benchmark Multivalue::
Parameters( -> ()): 14.185

::Benchmark Multivalue Reuse::
Parameters( -> ()): 6.276

>
>
> I also tested the multivalue benchmark with passing value classes unboxed and then using the TL storage object for benchmarking:
>
> ::Benchmark MultivalueReturn::
> Parameters( -> ()): 11.088242
>

This is a little bit slower for me too, and you lose the possibility of generics.

>
>
>
> I guess the take home message is that using a TL object for passing value classes is worth it, as it >avoids the garbage collection overhead (a 20% improvement), but we need to be very careful to avoid >having to call getStorage() which is expensive. Furthermore, I'm not sure how a single TL object would >scale to passing multiple value classes as arguments -- we could have several fields and serialize the >objects, but this doesn't seem like an elegant solution...

My idea is to have multiple sets of storage in a single storage object. val11, val12, ... and a second set val21, val22, ... If you have 10 of this, it's not much memory overhead, and you can pass up to 10 multiple value class arguments. This arbitrary limitations are ugly, but I think they work in practice. Also you could transparently box arguments greater than 10.

It don't think this will have any impact on performance, as you use the same global object, and the alternative sets are only used when needed. If you pass more arguments, of course it will take longer, but it's the same if you have to do more object creations.

>
>
>
>
>
> 2. I tried using a SeparateJvmsExector, with completely crazy results:
>
> ::Benchmark Case class::
> Parameters( -> ()): 0.00146
> ::Benchmark MultivalueReturn::
> Parameters( -> ()): 0.0012
>
>
> ::Benchmark Multivalue::
> Parameters( -> ()): 0.001719
> ::Benchmark Multivalue Reuse::
> Parameters( -> ()): 0.003434
> ::Benchmark Multivalue Generic Overhead 5::
> Parameters( -> ()): 0.001682
> ::Benchmark Multivalue Generic Overhead 5 Reuse::
>
>
> Parameters( -> ()): 0.003471
>
>
> Which was the megamorphic call you mentioned? Maybe that's causing such numbers, as the JVM is
> basically throwing away the loop after the first iteration:

I tried using SeparateJvmsExecutor, but I get this exception:
java.lang.ClassNotFoundException: valium.benchmark.MultivalueBenchmark$$anonfun$1$$anonfun$apply$mcV$sp$1

The megamorphic call idea was to have multiple fCaseClass that are provided to a caller function. This can avoid some optimizations.

But I think that maybe it's better to really use the final result, for example printing it. So that it cannot avoid the loop.

>
> 3. Regarding the specialization for value classes, I think this is a very nice idea: instead of boxing individually, we could have a "global box" that we use each time we operate with value classes.
>
>
> There are 3 areas where specialization/miniboxing reap benefits:
>
> - from avoiding object allocation/GC
>
> - from direct access to values (vs indirect access through pointers in case of boxing)
>
> - from aligned (cache-local) data structures
>
>
> If I understood the "global box" approach correctly, it reaps the benefits from the 1st point, but not
> the 2nd and 3rd. It would be great for heap-constrained applications and would avoid GC cycles, but
> wouldn't produce the big speedups associated to the 2nd and 3rd points. I'm not aware of any
> transformation that deeply replaces value classes into the generated bytecode (please correct me if
> I'm wrong, but I think even .net boxes value classes when they are used in generic contexts) -- so any
> step in this direction would be awesome!
>

I think we can benefit from the 3rd too. You are reusing the same object all the time. Therefore, this object will be for sure on L1. Even better, you can reuse it for different types / multiple unboxed parameters.

Now, I realize that maybe you are talking about a data structure of unboxed values. For example an array of unboxed objects. This is one of the use cases I'm most interested in, and it's possible with this approach. Let's say you have a (direct) byte buffer with a list of structs, in a C fashion. Then you have a function that providing a byte buffer and an offset returns an unboxed multivalue class. Then, you can have a generic object with a foreach method that uses a byte buffer, the reading function and a callback to iterate the structure.

No object allocation is needed, and you are able to work with really memory compact structures in a functional way. Of course, this can be extended to maps, folds, ... and also to more complex structures, not just simple lists.

In this case, you also reap the cache benefits, because the structure is memory aligned and will be efficiently retrieved by the cache. As before, the global object will be in L1.

As an aside, for my use case, I have memory mapped files with complex structures I need to traverse in different ways and post process the results using some kind of data streams (maybe akka-streams). The result is transformed multiple times along different pipes. All of this creates many small objects that live only during a few function calls. In my case, the app is IO bounded, but anyway more than 30% of the CPU is GC because I want a small heap.

>
> I think to validate the "global box" approach, we would need some more solid benchmarks and a
> recipe for handling multiple arguments of value class types. Do you have any ideas how we could do
> this?
>

For the benchmarks, lets see if we can reach agreement on our results or discover the root of the differences (java version, OS, CPU???).

About the multiple arguments, I really don't see the problem, but maybe my previous explanation is wrong and I'm missing something.

If you have time, or just want to learn about it, have a look at how Rust manages the memory. It tries to be as fast as C, with a much higher level language and nice functional programming. It has many good ideas about ownership that are not applicable, but for example, all the values are passed by default as a copy, even structs, and this is very similar to what we try to do. Of course this is an oversimplification, and they can do many thing we can't, but the approach seems valid.

http://paulkoerbitz.de/posts/Understanding-Pointers-Ownership-and-Lifetimes-in-Rust.html

Cheers,
Pablo

sir...@gmail.com

unread,

Jul 8, 2014, 5:37:13 AM7/8/14

to scala-mi...@googlegroups.com, sir...@gmail.com

Hi again,

I've updated the jvm (1.7.0_60-b19) and I don't see any change.

I've been thinking about increasing the global object reuse. If you have this situation, f1 and f3, functions that use the global object, f2 a function that doesn't use it. Then you call f2 inside f1 and f3 inside f2.

In this case, we have to grab the global object twice, which is bad. If we provide a new annotation to f2 that add an extra parameter with the global object we can fully reuse it. This can be useful when creating libraries.

If f2 is used in other context, we would need to grab the global object even if we don't need it. That shouldn't be a problem as it's an annotation provided by the user, that knows the context.

Cheers,
Pablo

Alexandru Nedelcu

unread,

Jul 14, 2014, 9:49:53 AM7/14/14

to sir...@gmail.com, scala-mi...@googlegroups.com

On Sat, Jul 5, 2014 at 10:19 PM, <sir...@gmail.com> wrote:

Then, the real problem. If we have to read a ThreadLocal to get the object, the results are much worst, and are even slower than creating new objects if we ignore the memory overhead. So, I think we need a solution to avoid ThreadLocals.

My first idea (not benchmarked) is to use a lockless object pool to get the global object. Maybe this can be faster than ThreadLocal. For example, a linked list of global objects that is updated using and atomic reference on the root and CompareAndSet.

Intuitively, it makes sense for ThreadLocals to be slow, as access is backed by a generic map in which the keys are thread IDs and the values are boxed.

The problem with using an atomic reference (and any kind of global box) is also one of throughput - that atomic reference will be a point of contention and even though it may relieve some stress from the GC and even though it has the lock-free property, it does introduce contention in places where there should be no contention to speak of. And compare-and-sets operations are also fairly expensive even when uncontended - and if you try to optimize the algorithm with a biased fast path for the uncontended scenarios, you can end up decreasing throughput for the contended scenarios (i.e. what happens with biased locking, which in practice is an awful idea).

Also - to add to the discussion - when speaking of GC overhead, the instance's lifecycle does matter. GC overhead is less interesting for short-lived objects, since deallocation of short-lived objects happens in bulk and is concurrent/non-blocking since CMS, than it is for objects that are neither short lived or long lived or anything else that leads to more stop-the-world pauses. I also think that benchmarking for problematic optimizations should also be done on top of a better garbage collector, like the one provided by Azul Systems. I wonder if they provide an evaluation version.

--
Alexandru Nedelcu
www.bionicspirit.com

PGP Public Key:
https://bionicspirit.com/key.aexpk

Pablo Guerrero

unread,

Jul 14, 2014, 10:43:21 AM7/14/14

to Alexandru Nedelcu, scala-mi...@googlegroups.com

Hi Alexandru,

I actually tried with the atomic reference (explained in previous
mails) and it was slower than TL. Actually TL are incredibly fast, and
they are not backed by a generic map, but by a really tuned map that
it's much faster than I thought. Have a look at the implementation on
OpenJDK if you are curious, it's really good.

The "problem" is that creating an instance is so fast that everything
is slow relative to it. But as shown in my previous mails, this can be
improved a lot reusing the thread local for chained calls on the same
thread.

Also, if we can make it work with generics (with something similar to
the proposed mechanism) it would be useful for collections and
streams, where the potential for reuse is much bigger, and also the
gains from locality are more noticeable.

The problem with GC it's not collecting this short lived objects, that
it's almost free, but forcing it to run many more times than needed
that has some overhead. Also, if you application is concurrent, the
concurrent GC it's going to decrease throughput, as this CPU could be
used by other application thread.

Other problem is the interaction with normal longer garbage, that will
be promoted to tenure much faster, slowing everything down.

This is greatly increased if you program in functional style, creating
much more garbage than programming in imperative style.

And it's even worst if you don't have a large heap, for example in
mobile, in desktop (as you need to share with other apps), and even in
sever if you don't have dedicated machines for a single JVM.

That's why I think a solution to this problem, even if it's not faster
in all the cases without taking into account the GC, could greatly
benefit scala and the functional paradigm.

Cheers,
Pablo

Alexandru Nedelcu

unread,

Jul 14, 2014, 11:30:57 AM7/14/14

to Pablo Guerrero, scala-mi...@googlegroups.com

Hi Pablo,

On Mon, Jul 14, 2014 at 5:43 PM, Pablo Guerrero <sir...@gmail.com> wrote:

I actually tried with the atomic reference (explained in previous
mails) and it was slower than TL.

This conversation is really interesting, but to my shame I haven't read all of it.

Sorry if my points are redundant or wrong :-)

Other problem is the interaction with normal longer garbage, that will
be promoted to tenure much faster, slowing everything down.
This is greatly increased if you program in functional style, creating
much more garbage than programming in imperative style.

Not saying this is wrong, but having to spend more concurrent cycles for doing garbage collection, or stop the world freezes for the young generation (which are predictable) is less of a problem than stop-the-world full GC phases of the old generation, which severely limits parallelism and vertical scalability - the importance of this is subjective of course, but it's such a pain to work on soft real-time systems on top of the JVM precisely because of this. And persistent data-structures suffer from this primarily because when doing path copying, the garbage you end up with is not short-lived.

Anyway, the point that I wanted to get across that was kind of lost in translation is that compiler-related optimizations are harder to get rid of because of backwards compatibility and I think they should be avoided if a better GC solves it, as GCs do evolve with each Java version. Hence my opinion that benchmarks should also be done with a more capable GC. To me it's not very intuitive what should happen, benchmarking is a PITA and I fear optimizations that in certain scenarios are doing more harm than good and so it's good to test with multiple configurations.

Of course, returning unboxed value classes with multiple values would be extremely awesome if it worked well.

Pablo Guerrero

unread,

Jul 14, 2014, 11:48:00 AM7/14/14

to Alexandru Nedelcu, scala-miniboxing

Hi Alexandru,

I completely agree, with your point, but I don't think we are in this
situation here. This is not a general compiler optimization, it's more
like a new language feature that enables optimizations.

AFAIK this will only be enabled if you annotate your classes in a
certain way, so you should use this only if it makes sense in your use
case. For everyone else, this is transparent. Still, it's a lot of
work, and it should be done only if it's useful in some scenarios.

For me, the largest complain about the JVM is memory consumption. If
you want a decent performance (compared to C), you have to provide a
large amount of memory compared to what's really needed in C with
manual memory management. This works really well on dedicated servers,
but not so well on desktop or mobile. And for OpenJDK, it doesn't seem
that they are tuning GC for this case, they are usually more concerned
about server than coexisting with other desktop apps.

Cheers,
Pablo

sir...@gmail.com

unread,

Dec 9, 2015, 4:23:05 AM12/9/15

to Scala Miniboxing Mailing List, al...@bionicspirit.com, sir...@gmail.com

Hi all,

I was working on something similar to what motivated my first tests, more than one year ago, and I found an amazing way to speed things up that I think it can make it a viable solution.

The problems before where that accessing a ThreadLocal every single time was adding some overhead over the simple case class creation. On the other hand, reusing the Storage object was really fast, but passing it around for reuse was too complex to be useful.

The new idea is to profit from the fact that Thread.currentThread() is incredibly fast (just one mov instruction from a register, at least on OpenJDK) to create a cache that avoids the use of TL most of the time.

The idea is to add a new field to the storage object to point to it's thread. Then, we create an array of Storage objects as a cache.

When trying to get a Storage, we first look in the position (Thread.currentThread().getId() % arraySize) of the array, and we verify it's the right thread comparing the currentThread with the new field in Storage. If we got the right one, we are done and that's incredibly fast, otherwise, we get it from the TL and put it on the array.

We can make other improvements, like having more cache levels for collisions, replacing % with a mask with arrays that have a size of a power of 2, ...

I've pushed a new test to https://github.com/miniboxing/value-benchmarks and the results on my machine (increasing iterations x10 for more stability) are:

::Benchmark Case class::
Parameters( -> ()): 36.077858
--
::Benchmark MultivalueReturn::
Parameters( -> ()): 48.205917
--
::Benchmark Multivalue::
Parameters( -> ()): 49.684233
--
::Benchmark Multivalue Reuse::
Parameters( -> ()): 26.452018
--
::Benchmark Smart Cache::
Parameters( -> ()): 26.227872

There are minimal variations on different runs but they look really promising to me. Could you try it out on your machine to see if the results are consistent?

By the way, if anyone is interested, my real use case where I plan to implement this idea is to have a stack based generic object pool that avoids allocating new objects in many situations, not just return values. It tries to solve a similar problem as scala-offheap but in a different way.

Cheers,
Pablo

Vlad Ureche

unread,

Dec 17, 2015, 2:28:49 AM12/17/15

to Pablo Guerrero, Scala Miniboxing Mailing List, Alexandru Nedelcu

Hi Pablo,

Thanks for sharing these insights, that's a great idea! Here's what I see on my machine:

::Benchmark Case class::
Parameters( -> ()): 45.427086

::Benchmark MultivalueReturn::
Parameters( -> ()): 88.634071

::Benchmark Multivalue::
Parameters( -> ()): 74.241335

::Benchmark Multivalue Reuse::
Parameters( -> ()): 37.72842

::Benchmark Smart Cache::
Parameters( -> ()): 37.735088

The one question I can't really answer is why is the case class benchmark so fast? I guess the JVM does some escape analysis magic and eliminates the case class allocation.

Did you check this?

Do you think we should integrate this storage strategy in the value class plugin?

Cheers,

Vlad

--
You received this message because you are subscribed to the Google Groups "Scala Miniboxing Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-miniboxi...@googlegroups.com.
To post to this group, send email to scala-mi...@googlegroups.com.
Visit this group at http://groups.google.com/group/scala-miniboxing.

To view this discussion on the web visit https://groups.google.com/d/msgid/scala-miniboxing/16aced8d-df06-4fb4-824c-e247216bf8eb%40googlegroups.com.

sir...@gmail.com

unread,

Dec 17, 2015, 2:49:32 AM12/17/15

to Scala Miniboxing Mailing List, sir...@gmail.com, al...@bionicspirit.com

Hi Vlad,

Good to see you can reproduce the results.

I also think that the JVM is avoiding case class allocation. It's really simple to inline the function call, and then, using escape analysis I guess it can stack allocate the objects.

But being realist, that's going to be the case in many situations, so it's important that we can beat that case too if we want a generic solution.

Anyway, we should try to find what's really going on, and write an additional test that really allocates objects to see the difference (use a megamorphic call to avoid inlining?).

But the most important aspect I think is avoiding GC, and lower memory consumption, that are really important in mobile, Scala.JS and many other situations.

About integrating this into the plugin, I think is promising and that it's worth it to test this ideas a little bit more to see where they lead us.

Cheers,
Pablo

sir...@gmail.com

unread,

Dec 17, 2015, 3:16:04 AM12/17/15

to Scala Miniboxing Mailing List, sir...@gmail.com, al...@bionicspirit.com

I just did a really quick test avoiding escape analysis and stack allocation and the difference is huge !

Using SBT_OPTS="-XX:-DoEscapeAnalysis -XX:-EliminateAllocations" sbt run

::Benchmark Case class::
Parameters( -> ()): 88.859212
--
::Benchmark MultivalueReturn::
Parameters( -> ()): 52.769595
--
::Benchmark Multivalue::
Parameters( -> ()): 53.696932
--
::Benchmark Multivalue Reuse::
Parameters( -> ()): 29.47849
--
::Benchmark Smart Cache::
Parameters( -> ()): 28.498999

For comparison, here are the results with without this options:

::Benchmark Case class::
Parameters( -> ()): 39.631732
--
::Benchmark MultivalueReturn::
Parameters( -> ()): 54.205114
--
::Benchmark Multivalue::
Parameters( -> ()): 54.940818
--
::Benchmark Multivalue Reuse::
Parameters( -> ()): 28.379325
--
::Benchmark Smart Cache::
Parameters( -> ()): 28.124978

My results vary a little if I run them multiple times because my computer is not 100% idle right now, so don't take absolute values. But the proportions among tests are consistent and Case class test gets a x2 speedup because of stack allocation.

Still, trying to write a test that can reproduce this without the command line options would be interesting. Probably the best option we have without introducing other overheads is to unroll the loop and call the function multiple times using different instances of a trait.

Reply all

Reply to author

Forward