refurbishment of benchmark changes in work discussion

58 views
Skip to first unread message

Rüdiger Möller

unread,
Mar 9, 2014, 3:51:38 PM3/9/14
to java-serializat...@googlegroups.com
I noticed horrendous run-to run jitter and got significant different results when measuring same classes serialized but triggered and run from a separate program.
I found the following reasons:

* with like 40 serializers on board, hotspots inlining suffers.because of many subclasses.
* the benchmark makes many runs and takes the MINIMUM time. This is calling for random. Additionally this way memory waste of a serializer is hidden. It might cause a GC with each run, but get lucky on one single run with an unrepresentative score.
* turboboost and energy management are bad for benchmarking :-).

So the following changes fixed run-to-run jitter from +-10% to < 0.5%
1) increase warmup time, then take the average time running the bench.
2) run each benchmark in an isolated vm
3) turned off turboboost.

Other changes:

- slow serializers suck and take forever to complete. Therfore i let run the bench for a fixed time instead for fixed iterations. This might skew very slow serializers, but anyway if it comes in at a 20 times worse score, a bias of 1% isn't significant.
- separated stats and benchrun. So you can run all the benchmarks, then create charts from it without actually having to rerun the lengthy bench.
- added a classification to each bench:

SerClz { FULL_SERIALIZER, FLAT_TREE_SER }
SerFormat { BINARY, BINARY_CROSSLANG, JSON, XML, MISC }
SerType { ZERO_KNOWLEDGE, CLASS_KNOWLEDGE, MANUAL_CLASS_SPECIFIC_OPT ]

This way one can create charts/stats e.g. All Cross Language Serializers not requiring generation or preparation, or all XML serializers excluding manual optimized.

Any suggestions or no-go's ?

Rüdiger Möller

unread,
Mar 9, 2014, 3:56:43 PM3/9/14
to java-serializat...@googlegroups.com
Addition: 
maintaining name lists of >40 serializers redundantely in config files is a night mare. Due to renamings, initially like 10 serializers where not found. I solved this by making the list in BenchmarkRunner the master and let run all serializers (each in a separate vm) regardless.
Filtering is then applied when generating the stats. 

Nate

unread,
Mar 9, 2014, 4:15:56 PM3/9/14
to java-serializat...@googlegroups.com
On Sun, Mar 9, 2014 at 8:51 PM, Rüdiger Möller <moru...@gmail.com> wrote:
* the benchmark makes many runs and takes the MINIMUM time. This is calling for random. Additionally this way memory waste of a serializer is hidden. It might cause a GC with each run, but get lucky on one single run with an unrepresentative score.

Memory usage and GC are complex enough it's probably better to try not to include their effects in the results. Taking only the fastest run might not be best, but including the slowest runs in the average is probably worse. Maybe we should average the fastest 25% or so of runs?

-Nate

Sam Pullara

unread,
Mar 9, 2014, 4:23:42 PM3/9/14
to java-serializat...@googlegroups.com
I totally disagree that the effects of GC should be removed. The amount of GC that a serializer produces is very important. Perhaps running them all with a very small and a very large heap to be able to distinguish those that are memory efficient. Also, you can run with -verbose:gc on and look at how much memory each serializer actually churns through over the life of the test. We definitely shouldn't cover it up...

Sam



--
You received this message because you are subscribed to the Google Groups "java-serialization-benchmarking" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-serialization-be...@googlegroups.com.
To post to this group, send email to java-serializat...@googlegroups.com.
Visit this group at http://groups.google.com/group/java-serialization-benchmarking.
For more options, visit https://groups.google.com/d/optout.

Rüdiger Möller

unread,
Mar 9, 2014, 4:52:26 PM3/9/14
to java-serializat...@googlegroups.com
Nate,

I agree that without running each benchmark in separate, GC is problematic as you might get the garbage of a previous benchmark. But that's not the case now. Additionally warmup is increased, so JIT'ing is not part of measurement.

Its not only GC its also L1 cache hits. Randomly e.g. Object.identityHashCode might happen to hit a single cacheline in an open hashmap without collisions. This can easily speed up like 10-15%, but it will happen only 1 out of 10000 times.

I already tested with averaging and its not like everything is upside down. The order of the speed chart is pretty much the same with some minor nanoshifts here and there (one of them in the right place of course ;-)) ). But its now much more reproducable and realistic.

Why would you want to average on 25% best runs ?

-ruediger

Nate

unread,
Mar 9, 2014, 5:50:36 PM3/9/14
to java-serializat...@googlegroups.com
On Sun, Mar 9, 2014 at 9:52 PM, Rüdiger Möller <moru...@gmail.com> wrote:
Nate,

I agree that without running each benchmark in separate, GC is problematic as you might get the garbage of a previous benchmark. But that's not the case now. Additionally warmup is increased, so JIT'ing is not part of measurement.

Its not only GC its also L1 cache hits. Randomly e.g. Object.identityHashCode might happen to hit a single cacheline in an open hashmap without collisions. This can easily speed up like 10-15%, but it will happen only 1 out of 10000 times.

I already tested with averaging and its not like everything is upside down. The order of the speed chart is pretty much the same with some minor nanoshifts here and there (one of them in the right place of course ;-)) ). But its now much more reproducable and realistic.

Why would you want to average on 25% best runs ?

GC and memory usage can be affected by the rest of an app and the environment the app runs in. The more we measure that is likely to vary in the real environments these libs will actually be used, the less useful the results are. If we can control the effects (eg separate VMs, large heap) and results are relatively on par with what they were before, I'm ok with it.

Using some percentage of the results would be to avoid skewing the results with runs that aren't actually representative, for whatever reasons. There are probably more things that could negatively affect a run than could make it "lucky". Maybe we could toss both the best and worst x%.

-Nate

Rüdiger Möller

unread,
Mar 10, 2014, 6:00:06 AM3/10/14
to java-serializat...@googlegroups.com
I had a look in VisualVM and the relevant (I did not check all ofc) Serializers do not run into FullGC, only minor GC (NewGen), so its pretty deterministic as long each test has its own VM. If a serializer allocates more memory than others, it will take a small hit in results (<5%), which reflects real behaviour in production closer than taking a percentage of best runs.

Regarding Heap size: As no FullGC is involved, heap size does not matter, what matters is NewGen size. The VM makes a best guess for NewGen size depending on MaxHeap size and applies some runtime heuristics. We could hard set NewGen size using VM options, if you prefer we can also increase Heap size, it won't make that much of a difference. With small NewGen, you'll get more frequent NewGen GC's of short duration. With large NewGen you'll get fewer, but those few will take longer (which in turn makes results more indeterministic if you do *not* average). 

I think some of your bad gut regarding this is mostly caused by historical VM behaviour. With isolated tests and modern VM's the effects of newgen GC on the results are not that hard. As allocation hits L1 locality a lot, the good serializers are those having low allocation rate anyway, in other words memory hogs are slower even when factoring out GC :-).

I'll put in a switch to determine measurement method (Min,Avg), so we can defer the decision. Note that determining the best 25% would require to record all result values temporary ..

Tatu Saloranta

unread,
Mar 10, 2014, 1:09:58 PM3/10/14
to java-serializat...@googlegroups.com
On Sun, Mar 9, 2014 at 1:15 PM, Nate <nathan...@gmail.com> wrote:
I am torn here: for the longest time, I have been advocate of including some of GC overhead in, because in real usage it is a significant component. But I also realize that the relative importance varies a lot based on both amount of serialization being done (is it a major component or minor), and heap size / general memory pressure.

I think that use of percentiles would make sense, and that although I do realize why minimums were chosen (it allows filtering out much of white noise), it is probably not the way to go.
Perhaps simple median would work?

-+ Tatu +-

Tatu Saloranta

unread,
Mar 10, 2014, 1:14:34 PM3/10/14
to java-serializat...@googlegroups.com
On Mon, Mar 10, 2014 at 3:00 AM, Rüdiger Möller <moru...@gmail.com> wrote:
I had a look in VisualVM and the relevant (I did not check all ofc) Serializers do not run into FullGC, only minor GC (NewGen), so its pretty deterministic as long each test has its own VM. If a serializer allocates more memory than others, it will take a small hit in results (<5%), which reflects real behaviour in production closer than taking a percentage of best runs.

Regarding Heap size: As no FullGC is involved, heap size does not matter, what matters is NewGen size. The VM makes a best guess for NewGen size depending on MaxHeap size and applies some runtime heuristics. We could hard set NewGen size using VM options, if you prefer we can also increase Heap size, it won't make that much of a difference. With small NewGen, you'll get more frequent NewGen GC's of short duration. With large NewGen you'll get fewer, but those few will take longer (which in turn makes results more indeterministic if you do *not* average). 

I think some of your bad gut regarding this is mostly caused by historical VM behaviour. With isolated tests and modern VM's the effects of newgen GC on the results are not that hard. As allocation hits L1 locality a lot, the good serializers are those having low allocation rate anyway, in other words memory hogs are slower even when factoring out GC :-).


I agree.
 
I'll put in a switch to determine measurement method (Min,Avg), so we can defer the decision. Note that determining the best 25% would require to record all result values temporary ..


One possibility would be to use libraries like Yammer metrics, which do good job of calculating percentiles without retaining all individual measurements. And given that numbers here are still relatively stable (there is random fluctuation, but workload is very steady), I think that'd give solid numbers.

As to GC: I also feel we should get some indication, because it DOES matter a lot in production.
And since like you correctly point out it is the YoungGen size (relative to working memory needed) that affects GC overhead, maybe it would be possible to:

1. Get percentage of time used for GC, per serializers, included that in results
2. Run two different configurations (low memory, high memory), with different settings (say, 16 megs vs 256 megs for young gen); or maybe have one (high?) as default, allow alternate set to be run as optional one.

-+ Tatu +-

 
Am Sonntag, 9. März 2014 22:50:36 UTC+1 schrieb Nate:
On Sun, Mar 9, 2014 at 9:52 PM, Rüdiger Möller <moru...@gmail.com> wrote:
Nate,

I agree that without running each benchmark in separate, GC is problematic as you might get the garbage of a previous benchmark. But that's not the case now. Additionally warmup is increased, so JIT'ing is not part of measurement.

Its not only GC its also L1 cache hits. Randomly e.g. Object.identityHashCode might happen to hit a single cacheline in an open hashmap without collisions. This can easily speed up like 10-15%, but it will happen only 1 out of 10000 times.

I already tested with averaging and its not like everything is upside down. The order of the speed chart is pretty much the same with some minor nanoshifts here and there (one of them in the right place of course ;-)) ). But its now much more reproducable and realistic.

Why would you want to average on 25% best runs ?

GC and memory usage can be affected by the rest of an app and the environment the app runs in. The more we measure that is likely to vary in the real environments these libs will actually be used, the less useful the results are. If we can control the effects (eg separate VMs, large heap) and results are relatively on par with what they were before, I'm ok with it.

Using some percentage of the results would be to avoid skewing the results with runs that aren't actually representative, for whatever reasons. There are probably more things that could negatively affect a run than could make it "lucky". Maybe we could toss both the best and worst x%.

-Nate

Rüdiger Möller

unread,
Mar 10, 2014, 2:21:39 PM3/10/14
to java-serializat...@googlegroups.com
I haven't seen a real argument besides vagous complaints. If each test runs in *its own VM*, its the serializer who creates 99,999 % of allocations. So taking the average is the most natural way of measuring real production performance. Its also the method delivering stable run-to-run results. Any method producing large run-to-run jitter mathematically cannot be a good method (and please don't ask for averaging multiple benchmark runs, that's nonsense).

Remember that the TestCase itself is not that representative (ascii strings, no double, no primitive arrays, few object structures, no hashmap), so let's stop going overboard with percentiles (<= what's the reasoning behind that anyway ?). The benchmark only gives a rough impression of performance anyway.

IMO the method producing the *most stable* results is the best. Any measuring producing high run-to-run jitter is just unusable. Outliers result from Minor-GCs and frequently from HashMap/Collection growth. A serializer creating more outliers than others will do so in production also, so why hide it. GC is not magic, the higher the allocation rate, the more frequent NewGC will happen, its that simple. If a serializer is hit by that, improve it, users will profit.

E.g. if a library does significant efforts to reuse objects, it worsens "best case" run (minimum) but improves average performance. It would be kind of crazy if I get better benchmark results by actually worsening the code.

Last but not least: The results did not change that much (so order in the charts has not changed significantly), they are just more reproducible.

And even more: Results also change significantly depending on L1 cache size of CPU. On my notebook (smaller L1 caches), the differences inbetween serializers is smaller compared to my desktop i7 3770 (i don't even dare to talk about AMD results :-) ).

I think we just have to accept the fact that Benchmarks give an indication about performance, but don't deliver hard facts.

-ruediger

Am Montag, 10. März 2014 18:09:58 UTC+1 schrieb cowtowncoder:
On Sun, Mar 9, 2014 at 1:15 PM, Nate <nathan...@gmail.com> wrote:
On Sun, Mar 9, 2014 at 8:51 PM, Rüdiger Möller <moru...@gmail.com> wrote:
* the benchmark makes many runs and takes the MINIMUM time. This is calling for random. Additionally this way memory waste of a serializer is hidden. It might cause a GC with each run, but get lucky on one single run with an unrepresentative score.

Memory usage and GC are complex enough it's probably better to try not to include their effects in the results. Taking only the fastest run might not be best, but including the slowest runs in the average is probably worse. Maybe we should average the fastest 25% or so of runs?


I am torn here: for the longest time, I have been advocate of including some of GC overhead in, because in real usage it is a significant component. But I also realize that the relative importance varies a lot based on both amount of serialization being done (is it a major component or minor), and heap size / general memory pressure.

I think that use of percentiles would make sense, and that although I do realize why minimums were chosen (it allows filtering out much of white noise), it is probably not the way to go.
Perhaps simple median would work?

-+ Tatu +-

 
-Nate

--
You received this message because you are subscribed to the Google Groups "java-serialization-benchmarking" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-serialization-benchmarking+unsubscribe@googlegroups.com.
To post to this group, send email to java-serialization-benchm...@googlegroups.com.

Rüdiger Möller

unread,
Mar 10, 2014, 2:25:34 PM3/10/14
to java-serializat...@googlegroups.com
from Nate;
>If we can control the effects (eg separate VMs, large heap) and results are relatively on par with what they were before, I'm ok with it.

That's the case.


--
You received this message because you are subscribed to a topic in the Google Groups "java-serialization-benchmarking" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/java-serialization-benchmarking/quSX2VLU_zY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to java-serialization-be...@googlegroups.com.
To post to this group, send email to java-serializat...@googlegroups.com.

Tatu Saloranta

unread,
Mar 10, 2014, 2:47:42 PM3/10/14
to java-serializat...@googlegroups.com
Just to make sure, by percentile I am including "median". And that it should be used instead of simple average, if that is practical. If we use a simple lib (as I said, Yammer metrics does really nice job and is simple to use) we could easily also include top and/or bottom percentiles, to give an idea of variability.
It does not add significant amount of overhead as it does not store full result set, and is used for monitoring tools very successfully.

But I am ok with simple average as the next step; this really isn't a big deal for me. Especially given that the big thing to get is proper JVM isolation.

-+ Tatu +-



To unsubscribe from this group and stop receiving emails from it, send an email to java-serialization-be...@googlegroups.com.
To post to this group, send email to java-serializat...@googlegroups.com.

Rüdiger Möller

unread,
Mar 10, 2014, 3:01:42 PM3/10/14
to java-serializat...@googlegroups.com
Hm, why do you think median is better ? We can try, but I am not sure if this won't jitter a lot. We have to try. Adding a lib may introduce bias, so I'd prefer computing this directly ..

Agree, one big thing is separation. The other thing is to be able to chart by format/feature. Some samples (not the end version, we have to discuss which charts will be published at the end, the queries are configurable):



You received this message because you are subscribed to a topic in the Google Groups "java-serialization-benchmarking" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/java-serialization-benchmarking/quSX2VLU_zY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to java-serialization-be...@googlegroups.com.

Tatu Saloranta

unread,
Mar 10, 2014, 3:10:11 PM3/10/14
to java-serializat...@googlegroups.com
Just because distribution of deviancies is not symmetric: spikes are much much slower, or just a little bit faster.
So average will be higher than median.

But I have to admit that whether there is really measurable difference _in practice_ is a valid question... and with long enough run, stable load, likely it would not be meaningful difference.
And typically really slow iterations are first N runs (or for some short period of time), which are easy to exclude by brief warm-up run.

So I think I am actually fine with just average. :-)

-+ Tatu +-

Kannan Goundan

unread,
Mar 10, 2014, 4:34:15 PM3/10/14
to java-serializat...@googlegroups.com
I've just repeatedly heard that for most benchmarks in our field, the median is better.  I don't understand enough about statistics to understand why, so I'm don't care *that* much.  I agree that the mean is easier, but the median isn't that tricky or expensive to compute either.

We're only doing like 500 runs, right?  Can't we just pre-allocate a 500-entry array, store the results in the array, then use the O(n) median algorithm when we're done?

Kannan Goundan

unread,
Mar 10, 2014, 4:37:31 PM3/10/14
to java-serializat...@googlegroups.com
First off, thanks Rüdiger for the benchmark improvements (if you look through the archives, we've all wanted this stuff to happen but none of us got around to it).

Being able to chart by format/feature is cool, but I've always wanted to let the viewer of the web page could control the pruning.  So our benchmark tool would actually output all the results in some Javascript array.  The resulting page would then contain checkboxes and dropdowns that let the viewer prune graphs down to only the results that they care about.

There are a bunch of client-side JS/HTML charting libraries.  I haven't used any of them yet, but these looked pretty:
- http://nvd3.org/
- http://code.shutterstock.com/rickshaw/examples/
- http://www.chartjs.org/

I don't have much experience with HTML/JS, so I've always held out hope that someone else would do this :-)  But the recent flurry of activity has gotten me motivated again and I'd be willing to take a crack at it once Rüdiger's changes are in.

Rüdiger Möller

unread,
Mar 10, 2014, 5:05:04 PM3/10/14
to java-serializat...@googlegroups.com
Preview https://github.com/RuedigerMoeller/fast-serialization/wiki/TestPage

@Nate: don't go overboard, i ran the tests with the trunk of fst, that's the main reason for deviation kryo-fst. Additionally the run was done with 100 iteration. I'll move to previous settings on final run ..

I think I'll check the me median option.

@Kannan
I changed the test duration to time based, so there are more than 500 iterations for the fast serializers. Originall setup was too long for slow ones and too short for the fast ones. However I'll compute median incremental.


Am Sonntag, 9. März 2014 20:51:38 UTC+1 schrieb Rüdiger Möller:

Sam Pullara

unread,
Mar 10, 2014, 5:47:50 PM3/10/14
to java-serializat...@googlegroups.com
I'm a big fan of using Yammer metrics to report on the runs.

Sam

Rüdiger Möller

unread,
Mar 10, 2014, 6:10:31 PM3/10/14
to java-serializat...@googlegroups.com
Yeah, I also thought about a dynamic query system .. but this would be quite some effort. Additionally you might need a hosting site.
Another chart lib would be interesting .. currently I can only add 18 bars to the chart, then it starts failing. 

What's also an issue is lack of descripion. I always wonder what exactly the difference between all those protostuff and smile/* jackson/* flavours is ..


-ruediger

Rüdiger Möller

unread,
Mar 10, 2014, 6:12:29 PM3/10/14
to java-serializat...@googlegroups.com
well i am suspicious wether another library skews results with allocation and stuff ..
Maybe just dump the test results to disk and compute outside the bench vm ..


You received this message because you are subscribed to a topic in the Google Groups "java-serialization-benchmarking" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/java-serialization-benchmarking/quSX2VLU_zY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to java-serialization-be...@googlegroups.com.

Kannan Goundan

unread,
Mar 10, 2014, 6:18:08 PM3/10/14
to java-serializat...@googlegroups.com
GitHub provides free static HTML hosting for every project.  If you commit to the "gh-pages" branch of a project, it'll show up at http://username.github.io/reponame

If you can get the info dumped into an HTML file, I'll try and see what I can do from there.  In particular we need the benchmark results for each serializer, along with the properties of each serializer.

Kannan Goundan

unread,
Mar 10, 2014, 6:23:58 PM3/10/14
to java-serializat...@googlegroups.com
What about allowing benchmark runs to be configured with three options: max-iterations, min-iterations, max-duration.  (Tangent: in case something takes a really long time, 'min-iterations' should maybe take precedence over 'max-duration'.)

That way we always have a cap on the number of results, and can preallocate an array of size 'max-iterations'.  This way, recording results involves zero allocations, an array bounds check, and a small fixed number of memory operations.

Sam Pullara

unread,
Mar 10, 2014, 6:46:21 PM3/10/14
to java-serializat...@googlegroups.com
Throwing this out there because I ran across it:


Sam

Rüdiger Möller

unread,
Mar 10, 2014, 6:55:20 PM3/10/14
to java-serializat...@googlegroups.com
hum .. I don't know,  this might raise discussions about bias. Lets make this as straight forward as possible. 
I think I can experimentally find the max array size and preallocate 4 times the size . After forcing a GC it will move to OldSpace and will not affect benchmark runtime at all.
There are also incremental algorithms for median so no recording needed. You know the test already creates a lot of data .. I can imagine adding more numbers would actually not help understanding results ;-)

Kannan Goundan

unread,
Mar 10, 2014, 7:00:42 PM3/10/14
to java-serializat...@googlegroups.com
Yeah, preallocating 4x would also work fine.

But to simplify my earlier email a little, I only meant to say that each benchmark should run up to X seconds or Y iterations, whichever comes first.  I don't think this introduces bias.

And yeah, the incremental median option is probably fine, but just recording the results might be even less intrusive.  We don't have to actually output all the numbers, though, we just have to wait until the end of the test run before computing the median on the array.

Rüdiger Möller

unread,
Mar 10, 2014, 8:32:51 PM3/10/14
to java-serializat...@googlegroups.com
see inline ..


Am Montag, 10. März 2014 23:18:08 UTC+1 schrieb Kannan Goundan:
GitHub provides free static HTML hosting for every project.  If you commit to the "gh-pages" branch of a project, it'll show up at http://username.github.io/reponame


Cool, was not aware of that :). I am happier each day having moved from gcode.
 
If you can get the info dumped into an HTML file, I'll try and see what I can do from there.  In particular we need the benchmark results for each serializer, along with the properties of each serializer.


I am currently using a bash script to collect all results into a single stats file. Additionally I already have written a "BenchmarkExporter" grabbing all registered serializers+properties. Will be easy to dump this into some txt/json, csv.
However I'll need some time as because of my job there is not that much time left. Think I'll be finished next weekend.

 

-Nate

To post to this group, send email to java-serialization-benchmarking...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "java-serialization-benchmarking" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-serialization-benchmarking+unsubscribe@googlegroups.com.
To post to this group, send email to java-serialization-benchm...@googlegroups.com.
Visit this group at http://groups.google.com/group/java-serialization-benchmarking.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "java-serialization-benchmarking" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/java-serialization-benchmarking/quSX2VLU_zY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to java-serialization-benchmarking+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "java-serialization-benchmarking" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-serialization-benchmarking+unsubscribe@googlegroups.com.
To post to this group, send email to java-serialization-benchm...@googlegroups.com.
Visit this group at http://groups.google.com/group/java-serialization-benchmarking.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "java-serialization-benchmarking" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/java-serialization-benchmarking/quSX2VLU_zY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to java-serialization-benchmarking+unsubscribe@googlegroups.com.

David Yu

unread,
Mar 11, 2014, 12:30:01 AM3/11/14
to java-serializat...@googlegroups.com
On Tue, Mar 11, 2014 at 6:10 AM, Rüdiger Möller <moru...@gmail.com> wrote:
Yeah, I also thought about a dynamic query system .. but this would be quite some effort. Additionally you might need a hosting site.
Another chart lib would be interesting .. currently I can only add 18 bars to the chart, then it starts failing. 

What's also an issue is lack of descripion. I always wonder what exactly the difference between all those protostuff and smile/* jackson/* flavours is ..
Jackson and Protostuff provide multiple formats/codec (the latter leverages the former for some of its formats).  
Protostuff's runtime means reflection.  Jackson's databind means reflection.
Jackson's afterburner means runtime bytecode generation.

Btw, the manual variants need to go (we have too many frameworks/variants already).
The libraries are most likely not gonna be used manually ... outside of their authors maybe.

Also, if we're gonna take memory/GC into account ... note that some of the libraries explicitly re-use their components while others do not.

For example in this benchmark, Protostuff has a distinct advantage over Protobuf because the former re-uses its buffer, while the latter creates one every iteration.

Another example is JBossMarshalling, it creates a specific input/output for the purposes of this benchmark in order to re-use components/buffers to avoid slowdowns from gc.  I don't blame the author that did this (I could have easily done the same) because he is simply trying to level the playing field.



--
When the cat is away, the mouse is alone.
- David Yu

Tatu Saloranta

unread,
Mar 11, 2014, 2:37:34 AM3/11/14
to java-serializat...@googlegroups.com
On Mon, Mar 10, 2014 at 9:30 PM, David Yu <david....@gmail.com> wrote:



On Tue, Mar 11, 2014 at 6:10 AM, Rüdiger Möller <moru...@gmail.com> wrote:
Yeah, I also thought about a dynamic query system .. but this would be quite some effort. Additionally you might need a hosting site.
Another chart lib would be interesting .. currently I can only add 18 bars to the chart, then it starts failing. 

What's also an issue is lack of descripion. I always wonder what exactly the difference between all those protostuff and smile/* jackson/* flavours is ..
Jackson and Protostuff provide multiple formats/codec (the latter leverages the former for some of its formats).  

and "smile" is a binary data format (of 'binary json' genre).
 
Protostuff's runtime means reflection.  Jackson's databind means reflection.
Jackson's afterburner means runtime bytecode generation.

One simple thing to change here, wrt naming, would be start with data format name, not library name.
This would be an opportunity to unify naming as well. We could replace 'databind' / 'runtime' with 'auto' (or automatic). Not sure if distinction between reflection and code generation is necessary to include in name, or just as a sidenote.
 

Btw, the manual variants need to go (we have too many frameworks/variants already).
The libraries are most likely not gonna be used manually ... outside of their authors maybe.


I was about to say that it's fine to leave them out from standard results, but I find some (ones for libs I work on) useful for sanity checking.
Others may occasionally find these useful perhaps -- if one wants XML output (f.ex), manual variant for streaming API does yield significant savings, and ability to measure amount of benefit can be useful.

But I think it is an exception, not rule; and requiring custom run is fine. I just wouldn't want to lose manual codecs altogether (they can also serve as code samples).
 
Also, if we're gonna take memory/GC into account ... note that some of the libraries explicitly re-use their components while others do not.

For example in this benchmark, Protostuff has a distinct advantage over Protobuf because the former re-uses its buffer, while the latter creates one every iteration.

Another example is JBossMarshalling, it creates a specific input/output for the purposes of this benchmark in order to re-use components/buffers to avoid slowdowns from gc.  I don't blame the author that did this (I could have easily done the same) because he is simply trying to level the playing field.


That last one I am not sure about -- it is ok if such usage would be reasonable for actual usage outside of benchmark. But I don't like benchmark-only optimization, so I hope this is just something an advanced user would do for her use case.

-+ Tatu +-

Rüdiger Möller

unread,
Mar 11, 2014, 6:22:22 AM3/11/14
to java-serializat...@googlegroups.com
Thanks for enlightment (just changed protostuff/protobuff to BINARY_CROSSLANG). Maybe each test should add a one-liner description. I already added a SerFormat to each test containing a featuredescription and a String.

Regarding manual optimized: I would let them in the benchmarks, but i mask them out in all the stats except one "manually optimized" chart. Its at least interesting how big the gap inbetween manual and automatic serialization is. 

Reuse is crucial for good performance. However a lot of libraries do not support reuse of their streams (e.g. Java built in). E.g. in FST this is actively supported as the classregistry and object-identiy-maps must be reset (which can be costly). So if a library supports reuse, it should be done in the bench, its a very important feature e.g. when doing high performance messaging.

I had a look at JBoss yesterday, from what i remember, it just creates 3 configurations. You would also use it in a real world app that way. Same applies to FstConfiguration or the Kryo Object:  They are usually initialized once per application or per thread. But i agree its worth a look if a serializer starts performing extraordinary fast.

Rüdiger Möller

unread,
Mar 11, 2014, 6:30:30 AM3/11/14
to java-serializat...@googlegroups.com

and "smile" is a binary data format (of 'binary json' genre).

I probably have not categorized all libs correctly. I'll do a dump this evening and post it, so anyone might spot wrong categorizations ..
 
 
Protostuff's runtime means reflection.  Jackson's databind means reflection.
Jackson's afterburner means runtime bytecode generation.


I put them in same category (ZERO_EFFORT). How the automatism is implemented is not important from a users point of view. I also don't want to make too finegrained categorization.

 
One simple thing to change here, wrt naming, would be start with data format name, not library name.
This would be an opportunity to unify naming as well. We could replace 'databind' / 'runtime' with 'auto' (or automatic). Not sure if distinction between reflection and code generation is necessary to include in name, or just as a sidenote.

Yes, this would be a very good thing to do. Also denote in name if it supports full object graphs or only cycle free trees. I added this to the feature descriptions, but did not unify naming for now. (maybe next iteration)

 
 

Btw, the manual variants need to go (we have too many frameworks/variants already).
The libraries are most likely not gonna be used manually ... outside of their authors maybe.


I was about to say that it's fine to leave them out from standard results, but I find some (ones for libs I work on) useful for sanity checking.
Others may occasionally find these useful perhaps -- if one wants XML output (f.ex), manual variant for streaming API does yield significant savings, and ability to measure amount of benefit can be useful.


I agree, just remove them from the charts except for the "manual chart" see preliminary https://github.com/RuedigerMoeller/fast-serialization/wiki/TestPage

 
Another example is JBossMarshalling, it creates a specific input/output for the purposes of this benchmark in order to re-use components/buffers to avoid slowdowns from gc.  I don't blame the author that did this (I could have easily done the same) because he is simply trying to level the playing field.


That last one I am not sure about -- it is ok if such usage would be reasonable for actual usage outside of benchmark. But I don't like benchmark-only optimization, so I hope this is just something an advanced user would do for her use case.

as said, being able to reuse streams/buffers is an important feature and many libs do not support this (no reset method). However one needs to verify it actually works correctly. E.g. full serilaizers built up a mapping id=>classname. If this is not cleared from run to run, this would work in a benchmark, but not in real apps.

Nate

unread,
Mar 11, 2014, 10:11:08 AM3/11/14
to java-serializat...@googlegroups.com
On Tue, Mar 11, 2014 at 11:22 AM, Rüdiger Möller <moru...@gmail.com> wrote:

It's great you are keeping the project alive! The organization is nice and easy to read. What is "cost of features"?

When choosing a serialization lib, when forward and/or backward compatibility is required, many libs are taken out of the equation. This is such a common requirement it might be helpful to people if we have a graphs for it. It's probably not important to actually exercise reading older or newer bytes, just running the serializers configured to do so is probably enough (though it makes it harder to verify).

Backward compatibility (reading old bytes with newer classes) with no forward compatibility is a common requirement and has different performance characteristics than doing both. I've never seen the need for only forward compatibility, so I'd vote for having two sections: backward and backward+forward. Then again, I don't have much free time these days so I'll take whatever you guys are willing to implement. :)

Compatibility restrictions vary between libs, eg: backward compatibility as long as you don't remove fields, forward compatibility as long as you don't change field types, etc. These are lower level details that I think are less important than comparing only the serializers that likely fit your needs, and that are running in a mode similar to your needs.

FWIW, backward compatibility in Kryo is done by TaggedFieldSerializer, and backward+foward is done by CompatibleFieldSerializer (with pretty terrible overhead).

Cheers!
-Nate

Rüdiger Möller

unread,
Mar 11, 2014, 10:50:21 AM3/11/14
to java-serializat...@googlegroups.com
Hi Nate,

Cost of features compares:
- manual
vs
- knowledge of classes in advance: (kryo, fst: class registration, setReferences(false), protobuf,msgpack pregenerated or precompiled scheme)
vs
- zero knowledge (message format is detected at runtime, no class preregistration) but no shared refs
vs
- full graph full serialization, zero knowledge/configuration in advance

I plan to add a short description to each test.

The versioning thingy would be interesting also, however its not like I want to invest unlimited time for now ;-). I actually have not thought in depth about versioning. I'll add an enumerationSet for feature annotations. Library owners then might fill in the appropriate feature annotation. Once enough libs are annotated, we can easily add more charts. Kannan plans to add a dynamic query interface, so this might also reduce data and chart clutter.

Cheers,
rüdiger


Rüdiger Möller

unread,
Mar 12, 2014, 4:43:23 PM3/12/14
to java-serializat...@googlegroups.com
Finished my work and added a pull request. Median is computed but i still stick with average. Looking into the data one finds that significant outliers are produced always by minor GC's.

Now consider a memory wasting lib's results with 8 runs (numbers amplified to show the effect more drastic):

[100,102,105,105,106,107,50000,50000]
(2 GCs)
Median: 105, avg 12,578 ns

Another lib with zero alloc has:

[102,102,105,105,106,107,108,110]
Median: 105, Avg: 105.6 ns

I am dumping median, q1,q3 and diff average-median during run, so you can see the diff of avg-median (3 series for init, write, read).

The good news is, that the good performers are also the ones with efficient memory management, so it does not make a difference in the order they are coming in from what i can see.
The difference between avg and median dumped ("deviation") gives a pretty stable and solid indication of the amount of memory allocation is happening in a library. Naturally the "read" test needs to create read objects, so the deviation is larger there.
Note that I have updated fst to 1.42 which results in slightly better results. These are not caused by the measurement change, but by some changes squeezing out some additional nanos :-).

current results:


Reply all
Reply to author
Forward
0 new messages