benchmark overhead

167 views
Skip to first unread message

Nate

unread,
Apr 27, 2012, 2:53:17 PM4/27/12
to java-serializat...@googlegroups.com
On Thu, Apr 26, 2012 at 8:00 PM, Nate <nathan...@gmail.com> wrote:
Here it is differently. Let's say that we know A is exactly 10x as fast as B (A is 100ms, B is 1000ms).

Now we're going to add an overhead of 250ms:
benchmark of A = 100ms + 250ms = 350ms
benchmark of B = 1000ms + 250ms = 1250ms
-> conclusion: "A is 3.57x as fast as B"

Let's instead add an overhead of 100ms:
benchmark of A = 100ms + 100ms = 200ms
benchmark of B = 1000ms + 100ms = 1100ms
-> conclusion: "A is 5.50x as fast as B"

Finally, add an overhead of 10000ms:
benchmark of A = 100ms + 10000ms = 10100ms
benchmark of B = 1000ms + 10000ms = 11000ms
-> conclusion: "A and B are about as fast"

As you can see, minimizing the overhead is important. We want to compare the serializers' code. Any overhead that we can avoid should be avoided, else it can dramatically affect the results.

As described above, the importance of reducing overhead in micro benchmarks is paramount for comparisons to be useful. I'd like to continue the discussion of overhead and how it could be reduced in our benchmarks.

1) MediaItemBenchmark defines a TestCase named "Serialize" that does:
      <start time>
      for (int i = 0; i < iterations; i++)
      {
              Object obj = transformer.forward(value);
              serializer.serialize(obj);
      }
      <end time>
The transformer.forward() call makes a copy of the value, which is the serialized. I think we should move making the copy outside of the timed code to reduce overhead. The new code would look like:
      Object[] objects = new Object[iterations];
      for (int i = 0; i < iterations; i++)
      {
          objects[i] = transformer.forward(value);
      }
      <start time>
      for (int i = 0; i < iterations; i++)
      {
              serializer.serialize(objects[
i]);
      }
      <end time>
The same would be done to MediaStreamBenchmark.

2) Change #1 will cause more RAM to be used. I suggest we raise the max heap size from 16mb to something much higher (256? 512?). GC could easily and silently ruin results, so we should try to avoid it. It is difficult enough to benchmark the serializers and get meaningful numbers, introducing such low memory constraints only complicates unnecessarily.

3) I suggest we remove the test named "SerializeSameObject", since its sole purpose appears to be to serialize without the overhead that we just removed from "Serialize" with change #1.

4) We have 3 tests for deserialization: Deserialize, DeserializeAndCheck, and DeserializeAndCheckShallow. Deserialize just does deserialization, so that is good. DeserializeAndCheckShallow makes a shallow copy of the deserialized object. What use does this metric have? DeserializeAndCheck makes a deep copy of the deserialized object. I believe this is done because protobuf/activemq+alt does lazy deserialization. Making a deep copy is quite a bit of overhead to add to ALL the benchmarks just so protobuf/activemq+alt can be included! This is very important because totalTime is computed as timeSerialize + timeDeserializeAndCheck, so the overhead is skewing all of the results. Rather than negatively affect all other results, I suggest we remove both DeserializeAndCheck and DeserializeAndCheckShallow, then we ensure that protobuf/activemq+alt calls all the getters so all deserialization occurs (which isn't the same as a deep copy anyway). This is the only way to fairly benchmark protobuf/activemq+alt along with the rest of the serializers. We should prominently post a message (possibly with an asterisk in the results numbers and charts) that explains that protobuf/activemq+alt supports lazy deserialization, which is a really awesome feature no one else has, so deserialization times could be much lower when using that lib.

5) This item is related to the ongoing discussion about buffers reuse. Currently we allocate a new ByteArrayOutputStream in Serializer#outputStream. This allocation is overhead that could be avoided. Also, we allocate at a default size of 512, which causes additional overhead to grow the buffer for any object graph that serializes to larger than this. Even for media.1.cks, there are 7 serializers whose output is > 512 bytes, which is extremely unfair.

-Nate


On Thu, Apr 26, 2012 at 8:00 PM, Nate <nathan...@gmail.com> wrote:

    Here it is differently. Let's say that we know A is exactly 10x as fast as B (A is 100ms, B is 1000ms).

    Now we're going to add an overhead of 250ms:
    benchmark of A = 100ms + 250ms = 350ms
    benchmark of B = 1000ms + 250ms = 1250ms
    -> conclusion: "A is 3.57x as fast as B"

    Let's instead add an overhead of 100ms:
    benchmark of A = 100ms + 100ms = 200ms
    benchmark of B = 1000ms + 100ms = 1100ms
    -> conclusion: "A is 5.50x as fast as B"

    Finally, add an overhead of 10000ms:
    benchmark of A = 100ms + 10000ms = 10100ms
    benchmark of B = 1000ms + 10000ms = 11000ms
    -> conclusion: "A and B are about as fast"

    As you can see, minimizing the overhead is important. We want to compare the serializers' code. Any overhead that we can avoid should be avoided, else it can dramatically affect the results.

Tatu Saloranta

unread,
Apr 27, 2012, 3:54:44 PM4/27/12
to java-serializat...@googlegroups.com
I agree with this, assuming we do not require "POJO to X"
functionality (i.e. having to start with POJO).
It is not explicitly defined, but I assume this is the thinking
already, i.e. start with whatever item abstraction is used, which may
be POJO or package-generated class (for thrift, protobuf etc)

> 2) Change #1 will cause more RAM to be used. I suggest we raise the max heap
> size from 16mb to something much higher (256? 512?). GC could easily and

JVM default is 64 megs. Are we lowering it explicitly.

> silently ruin results, so we should try to avoid it. It is difficult enough
> to benchmark the serializers and get meaningful numbers, introducing such
> low memory constraints only complicates unnecessarily.

I am fine with this.

But I would want to include actual normal GC costs in measurements,
since that will be incurred for real production code as well. And
inclusion then rewards codecs that are careful to with their temporary
object creation.
I mention this as I have seen some benchmarking articles recommend
trying to remove GC overhead, which I don't agree with, at least not
as general advice.

> 3) I suggest we remove the test named "SerializeSameObject", since its sole
> purpose appears to be to serialize without the overhead that we just removed
> from "Serialize" with change #1.

Agreed. I think this was due to some codec ("optimized" protobuf by
activemq) abusing knowledge that same item was being serialized.
However, I would probably vote for just removing such codecs, to avoid
having to create fresh items.

That is, I think "new vs same" should be collated into just one test, number.

>
> 4) We have 3 tests for deserialization: Deserialize, DeserializeAndCheck,
> and DeserializeAndCheckShallow. Deserialize just does deserialization, so
> that is good. DeserializeAndCheckShallow makes a shallow copy of the
> deserialized object. What use does this metric have? DeserializeAndCheck
> makes a deep copy of the deserialized object. I believe this is done because
> protobuf/activemq+alt does lazy deserialization. Making a deep copy is quite
> a bit of overhead to add to ALL the benchmarks just so protobuf/activemq+alt
> can be included! This is very important because totalTime is computed as
> timeSerialize + timeDeserializeAndCheck, so the overhead is skewing all of
> the results. Rather than negatively affect all other results, I suggest we
> remove both DeserializeAndCheck and DeserializeAndCheckShallow, then we
> ensure that protobuf/activemq+alt calls all the getters so all
> deserialization occurs (which isn't the same as a deep copy anyway). This is
> the only way to fairly benchmark protobuf/activemq+alt along with the rest
> of the serializers. We should prominently post a message (possibly with an
> asterisk in the results numbers and charts) that explains that
> protobuf/activemq+alt supports lazy deserialization, which is a really
> awesome feature no one else has, so deserialization times could be much
> lower when using that lib.

As I said, I think we could just rm that one codec. It's not faster
than standard one anyway (except when it can optimize out calls, which
is fishy).

One more suggestion: another way to reduce relative importance of
overhead would be to increase data size. And of two dimensions
available -- making value itself bigger; using sequences of items -- I
think latter would be easier to use. Either way, ability to generate
permutations would be very useful so that we can create bigger data
sets; and vary things like inclusion of non-ASCII characters.
And the case of single item can be thought of simply as a subset of
more general sequence test... so we migth be able to automate tests of
1, 10, 100 (or such) item cases.

This has been talked about often, and I realize that if it was trivial
one of us would have done it. But I want to keep it in discussion, so
eventually someone is itching badly enough to do it. :-D

-+ Tatu +-

Nate

unread,
Apr 27, 2012, 7:29:02 PM4/27/12
to java-serializat...@googlegroups.com

Yeah, the change would only be to create the objects outside the timed code, using the same transformer mechanism to create the objects.
 

> 2) Change #1 will cause more RAM to be used. I suggest we raise the max heap
> size from 16mb to something much higher (256? 512?). GC could easily and

JVM default is 64 megs. Are we lowering it explicitly.

Yup, to 16mb.
 

> silently ruin results, so we should try to avoid it. It is difficult enough
> to benchmark the serializers and get meaningful numbers, introducing such
> low memory constraints only complicates unnecessarily.

I am fine with this.

But I would want to include actual normal GC costs in measurements,
since that will be incurred for real production code as well.

How could we measure GC costs? How could we quantify "normal"?
 
And
inclusion then rewards codecs that are careful to with their temporary
object creation.
I mention this as I have seen some benchmarking articles recommend
trying to remove GC overhead, which I don't agree with, at least not
as general advice.

I think GC overhead has no place in a benchmark. There are already so many variables, we can't possibly recreate someone's production scenario. Sticking to just measuring the serializers is plenty hard enough! Performance could be degraded in so many ways, we should just try to show optimal performance.
 

> 3) I suggest we remove the test named "SerializeSameObject", since its sole
> purpose appears to be to serialize without the overhead that we just removed
> from "Serialize" with change #1.

Agreed. I think this was due to some codec ("optimized" protobuf by
activemq) abusing knowledge that same item was being serialized.
However, I would probably vote for just removing such codecs, to avoid
having to create fresh items.

That is, I think "new vs same" should be collated into just one test, number.

Ok, I'll go ahead and commit the changes for #1 and #3. I'll post a message to the list for review.
 

>
> 4) We have 3 tests for deserialization: Deserialize, DeserializeAndCheck,
> and DeserializeAndCheckShallow. Deserialize just does deserialization, so
> that is good. DeserializeAndCheckShallow makes a shallow copy of the
> deserialized object. What use does this metric have? DeserializeAndCheck
> makes a deep copy of the deserialized object. I believe this is done because
> protobuf/activemq+alt does lazy deserialization. Making a deep copy is quite
> a bit of overhead to add to ALL the benchmarks just so protobuf/activemq+alt
> can be included! This is very important because totalTime is computed as
> timeSerialize + timeDeserializeAndCheck, so the overhead is skewing all of
> the results. Rather than negatively affect all other results, I suggest we
> remove both DeserializeAndCheck and DeserializeAndCheckShallow, then we
> ensure that protobuf/activemq+alt calls all the getters so all
> deserialization occurs (which isn't the same as a deep copy anyway). This is
> the only way to fairly benchmark protobuf/activemq+alt along with the rest
> of the serializers. We should prominently post a message (possibly with an
> asterisk in the results numbers and charts) that explains that
> protobuf/activemq+alt supports lazy deserialization, which is a really
> awesome feature no one else has, so deserialization times could be much
> lower when using that lib.

As I said, I think we could just rm that one codec. It's not faster
than standard one anyway (except when it can optimize out calls, which
is fishy).

Personally, I think the lazy deserialization is a neat feature. I guess I'm ok with removing it, though it would be nice if it could be included in a way that is fair to all the other serializers.

Should we go ahead and remove protobuf/activemq+alt, deser+shallow, and deser+deep?

 

One more suggestion: another way to reduce relative importance of
overhead would be to increase data size. And of two dimensions
available -- making value itself bigger; using sequences of items -- I
think latter would be easier to use. Either way, ability to generate
permutations would be very useful so that we can create bigger data
sets; and vary things like inclusion of non-ASCII characters.
And the case of single item can be thought of simply as a subset of
more general sequence test... so we migth be able to automate tests of
1, 10, 100 (or such) item cases.

This has been talked about often, and I realize that if it was trivial
one of us would have done it. But I want to keep it in discussion, so
eventually someone is itching badly enough to do it. :-D

True, if the overhead doesn't occur during a single serialization (eg buffer growing).

-Nate

Nate

unread,
Apr 27, 2012, 11:47:55 PM4/27/12
to java-serializat...@googlegroups.com

David Yu

unread,
Apr 28, 2012, 12:58:44 AM4/28/12
to java-serializat...@googlegroups.com
Since it seems there is not much interest in capturing the overhead when the message size > buffer size (nate is trying so hard to hide it) , then make sure that the default buffer size is large enough (for all datasets) that no serializer will need to resize/expand.
Its either every serializer gets overhead, or none at all.  

-Nate


On Thu, Apr 26, 2012 at 8:00 PM, Nate <nathan...@gmail.com> wrote:

    Here it is differently. Let's say that we know A is exactly 10x as fast as B (A is 100ms, B is 1000ms).

    Now we're going to add an overhead of 250ms:
    benchmark of A = 100ms + 250ms = 350ms
    benchmark of B = 1000ms + 250ms = 1250ms
    -> conclusion: "A is 3.57x as fast as B"

    Let's instead add an overhead of 100ms:
    benchmark of A = 100ms + 100ms = 200ms
    benchmark of B = 1000ms + 100ms = 1100ms
    -> conclusion: "A is 5.50x as fast as B"

    Finally, add an overhead of 10000ms:
    benchmark of A = 100ms + 10000ms = 10100ms
    benchmark of B = 1000ms + 10000ms = 11000ms
    -> conclusion: "A and B are about as fast"

    As you can see, minimizing the overhead is important. We want to compare the serializers' code. Any overhead that we can avoid should be avoided, else it can dramatically affect the results.

--
You received this message because you are subscribed to the Google Groups "java-serialization-benchmarking" group.
To post to this group, send email to java-serializat...@googlegroups.com.
To unsubscribe from this group, send email to java-serialization-be...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/java-serialization-benchmarking?hl=en.



--
When the cat is away, the mouse is alone.
- David Yu

Nate

unread,
Apr 28, 2012, 1:12:38 AM4/28/12
to java-serializat...@googlegroups.com
Agreed. Holy shit we agreed on something...

Tatu Saloranta

unread,
May 1, 2012, 1:28:54 PM5/1/12
to java-serializat...@googlegroups.com
I don't think we need to measure it separately. Just let it be there.
All VMs do steady amount of Young generation collections, with
predictable overhead (at least for tests since we have steady
throughput).
All we are doing is remove other sources garbage production and do
about 100% codec work. Which should give reasonable representation of
its contribution to overall costs.

The only thing we may want to eliminate are Old generation
collections, since they are rare enough to be problematic from
measurement perspective.

Having said all that, maybe this is a trivial detail -- it is rare to
see more single-digit percent of time spent on GC in production, when
memory size is not artificially limited to unrealistic limits (which I
think 16 megs is).

One more way to test this, then, would be to compare heap of 64 megs
with, say, 1 gig; and ensure most of it is for young gen (eden). Since
Eden area is split in two, it'd have at most about 24 megs in first
case, and about 500 megs in second case. Difference between results
should give some estimation of GC overhead.
There are other tools, too, if anyone is interested; but this should
give order-of-magnitude estimation. And difference really should be no
more than 5% or so, probably even less.

>
>>
>> And
>> inclusion then rewards codecs that are careful to with their temporary
>> object creation.
>> I mention this as I have seen some benchmarking articles recommend
>> trying to remove GC overhead, which I don't agree with, at least not
>> as general advice.
>
>
> I think GC overhead has no place in a benchmark. There are already so many
> variables, we can't possibly recreate someone's production scenario.
> Sticking to just measuring the serializers is plenty hard enough!
> Performance could be degraded in so many ways, we should just try to show
> optimal performance.

I think you may not quite understand how GC works then.

What I say is the opposite: why try eliminate GC when it is more work,
hard(er) to guarantee, and ultimately changes performance to be LESS
like actual real-life performance? Why would we try to spend more time
on making results less relevant? No java service runs without GC
activity, it is fact of life.

If we want, we can start each test right after manual GC, to give
equal starting point. But beyond that, let's use moderate heap size
and stop worrying about GC part of the results.

-+ Tatu +-

Tatu Saloranta

unread,
May 1, 2012, 1:32:21 PM5/1/12
to java-serializat...@googlegroups.com
On Fri, Apr 27, 2012 at 4:29 PM, Nate <nathan...@gmail.com> wrote:
>
>
> On Fri, Apr 27, 2012 at 12:54 PM, Tatu Saloranta <tsalo...@gmail.com>
...
>> As I said, I think we could just rm that one codec. It's not faster
>> than standard one anyway (except when it can optimize out calls, which
>> is fishy).
>
> Personally, I think the lazy deserialization is a neat feature. I guess I'm
> ok with removing it, though it would be nice if it could be included in a
> way that is fair to all the other serializers.

It is nice feature if it is useful for some use case. I just do not
think it is relevant for use case measured here, since (AFAIK) we are
trying to model a process of incoming variable messages and responses.
That we may be only using one item is a flaw in test itself.

>
> Should we go ahead and remove protobuf/activemq+alt, deser+shallow, and
> deser+deep?

I think so.

>> One more suggestion: another way to reduce relative importance of
>> overhead would be to increase data size. And of two dimensions
>> available -- making value itself bigger; using sequences of items -- I
>> think latter would be easier to use. Either way, ability to generate
>> permutations would be very useful so that we can create bigger data
>> sets; and vary things like inclusion of non-ASCII characters.
>> And the case of single item can be thought of simply as a subset of
>> more general sequence test... so we migth be able to automate tests of
>> 1, 10, 100 (or such) item cases.
>>
>> This has been talked about often, and I realize that if it was trivial
>> one of us would have done it. But I want to keep it in discussion, so
>> eventually someone is itching badly enough to do it. :-D
>
> True, if the overhead doesn't occur during a single serialization (eg buffer
> growing).

I think this is doable, since individual item size is 200 - 500 bytes;
and 100 of them would still fit in 64k block, which is not
unreasonable default size.

-+ Tatu +-

Nate

unread,
May 1, 2012, 1:38:45 PM5/1/12
to java-serializat...@googlegroups.com
On Tue, May 1, 2012 at 10:28 AM, Tatu Saloranta <tsalo...@gmail.com> wrote:
> I think GC overhead has no place in a benchmark. There are already so many
> variables, we can't possibly recreate someone's production scenario.
> Sticking to just measuring the serializers is plenty hard enough!
> Performance could be degraded in so many ways, we should just try to show
> optimal performance.

I think you may not quite understand how GC works then.

What I say is the opposite: why try eliminate GC when it is more work,
hard(er) to guarantee, and ultimately changes performance to be LESS
like actual real-life performance? Why would we try to spend more time
on making results less relevant? No java service runs without GC
activity, it is fact of life.

I meant that we should try to minimize the effects of GC on the benchmarks. We shouldn't try to mimic Real Life environments, there are too many variables.
 

If we want, we can start each test right after manual GC, to give
equal starting point. But beyond that, let's use moderate heap size
and stop worrying about GC part of the results.

We already do manual GC between tests.

It sounds like Tatu and I would like to see a larger max heap size (64mb or more) and Kannan and David would like to stay at 16mb. What to do?

-Nate

Tatu Saloranta

unread,
May 1, 2012, 9:09:27 PM5/1/12
to java-serializat...@googlegroups.com
On Tue, May 1, 2012 at 10:38 AM, Nate <nathan...@gmail.com> wrote:
> On Tue, May 1, 2012 at 10:28 AM, Tatu Saloranta <tsalo...@gmail.com>
> wrote:
>>
>> > I think GC overhead has no place in a benchmark. There are already so
>> > many
>> > variables, we can't possibly recreate someone's production scenario.
>> > Sticking to just measuring the serializers is plenty hard enough!
>> > Performance could be degraded in so many ways, we should just try to
>> > show
>> > optimal performance.
>>
>> I think you may not quite understand how GC works then.
>>
>> What I say is the opposite: why try eliminate GC when it is more work,
>> hard(er) to guarantee, and ultimately changes performance to be LESS
>> like actual real-life performance? Why would we try to spend more time
>> on making results less relevant? No java service runs without GC
>> activity, it is fact of life.
>
>
> I meant that we should try to minimize the effects of GC on the benchmarks.
> We shouldn't try to mimic Real Life environments, there are too many
> variables.

Right. I think we agree in a round-about way then.
I don't want to add complexity that may or may not bring tests closer
to whatever our target prod environment would be. :)

>> If we want, we can start each test right after manual GC, to give
>> equal starting point. But beyond that, let's use moderate heap size
>> and stop worrying about GC part of the results.
>
> We already do manual GC between tests.
>
> It sounds like Tatu and I would like to see a larger max heap size (64mb or
> more) and Kannan and David would like to stay at 16mb. What to do?

Hmmh. What was the point of using smaller heap again... This will
force more frequent minor garbage collections, increasing relative
time spent on GC, as shown by your results, right? From that
perspective, it would seem best to increate heap until results do not
significantly change.

But as with many other things, I don't care enough. If others think 16
megs makes sense, there are more important things to consider.

-+ Tatu +-
Reply all
Reply to author
Forward
0 new messages