Recents updates were pushed to remove shortcuts from kryo. 1. explicitly disables utf8 based from advanced knowledge of the content of the dataset. 2. uses the maximum possible buffer size (based from the biggest dataset) to avoid using an outputstream for flushing (shortcut is directly using its internal buffer since it knows everything fits). Other streaming libs (smile/json/etc) are not doing this. Previously with kryo v1, this could not be alleviated as it was based on ByteBuffer (crashes on buffer overflow). Although another lib, Wobly, also based on ByteBuffer (looks like kryo v1 from the outside, but it is mostly similar to protobuf with the size computation and schema evolution suppport) is capable of handling any dataset no matter how large.
Now that every serializer can handle overflows, its now possible to use a common buffer size. I've updated some libs to use it as a starting point. 512 is the default (was previously 500 on ByteArrayOutputStream), which can be changed via system properties. Binary (except java-built-in/scala built-in) and json data are less than 512 bytes on media.1.cks, so there's no flushing/etc in the benchmark. Eventually we'll be able to measure the behavior of serializers when data does not fit in the buffer.
The results are updated to the wiki with the latest changes. I've tried to include a single utf8 character in the dataset but couldn't figure out how to make cks accept it. Maybe on the next run we can include it in the default dataset (with the help of kannan).
The results have not changed much, kryo-manual is still fastest. I'm not even sure why those shortcuts were necessary. One thing I noticed is that wobly actually performs better when run on windows 7 (based from previous results).
Interestingly, with the shortcuts removed, smile/jackson/manual now seems to be on par with kryo.
-- When the cat is away, the mouse is alone. - David Yu
On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com> wrote: > Hi all,
> Recents updates were pushed to remove shortcuts from kryo. > 1. explicitly disables utf8 based from advanced knowledge of the content of > the dataset. > 2. uses the maximum possible buffer size (based from the biggest dataset) to > avoid using an outputstream for flushing (shortcut is directly using its > internal buffer since it knows everything fits). Other streaming libs > (smile/json/etc) are not doing this. > Previously with kryo v1, this could not be alleviated as it was based on > ByteBuffer (crashes on buffer overflow). > Although another lib, Wobly, also based on ByteBuffer (looks like kryo v1 > from the outside, but it is mostly similar to protobuf with the size > computation and schema evolution suppport) is capable of handling any > dataset no matter how large.
Ok. I did notice that Wobly is very fast as well, when added.
> Now that every serializer can handle overflows, its now possible to use a > common buffer size. > I've updated some libs to use it as a starting point. > 512 is the default (was previously 500 on ByteArrayOutputStream), which can > be changed via > system properties.
Good idea.
> Binary (except java-built-in/scala built-in) and json data are less than 512 > bytes on media.1.cks, so there's no flushing/etc in the benchmark. > Eventually we'll be able to measure the behavior of serializers when data > does not fit in the buffer.
> The results are updated to the wiki with the latest changes. > I've tried to include a single utf8 character in the dataset but couldn't > figure out how to make cks accept it. > Maybe on the next run we can include it in the default dataset (with the > help of kannan).
Yes, this seems like a good idea as well.
> The results have not changed much, kryo-manual is still fastest. I'm not > even sure why those shortcuts were necessary. > One thing I noticed is that wobly actually performs better when run on > windows 7 (based from previous results).
> Interestingly, with the shortcuts removed, smile/jackson/manual now seems to > be on par with kryo.
Another related thing is that a while ago I added alternate sequence testing. However, it is only supported by a subset of serializers, based on codecs that were easiest to change; some might need external framing. But it should be relatively easy to expand coverage. I was hoping to find that Avro was more efficient with longer sequences, although that did not seem to be the case for some reason. But I think this might also show some differences between other codecs; binary formats should benefit more due to size differences, for example.
One thing that would benefit sequence tests most however would be some way to generate variations of items; if this was possible, it would be possible to run tests with multiple sequence lengths without having to hand create different input sets.
On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com> wrote: > Hi all,
> Recents updates were pushed to remove shortcuts from kryo. > 1. explicitly disables utf8 based from advanced knowledge of the content > of the dataset. > 2. uses the maximum possible buffer size (based from the biggest dataset) > to avoid using an outputstream for flushing (shortcut is directly using its > internal buffer since it knows everything fits). Other streaming libs > (smile/json/etc) are not doing this. > Previously with kryo v1, this could not be alleviated as it was based > on ByteBuffer (crashes on buffer overflow).
The benchmark serialize method returns a byte[]. The most efficient way to do this with Kryo is to use Output by itself. Your changes cause the bytes to be written to a byte[] in Output, then unnecessarily copied to a ByteArrayOutputStream. The benchmark should not force a ByteArrayOutputStream to be used. https://github.com/eishay/jvm-serializers/commit/fb52f09d24503808024b... I request you revert the serialize() method to how it was before: public byte[] serialize (T content) { output.clear(); kryo.writeObject(output, content); return output.toBytes(); } This works in exactly the same was as the ByteArrayOutputStream used by Serializer#outputStream(), the only difference is it avoids copying around the bytes. Output even extends OutputStream.
> Now that every serializer can handle overflows, its now possible to use a > common buffer size. > I've updated some libs to use it as a starting point. > 512 is the default (was previously 500 on ByteArrayOutputStream), which > can be changed via > system properties.
Binary (except java-built-in/scala built-in) and json data are less than
> 512 bytes on media.1.cks, so there's no flushing/etc in the benchmark. > Eventually we'll be able to measure the behavior of serializers when data > does not fit in the buffer.
Your changes to set a buffer size would have no effect with a size smaller than the data, since Serializer reuses the ByteArrayOutputStream by calling reset(). IMO, this is how it should be, as we are measuring the serializers, not how long it takes to allocate a buffer to hold the serialized bytes.
The results are updated to the wiki with the latest changes.
I think we should use the latest Java to run the benchmark for the wiki.
On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote: > On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com> wrote:
>> Hi all,
>> Recents updates were pushed to remove shortcuts from kryo. >> 1. explicitly disables utf8 based from advanced knowledge of the content >> of the dataset. >> 2. uses the maximum possible buffer size (based from the biggest dataset) >> to avoid using an outputstream for flushing (shortcut is directly using its >> internal buffer since it knows everything fits). Other streaming libs >> (smile/json/etc) are not doing this. >> Previously with kryo v1, this could not be alleviated as it was based >> on ByteBuffer (crashes on buffer overflow).
> The benchmark serialize method returns a byte[]. The most efficient way to > do this with Kryo is to use Output by itself. Your changes cause the bytes > to be written to a byte[] in Output, then unnecessarily copied to a > ByteArrayOutputStream.
The benchmark should not force a ByteArrayOutputStream to be used.
Nope. Why don't you read #2 again. Notice that before the change, all the libs in the benchmark are able to handle even if the data is 100x the buffer, *except kryo, *which crashes (apparently kryo v2 has same problems with v1).
The point is that there should be no bias/shortcuts based from direct knowledge of the content and size dataset. That is why the buffer size will be provided by the benchmark, not the author. Future runs will have the option to use a dataset whose size exceeds the buffer size provided.
> output.clear(); > kryo.writeObject(output, content); > return output.toBytes(); > } > This works in exactly the same was as the ByteArrayOutputStream used by > Serializer#outputStream(), the only difference is it avoids copying around > the bytes. Output even extends OutputStream.
>> Now that every serializer can handle overflows, its now possible to use a >> common buffer size. >> I've updated some libs to use it as a starting point. >> 512 is the default (was previously 500 on ByteArrayOutputStream), which >> can be changed via >> system properties.
> Binary (except java-built-in/scala built-in) and json data are less than >> 512 bytes on media.1.cks, so there's no flushing/etc in the benchmark. >> Eventually we'll be able to measure the behavior of serializers when data >> does not fit in the buffer.
> Your changes to set a buffer size would have no effect with a size smaller > than the data, since Serializer reuses the ByteArrayOutputStream by calling > reset(). IMO, this is how it should be, as we are measuring the > serializers,
not how long it takes to allocate a buffer to hold the serialized bytes.
> The results are updated to the wiki with the latest changes.
> I think we should use the latest Java to run the benchmark for the wiki.
> -Nate
> -- > You received this message because you are subscribed to the Google Groups > "java-serialization-benchmarking" group. > To post to this group, send email to > java-serialization-benchmarking@googlegroups.com. > To unsubscribe from this group, send email to > java-serialization-benchmarking+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/java-serialization-benchmarking?hl=en.
-- When the cat is away, the mouse is alone. - David Yu
On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com> wrote: > On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote:
>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com> wrote:
>>> Hi all,
>>> Recents updates were pushed to remove shortcuts from kryo. >>> 1. explicitly disables utf8 based from advanced knowledge of the content >>> of the dataset. >>> 2. uses the maximum possible buffer size (based from the biggest >>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>> using its internal buffer since it knows everything fits). Other streaming >>> libs (smile/json/etc) are not doing this. >>> Previously with kryo v1, this could not be alleviated as it was based >>> on ByteBuffer (crashes on buffer overflow).
>> The benchmark serialize method returns a byte[]. The most efficient way >> to do this with Kryo is to use Output by itself. Your changes cause the >> bytes to be written to a byte[] in Output, then unnecessarily copied to a >> ByteArrayOutputStream.
> The benchmark should not force a ByteArrayOutputStream to be used.
> Nope. Why don't you read #2 again. > Notice that before the change, all the libs in the benchmark are able to > handle even if the data is 100x the buffer, *except kryo, *which crashes > (apparently kryo v2 has same problems with v1).
> The point is that there should be no bias/shortcuts based from direct > knowledge of the content and size dataset. > That is why the buffer size will be provided by the benchmark, not the > author. > Future runs will have the option to use a dataset whose size exceeds the > buffer size provided.
Use this... this.output = new Output(BUFFER_SIZE, -1); ...so Output grows its buffer unbounded. Next time if you are going to change my code, read the javadocs. Now, revert the changes you made to the serialize() method in the Kryo benchmark. Thanks.
On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote: > On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com> wrote:
>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote:
>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>> Hi all,
>>>> Recents updates were pushed to remove shortcuts from kryo. >>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>> content of the dataset. >>>> 2. uses the maximum possible buffer size (based from the biggest >>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>> using its internal buffer since it knows everything fits). Other streaming >>>> libs (smile/json/etc) are not doing this. >>>> Previously with kryo v1, this could not be alleviated as it was >>>> based on ByteBuffer (crashes on buffer overflow).
>>> The benchmark serialize method returns a byte[]. The most efficient way >>> to do this with Kryo is to use Output by itself. Your changes cause the >>> bytes to be written to a byte[] in Output, then unnecessarily copied to a >>> ByteArrayOutputStream.
>> The benchmark should not force a ByteArrayOutputStream to be used.
>> Nope. Why don't you read #2 again. >> Notice that before the change, all the libs in the benchmark are able to >> handle even if the data is 100x the buffer, *except kryo, *which crashes >> (apparently kryo v2 has same problems with v1).
>> The point is that there should be no bias/shortcuts based from direct >> knowledge of the content and size dataset. >> That is why the buffer size will be provided by the benchmark, not the >> author. >> Future runs will have the option to use a dataset whose size exceeds the >> buffer size provided.
> Use this... > this.output = new Output(BUFFER_SIZE, -1); > ...so Output grows its buffer unbounded. Next time if you are going to > change my code, read the javadocs. Now, revert the changes you made to the > serialize() method in the Kryo benchmark. Thanks.
Now there is another issue with re-use. There is a difference between stateless and stateful re-use. For example MsgPack re-uses its Packer/Unpacker, but on every iteration it completely resets all of the changes made such that the second run will have exactly the same state like it was the first run.
The benchmark results are basically based from a single run. The only reason why we run iterations is to compute the average and use that as the basis for the one-single-run.
The problem with kryo is that the second run is completely different from the first run (stateful). The buffers have been resized (you could call this sampling. After the first run, it already knows the message can fit). The second run and onwards no longer deals with overflows.
Compare this with the rest of the libs, on every iteration, they all need to flush/resize/handle overflows, because either the state is new ... or its completely reset (MsgPack)
Therefore, kryo's re-use is not valid at all. So thank you for reminding me to refactor the code even further to remove your re-use. Next time, try not to blatantly take shortcuts.
> -Nate
> -- > You received this message because you are subscribed to the Google Groups > "java-serialization-benchmarking" group. > To post to this group, send email to > java-serialization-benchmarking@googlegroups.com. > To unsubscribe from this group, send email to > java-serialization-benchmarking+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/java-serialization-benchmarking?hl=en.
-- When the cat is away, the mouse is alone. - David Yu
On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com> wrote: > On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com> wrote:
>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>> Hi all,
>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>> content of the dataset. >>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>> using its internal buffer since it knows everything fits). Other streaming >>>>> libs (smile/json/etc) are not doing this. >>>>> Previously with kryo v1, this could not be alleviated as it was >>>>> based on ByteBuffer (crashes on buffer overflow).
>>>> The benchmark serialize method returns a byte[]. The most efficient way >>>> to do this with Kryo is to use Output by itself. Your changes cause the >>>> bytes to be written to a byte[] in Output, then unnecessarily copied to a >>>> ByteArrayOutputStream.
>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>> Nope. Why don't you read #2 again. >>> Notice that before the change, all the libs in the benchmark are able to >>> handle even if the data is 100x the buffer, *except kryo, *which >>> crashes (apparently kryo v2 has same problems with v1).
>>> The point is that there should be no bias/shortcuts based from direct >>> knowledge of the content and size dataset. >>> That is why the buffer size will be provided by the benchmark, not the >>> author. >>> Future runs will have the option to use a dataset whose size exceeds the >>> buffer size provided.
>> Use this... >> this.output = new Output(BUFFER_SIZE, -1); >> ...so Output grows its buffer unbounded. Next time if you are going to >> change my code, read the javadocs. Now, revert the changes you made to the >> serialize() method in the Kryo benchmark. Thanks.
> Now there is another issue with re-use. > There is a difference between stateless and stateful re-use. > For example MsgPack re-uses its Packer/Unpacker, but on every iteration it > completely resets all of the changes made such that the second run will > have exactly the same state like it was the first run.
> The benchmark results are basically based from a single run. The only > reason why we run iterations is to compute the average and use that as the > basis for the one-single-run.
> The problem with kryo is that the second run is completely different from > the first run (stateful). The buffers have been resized (you could call > this sampling. After the first run, it already knows the message can fit). > The second run and onwards no longer deals with overflows.
> Compare this with the rest of the libs, on every iteration, they all need > to flush/resize/handle overflows, because either the state is new ... or > its completely reset (MsgPack)
> Therefore, kryo's re-use is not valid at all. > So thank you for reminding me to refactor the code even further to remove > your re-use. > Next time, try not to blatantly take shortcuts.
We've already been over this. So far, no one else has agreed with you. Tatu accepted the patch for reuse of the ByteArrayOutputStream. We are measuring the serializers, not the growing of the ByteArrayOutputStream.
On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote: > On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com> wrote:
>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>wrote:
>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>>> Hi all,
>>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>>> content of the dataset. >>>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>>> using its internal buffer since it knows everything fits). Other streaming >>>>>> libs (smile/json/etc) are not doing this. >>>>>> Previously with kryo v1, this could not be alleviated as it was >>>>>> based on ByteBuffer (crashes on buffer overflow).
>>>>> The benchmark serialize method returns a byte[]. The most efficient >>>>> way to do this with Kryo is to use Output by itself. Your changes cause the >>>>> bytes to be written to a byte[] in Output, then unnecessarily copied to a >>>>> ByteArrayOutputStream.
>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>> Nope. Why don't you read #2 again. >>>> Notice that before the change, all the libs in the benchmark are able >>>> to handle even if the data is 100x the buffer, *except kryo, *which >>>> crashes (apparently kryo v2 has same problems with v1).
>>>> The point is that there should be no bias/shortcuts based from direct >>>> knowledge of the content and size dataset. >>>> That is why the buffer size will be provided by the benchmark, not the >>>> author. >>>> Future runs will have the option to use a dataset whose size exceeds >>>> the buffer size provided.
>>> Use this... >>> this.output = new Output(BUFFER_SIZE, -1); >>> ...so Output grows its buffer unbounded. Next time if you are going to >>> change my code, read the javadocs. Now, revert the changes you made to the >>> serialize() method in the Kryo benchmark. Thanks.
>> Now there is another issue with re-use. >> There is a difference between stateless and stateful re-use. >> For example MsgPack re-uses its Packer/Unpacker, but on every iteration >> it completely resets all of the changes made such that the second run will >> have exactly the same state like it was the first run.
>> The benchmark results are basically based from a single run. The only >> reason why we run iterations is to compute the average and use that as the >> basis for the one-single-run.
>> The problem with kryo is that the second run is completely different from >> the first run (stateful). The buffers have been resized (you could call >> this sampling. After the first run, it already knows the message can fit). >> The second run and onwards no longer deals with overflows.
>> Compare this with the rest of the libs, on every iteration, they all need >> to flush/resize/handle overflows, because either the state is new ... or >> its completely reset (MsgPack)
>> Therefore, kryo's re-use is not valid at all. >> So thank you for reminding me to refactor the code even further to remove >> your re-use. >> Next time, try not to blatantly take shortcuts.
> We've already been over this.
So far, no one else has agreed with you. Tatu accepted the patch for reuse
> of the ByteArrayOutputStream.
This is unreleated to what we discussed before (plus your shortcuts went unnoticed at that time). This is about stateless/stateful re-use. What's been recently stated in this thread are facts. You don't even have a logical explanation to counter the previous statement.
If you're talking about buffer re-use, it is allowed (be it directly used or from external buffer cache). Just look at your code, nobody changed that.
We are measuring the serializers, not the growing of the
> ByteArrayOutputStream.
That's not even the issue. The issue is that kryo is having a different state from the first run. If you can completely reset the state of your Input/Output (like MsgPack's Packer/Unpacker), there isn't a problem.
@Tatu, if you can chime in and share your thoughts on the previous statement, that be great.
> -Nate
> -- > You received this message because you are subscribed to the Google Groups > "java-serialization-benchmarking" group. > To post to this group, send email to > java-serialization-benchmarking@googlegroups.com. > To unsubscribe from this group, send email to > java-serialization-benchmarking+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/java-serialization-benchmarking?hl=en.
-- When the cat is away, the mouse is alone. - David Yu
On Fri, Apr 20, 2012 at 10:05 AM, David Yu <david.yu....@gmail.com> wrote:
> On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
>> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com>wrote:
>>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>> Hi all,
>>>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>>>> content of the dataset. >>>>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>>>> using its internal buffer since it knows everything fits). Other streaming >>>>>>> libs (smile/json/etc) are not doing this. >>>>>>> Previously with kryo v1, this could not be alleviated as it was >>>>>>> based on ByteBuffer (crashes on buffer overflow).
>>>>>> The benchmark serialize method returns a byte[]. The most efficient >>>>>> way to do this with Kryo is to use Output by itself. Your changes cause the >>>>>> bytes to be written to a byte[] in Output, then unnecessarily copied to a >>>>>> ByteArrayOutputStream.
>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>>> Nope. Why don't you read #2 again. >>>>> Notice that before the change, all the libs in the benchmark are able >>>>> to handle even if the data is 100x the buffer, *except kryo, *which >>>>> crashes (apparently kryo v2 has same problems with v1).
>>>>> The point is that there should be no bias/shortcuts based from direct >>>>> knowledge of the content and size dataset. >>>>> That is why the buffer size will be provided by the benchmark, not the >>>>> author. >>>>> Future runs will have the option to use a dataset whose size exceeds >>>>> the buffer size provided.
>>>> Use this... >>>> this.output = new Output(BUFFER_SIZE, -1); >>>> ...so Output grows its buffer unbounded. Next time if you are going to >>>> change my code, read the javadocs. Now, revert the changes you made to the >>>> serialize() method in the Kryo benchmark. Thanks.
>>> Now there is another issue with re-use. >>> There is a difference between stateless and stateful re-use. >>> For example MsgPack re-uses its Packer/Unpacker, but on every iteration >>> it completely resets all of the changes made such that the second run will >>> have exactly the same state like it was the first run.
>>> The benchmark results are basically based from a single run. The only >>> reason why we run iterations is to compute the average and use that as the >>> basis for the one-single-run.
>>> The problem with kryo is that the second run is completely different >>> from the first run (stateful). The buffers have been resized (you could >>> call this sampling. After the first run, it already knows the message can >>> fit). The second run and onwards no longer deals with overflows.
>>> Compare this with the rest of the libs, on every iteration, they all >>> need to flush/resize/handle overflows, because either the state is new ... >>> or its completely reset (MsgPack)
>>> Therefore, kryo's re-use is not valid at all. >>> So thank you for reminding me to refactor the code even further to >>> remove your re-use. >>> Next time, try not to blatantly take shortcuts.
>> We've already been over this.
> So far, no one else has agreed with you. Tatu accepted the patch for reuse >> of the ByteArrayOutputStream.
> This is unreleated to what we discussed before (plus your shortcuts went > unnoticed at that time). > This is about stateless/stateful re-use. > What's been recently stated in this thread are facts. > You don't even have a logical explanation to counter the previous > statement.
> If you're talking about buffer re-use, it is allowed (be it directly used > or from external buffer cache). > Just look at your code, nobody changed that.
> We are measuring the serializers, not the growing of the >> ByteArrayOutputStream.
> That's not even the issue. The issue is that kryo is having a different > state from the first run. If you can completely reset the state of your > Input/Output (like MsgPack's Packer/Unpacker), there isn't a problem.
I have no idea what you are talking about. Tatu has asked you many times to show the code you are talking about. Quit wasting our time.
On Sat, Apr 21, 2012 at 1:05 AM, David Yu <david.yu....@gmail.com> wrote:
> On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
>> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com>wrote:
>>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>> Hi all,
>>>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>>>> content of the dataset. >>>>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>>>> using its internal buffer since it knows everything fits). Other streaming >>>>>>> libs (smile/json/etc) are not doing this. >>>>>>> Previously with kryo v1, this could not be alleviated as it was >>>>>>> based on ByteBuffer (crashes on buffer overflow).
>>>>>> The benchmark serialize method returns a byte[]. The most efficient >>>>>> way to do this with Kryo is to use Output by itself. Your changes cause the >>>>>> bytes to be written to a byte[] in Output, then unnecessarily copied to a >>>>>> ByteArrayOutputStream.
>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>>> Nope. Why don't you read #2 again. >>>>> Notice that before the change, all the libs in the benchmark are able >>>>> to handle even if the data is 100x the buffer, *except kryo, *which >>>>> crashes (apparently kryo v2 has same problems with v1).
>>>>> The point is that there should be no bias/shortcuts based from direct >>>>> knowledge of the content and size dataset. >>>>> That is why the buffer size will be provided by the benchmark, not the >>>>> author. >>>>> Future runs will have the option to use a dataset whose size exceeds >>>>> the buffer size provided.
>>>> Use this... >>>> this.output = new Output(BUFFER_SIZE, -1); >>>> ...so Output grows its buffer unbounded. Next time if you are going to >>>> change my code, read the javadocs. Now, revert the changes you made to the >>>> serialize() method in the Kryo benchmark. Thanks.
>>> Now there is another issue with re-use. >>> There is a difference between stateless and stateful re-use. >>> For example MsgPack re-uses its Packer/Unpacker, but on every iteration >>> it completely resets all of the changes made such that the second run will >>> have exactly the same state like it was the first run.
>>> The benchmark results are basically based from a single run. The only >>> reason why we run iterations is to compute the average and use that as the >>> basis for the one-single-run.
>>> The problem with kryo is that the second run is completely different >>> from the first run (stateful). The buffers have been resized (you could >>> call this sampling. After the first run, it already knows the message can >>> fit). The second run and onwards no longer deals with overflows.
>>> Compare this with the rest of the libs, on every iteration, they all >>> need to flush/resize/handle overflows, because either the state is new ... >>> or its completely reset (MsgPack)
>>> Therefore, kryo's re-use is not valid at all. >>> So thank you for reminding me to refactor the code even further to >>> remove your re-use. >>> Next time, try not to blatantly take shortcuts.
>> We've already been over this.
> So far, no one else has agreed with you. Tatu accepted the patch for reuse >> of the ByteArrayOutputStream.
> This is unreleated to what we discussed before (plus your shortcuts went > unnoticed at that time). > This is about stateless/stateful re-use. > What's been recently stated in this thread are facts. > You don't even have a logical explanation to counter the previous > statement.
> If you're talking about buffer re-use, it is allowed (be it directly used > or from external buffer cache). > Just look at your code, nobody changed that.
> We are measuring the serializers, not the growing of the >> ByteArrayOutputStream.
> That's not even the issue. The issue is that kryo is having a different > state from the first run. If you can completely reset the state of your > Input/Output (like MsgPack's Packer/Unpacker), there isn't a problem.
> @Tatu, if you can chime in and share your thoughts on > the previous statement, that be great.
Put it this way. As I've said on the first post, we'll be able to measure the behavior/performance of serializers when data does not fit in the buffer.. With kryo's re-used output that internally resizes on the first run (test for correctness), it is basically exempted. The two solutions are: 1. Continue with re-use but use an outputstream, so it doesn't resize but flush (the current solution). 2. Skip the outputstream and allow it to resize, but use a new instance on every iteration. Without that, kryo is exempted.
>> -- >> You received this message because you are subscribed to the Google Groups >> "java-serialization-benchmarking" group. >> To post to this group, send email to >> java-serialization-benchmarking@googlegroups.com. >> To unsubscribe from this group, send email to >> java-serialization-benchmarking+unsubscribe@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/java-serialization-benchmarking?hl=en.
> -- > When the cat is away, the mouse is alone. > - David Yu
-- When the cat is away, the mouse is alone. - David Yu
On Sat, Apr 21, 2012 at 1:07 AM, Nate <nathan.sw...@gmail.com> wrote:
> On Fri, Apr 20, 2012 at 10:05 AM, David Yu <david.yu....@gmail.com> wrote:
>> On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
>>> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com>wrote:
>>>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>>> Hi all,
>>>>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>>>>> content of the dataset. >>>>>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>>>>> using its internal buffer since it knows everything fits). Other streaming >>>>>>>> libs (smile/json/etc) are not doing this. >>>>>>>> Previously with kryo v1, this could not be alleviated as it was >>>>>>>> based on ByteBuffer (crashes on buffer overflow).
>>>>>>> The benchmark serialize method returns a byte[]. The most efficient >>>>>>> way to do this with Kryo is to use Output by itself. Your changes cause the >>>>>>> bytes to be written to a byte[] in Output, then unnecessarily copied to a >>>>>>> ByteArrayOutputStream.
>>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>>>> Nope. Why don't you read #2 again. >>>>>> Notice that before the change, all the libs in the benchmark are able >>>>>> to handle even if the data is 100x the buffer, *except kryo, *which >>>>>> crashes (apparently kryo v2 has same problems with v1).
>>>>>> The point is that there should be no bias/shortcuts based from direct >>>>>> knowledge of the content and size dataset. >>>>>> That is why the buffer size will be provided by the benchmark, not >>>>>> the author. >>>>>> Future runs will have the option to use a dataset whose size exceeds >>>>>> the buffer size provided.
>>>>> Use this... >>>>> this.output = new Output(BUFFER_SIZE, -1); >>>>> ...so Output grows its buffer unbounded. Next time if you are going to >>>>> change my code, read the javadocs. Now, revert the changes you made to the >>>>> serialize() method in the Kryo benchmark. Thanks.
>>>> Now there is another issue with re-use. >>>> There is a difference between stateless and stateful re-use. >>>> For example MsgPack re-uses its Packer/Unpacker, but on every iteration >>>> it completely resets all of the changes made such that the second run will >>>> have exactly the same state like it was the first run.
>>>> The benchmark results are basically based from a single run. The only >>>> reason why we run iterations is to compute the average and use that as the >>>> basis for the one-single-run.
>>>> The problem with kryo is that the second run is completely different >>>> from the first run (stateful). The buffers have been resized (you could >>>> call this sampling. After the first run, it already knows the message can >>>> fit). The second run and onwards no longer deals with overflows.
>>>> Compare this with the rest of the libs, on every iteration, they all >>>> need to flush/resize/handle overflows, because either the state is new ... >>>> or its completely reset (MsgPack)
>>>> Therefore, kryo's re-use is not valid at all. >>>> So thank you for reminding me to refactor the code even further to >>>> remove your re-use. >>>> Next time, try not to blatantly take shortcuts.
>>> We've already been over this.
>> So far, no one else has agreed with you. Tatu accepted the patch for >>> reuse of the ByteArrayOutputStream.
>> This is unreleated to what we discussed before (plus your shortcuts went >> unnoticed at that time). >> This is about stateless/stateful re-use. >> What's been recently stated in this thread are facts. >> You don't even have a logical explanation to counter the previous >> statement.
>> If you're talking about buffer re-use, it is allowed (be it directly used >> or from external buffer cache). >> Just look at your code, nobody changed that.
>> We are measuring the serializers, not the growing of the >>> ByteArrayOutputStream.
>> That's not even the issue. The issue is that kryo is having a different >> state from the first run. If you can completely reset the state of your >> Input/Output (like MsgPack's Packer/Unpacker), there isn't a problem.
> I have no idea what you are talking about.
So you're avoiding the argument now?
> Tatu has asked you many times to show the code you are talking about.
Again, what is mentioned in this thread is unrelated to the previous discussion.
>> @Tatu, if you can chime in and share your thoughts on >> the previous statement, that be great.
> -- > You received this message because you are subscribed to the Google Groups > "java-serialization-benchmarking" group. > To post to this group, send email to > java-serialization-benchmarking@googlegroups.com. > To unsubscribe from this group, send email to > java-serialization-benchmarking+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/java-serialization-benchmarking?hl=en.
-- When the cat is away, the mouse is alone. - David Yu
On Fri, Apr 20, 2012 at 10:32 AM, David Yu <david.yu....@gmail.com> wrote:
> On Sat, Apr 21, 2012 at 1:05 AM, David Yu <david.yu....@gmail.com> wrote:
>> On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
>>> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com>wrote:
>>>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>>> Hi all,
>>>>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>>>>> content of the dataset. >>>>>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>>>>> using its internal buffer since it knows everything fits). Other streaming >>>>>>>> libs (smile/json/etc) are not doing this. >>>>>>>> Previously with kryo v1, this could not be alleviated as it was >>>>>>>> based on ByteBuffer (crashes on buffer overflow).
>>>>>>> The benchmark serialize method returns a byte[]. The most efficient >>>>>>> way to do this with Kryo is to use Output by itself. Your changes cause the >>>>>>> bytes to be written to a byte[] in Output, then unnecessarily copied to a >>>>>>> ByteArrayOutputStream.
>>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>>>> Nope. Why don't you read #2 again. >>>>>> Notice that before the change, all the libs in the benchmark are able >>>>>> to handle even if the data is 100x the buffer, *except kryo, *which >>>>>> crashes (apparently kryo v2 has same problems with v1).
>>>>>> The point is that there should be no bias/shortcuts based from direct >>>>>> knowledge of the content and size dataset. >>>>>> That is why the buffer size will be provided by the benchmark, not >>>>>> the author. >>>>>> Future runs will have the option to use a dataset whose size exceeds >>>>>> the buffer size provided.
>>>>> Use this... >>>>> this.output = new Output(BUFFER_SIZE, -1); >>>>> ...so Output grows its buffer unbounded. Next time if you are going to >>>>> change my code, read the javadocs. Now, revert the changes you made to the >>>>> serialize() method in the Kryo benchmark. Thanks.
>>>> Now there is another issue with re-use. >>>> There is a difference between stateless and stateful re-use. >>>> For example MsgPack re-uses its Packer/Unpacker, but on every iteration >>>> it completely resets all of the changes made such that the second run will >>>> have exactly the same state like it was the first run.
>>>> The benchmark results are basically based from a single run. The only >>>> reason why we run iterations is to compute the average and use that as the >>>> basis for the one-single-run.
>>>> The problem with kryo is that the second run is completely different >>>> from the first run (stateful). The buffers have been resized (you could >>>> call this sampling. After the first run, it already knows the message can >>>> fit). The second run and onwards no longer deals with overflows.
>>>> Compare this with the rest of the libs, on every iteration, they all >>>> need to flush/resize/handle overflows, because either the state is new ... >>>> or its completely reset (MsgPack)
>>>> Therefore, kryo's re-use is not valid at all. >>>> So thank you for reminding me to refactor the code even further to >>>> remove your re-use. >>>> Next time, try not to blatantly take shortcuts.
>>> We've already been over this.
>> So far, no one else has agreed with you. Tatu accepted the patch for >>> reuse of the ByteArrayOutputStream.
>> This is unreleated to what we discussed before (plus your shortcuts went >> unnoticed at that time). >> This is about stateless/stateful re-use. >> What's been recently stated in this thread are facts. >> You don't even have a logical explanation to counter the previous >> statement.
>> If you're talking about buffer re-use, it is allowed (be it directly used >> or from external buffer cache). >> Just look at your code, nobody changed that.
>> We are measuring the serializers, not the growing of the >>> ByteArrayOutputStream.
>> That's not even the issue. The issue is that kryo is having a different >> state from the first run. If you can completely reset the state of your >> Input/Output (like MsgPack's Packer/Unpacker), there isn't a problem.
>> @Tatu, if you can chime in and share your thoughts on >> the previous statement, that be great.
> Put it this way. As I've said on the first post, we'll be able to > measure the behavior/performance of serializers when data does not fit in > the buffer.. With kryo's re-used output that internally resizes on the > first run (test for correctness), it is basically exempted.
For the last time, when data doesn't fit in the buffer, the buffer grows. Measuring this is WORTHLESS.
> The two solutions are: > 1. Continue with re-use but use an outputstream, so it doesn't resize but > flush (the current solution).
Serializer#serialize() returns a byte[]. It doesn't force usage of a ByteArrayOutputStream. Kryo can avoid using ByteArrayOutputStream by using its own Output class, which works in the same way.
> 2. Skip the outputstream and allow it to resize, but use a new instance > on every iteration. > Without that, kryo is exempted.
This is a community effort, so thankfully you alone do not get to decide what makes a library exempt.
On Fri, Apr 20, 2012 at 10:39 AM, David Yu <david.yu....@gmail.com> wrote:
> On Sat, Apr 21, 2012 at 1:07 AM, Nate <nathan.sw...@gmail.com> wrote:
>> On Fri, Apr 20, 2012 at 10:05 AM, David Yu <david.yu....@gmail.com>wrote:
>>> On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
>>>> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com>wrote:
>>>>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>>>> Hi all,
>>>>>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>>>>>> content of the dataset. >>>>>>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>>>>>> using its internal buffer since it knows everything fits). Other streaming >>>>>>>>> libs (smile/json/etc) are not doing this. >>>>>>>>> Previously with kryo v1, this could not be alleviated as it was >>>>>>>>> based on ByteBuffer (crashes on buffer overflow).
>>>>>>>> The benchmark serialize method returns a byte[]. The most efficient >>>>>>>> way to do this with Kryo is to use Output by itself. Your changes cause the >>>>>>>> bytes to be written to a byte[] in Output, then unnecessarily copied to a >>>>>>>> ByteArrayOutputStream.
>>>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>>>>> Nope. Why don't you read #2 again. >>>>>>> Notice that before the change, all the libs in the benchmark are >>>>>>> able to handle even if the data is 100x the buffer, *except kryo, *which >>>>>>> crashes (apparently kryo v2 has same problems with v1).
>>>>>>> The point is that there should be no bias/shortcuts based from >>>>>>> direct knowledge of the content and size dataset. >>>>>>> That is why the buffer size will be provided by the benchmark, not >>>>>>> the author. >>>>>>> Future runs will have the option to use a dataset whose size exceeds >>>>>>> the buffer size provided.
>>>>>> Use this... >>>>>> this.output = new Output(BUFFER_SIZE, -1); >>>>>> ...so Output grows its buffer unbounded. Next time if you are going >>>>>> to change my code, read the javadocs. Now, revert the changes you made to >>>>>> the serialize() method in the Kryo benchmark. Thanks.
>>>>> Now there is another issue with re-use. >>>>> There is a difference between stateless and stateful re-use. >>>>> For example MsgPack re-uses its Packer/Unpacker, but on every >>>>> iteration it completely resets all of the changes made such that the second >>>>> run will have exactly the same state like it was the first run.
>>>>> The benchmark results are basically based from a single run. The only >>>>> reason why we run iterations is to compute the average and use that as the >>>>> basis for the one-single-run.
>>>>> The problem with kryo is that the second run is completely different >>>>> from the first run (stateful). The buffers have been resized (you could >>>>> call this sampling. After the first run, it already knows the message can >>>>> fit). The second run and onwards no longer deals with overflows.
>>>>> Compare this with the rest of the libs, on every iteration, they all >>>>> need to flush/resize/handle overflows, because either the state is new ... >>>>> or its completely reset (MsgPack)
>>>>> Therefore, kryo's re-use is not valid at all. >>>>> So thank you for reminding me to refactor the code even further to >>>>> remove your re-use. >>>>> Next time, try not to blatantly take shortcuts.
>>>> We've already been over this.
>>> So far, no one else has agreed with you. Tatu accepted the patch for >>>> reuse of the ByteArrayOutputStream.
>>> This is unreleated to what we discussed before (plus your shortcuts went >>> unnoticed at that time). >>> This is about stateless/stateful re-use. >>> What's been recently stated in this thread are facts. >>> You don't even have a logical explanation to counter the previous >>> statement.
>>> If you're talking about buffer re-use, it is allowed (be it directly >>> used or from external buffer cache). >>> Just look at your code, nobody changed that.
>>> We are measuring the serializers, not the growing of the >>>> ByteArrayOutputStream.
>>> That's not even the issue. The issue is that kryo is having a different >>> state from the first run. If you can completely reset the state of your >>> Input/Output (like MsgPack's Packer/Unpacker), there isn't a problem.
>> I have no idea what you are talking about.
> So you're avoiding the argument now?
Tatu has asked you many times to show the code you are talking about.
> Again, what is mentioned in this thread is unrelated to the previous > discussion.
And so you refuse to make it clear what you are talking about? When asked for clarification, you just continue your same drivel. I am all for discussing issues in an adult manner, but you continuously cast blame and make claims that I am trying to maliciously deceive. When you don't like the results, you make changes to Kryo's benchmarks that are suboptimal and you update the wiki before the group can discuss your changes. Your continued argumentativeness is petty and childish.
It seems this project can no longer be sanely managed as a group. I vote that we have a single developer that manages the main source repository. Contributions would occur as git forks. The developer managing the project would discuss changes with the group as needed. I nominate Kannan to manage the project.
On Sat, Apr 21, 2012 at 2:06 AM, Nate <nathan.sw...@gmail.com> wrote: > On Fri, Apr 20, 2012 at 10:32 AM, David Yu <david.yu....@gmail.com> wrote:
>> On Sat, Apr 21, 2012 at 1:05 AM, David Yu <david.yu....@gmail.com> wrote:
>>> On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
>>>> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com>wrote:
>>>>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>>>> Hi all,
>>>>>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>>>>>> content of the dataset. >>>>>>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>>>>>> using its internal buffer since it knows everything fits). Other streaming >>>>>>>>> libs (smile/json/etc) are not doing this. >>>>>>>>> Previously with kryo v1, this could not be alleviated as it was >>>>>>>>> based on ByteBuffer (crashes on buffer overflow).
>>>>>>>> The benchmark serialize method returns a byte[]. The most efficient >>>>>>>> way to do this with Kryo is to use Output by itself. Your changes cause the >>>>>>>> bytes to be written to a byte[] in Output, then unnecessarily copied to a >>>>>>>> ByteArrayOutputStream.
>>>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>>>>> Nope. Why don't you read #2 again. >>>>>>> Notice that before the change, all the libs in the benchmark are >>>>>>> able to handle even if the data is 100x the buffer, *except kryo, *which >>>>>>> crashes (apparently kryo v2 has same problems with v1).
>>>>>>> The point is that there should be no bias/shortcuts based from >>>>>>> direct knowledge of the content and size dataset. >>>>>>> That is why the buffer size will be provided by the benchmark, not >>>>>>> the author. >>>>>>> Future runs will have the option to use a dataset whose size exceeds >>>>>>> the buffer size provided.
>>>>>> Use this... >>>>>> this.output = new Output(BUFFER_SIZE, -1); >>>>>> ...so Output grows its buffer unbounded. Next time if you are going >>>>>> to change my code, read the javadocs. Now, revert the changes you made to >>>>>> the serialize() method in the Kryo benchmark. Thanks.
>>>>> Now there is another issue with re-use. >>>>> There is a difference between stateless and stateful re-use. >>>>> For example MsgPack re-uses its Packer/Unpacker, but on every >>>>> iteration it completely resets all of the changes made such that the second >>>>> run will have exactly the same state like it was the first run.
>>>>> The benchmark results are basically based from a single run. The only >>>>> reason why we run iterations is to compute the average and use that as the >>>>> basis for the one-single-run.
>>>>> The problem with kryo is that the second run is completely different >>>>> from the first run (stateful). The buffers have been resized (you could >>>>> call this sampling. After the first run, it already knows the message can >>>>> fit). The second run and onwards no longer deals with overflows.
>>>>> Compare this with the rest of the libs, on every iteration, they all >>>>> need to flush/resize/handle overflows, because either the state is new ... >>>>> or its completely reset (MsgPack)
>>>>> Therefore, kryo's re-use is not valid at all. >>>>> So thank you for reminding me to refactor the code even further to >>>>> remove your re-use. >>>>> Next time, try not to blatantly take shortcuts.
>>>> We've already been over this.
>>> So far, no one else has agreed with you. Tatu accepted the patch for >>>> reuse of the ByteArrayOutputStream.
>>> This is unreleated to what we discussed before (plus your shortcuts went >>> unnoticed at that time). >>> This is about stateless/stateful re-use. >>> What's been recently stated in this thread are facts. >>> You don't even have a logical explanation to counter the previous >>> statement.
>>> If you're talking about buffer re-use, it is allowed (be it directly >>> used or from external buffer cache). >>> Just look at your code, nobody changed that.
>>> We are measuring the serializers, not the growing of the >>>> ByteArrayOutputStream.
>>> That's not even the issue. The issue is that kryo is having a different >>> state from the first run. If you can completely reset the state of your >>> Input/Output (like MsgPack's Packer/Unpacker), there isn't a problem.
>>> @Tatu, if you can chime in and share your thoughts on >>> the previous statement, that be great.
>> Put it this way. As I've said on the first post, we'll be able to >> measure the behavior/performance of serializers when data does not fit in >> the buffer.. With kryo's re-used output that internally resizes on the >> first run (test for correctness), it is basically exempted.
> Kryo's Output class does EXACTLY the same thing that ByteArrayOutputStream > does:
> For the last time, when data doesn't fit in the buffer, the buffer grows.
Except that kryo doesn't because the growth is permanent once the first run is done. So ultimately, the buffer size being assigned by the benchmark is useless for kryo.
> Measuring this is WORTHLESS.
Really? Here's some sample data with the message size larger than buffer size (media.3.cks with 512 as the provided buffer size).
*====== when kryo internally resizes buffers after the first run, which also effectively ignores the buffer size provided by benchmark* * * ./run -trials=500 -include=java-manual,kryo,wobly data/media.3.cks Checking correctness... [done] create ser +same deser +shal +deep total size +dfl java-manual 136 7843 7684 3975 4039 4170 12013 1596 255 *kryo 134 3855 3708 4468 4480 4541 8396 1573 254* wobly 86 11227 11012 3560 3563 3617 14844 1604 275
// Here's the code: protected final Input input = new Input(BUFFER_SIZE); // as author suggested protected final Output output = new Output(BUFFER_SIZE, Integer.MAX_VALUE); // same as: new Output(BUFFER_SIZE, -1)
public T deserialize (byte[] array) { input.setBuffer(array); return kryo.readObject(input, type);
// Here's the code: protected final byte[] *buffer* = new byte[BUFFER_SIZE]; // buffer assigned by the benchmark, re-used
public T deserialize (byte[] array) { return kryo.readObject(new Input(array), type);
}
public byte[] serialize (T content) { Output output = new Output(*buffer*, Integer.MAX_VALUE); kryo.writeObject(output, content); return output.toBytes();
}
Finally, for a non-stream based serializer like wobly, one could improvise (retain state like kryo) and save the computed size of the message on the first run. The second run onwards, it already knows the size, so it doesn't have to compute anymore and directly serialize to a byte array (which could be like 50-100% speed increase for wobly).
*Now don't tell me its worthless because clearly, the results prove it.*
>> The two solutions are: >> 1. Continue with re-use but use an outputstream, so it doesn't resize >> but flush (the current solution).
> Serializer#serialize() returns a byte[]. It doesn't force usage of a > ByteArrayOutputStream. Kryo can avoid using ByteArrayOutputStream by using > its own Output class, which works in the same way.
>> 2. Skip the outputstream and allow it to resize, but use a new instance >> on every iteration. >> Without that, kryo is exempted.
> This is a community effort, so thankfully you alone do not get to decide > what makes a library exempt.
> -- > You received this message because you are subscribed to the Google Groups > "java-serialization-benchmarking" group. > To post to this group, send email to > java-serialization-benchmarking@googlegroups.com. > To unsubscribe from this group, send email to > java-serialization-benchmarking+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/java-serialization-benchmarking?hl=en.
-- When the cat is away, the mouse is alone. - David Yu
Unfortunately, I don't have the time to be a responsive project maintainer. I think, though, that we can get things back on track with our usual mailing-list-consensus-based ways once we get a little more process set up.
I think one issue is that the results page is very high stakes. For most people, the bar charts we publish are taken to be the final word on JVM serializer performance. Internet People cite them all the time.
<irrelevant-aside>I think this is unfortunate. As someone who spent weeks reworking the codebase, I got to see in great detail just how flawed these benchmarks are. Just the fact that we test a single, limited data value, yet provide results with four significant figures is absurd! This isn't anyone's fault; I'm just trying to point out the discrepancy with the accuracy of our results and (my perception of) how much The Internet bases decisions off our graphs. I think we're better than the other comparisons out there (like the lolgraph on MsgPack's website), but benchmarking is so hard to get right.</irrelevant-aside>
Anyway, the stakes are high enough that maybe we should push new results to a "staging" URL and give everyone a week to scrutinize them before publishing to the main URL.
Second, I agree that we should probably write up some "rules". When I first made a pass through the project I fixed the discrepancies that were obvious, but we're now dealing with more subtle stuff. Things that may seem right to you might not be to someone else, to the point that it seems like the other person is being malicious.
For example, this particular thread included (among other things) whether we should count buffer resizing in the serialization time. I can see Nate's high-level point about not counting things that may not happen in real usage. For a long-running process, maybe the buffer gets resized on the first few sends and then doesn't hit the limit ever again (though I haven't quite digested what's going on in David's last message...). But even if counting resize time is a bad idea, it's how we measure the other tools, so it's still not fair to publish results without fixing up the others.
So lets try writing everything down. When some code looks shady, figure out what written rule it breaks and point it out. If it's doing something questionable but there's no written rule to prevent it, lets then discuss and make a new rule. A side benefit is that we'll have a detailed documentation of our testing methodology that we can put on the wiki (so people don't have to rummage through the source code for this information).
To start, does anyone want to take a stab at formalizing the buffering rules?
On Fri, Apr 20, 2012 at 16:36, David Yu <david.yu....@gmail.com> wrote:
> On Sat, Apr 21, 2012 at 2:06 AM, Nate <nathan.sw...@gmail.com> wrote:
>> On Fri, Apr 20, 2012 at 10:32 AM, David Yu <david.yu....@gmail.com>wrote:
>>> On Sat, Apr 21, 2012 at 1:05 AM, David Yu <david.yu....@gmail.com>wrote:
>>>> On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
>>>>> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com>wrote:
>>>>>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com>wrote:
>>>>>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>>>>>>> content of the dataset. >>>>>>>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>>>>>>> using its internal buffer since it knows everything fits). Other streaming >>>>>>>>>> libs (smile/json/etc) are not doing this. >>>>>>>>>> Previously with kryo v1, this could not be alleviated as it >>>>>>>>>> was based on ByteBuffer (crashes on buffer overflow).
>>>>>>>>> The benchmark serialize method returns a byte[]. The most >>>>>>>>> efficient way to do this with Kryo is to use Output by itself. Your changes >>>>>>>>> cause the bytes to be written to a byte[] in Output, then unnecessarily >>>>>>>>> copied to a ByteArrayOutputStream.
>>>>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>>>>>> Nope. Why don't you read #2 again. >>>>>>>> Notice that before the change, all the libs in the benchmark are >>>>>>>> able to handle even if the data is 100x the buffer, *except kryo, *which >>>>>>>> crashes (apparently kryo v2 has same problems with v1).
>>>>>>>> The point is that there should be no bias/shortcuts based from >>>>>>>> direct knowledge of the content and size dataset. >>>>>>>> That is why the buffer size will be provided by the benchmark, not >>>>>>>> the author. >>>>>>>> Future runs will have the option to use a dataset whose size >>>>>>>> exceeds the buffer size provided.
>>>>>>> Use this... >>>>>>> this.output = new Output(BUFFER_SIZE, -1); >>>>>>> ...so Output grows its buffer unbounded. Next time if you are going >>>>>>> to change my code, read the javadocs. Now, revert the changes you made to >>>>>>> the serialize() method in the Kryo benchmark. Thanks.
>>>>>> Now there is another issue with re-use. >>>>>> There is a difference between stateless and stateful re-use. >>>>>> For example MsgPack re-uses its Packer/Unpacker, but on every >>>>>> iteration it completely resets all of the changes made such that the second >>>>>> run will have exactly the same state like it was the first run.
>>>>>> The benchmark results are basically based from a single run. The >>>>>> only reason why we run iterations is to compute the average and use that as >>>>>> the basis for the one-single-run.
>>>>>> The problem with kryo is that the second run is completely different >>>>>> from the first run (stateful). The buffers have been resized (you could >>>>>> call this sampling. After the first run, it already knows the message can >>>>>> fit). The second run and onwards no longer deals with overflows.
>>>>>> Compare this with the rest of the libs, on every iteration, they all >>>>>> need to flush/resize/handle overflows, because either the state is new ... >>>>>> or its completely reset (MsgPack)
>>>>>> Therefore, kryo's re-use is not valid at all. >>>>>> So thank you for reminding me to refactor the code even further to >>>>>> remove your re-use. >>>>>> Next time, try not to blatantly take shortcuts.
>>>>> We've already been over this.
>>>> So far, no one else has agreed with you. Tatu accepted the patch for >>>>> reuse of the ByteArrayOutputStream.
>>>> This is unreleated to what we discussed before (plus your shortcuts >>>> went unnoticed at that time). >>>> This is about stateless/stateful re-use. >>>> What's been recently stated in this thread are facts. >>>> You don't even have a logical explanation to counter the previous >>>> statement.
>>>> If you're talking about buffer re-use, it is allowed (be it directly >>>> used or from external buffer cache). >>>> Just look at your code, nobody changed that.
>>>> We are measuring the serializers, not the growing of the >>>>> ByteArrayOutputStream.
>>>> That's not even the issue. The issue is that kryo is having a >>>> different state from the first run. If you can completely reset the state >>>> of your Input/Output (like MsgPack's Packer/Unpacker), there isn't a >>>> problem.
>>>> @Tatu, if you can chime in and share your thoughts on >>>> the previous statement, that be great.
>>> Put it this way. As I've said on the first post, we'll be able to >>> measure the behavior/performance of serializers when data does not fit in >>> the buffer.. With kryo's re-used output that internally resizes on the >>> first run (test for correctness), it is basically exempted.
>> Kryo's Output class does EXACTLY the same thing that >> ByteArrayOutputStream does:
>> For the last time, when data doesn't fit in the buffer, the buffer grows.
> Except that kryo doesn't because the growth is permanent once the first > run is done. > So ultimately, the buffer size being assigned by the benchmark is useless > for kryo.
>> Measuring this is WORTHLESS.
> Really? Here's some sample data with the message size larger than buffer > size (media.3.cks with 512 as the provided buffer size).
> *====== when kryo internally resizes buffers after the first run, which > also effectively ignores the buffer size provided by benchmark* > * > * > ./run -trials=500 -include=java-manual,kryo,wobly data/media.3.cks > Checking correctness... > [done] > create ser +same deser +shal > +deep total size +dfl > java-manual 136 7843 7684 3975 4039 > 4170 12013 1596 255 > *kryo 134 3855 3708 4468 4480 > 4541 8396 1573 254* > wobly 86 11227 11012 3560 3563 > 3617 14844 1604 275
> // Here's the code: > protected final Input input = new Input(BUFFER_SIZE); > // as author suggested > protected final Output output = new Output(BUFFER_SIZE, > Integer.MAX_VALUE); // same as: new Output(BUFFER_SIZE, -1)
> public T deserialize (byte[] array) { > input.setBuffer(array);
On Sat, Apr 21, 2012 at 3:22 PM, Kannan Goundan <kan...@cakoose.com> wrote: > Unfortunately, I don't have the time to be a responsive project > maintainer. I think, though, that we can get things back on track with our > usual mailing-list-consensus-based ways once we get a little more process > set up.
> I think one issue is that the results page is very high stakes. For most > people, the bar charts we publish are taken to be the final word on JVM > serializer performance. Internet People cite them all the time.
> <irrelevant-aside>I think this is unfortunate. As someone who spent weeks > reworking the codebase, I got to see in great detail just how flawed these > benchmarks are. Just the fact that we test a single, limited data value, > yet provide results with four significant figures is absurd! This isn't > anyone's fault; I'm just trying to point out the discrepancy with the > accuracy of our results and (my perception of) how much The Internet bases > decisions off our graphs. I think we're better than the other comparisons > out there (like the lolgraph on MsgPack's website),
Yep. Their dataset was crafted to make msgpack look good.
> but benchmarking is so hard to get right.</irrelevant-aside>
> Anyway, the stakes are high enough that maybe we should push new results > to a "staging" URL and give everyone a week to scrutinize them before > publishing to the main URL.
> Second, I agree that we should probably write up some "rules". When I > first made a pass through the project I fixed the discrepancies that were > obvious, but we're now dealing with more subtle stuff. Things that may > seem right to you might not be to someone else, to the point that it seems > like the other person is being malicious.
> For example, this particular thread included (among other things) whether > we should count buffer resizing in the serialization time. I can see > Nate's high-level point about not counting things that may not happen in > real usage. For a long-running process, maybe the buffer gets resized on > the first few sends and then doesn't hit the limit ever again (though I > haven't quite digested what's going on in David's last message...). But > even if counting resize time is a bad idea, it's how we measure the other > tools, so it's still not fair to publish results without fixing up the > others.
100% agree. I've been trying to point that out.
> So lets try writing everything down. When some code looks shady, figure > out what written rule it breaks and point it out. If it's doing something > questionable but there's no written rule to prevent it, lets then discuss > and make a new rule. A side benefit is that we'll have a detailed > documentation of our testing methodology that we can put on the wiki (so > people don't have to rummage through the source code for this information).
> To start, does anyone want to take a stab at formalizing the buffering > rules?
> On Fri, Apr 20, 2012 at 16:36, David Yu <david.yu....@gmail.com> wrote:
>> On Sat, Apr 21, 2012 at 2:06 AM, Nate <nathan.sw...@gmail.com> wrote:
>>> On Fri, Apr 20, 2012 at 10:32 AM, David Yu <david.yu....@gmail.com>wrote:
>>>> On Sat, Apr 21, 2012 at 1:05 AM, David Yu <david.yu....@gmail.com>wrote:
>>>>> On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
>>>>>> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com>wrote:
>>>>>>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>wrote:
>>>>>>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com>wrote:
>>>>>>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com >>>>>>>>>> > wrote:
>>>>>>>>>>> Hi all,
>>>>>>>>>>> Recents updates were pushed to remove shortcuts from kryo. >>>>>>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the >>>>>>>>>>> content of the dataset. >>>>>>>>>>> 2. uses the maximum possible buffer size (based from the biggest >>>>>>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly >>>>>>>>>>> using its internal buffer since it knows everything fits). Other streaming >>>>>>>>>>> libs (smile/json/etc) are not doing this. >>>>>>>>>>> Previously with kryo v1, this could not be alleviated as it >>>>>>>>>>> was based on ByteBuffer (crashes on buffer overflow).
>>>>>>>>>> The benchmark serialize method returns a byte[]. The most >>>>>>>>>> efficient way to do this with Kryo is to use Output by itself. Your changes >>>>>>>>>> cause the bytes to be written to a byte[] in Output, then unnecessarily >>>>>>>>>> copied to a ByteArrayOutputStream.
>>>>>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>>>>>>> Nope. Why don't you read #2 again. >>>>>>>>> Notice that before the change, all the libs in the benchmark are >>>>>>>>> able to handle even if the data is 100x the buffer, *except kryo, >>>>>>>>> *which crashes (apparently kryo v2 has same problems with v1).
>>>>>>>>> The point is that there should be no bias/shortcuts based from >>>>>>>>> direct knowledge of the content and size dataset. >>>>>>>>> That is why the buffer size will be provided by the benchmark, not >>>>>>>>> the author. >>>>>>>>> Future runs will have the option to use a dataset whose size >>>>>>>>> exceeds the buffer size provided.
>>>>>>>> Use this... >>>>>>>> this.output = new Output(BUFFER_SIZE, -1); >>>>>>>> ...so Output grows its buffer unbounded. Next time if you are going >>>>>>>> to change my code, read the javadocs. Now, revert the changes you made to >>>>>>>> the serialize() method in the Kryo benchmark. Thanks.
>>>>>>> Now there is another issue with re-use. >>>>>>> There is a difference between stateless and stateful re-use. >>>>>>> For example MsgPack re-uses its Packer/Unpacker, but on every >>>>>>> iteration it completely resets all of the changes made such that the second >>>>>>> run will have exactly the same state like it was the first run.
>>>>>>> The benchmark results are basically based from a single run. The >>>>>>> only reason why we run iterations is to compute the average and use that as >>>>>>> the basis for the one-single-run.
>>>>>>> The problem with kryo is that the second run is completely different >>>>>>> from the first run (stateful). The buffers have been resized (you could >>>>>>> call this sampling. After the first run, it already knows the message can >>>>>>> fit). The second run and onwards no longer deals with overflows.
>>>>>>> Compare this with the rest of the libs, on every iteration, they all >>>>>>> need to flush/resize/handle overflows, because either the state is new ... >>>>>>> or its completely reset (MsgPack)
>>>>>>> Therefore, kryo's re-use is not valid at all. >>>>>>> So thank you for reminding me to refactor the code even further to >>>>>>> remove your re-use. >>>>>>> Next time, try not to blatantly take shortcuts.
>>>>>> We've already been over this.
>>>>> So far, no one else has agreed with you. Tatu accepted the patch for >>>>>> reuse of the ByteArrayOutputStream.
>>>>> This is unreleated to what we discussed before (plus your shortcuts >>>>> went unnoticed at that time). >>>>> This is about stateless/stateful re-use. >>>>> What's been recently stated in this thread are facts. >>>>> You don't even have a logical explanation to counter the previous >>>>> statement.
>>>>> If you're talking about buffer re-use, it is allowed (be it directly >>>>> used or from external buffer cache). >>>>> Just look at your code, nobody changed that.
>>>>> We are measuring the serializers, not the growing of the >>>>>> ByteArrayOutputStream.
>>>>> That's not even the issue. The issue is that kryo is having a >>>>> different state from the first run. If you can completely reset the state >>>>> of your Input/Output (like MsgPack's Packer/Unpacker), there isn't a >>>>> problem.
>>>>> @Tatu, if you can chime in and share your thoughts on >>>>> the previous statement, that be great.
>>>> Put it this way. As I've said on the first post, we'll be able to >>>> measure the behavior/performance of serializers when data does not fit in >>>> the buffer.. With kryo's re-used output that internally resizes on >>>> the first run (test for correctness), it is basically exempted.
>>> Kryo's Output class does EXACTLY the same thing that >>> ByteArrayOutputStream does:
>>> For the last time, when data doesn't fit in the buffer, the buffer grows.
>> Except that kryo doesn't because the growth is permanent once the first >> run is done. >> So ultimately, the buffer size being assigned by the benchmark is useless >> for kryo.
>>> Measuring this is WORTHLESS.
>> Really? Here's some sample data with the message size larger than buffer >> size (media.3.cks with 512 as the provided buffer size).
>> *====== when kryo internally resizes buffers after the first run, which >> also effectively ignores the buffer size provided by benchmark* >> * >> * >> ./run -trials=500 -include=java-manual,kryo,wobly data/media.3.cks >> Checking correctness... >> [done] >> create ser +same deser +shal >> +deep total size +dfl >> java-manual 136 7843 7684 3975 4039 >> 4170 12013 1596 255 >> *kryo
On Fri, Apr 20, 2012 at 10:05 AM, David Yu <david.yu....@gmail.com> wrote:
> On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
>> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com> wrote:
>>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com>
>>>> wrote:
>>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> wrote:
>>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com>
>>>>>> wrote:
>>>>>>> Hi all,
>>>>>>> Recents updates were pushed to remove shortcuts from kryo.
>>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the
>>>>>>> content of the dataset.
>>>>>>> 2. uses the maximum possible buffer size (based from the biggest
>>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is directly
>>>>>>> using its internal buffer since it knows everything fits). Other streaming
>>>>>>> libs (smile/json/etc) are not doing this.
>>>>>>> Previously with kryo v1, this could not be alleviated as it was
>>>>>>> based on ByteBuffer (crashes on buffer overflow).
>>>>>> The benchmark serialize method returns a byte[]. The most efficient
>>>>>> way to do this with Kryo is to use Output by itself. Your changes cause the
>>>>>> bytes to be written to a byte[] in Output, then unnecessarily copied to a
>>>>>> ByteArrayOutputStream.
>>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
>>>>> Nope. Why don't you read #2 again.
>>>>> Notice that before the change, all the libs in the benchmark are able
>>>>> to handle even if the data is 100x the buffer, except kryo, which crashes
>>>>> (apparently kryo v2 has same problems with v1).
>>>>> The point is that there should be no bias/shortcuts based from direct
>>>>> knowledge of the content and size dataset.
>>>>> That is why the buffer size will be provided by the benchmark, not the
>>>>> author.
>>>>> Future runs will have the option to use a dataset whose size exceeds
>>>>> the buffer size provided.
>>>> Use this...
>>>> this.output = new Output(BUFFER_SIZE, -1);
>>>> ...so Output grows its buffer unbounded. Next time if you are going to
>>>> change my code, read the javadocs. Now, revert the changes you made to the
>>>> serialize() method in the Kryo benchmark. Thanks.
>>> Now there is another issue with re-use.
>>> There is a difference between stateless and stateful re-use.
>>> For example MsgPack re-uses its Packer/Unpacker, but on every iteration
>>> it completely resets all of the changes made such that the second run will
>>> have exactly the same state like it was the first run.
>>> The benchmark results are basically based from a single run. The only
>>> reason why we run iterations is to compute the average and use that as the
>>> basis for the one-single-run.
>>> The problem with kryo is that the second run is completely different from
>>> the first run (stateful). The buffers have been resized (you could call
>>> this sampling. After the first run, it already knows the message can fit).
>>> The second run and onwards no longer deals with overflows.
>>> Compare this with the rest of the libs, on every iteration, they all need
>>> to flush/resize/handle overflows, because either the state is new ... or its
>>> completely reset (MsgPack)
>>> Therefore, kryo's re-use is not valid at all.
>>> So thank you for reminding me to refactor the code even further to remove
>>> your re-use.
>>> Next time, try not to blatantly take shortcuts.
>> We've already been over this.
>> So far, no one else has agreed with you. Tatu accepted the patch for reuse
>> of the ByteArrayOutputStream.
> This is unreleated to what we discussed before (plus your shortcuts went
> unnoticed at that time).
> This is about stateless/stateful re-use.
> What's been recently stated in this thread are facts.
> You don't even have a logical explanation to counter the previous statement.
> If you're talking about buffer re-use, it is allowed (be it directly used or
> from external buffer cache).
> Just look at your code, nobody changed that.
>> We are measuring the serializers, not the growing of the
>> ByteArrayOutputStream.
> That's not even the issue. The issue is that kryo is having a different
> state from the first run. If you can completely reset the state of your
> Input/Output (like MsgPack's Packer/Unpacker), there isn't a problem.
> @Tatu, if you can chime in and share your thoughts on
> the previous statement, that be great.
I don't feel I have the full picture, but for what it is worth here
are related thoughts.
- Sounds like handling of ByteArrayOutputStream (external to codecs)
was not problematic, and we can ignore it (originally I assumed you
meant that was problematic)
- I agree in that further runs should not be based on assuming/knowing
that further iterations provide same data.
I think our best chance is to focus on two things:
- creating "fully automatic" subset, which should avoid many of
disputed techniques
- trying to find a way to create permutations for different runs, so
that tests would exhibit some level of variation.
One more comment on "manual" tests (where I think all of us
occasionally get overzealous with optimizations, and/or disagree
most): these were, I think, originally created mostly because there
were no good fully-automated providers.
With XML, for example, such solutions tended to have
disproportionately high overhead -- but even there, adding JAXB at
this point would solve the issue, as it can use fastest parsers, and
does not have more than maybe 50% overhead (compared to XStream and
others that have steeper).
Put another way, I think manual variants are less necessary due to
coverage. In fact, moving tree-based and manual variants to completely
separate suite (but with comparable throughputs, so one can compare if
need be).
I know this does not help resolve the specific issue, but I feel that
we would do better if we stepped back and considered bigger picture.
And once we are done with "bigger" changes, we can solve detail
issues.
I don't mean to belittle the question of fairness -- which is
fundamental with comparison -- but sometimes best way to solve a
specific problem is not full frontal assault, but by outmaneuvering
the thing.
> On Fri, Apr 20, 2012 at 10:05 AM, David Yu <david.yu....@gmail.com> wrote:
> > On Fri, Apr 20, 2012 at 11:36 PM, Nate <nathan.sw...@gmail.com> wrote:
> >> On Thu, Apr 19, 2012 at 10:41 PM, David Yu <david.yu....@gmail.com> > wrote:
> >>> On Fri, Apr 20, 2012 at 11:21 AM, Nate <nathan.sw...@gmail.com> wrote:
> >>>> On Thu, Apr 19, 2012 at 6:35 PM, David Yu <david.yu....@gmail.com> > >>>> wrote:
> >>>>> On Fri, Apr 20, 2012 at 2:36 AM, Nate <nathan.sw...@gmail.com> > wrote:
> >>>>>> On Thu, Apr 19, 2012 at 3:49 AM, David Yu <david.yu....@gmail.com> > >>>>>> wrote:
> >>>>>>> Hi all,
> >>>>>>> Recents updates were pushed to remove shortcuts from kryo. > >>>>>>> 1. explicitly disables utf8 based from advanced knowledge of the > >>>>>>> content of the dataset. > >>>>>>> 2. uses the maximum possible buffer size (based from the biggest > >>>>>>> dataset) to avoid using an outputstream for flushing (shortcut is > directly > >>>>>>> using its internal buffer since it knows everything fits). Other > streaming > >>>>>>> libs (smile/json/etc) are not doing this. > >>>>>>> Previously with kryo v1, this could not be alleviated as it was > >>>>>>> based on ByteBuffer (crashes on buffer overflow).
> >>>>>> The benchmark serialize method returns a byte[]. The most efficient > >>>>>> way to do this with Kryo is to use Output by itself. Your changes > cause the > >>>>>> bytes to be written to a byte[] in Output, then unnecessarily > copied to a > >>>>>> ByteArrayOutputStream.
> >>>>>> The benchmark should not force a ByteArrayOutputStream to be used.
> >>>>> Nope. Why don't you read #2 again. > >>>>> Notice that before the change, all the libs in the benchmark are able > >>>>> to handle even if the data is 100x the buffer, except kryo, which > crashes > >>>>> (apparently kryo v2 has same problems with v1).
> >>>>> The point is that there should be no bias/shortcuts based from direct > >>>>> knowledge of the content and size dataset. > >>>>> That is why the buffer size will be provided by the benchmark, not > the > >>>>> author. > >>>>> Future runs will have the option to use a dataset whose size exceeds > >>>>> the buffer size provided.
> >>>> Use this... > >>>> this.output = new Output(BUFFER_SIZE, -1); > >>>> ...so Output grows its buffer unbounded. Next time if you are going to > >>>> change my code, read the javadocs. Now, revert the changes you made > to the > >>>> serialize() method in the Kryo benchmark. Thanks.
> >>> Now there is another issue with re-use. > >>> There is a difference between stateless and stateful re-use. > >>> For example MsgPack re-uses its Packer/Unpacker, but on every iteration > >>> it completely resets all of the changes made such that the second run > will > >>> have exactly the same state like it was the first run.
> >>> The benchmark results are basically based from a single run. The only > >>> reason why we run iterations is to compute the average and use that as > the > >>> basis for the one-single-run.
> >>> The problem with kryo is that the second run is completely different > from > >>> the first run (stateful). The buffers have been resized (you could > call > >>> this sampling. After the first run, it already knows the message can > fit). > >>> The second run and onwards no longer deals with overflows.
> >>> Compare this with the rest of the libs, on every iteration, they all > need > >>> to flush/resize/handle overflows, because either the state is new ... > or its > >>> completely reset (MsgPack)
> >>> Therefore, kryo's re-use is not valid at all. > >>> So thank you for reminding me to refactor the code even further to > remove > >>> your re-use. > >>> Next time, try not to blatantly take shortcuts.
> >> We've already been over this.
> >> So far, no one else has agreed with you. Tatu accepted the patch for > reuse > >> of the ByteArrayOutputStream.
> > This is unreleated to what we discussed before (plus your shortcuts went > > unnoticed at that time). > > This is about stateless/stateful re-use. > > What's been recently stated in this thread are facts. > > You don't even have a logical explanation to counter the previous > statement.
> > If you're talking about buffer re-use, it is allowed (be it directly > used or > > from external buffer cache). > > Just look at your code, nobody changed that.
> >> We are measuring the serializers, not the growing of the > >> ByteArrayOutputStream.
> > That's not even the issue. The issue is that kryo is having a different > > state from the first run. If you can completely reset the state of your > > Input/Output (like MsgPack's Packer/Unpacker), there isn't a problem.
> > @Tatu, if you can chime in and share your thoughts on > > the previous statement, that be great.
> I don't feel I have the full picture, but for what it is worth here > are related thoughts.
> - Sounds like handling of ByteArrayOutputStream (external to codecs) > was not problematic, and we can ignore it (originally I assumed you > meant that was problematic)
Yea, we basically had a misunderstanding.
> - I agree in that further runs should not be based on assuming/knowing > that further iterations provide same data.
Agreed.
> I think our best chance is to focus on two things:
> - creating "fully automatic" subset
I'm ok with that.
> , which should avoid many of > disputed techniques
- trying to find a way to create permutations for different runs, so
> that tests would exhibit some level of variation.
I'm ok with that as well. The real problem is someone actually implementing it hehe.
> One more comment on "manual" tests (where I think all of us > occasionally get overzealous with optimizations, and/or disagree > most): these were, I think, originally created mostly because there > were no good fully-automated providers.
I actually don't have problems with the optimizations when you told me fastjson was doing the same internally. So, its all good.
> With XML, for example, such solutions tended to have > disproportionately high overhead -- but even there, adding JAXB at > this point would solve the issue, as it can use fastest parsers, and > does not have more than maybe 50% overhead (compared to XStream and > others that have steeper).
> Put another way, I think manual variants are less necessary due to > coverage. In fact, moving tree-based and manual variants to completely > separate suite (but with comparable throughputs, so one can compare if > need be).
> I know this does not help resolve the specific issue, but I feel that > we would do better if we stepped back and considered bigger picture. > And once we are done with "bigger" changes, we can solve detail > issues.
> I don't mean to belittle the question of fairness -- which is > fundamental with comparison -- but sometimes best way to solve a > specific problem is not full frontal assault, but by outmaneuvering > the thing.
> -+ Tatu +-
> -- > You received this message because you are subscribed to the Google Groups > "java-serialization-benchmarking" group. > To post to this group, send email to > java-serialization-benchmarking@googlegroups.com. > To unsubscribe from this group, send email to > java-serialization-benchmarking+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/java-serialization-benchmarking?hl=en.
-- When the cat is away, the mouse is alone. - David Yu
On Fri, Apr 20, 2012 at 7:36 PM, David Yu <david.yu....@gmail.com> wrote:
> On Sat, Apr 21, 2012 at 2:06 AM, Nate <nathan.sw...@gmail.com> wrote:
...
>> For the last time, when data doesn't fit in the buffer, the buffer grows.
> Except that kryo doesn't because the growth is permanent once the first run
> is done.
> So ultimately, the buffer size being assigned by the benchmark is useless
> for kryo.
>> Measuring this is WORTHLESS.
> Really? Here's some sample data with the message size larger than buffer
> size (media.3.cks with 512 as the provided buffer size).
> ====== when kryo internally resizes buffers after the first run, which also
> effectively ignores the buffer size provided by benchmark
> // Here's the code:
> protected final Input input = new Input(BUFFER_SIZE);
> // as author suggested
> protected final Output output = new Output(BUFFER_SIZE, Integer.MAX_VALUE);
> // same as: new Output(BUFFER_SIZE, -1)
> public T deserialize (byte[] array) {
> input.setBuffer(array);
> return kryo.readObject(input, type);
> }
This does seem odd. For deserialization, difference is significant but
not earth shattering. But serialization difference is huge.
What would explain this level of difference? Is there any possibility
it could be related to JVM warmup oddities -- I have observed that the
first entries tend to get preferentially treated (I guess it is due to
class unloading when speculative inlining was done due to assumption
of no sub-classing etc).
> On Fri, Apr 20, 2012 at 7:36 PM, David Yu <david.yu....@gmail.com> wrote:
> > On Sat, Apr 21, 2012 at 2:06 AM, Nate <nathan.sw...@gmail.com> wrote: > ... > >> For the last time, when data doesn't fit in the buffer, the buffer > grows.
> > Except that kryo doesn't because the growth is permanent once the first > run > > is done. > > So ultimately, the buffer size being assigned by the benchmark is useless > > for kryo.
> >> Measuring this is WORTHLESS.
> > Really? Here's some sample data with the message size larger than buffer > > size (media.3.cks with 512 as the provided buffer size).
> > ====== when kryo internally resizes buffers after the first run, which > also > > effectively ignores the buffer size provided by benchmark
> > // Here's the code: > > protected final Input input = new Input(BUFFER_SIZE); > > // as author suggested > > protected final Output output = new Output(BUFFER_SIZE, > Integer.MAX_VALUE); > > // same as: new Output(BUFFER_SIZE, -1)
> > public T deserialize (byte[] array) { > > input.setBuffer(array); > > return kryo.readObject(input, type); > > }
> This does seem odd. For deserialization, difference is significant but > not earth shattering. But serialization difference is huge. > What would explain this level of difference?
In the first code, the buffer doesn't resize (well, it actually resized only on the first non-measured run). That is the performance you expect from kryo when everything fits in the buffer.
The second code, that is the performance you get from kryo when message size exceeds buffer size.
Why the difference is that huge, I'm not really sure. First thing that comes to mind is how efficient the library handles buffer overflow.
Is there any possibility
> it could be related to JVM warmup oddities --
I've just pushed the code that above (clean state). Try it yourself. The wiki has been updated. I now reflects the change of kryo not using an outputstream to produce a byte array (as per nate).
> I have observed that the > first entries tend to get preferentially treated (I guess it is due to > class unloading when speculative inlining was done due to assumption > of no sub-classing etc).
> -+ Tatu +-
> -- > You received this message because you are subscribed to the Google Groups > "java-serialization-benchmarking" group. > To post to this group, send email to > java-serialization-benchmarking@googlegroups.com. > To unsubscribe from this group, send email to > java-serialization-benchmarking+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/java-serialization-benchmarking?hl=en.
-- When the cat is away, the mouse is alone. - David Yu
The discussion about buffer sizes has been going in circles. Let's step
back a bit and clearly define the issues and potential fixes. We should be
able to discuss without being impolite. No one is trying to be malicious or
misrepresent results. We seem to have a misunderstanding and we need to
focus on discussing it productively.
On Fri, Apr 20, 2012 at 7:36 PM, David Yu <david.yu....@gmail.com> wrote:
> Kryo's Output class does EXACTLY the same thing that ByteArrayOutputStream
>> does:
>> For the last time, when data doesn't fit in the buffer, the buffer grows.
> Except that kryo doesn't because the growth is permanent once the first
> run is done.
> So ultimately, the buffer size being assigned by the benchmark is useless
> for kryo.
My point is that the ByteArrayOutputStream provided by Serializer...
https://github.com/eishay/jvm-serializers/blob/kannan/tpc/src/seriali...<https://github.com/eishay/jvm-serializers/blob/kannan/tpc/src/seriali...>
...is reused in Serializer#outputStream and Serializer#outputStreamForList.
These methods are used by many serializers that need an OutputStream, eg
java-manual. Because the same ByteArrayOutputStream is reset() and reused,
the backing buffer will only grow on the first serialization, and will
never grow afterward. This is the same behavior as Kryo reusing an Output
instance. Note I am not yet judging whether this is right or wrong, just
pointing it out.
Hopefully now you understand why I disagreed with your changes, as you have
made Kryo allocate a new buffer each time, while java-manual and others
reuse the same buffer. You made these changes which put Kryo at a
disadvantage and updated the wiki before anyone could review them.
Here are the numbers for java-manual with and without buffer reuse, running
on Sun's Java 1.7.0_03:
run -trials=500 -include=java-manual data/media.3.cks
create ser +same deser +shal
+deep total size +dfl
java-manual WITHOUT buffer reuse: 80 3711 3725 1728
1797 1859 5570 1596 255
java-manual WITH buffer reuse: 82 3487 3409 1755 1854
1878 5364 1596 255
And here is the same for Kryo:
run -trials=500 -include=kryo data/media.3.cks
create ser +same deser +shal
+deep total size +dfl
Kryo WITHOUT buffer reuse: 80 3212 3120 2620
2681 2718 5930 1573 254
Kryo WITH buffer reuse: 81 2074 1985 2756 2725
3493 5566 1573 254
Interesting that the difference I see for Kryo is less pronounced than on
your machine, but still a pretty big difference.
> Measuring this is WORTHLESS.
> Really? Here's some sample data with the message size larger than buffer
> size (media.3.cks with 512 as the provided buffer size).
I understand that reusing or allocating a new buffer for each serialization
will have an affect on the results. What I am saying is that I would like
results for ALL serializers to avoid growing the buffer. In my mind, we
should be measuring the serializers' code and we should exclude as much
noise as possible. If we include growing of the buffer in our timings, it
makes the relative difference between timings smaller. For an extreme
example, if we added 10 seconds to all results, it would appear as if there
was very little difference between all the serializers.
Does everyone understand the buffering reuse issue?
Here are some options I think are worth discussing, numbered for
convenience:
1) Require serializers to start each serialization with a buffer of size
Serializer.BUFFER_SIZE. Each serialization will include any growing of the
buffer.
1a) What would be a reasonable size? FWIW, BufferedInputStream uses 8192.
2) Allow serializers to reuse the same buffer. This means that the first
serialization may grow the buffer, and subsequent serializations can reuse
this buffer which is known to be big enough.
2a) After serialization, most buffers allocate a new byte[] and copy out
the bytes, like ByteArrayOutputStream#toByteArray() and Kryo's
Output#toBytes(). A serializer could get a speed boost by avoiding this
allocation and copy by returning the backing byte[], since it knows EXACTLY
how big it should be beforehand. This seems somewhat sleazy, as it doesn't
parallel real world usage, where objects are extremely unlikely to all be
the same size.
3) Force serializers to serialize into a byte[] provided by the framework.
The buffer size would have to be large enough, but this isn't much of an
issue as it can be specified. The method would write bytes to the array
starting at zero and would return the number of bytes written, so it would
change from...
public abstract byte[] serialize(S content) throws Exception;
...to...
public abstract int serialize(S content, byte[] buffer) throws Exception;
3a) #3 is basically the same as #1, but with a buffer size known to be
large enough. I think I prefer #3, as #1 could silently grow the buffer and
negatively affect the results without anyone noticing.
On Sat, Apr 21, 2012 at 12:22 AM, Kannan Goundan <kan...@cakoose.com> wrote:
> I think one issue is that the results page is very high stakes. For most
> people, the bar charts we publish are taken to be the final word on JVM
> serializer performance. Internet People cite them all the time.
> [snip]
> Anyway, the stakes are high enough that maybe we should push new results
> to a "staging" URL and give everyone a week to scrutinize them before
> publishing to the main URL.
I agree. We can use a new wiki page "Tentative Results" (or whatever) for
discussion and review before updating the main page.
> Second, I agree that we should probably write up some "rules". When I
> first made a pass through the project I fixed the discrepancies that were
> obvious, but we're now dealing with more subtle stuff. Things that may
> seem right to you might not be to someone else, to the point that it seems
> like the other person is being malicious.
> For example, this particular thread included (among other things) whether
> we should count buffer resizing in the serialization time. I can see
> Nate's high-level point about not counting things that may not happen in
> real usage.
Close, but not my exact point. I'd rather exclude everything we can that is
outside of the serializer's code, since the serializer's code is really
what we are trying to measure.
> For a long-running process, maybe the buffer gets resized on the first few
> sends and then doesn't hit the limit ever again (though I haven't quite
> digested what's going on in David's last message...). But even if counting
> resize time is a bad idea, it's how we measure the other tools, so it's
> still not fair to publish results without fixing up the others.
> So lets try writing everything down. When some code looks shady, figure
> out what written rule it breaks and point it out. If it's doing something
> questionable but there's no written rule to prevent it, lets then discuss
> and make a new rule. A side benefit is that we'll have a detailed
> documentation of our testing methodology that we can put on the wiki (so
> people don't have to rummage through the source code for this information).
> To start, does anyone want to take a stab at formalizing the buffering
> rules?
I just posted some options for discussion on the buffer reuse issue.
On Thu, Apr 26, 2012 at 2:30 PM, Nate <nathan.sw...@gmail.com> wrote:
> Measuring this is WORTHLESS.
>> Really? Here's some sample data with the message size larger than buffer
>> size (media.3.cks with 512 as the provided buffer size).
> I understand that reusing or allocating a new buffer for each
> serialization will have an affect on the results. What I am saying is that
> I would like results for ALL serializers to avoid growing the buffer. In my
> mind, we should be measuring the serializers' code and we should exclude as
> much noise as possible. If we include growing of the buffer in our timings,
> it makes the relative difference between timings smaller. For an extreme
> example, if we added 10 seconds to all results, it would appear as if there
> was very little difference between all the serializers.
Here it is differently. Let's say that we know A is exactly 10x as fast as
B (A is 100ms, B is 1000ms).
Now we're going to add an overhead of 250ms:
benchmark of A = 100ms + 250ms = 350ms
benchmark of B = 1000ms + 250ms = 1250ms
-> conclusion: "A is 3.57x as fast as B"
Let's instead add an overhead of 100ms:
benchmark of A = 100ms + 100ms = 200ms
benchmark of B = 1000ms + 100ms = 1100ms
-> conclusion: "A is 5.50x as fast as B"
Finally, add an overhead of 10000ms:
benchmark of A = 100ms + 10000ms = 10100ms
benchmark of B = 1000ms + 10000ms = 11000ms
-> conclusion: "A and B are about as fast"
As you can see, minimizing the overhead is important. We want to compare
the serializers' code. Any overhead that we can avoid should be avoided,
else it can dramatically affect the results.
On Fri, Apr 27, 2012 at 5:30 AM, Nate <nathan.sw...@gmail.com> wrote:
> The discussion about buffer sizes has been going in circles. Let's step
> back a bit and clearly define the issues and potential fixes. We should be
> able to discuss without being impolite.
You started it.
> No one is trying to be malicious or misrepresent results.
Hopefully that is the truth. Whether it was malicious/intentional or not,
it was still a shortcut.
>>> For the last time, when data doesn't fit in the buffer, the buffer grows.
>> Except that kryo doesn't because the growth is permanent once the first
>> run is done.
>> So ultimately, the buffer size being assigned by the benchmark is useless
>> for kryo.
> My point is that the ByteArrayOutputStream provided by Serializer...
Note I am not yet judging whether this is right or wrong, just pointing it
> out.
> Hopefully now you understand why I disagreed with your changes, as you
> have made Kryo allocate a new buffer each time
Wrong.
What part of this don't you understand:
protected final byte[] *buffer* = new byte[BUFFER_SIZE];
public byte[] serialize (T content) {
Output output = new Output(*buffer*, Integer.MAX_VALUE);
kryo.writeObject(output, content);
return output.toBytes();
}
Is that "new buffer each time"? * *Please check the code first next time.
> , while java-manual and others reuse the same buffer.
There's a difference between buffer from OutputStream and the internal
buffer used by the library. (which tatu acknowledged)
Some of the libraries here don't even have the luxury of re-using the
internal buffer for each run.
Ultimately, they still use (or re-use) the internal buffer to flush to the
outputstream.
If the data is 5x bigger, they flush 5x.
If not using a stream, you resize/expand it x times (depends on the
algorithm) or simply pre-compute the data size.
Now kryo wants to take a shortcut and avoid all that by persisting the
resized buffers from the first run.
The stream-based and non-stream based serializers aren't gonna be happy
with that.
You made these changes which put Kryo at a disadvantage and updated the
> wiki before anyone could review them.
The wiki results are based from media.1.cks. No one is at a disadvantage
because everything fits it the buffer (well, except for
java-built-in/scala-built-in). Stop BSing.
> Here are the numbers for java-manual with and without buffer reuse,
> running on Sun's Java 1.7.0_03:
> run -trials=500 -include=java-manual data/media.3.cks
> Interesting that the difference I see for Kryo is less pronounced than on
> your machine, but still a pretty big difference.
I ran it on my 2.66ghz Core2Quad ubuntu dev machine. I imagine it will
only be much worse when run on slower machines (especially virtualized ones)
>> Measuring this is WORTHLESS.
>> Really? Here's some sample data with the message size larger than buffer
>> size (media.3.cks with 512 as the provided buffer size).
> I understand that reusing or allocating a new buffer for each
> serialization will have an affect on the results. What I am saying is that
> I would like results for ALL serializers to avoid growing the buffer.
Are we not publishing results from media.1.cks? (which fits inside 512)
*Just like media.2.cks*, *media.3.cks is also there to keep the libraries
honest*.
It also answers the question:
What if the data cannot fit in the buffer, how will the library behave?
We're not publishing results other than media.1.cks.
I don't see a problem here.
> In my mind, we should be measuring the serializers' code and we should
> exclude as much noise as possible. If we include growing of the buffer in
> our timings, it makes the relative difference between timings smaller. For
> an extreme example, if we added 10 seconds to all results, it would appear
> as if there was very little difference between all the serializers.
> Does everyone understand the buffering reuse issue?
> Here are some options I think are worth discussing, numbered for
> convenience:
> 1) Require serializers to start each serialization with a buffer of size
> Serializer.BUFFER_SIZE. Each serialization will include any growing of the
> buffer.
> 1a) What would be a reasonable size? FWIW, BufferedInputStream uses 8192.
> 2) Allow serializers to reuse the same buffer. This means that the first
> serialization may grow the buffer, and subsequent serializations can reuse
> this buffer which is known to be big enough.
> 2a) After serialization, most buffers allocate a new byte[] and copy out
> the bytes, like ByteArrayOutputStream#toByteArray() and Kryo's
> Output#toBytes(). A serializer could get a speed boost by avoiding this
> allocation and copy by returning the backing byte[], since it knows EXACTLY
> how big it should be beforehand. This seems somewhat sleazy, as it doesn't
> parallel real world usage, where objects are extremely unlikely to all be
> the same size.
> 3) Force serializers to serialize into a byte[] provided by the framework.
> The buffer size would have to be large enough, but this isn't much of an
> issue as it can be specified. The method would write bytes to the array
> starting at zero and would return the number of bytes written, so it would
> change from...
> public abstract byte[] serialize(S content) throws Exception;
> ...to...
> public abstract int serialize(S content, byte[] buffer) throws Exception;
> 3a) #3 is basically the same as #1, but with a buffer size known to be
> large enough. I think I prefer #3, as #1 could silently grow the buffer and
> negatively affect the results without anyone noticing.
> --
> You received this message because you are subscribed to the Google Groups
> "java-serialization-benchmarking" group.
> To post to this group, send email to
> java-serialization-benchmarking@googlegroups.com.
> To unsubscribe from this group, send email to
> java-serialization-benchmarking+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/java-serialization-benchmarking?hl=en.
-- When the cat is away, the mouse is alone.
- David Yu
On Thu, Apr 26, 2012 at 8:45 PM, David Yu <david.yu....@gmail.com> wrote:
> On Fri, Apr 27, 2012 at 5:30 AM, Nate <nathan.sw...@gmail.com> wrote:
>> The discussion about buffer sizes has been going in circles. Let's step
>> back a bit and clearly define the issues and potential fixes. We should be
>> able to discuss without being impolite.
> You started it.
Wow. We are trying to have an adult discussion here.
> Hopefully now you understand why I disagreed with your changes, as you
>> have made Kryo allocate a new buffer each time
> Wrong.
> What part of this don't you understand:
> protected final byte[] *buffer* = new byte[BUFFER_SIZE];
> public byte[] serialize (T content) {
> Output output = new Output(*buffer*, Integer.MAX_VALUE);
> kryo.writeObject(output, content);
> return output.toBytes();
> }
> Is that "new buffer each time"? * *Please check the code first next time.
Yes I did see you had changed the code again after your initial change.
> , while java-manual and others reuse the same buffer.
> There's a difference between buffer from OutputStream and the internal
> buffer used by the library. (which tatu acknowledged)
No, there is no difference. See the Serializer#serialize method declaration:
public abstract byte[] serialize(S content) throws Exception;
Internally, the code can do whatever it likes. If it wants to use
OutputStream from Serializer#outputStream(), it can do that. If it wants to
use its own buffer equivalent to a ByteArrayOutputStream, it can do that.
If Serializer#outputStream() is going to reuse the ByteArrayOutputStream,
then whatever internal buffer a serializer uses should also be allowed to
reuse its buffer.
Some of the libraries here don't even have the luxury of re-using the
> internal buffer for each run.
> Ultimately, they still use (or re-use) the internal buffer to flush to the
> outputstream.
> If the data is 5x bigger, they flush 5x.
This gives me the impression you misunderstand how the OutputStream from
Serializer#outputStream is used. When writing to the stream, the serializer
doesn't care if the data is 5x bigger than the buffer size. It does not
flush 5x, it just writes the data. Internally, the ByteArrayOutputStream
grows its byte[] during the first run and then it never grows again.
> If not using a stream, you resize/expand it x times (depends on the
> algorithm) or simply pre-compute the data size.
> Now kryo wants to take a shortcut and avoid all that by persisting the
> resized buffers from the first run.
There is no shortcut, Kryo's Output class works exactly the same as the
OutputStream from Serializer#outputStream.
> You made these changes which put Kryo at a disadvantage and updated the
>> wiki before anyone could review them.
> The wiki results are based from media.1.cks. No one is at a disadvantage
> because everything fits it the buffer (well, except for
> java-built-in/scala-built-in). Stop BSing.
Regarding your "Stop BSing" comment, if you continue to be impolite, I will
cease discussion with you and I expect others will do the same.
In your initial update to the wiki, you changed Kryo to allocate a new
buffer each time and you updated the wiki with the results. This is a
non-issue now anyway, as we will being using a "staging" wiki page so that
updates can be reviewed and discussed.
I understand that reusing or allocating a new buffer for each serialization
>> will have an affect on the results. What I am saying is that I would like
>> results for ALL serializers to avoid growing the buffer.
> Are we not publishing results from media.1.cks? (which fits inside 512)
> *Just like media.2.cks*, *media.3.cks is also there to keep the libraries
> honest*.
> It also answers the question:
> What if the data cannot fit in the buffer, how will the library behave?
> We're not publishing results other than media.1.cks.
> I don't see a problem here.
The current code does not answer that question, because most of the
serializers use Serializer#outputStream, which reuses the buffer. Your
changes have made Kryo not reuse the buffer, so it will grow each time, and
therefore it is not fair for media.3.cks or any other test where the output
is > 512. We don't publish those results, but it is still unfair.
Besides that, I don't believe this is an important question to ask. It only
adds overhead that skews the results.