Java UTF-8 encoding/decoding: possible performance improvements

4,216 views
Skip to first unread message

Evan Jones

unread,
Dec 22, 2009, 12:26:19 PM12/22/09
to Protocol Buffers
I've done some quick and dirty benchmarking of Java string encoding/
decoding to/from UTF-8 for an unrelated project, but I've realized
that these performance improvements could be added to protobufs. The
"easy" way to do UTF-8 conversions is the way CodedInputStream/
CodedOutputStream does it: using String.getBytes() and new String().
It turns out that using the java.nio.charset.CharsetDecoder/
CharsetEncoder *can* be faster. However, to make it faster the objects
need to be reused, due to the cost of allocating temporary buffers and
objects.

Before I attempt to make any improvements, I want to see if anyone
(Kenton primarily) has any opinions if these make sense. They would
add ~100 lines of code to replace something which is now a few lines
of code, and it is a small improvement (approximately 40% less time
per encode/decode, on a list of 1400 strings in different languages).
I haven't tried adding this to protobufs yet, so final performance
improvements are unknown:


Problem 1: A Java protobuf string is stored as a String instance. It
typically gets converted to UTF-8 *twice*: Once in getSerializedSize()
via a call to CodedOutputStream.computeStringSize, then again in
writeTo().

Solution: Cache the byte[] version of String fields. This would
increase the memory size of each message (an additional pointer per
string, plus the space for the byte[]), but would HALVE the number of
conversions. I suspect this will be a fair bit faster. If added, it
should only be added for the SPEED generated messages.


Problem 2: Using the NIO encoders/decoders can be faster than
String.getBytes, but only if it is used >= 4 times. If used only once,
it is worse. The same is approximately true about decoding. Lame
results: http://evanjones.ca/software/java-string-encoding.html

Solution 1: Add a custom encoder/decoder to CodedOutputStream,
allocated as needed. This could be *bad* for applications that call
Message.toByteString or .toByteArray frequently for messages with few
strings, since that creates and throws away a single CodedOutputStream
instance.

Solution 2: Add a custom encoder/decoder per thread via a ThreadLocal.
This requires fetching the ThreadLocal, which is slightly expensive,
and adds some per-thread memory overhead (~ 4kB, tunable). however the
allocations are done ONCE per thread, which should be significantly
better.


--
Evan Jones
http://evanjones.ca/

Kenton Varda

unread,
Dec 22, 2009, 7:59:29 PM12/22/09
to Evan Jones, Protocol Buffers
These ideas sound good to me.

On Tue, Dec 22, 2009 at 9:26 AM, Evan Jones <ev...@mit.edu> wrote:
Problem 1: A Java protobuf string is stored as a String instance. It
typically gets converted to UTF-8 *twice*: Once in getSerializedSize()
via a call to CodedOutputStream.computeStringSize, then again in
writeTo().

Solution: Cache the byte[] version of String fields. This would
increase the memory size of each message (an additional pointer per
string, plus the space for the byte[]), but would HALVE the number of
conversions. I suspect this will be a fair bit faster. If added, it
should only be added for the SPEED generated messages.

I wonder if we can safely discard the cached byte array during serialization on the assumption that most messages are serialized only once?
 
Problem 2: Using the NIO encoders/decoders can be faster than
String.getBytes, but only if it is used >= 4 times. If used only once,
it is worse. The same is approximately true about decoding. Lame
results: http://evanjones.ca/software/java-string-encoding.html

Solution 1: Add a custom encoder/decoder to CodedOutputStream,
allocated as needed. This could be *bad* for applications that call
Message.toByteString or .toByteArray frequently for messages with few
strings, since that creates and throws away a single CodedOutputStream
instance.

Solution 2: Add a custom encoder/decoder per thread via a ThreadLocal.
This requires fetching the ThreadLocal, which is slightly expensive,
and adds some per-thread memory overhead (~ 4kB, tunable). however the
allocations are done ONCE per thread, which should be significantly
better.

Fetching a threadlocal should just be a pointer dereference on any decent threading implementation.  Is it really that expensive in Java?

Solution 3:  Maintain a private freelist of encoder objects within CodedOutputStream.  Allocate one the first time a string is encoded on a particular stream object, and return it to the freelist on flush() (which is always called before discarding the stream unless an exception interrupts serialization).  In may make sense for the freelist to additionally be thread-local to avoid locking, but if it's only one lock per serialization maybe it's not a big deal?

David Yu

unread,
Dec 22, 2009, 10:06:04 PM12/22/09
to Evan Jones, Protocol Buffers

I noticed this as well before ... the solution could be applied to *all* generated messages for efficiency.
There should be a writeByteArray(int fieldNumber, byte[] value) in CodedOutputStream so that the cached bytes of strings would
be written directly.  The ByteString would not help, it adds more memory since it creates a copy of the byte array.



Problem 2: Using the NIO encoders/decoders can be faster than
String.getBytes, but only if it is used >= 4 times. If used only once,
it is worse. The same is approximately true about decoding. Lame
results: http://evanjones.ca/software/java-string-encoding.html

Solution 1: Add a custom encoder/decoder to CodedOutputStream,
allocated as needed. This could be *bad* for applications that call
Message.toByteString or .toByteArray frequently for messages with few
strings, since that creates and throws away a single CodedOutputStream
instance.

Solution 2: Add a custom encoder/decoder per thread via a ThreadLocal.
This requires fetching the ThreadLocal, which is slightly expensive,
and adds some per-thread memory overhead (~ 4kB, tunable). however the
allocations are done ONCE per thread, which should be significantly
better.


--
Evan Jones
http://evanjones.ca/

--

You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.





--
When the cat is away, the mouse is alone.
- David Yu

Kenton Varda

unread,
Dec 22, 2009, 10:14:05 PM12/22/09
to David Yu, Evan Jones, Protocol Buffers
On Tue, Dec 22, 2009 at 7:06 PM, David Yu <david....@gmail.com> wrote:
There should be a writeByteArray(int fieldNumber, byte[] value) in CodedOutputStream so that the cached bytes of strings would
be written directly.  The ByteString would not help, it adds more memory since it creates a copy of the byte array.

We could cache the bytes as a ByteString.  Converting a String to a ByteString does not require a redundant copy, as ByteString has methods for this.

I think it would be better to do it this way because, in the long run, we actually want to extend ByteString to allow avoiding copies in some cases.  For example, if you are serializing a message to a ByteString (you caleld toByteString()) or parsing from a ByteString, then handling "bytes" fields should require any copy.  Instead, it should be possible to construct a ByteString which is a substring of some other ByteString in O(1) time, as well as concatenate ByteStrings in O(1) time.

So this way, if the size-computation step converted the String to a ByteString and cached that, no further copy of the bytes would ever be needed in many cases.

David Yu

unread,
Dec 22, 2009, 11:18:17 PM12/22/09
to Kenton Varda, Evan Jones, Protocol Buffers
Cool.
Btw, the ByteString's snippet is:
 return new ByteString(text.getBytes("UTF-
8"));

Another improvement would be avoiding the lookup and instead cache the Charset.forName("UTF-8") object and use it.
I believe you google guys have also been evangelizing this :-) (PDF from http://code.google.com/p/guava-libraries/)

Kenton Varda

unread,
Dec 22, 2009, 11:21:21 PM12/22/09
to David Yu, Evan Jones, Protocol Buffers
I tried doing that at one point and found that it was *much slower* -- apparently String.getBytes("UTF-8") is highly-optimized, whereas creating a Charset object (even statically) and using that is not.  :/

David Yu

unread,
Dec 22, 2009, 11:41:39 PM12/22/09
to Kenton Varda, Evan Jones, Protocol Buffers
Just checked the code ... and you're absolutely right.
java.lang.StringCoding (line 277) creates an unnecessary copy of the char array which makes it slow.  I'm not sure but it might be a sun jdk 6 bug.

Evan Jones

unread,
Dec 23, 2009, 10:44:38 AM12/23/09
to Kenton Varda, David Yu, Protocol Buffers
On Dec 22, 2009, at 19:59 , Kenton Varda wrote:
> I wonder if we can safely discard the cached byte array during
> serialization on the assumption that most messages are serialized
> only once?

This is a good idea, and it seems to me that this should definitely be
possible. It would need to be done somewhat carefully, since Message
objects are supposed to be thread safe, but I don't think this is
particularly hard. This would result in the first call to the
combination of getSerializedSize() and writeTo() only serializing
strings once, and each subsequent call would also serialize the string
once. The advantage would be nearly no "permanent" extra memory
overhead, beyond perhaps one extra field for messages that contain
strings. The disadvantage is that you serialize strings each time you
serialize the message.

The only additional tricky part: on subsequent serializations, it
would be useful to know the serialized size of the string, in order to
serialize the string directly into the output buffer, rather than
needing to create a temporary byte[] array to get the length. This is
also a solvable problem, and my lame benchmarks suggest this is only a
small improvement anyway.


On Dec 22, 2009, at 22:06 , David Yu wrote:
> I noticed this as well before ... the solution could be applied to
> *all* generated messages for efficiency.

I don't think other types have this "double encoding" overhead, but I
could be wrong. This is not a problem for nested messages, since the
call to Message.getSerializedSize() ends up
calling .getSerializedSize() on the sub-message, so the size will be
calculated without actually serializing the sub message.

While caching the serialized representation of the entire message
would be faster for applications that call message.writeTo() multiple
times on a single message, there is a significant memory cost to doing
that automatically. Personally, I think this is the sort of thing that
applications should do themselves, if they want it, and not something
that should be part of the core library.


On Dec 22, 2009, at 19:59 , Kenton Varda wrote:
> Fetching a threadlocal should just be a pointer dereference on any
> decent threading implementation. Is it really that expensive in Java?

It used to be very expensive. I think it does the "right thing" now,
but I'll measure to make sure. Its still more expensive than accessing
a local field.


> Solution 3: Maintain a private freelist of encoder objects within
> CodedOutputStream. Allocate one the first time a string is encoded
> on a particular stream object, and return it to the freelist on
> flush() (which is always called before discarding the stream unless
> an exception interrupts serialization). In may make sense for the
> freelist to additionally be thread-local to avoid locking, but if
> it's only one lock per serialization maybe it's not a big deal?

I would guess that this might be more expensive than the ThreadLocal,
but I don't know that for sure. It would avoid the "one encoder/
decoder per thread" overhead. Do you think it is worth it?

Evan

Kenton Varda

unread,
Dec 23, 2009, 4:59:17 PM12/23/09
to Evan Jones, David Yu, Protocol Buffers
On Wed, Dec 23, 2009 at 7:44 AM, Evan Jones <ev...@mit.edu> wrote:
On Dec 22, 2009, at 19:59 , Kenton Varda wrote:
I wonder if we can safely discard the cached byte array during serialization on the assumption that most messages are serialized only once?

This is a good idea, and it seems to me that this should definitely be possible. It would need to be done somewhat carefully, since Message objects are supposed to be thread safe, but I don't think this is particularly hard.

Right.  Assuming pointer reads and writes are atomic -- a reasonable assumption in general, but we can guarantee it with "volatile" -- then it is safe for one thread to set the cached value to null even if another thread is reading it simultaneously.  Either the other thread will successfully get the pointer and be able to use it, or it will get null and have to rebuild it.
 
The only additional tricky part: on subsequent serializations, it would be useful to know the serialized size of the string, in order to serialize the string directly into the output buffer, rather than needing to create a temporary byte[] array to get the length. This is also a solvable problem, and my lame benchmarks suggest this is only a small improvement anyway.

We could cache the size separately, but since it would only be useful in the case of multiple serializations, it's probably not worthwhile.  Who serializes the same immutable message multiple times?
 
Solution 3:  Maintain a private freelist of encoder objects within CodedOutputStream.  Allocate one the first time a string is encoded on a particular stream object, and return it to the freelist on flush() (which is always called before discarding the stream unless an exception interrupts serialization).  In may make sense for the freelist to additionally be thread-local to avoid locking, but if it's only one lock per serialization maybe it's not a big deal?

I would guess that this might be more expensive than the ThreadLocal, but I don't know that for sure. It would avoid the "one encoder/decoder per thread" overhead. Do you think it is worth it?

If a simple ThreadLocal is faster, use that.  I was just thinking that if ThreadLocal lookup is slow, then it would be useful to only have to do that lookup once per CodedOutputStream object -- then subsequent uses are just dereferencing a field.

Evan Jones

unread,
Jan 17, 2010, 12:33:49 PM1/17/10
to Kenton Varda, Protocol Buffers
I've implemented a rough prototype of an optimization to the Java
implementation that serializes Java strings once when optimize_for ==
SPEED, rather than twice. Brief overview:

* Add a private volatile byte[] cache for each string field.
* When computing the serialized size, serialize the string and store
it in the cache.
* When serializing, use the cache if available, then set the cache to
null.

I used the ProtoBench.java program included in the SVN repository,
using the messages included in the repository. Brief summary of results:

* Serializing a protocol buffer more than once is possibly slightly
slower (~2-10%). I'm guessing the reason is that since it already has
the message length, the extra conditionals and fields for string
caching just get in the way.
* Serializing a protocol buffer once, with the length prepended, is
significantly faster (~33% for SpeedMessage1, with 10 strings, ~80%
for SpeedMessage2, with lots of strings; the benchmark measures the
time to create the protocol buffer, so the improvement is probably
actually larger).
* Classes are a bit bigger (SpeedMessage1.class with 8 strings: 24802 -
> 25577)
* Size messages are unchanged.
* Messages without strings are unchanged.


Pros:
* Faster serialization of messages that contain strings.

Cons:
* More code (extra fields; conditional checking of the cache)
* More RAM (extra fields)
* Some changes to CodedOutputStream (more code to handle cached
strings).

Does this seem like a worthwhile optimization? If so, I'll clean up my
patch a bit and submit it for code review. Thanks,

Kenton Varda

unread,
Jan 19, 2010, 9:19:10 PM1/19/10
to Evan Jones, Protocol Buffers
I think the 30-80% speed boost would be well worth the extra code size / memory overhead.  Please send me the patch!

Evan Jones

unread,
Apr 28, 2010, 2:17:45 PM4/28/10
to Protocol Buffers
Evan Jones wrote:
> Problem 2: Using the NIO encoders/decoders can be faster than
> String.getBytes, but only if it is used >= 4 times. If used only
> once, it is worse. The same is approximately true about decoding.
> Lame results: http://evanjones.ca/software/java-string-encoding.html

I'm revisiting this old issue, thanks to being reminded about it by an
earlier message. I've tested this with "recent" JVMs, and it still
seems to hold true: using the NIO encoders and decoders can be faster
than using String.getBytes(). My numbers show that encoding is
approximately 40% faster, while decoding shows a smaller improvement.
See the following for details on this microbenchmark:

http://evanjones.ca/software/java-string-encoding.html


This surprises me, since it suggests that Sun/Oracle could replace
their implementation of String.getBytes() with something similar to my
code, and get a performance improvement. In fact, with privileged
access to the internals of a String, they should be able to do even
better.

I've integrated this change into protobuf and with the microbenchmark
in the protobuf source tree, it shows a performance improvement of
~20% (numbers below). I'll send a code review with the code shortly,
once I've cleaned it up a bit, in case anyone wants to look at it.


Pros:
+ Faster protocol buffer encoding.

Cons:
- Far more code to handle encoding (like an extra hundred lines or so)
which means there could be bugs.
- Extra memory (~2 kB per thread for encoding)


I'm unsure if the benefits outweigh the costs, particularly since the
fact this appears faster seems to be a surprising result, and I
wouldn't be shocked to find that future JVM / JDK releases could make
this "optimization" useless. I'll leave that decision to the protocol
buffer maintainers.

Evan



Results: All the serialize results are better. The deserialized
results are unchanged (as expected).


SpeedMessage1 (which is small):
Original: Serialize to byte string: 12360424 iterations in 29.83s;
90.097984MB/s
Optimized: Serialize to byte string: 15911623 iterations in 29.997s;
115.337776MB/s

SpeedMessage2 (larger):
Serialize to byte string: 33482 iterations in 29.754s; 90.757484MB/s
Serialize to byte string: 40381 iterations in 30.031s; 108.44853MB/s


Raw results on my Macbook (Core2 Duo CPU):

ORIGINAL
Benchmarking benchmarks.GoogleSpeed$SpeedMessage1 with file
google_message1.dat
Serialize to byte string: 12360424 iterations in 29.83s; 90.097984MB/s
Serialize to byte array: 12244951 iterations in 30.303s; 87.86307MB/s
Serialize to memory stream: 8699469 iterations in 23.732s; 79.70642MB/s
Serialize to /dev/null with FileOutputStream: 6075179 iterations in
27.764s; 47.578636MB/s
Serialize to /dev/null reusing FileOutputStream: 6975006 iterations in
29.99s; 50.571175MB/s
Serialize to /dev/null with FileChannel: 10375092 iterations in
29.864s; 75.54034MB/s
Serialize to /dev/null reusing FileChannel: 11166943 iterations in
30.699s; 79.09427MB/s
Deserialize from byte string: 14463117 iterations in 30.06s;
104.618355MB/s
Deserialize from byte array: 14436567 iterations in 30.007s;
104.61074MB/s
Deserialize from memory stream: 6221772 iterations in 28.024s;
48.274624MB/s

Benchmarking benchmarks.GoogleSpeed$SpeedMessage2 with file
google_message2.dat
Serialize to byte string: 33482 iterations in 29.754s; 90.757484MB/s
Serialize to byte array: 33103 iterations in 29.517s; 90.45062MB/s
Serialize to memory stream: 28872 iterations in 29.939s; 77.77786MB/s
Serialize to /dev/null with FileOutputStream: 32934 iterations in
29.927s; 88.756MB/s
Serialize to /dev/null reusing FileOutputStream: 32979 iterations in
29.887s; 88.99622MB/s
Serialize to /dev/null with FileChannel: 32447 iterations in 29.921s;
87.46108MB/s
Serialize to /dev/null reusing FileChannel: 32585 iterations in
29.903s; 87.88594MB/s
Deserialize from byte string: 38388 iterations in 29.879s;
103.620544MB/s
Deserialize from byte array: 38677 iterations in 29.866s; 104.446075MB/s
Deserialize from memory stream: 37879 iterations in 29.954s;
101.990585MB/s

OPTIMIZED
Benchmarking benchmarks.GoogleSpeed$SpeedMessage1 with file
google_message1.dat
Serialize to byte string: 15911623 iterations in 29.997s; 115.337776MB/s
Serialize to byte array: 16152646 iterations in 30.008s; 117.041954MB/s
Serialize to memory stream: 14859367 iterations in 29.551s;
109.33597MB/s
Serialize to /dev/null with FileOutputStream: 7224915 iterations in
29.954s; 52.446056MB/s
Serialize to /dev/null reusing FileOutputStream: 7479144 iterations in
30.081s; 54.062305MB/s
Serialize to /dev/null with FileChannel: 12730586 iterations in
30.025s; 92.193504MB/s
Serialize to /dev/null reusing FileChannel: 14024645 iterations in
30.399s; 100.31538MB/s
Deserialize from byte string: 14390338 iterations in 29.958s;
104.44631MB/s
Deserialize from byte array: 14496442 iterations in 30.142s;
104.57414MB/s
Deserialize from memory stream: 6653401 iterations in 29.989s;
48.24104MB/s

Benchmarking benchmarks.GoogleSpeed$SpeedMessage2 with file
google_message2.dat
Serialize to byte string: 40381 iterations in 30.031s; 108.44853MB/s
Serialize to byte array: 39447 iterations in 29.678s; 107.20024MB/s
Serialize to memory stream: 33647 iterations in 30.029s; 90.36951MB/s
Serialize to /dev/null with FileOutputStream: 39071 iterations in
30.142s; 104.543945MB/s
Serialize to /dev/null reusing FileOutputStream: 39321 iterations in
29.969s; 105.82024MB/s
Serialize to /dev/null with FileChannel: 38460 iterations in 29.955s;
103.55149MB/s
Serialize to /dev/null reusing FileChannel: 38690 iterations in
29.969s; 104.12209MB/s
Deserialize from byte string: 37440 iterations in 29.312s;
103.016495MB/s
Deserialize from byte array: 38208 iterations in 29.862s; 103.193375MB/s
Deserialize from memory stream: 37315 iterations in 29.944s;
100.505554MB/s

--
Evan Jones
http://evanjones.ca/

Kenton Varda

unread,
May 3, 2010, 1:10:03 PM5/3/10
to Evan Jones, Protocol Buffers
Interesting.  Since this seems like a JVM implementation issue, I wonder if the results are different on Dalvik (Android).  Also, the extra code sounds undesirable for lite mode, but my guess is that you had to place this code inside CodedOutputStream which is shared by lite mode.  So yeah, there are some down sides.  Not sure how I feel about it.

Evan Jones

unread,
May 3, 2010, 9:11:04 PM5/3/10
to Kenton Varda, Protocol Buffers
On May 3, 2010, at 13:10 , Kenton Varda wrote:
> Interesting. Since this seems like a JVM implementation issue, I
> wonder if the results are different on Dalvik (Android). Also, the
> extra code sounds undesirable for lite mode, but my guess is that
> you had to place this code inside CodedOutputStream which is shared
> by lite mode. So yeah, there are some down sides. Not sure how I
> feel about it.

Yes, I actually changed ByteString, since ByteString.copyFromUtf8 is
how protocol buffers get UTF-8 encoded strings at this point.

I can understand that viewpoint. As I said, I'm somewhat surprised
that this is faster, as I don't think there is a good reason for it.
It also uglifies the code a fair amount. I'm somewhat tempted to file
a JDK bug on this, but its unclear to me if that is actually a useful
process.

What I should probably do is look at the JDK source, and figure out
*why* my code actually seems faster ...

Evan

Evan Jones

unread,
May 3, 2010, 9:16:39 PM5/3/10
to Protocol Buffers
On May 3, 2010, at 21:11 , Evan Jones wrote:
> Yes, I actually changed ByteString, since ByteString.copyFromUtf8 is
> how protocol buffers get UTF-8 encoded strings at this point.

Although now that I think about it, I think it might be possible to
enable this only for SPEED messages, if that might make it
interesting. It may require some gross stuff in ByteString, though, in
order to be able to create ByteStrings using a different code path,
without including the .class files in the lite jar.

My guess: this probably isn't worth an extra 10-20% performance
improvement. I'll try to clean up and maintain the patch, so speed
freak types can always patch their own implementations.

Kenton Varda

unread,
May 7, 2010, 6:54:55 PM5/7/10
to Evan Jones, Protocol Buffers
Yeah I don't think we should add a way to inject decoders into ByteString...

I'd be very interested to hear why the JDK is not optimal here.

Evan Jones

unread,
May 7, 2010, 9:21:25 PM5/7/10
to Kenton Varda, Protocol Buffers
On May 7, 2010, at 18:54 , Kenton Varda wrote:
> I'd be very interested to hear why the JDK is not optimal here.

I dug into this. I *think* the problem is that the JDK ends up
allocating a huge temporary array for the UTF-8 data. Hence, the
garbage collection cost is higher for the JDK's implementation, rather
than my implementation. Basically the code does this:


* allocate a new byte[] array that is string length * max bytes per
character ( = 4 for the UTF encoder)
* use the java.nio.charset.CharsetEncoder to encode the char[] into
the byte[] (wrapped in CharBuffer / ByteBuffer).
* copy the exact number of bytes out of the byte[] into a new byte[],
and return that.

The only "trick" the JDK gets to use that "normal" Java code can't is
that they can access the string's char[] buffer directly, whereas I
need to copy it out into a char[] array.


Hence, I think what is happening is that the JDK allocates 4-5 times
as much memory per encode than I do. In the cases where the data is
ASCII, my code is faster, since it allocates exactly the right amount
of space, and doesn't need an extra copy. When the data is not ASCII,
my code may still be faster, since it doesn't overallocate quite as
much (in exchange my code does many copies).


Conclusion: there is a legitimate reason for this code to be faster
than the JDK's code. But it still may not be worth including this
patch in the main line protocol buffer code.

Kenton Varda

unread,
May 17, 2010, 3:38:34 PM5/17/10
to Evan Jones, Protocol Buffers
I see.  So in fact your code is quite possibly slower in non-ASCII cases?  In fact, it sounds like having even one non-ASCII character would force extra copies to occur, which I would guess would defeat the benefit, but we'd need benchmarks to tell for sure.

Evan Jones

unread,
May 17, 2010, 7:10:02 PM5/17/10
to Kenton Varda, Protocol Buffers
On May 17, 2010, at 15:38 , Kenton Varda wrote:
> I see. So in fact your code is quite possibly slower in non-ASCII
> cases? In fact, it sounds like having even one non-ASCII character
> would force extra copies to occur, which I would guess would defeat
> the benefit, but we'd need benchmarks to tell for sure.

Yes. I've been playing with this a bit in my spare time since the last
email, but I don't have any results I'm happy with yet. Rough notes:

* Encoding is (quite a bit?) faster than String.getBytes() if you
assume one byte per character.
* If you "guess" the number bytes per character poorly and have to do
multiple allocations and copies, the regular Java version will win. If
you get it right (even if you first guess 1 byte per character) it
looks like it can be slightly faster or on par with the Java version.
* Re-using a temporary byte[] for string encoding may be faster than
String.getBytes(), which effectively allocates a temporary byte[] each
time.


I'm going to try to rework my code with a slightly different policy:

a) Assume 1 byte per character and attempt the encode. If we run out
of space:
b) Use a shared temporary buffer and continue the encode. If we run
out of space:
c) Allocate a worst case 4 byte per character buffer and finish the
encode.


This should be much better than the JDK version for ASCII, a bit
better for "short" strings that fit in the shared temporary buffer,
and not significantly worse for the rest, but I'll need to test it to
be sure.

This is sort of just a "fun" experiment for me at this point, so who
knows when I may get around to actually "finishing" this.

Christopher Smith

unread,
May 17, 2010, 10:59:27 PM5/17/10
to Evan Jones, Kenton Varda, Protocol Buffers
This does somewhat suggestive that it might be worthwhile specifically
tagging a field as ASCII only. There are enough cases of this that it
could be a huge win.
--
Sent from my mobile device

Chris

Kenton Varda

unread,
May 18, 2010, 12:33:09 AM5/18/10
to Christopher Smith, Evan Jones, Protocol Buffers
What if you did a fast scan of the bytes first to see if any are non-ASCII?  Maybe only do this fast scan if the data is short enough to fit in L1 cache?

Christopher Smith

unread,
May 18, 2010, 2:01:10 AM5/18/10
to Kenton Varda, Evan Jones, Protocol Buffers
That seems simple enough and likely to produce a net win often enough.

--Chris

Evan Jones

unread,
May 30, 2010, 2:32:50 PM5/30/10
to Kenton Varda, Christopher Smith, Protocol Buffers
On May 18, 2010, at 0:33 , Kenton Varda wrote:
> What if you did a fast scan of the bytes first to see if any are non-
> ASCII? Maybe only do this fast scan if the data is short enough to
> fit in L1 cache?

I didn't try this exact idea, but my guess is that it may not be a
win: to get "fast" access to the char[] in the string, you need to
copy them into a separate char[] array. If you are doing this much
work, you might as well encode them into UTF-8 while you are at it.

Conclusion: 10% performance win for ASCII (garbage collection
savings); no win for general UTF-8 text. Not worth it for protocol
buffers, but I'll try digging into the decoding.


The approach I found worked the best:

1. Copy the string into a pre-allocated and re-used char[] array. This
is needed since the JDK does not permit access to the String's
char[] ,to enforce immutability. This is a performance "loss" VS the
JDK, which can access the char[] directly.

2. Encode into a pre-allocated byte[] array. This will completely
handle short strings. For long strings, you end up needing to allocate
additional temporary space. This is "better" than the JDK, which
allocates a new temporary buffer = 4 * str.length().

3. Allocate the final byte[] array, and System.arraycopy into it. This
is the same as the JDK.

Conclusion: This is only better than the JDK in that it reduces
allocation and garbage collection. It is worse than the JDK because it
requires a copy from the String into another char[]. On my tests with
ASCII-only data, it ends up ~10% faster. In my tests with UTF-8 data,
it ends up about the same speed. In other words: this probably isn't
worth it for Protocol Buffers: this is a small performance
improvement, but complicates the code significantly.

However: For applications that can encode the string directly to the
output buffer, this custom code can be significantly faster. However,
since protocol buffers needs to encode to another buffer first to get
the string length, this


I may spend some time poking around at string *decoding*, because
there we should be able to decode directly from the input buffer,
saving a copy and an allocation of a temporary byte[] array. Unclear
if this will actually be significantly faster, but it might be
slightly faster.

David Dabbs

unread,
May 31, 2010, 2:25:31 PM5/31/10
to Evan Jones, Kenton Varda, Christopher Smith, Protocol Buffers

The approach I found worked the best:

1. Copy the string into a pre-allocated and re-used char[] array. This
is needed since the JDK does not permit access to the String's
char[] ,to enforce immutability. This is a performance "loss" VS the

JDK, which can access the char[] directly....


Evan,

you may access a String's internals via reflection in a "safe," albeit
potentially implementation-specific way. See class code below.
As long as your java.lang.String uses "value" for the char[] and
"offset" for the storage offset, this should work.
No sun.misc.Unsafe used. Only tested/used on JDK6.


David


/***************************************************************************
**********/
import java.lang.reflect.*;


public final class ReflectionUtils {

/**
* There is no explicit Modifier constant for package-privete, so 0 is
used.
*/
public static final int MODIFIER_PACKAGE_PRIVATE = 0x00000000;


/** Field object for accessing Sting::value character storage. */
public static final Field STRING_VALUE_FIELD =
getFieldAccessible(String.class, "value");


/** Field object for accessing Sting::offset, the first index used in
the value[] char storage. */
public static final Field STRING_OFFSET_FIELD =
getFieldAccessible(String.class, "offset");


/**
* Package private String constructor which shares value array for
speed.
*
* Use when a number of String objects share the same char[] storage.
*
* See String(int offset, int count, char value[]).
*/
public static final Constructor<?> STRING_PP_CTOR =
getConstructor(String.class, MODIFIER_PACKAGE_PRIVATE, int.class, int.class,
char[].class);


/**
* To avoid violating final field semantics, take care to only _read_
* the char[] value returned.
*/
public static char[] getChars(final String s) {
try {
// use reflection to read the char[] value from the string. . .
return (char[]) STRING_VALUE_FIELD.get(s);
} catch (Throwable t) {
return null;
}
}


public static String sharedString(final char[] chars, final int offset,
final int length) {
try {
return (String) STRING_PP_CTOR.newInstance(offset, length,
chars);
} catch (InstantiationException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
} catch (Throwable t) {
t.printStackTrace();
}
return null;
}


public static Field getFieldAccessible(final Class<?> clazz, final
String fieldName) {
Field fld = null;
try {
fld = clazz.getDeclaredField(fieldName);
fld.setAccessible(true);
} catch (NoSuchFieldException e) {
e.printStackTrace();
} catch (SecurityException e) {
e.printStackTrace();
}
return fld;
}


public static Constructor<?> getConstructor(final Class<?> clazz, final
int searchModifier, final Class<?>... paramTypes) {

if(clazz==null) {
throw new IllegalArgumentException("A class parameter is
required");
}

try {
//
// There is no explicit Modifier accessor constant for
package-privete, so 0 is used.
//

for (Constructor<?> ctor : clazz.getDeclaredConstructors()) {

if (searchModifier == (ctor.getModifiers() &
(Modifier.PUBLIC | Modifier.PRIVATE | Modifier.PROTECTED))) {
//
// access modifier matches. . .
//
final Class<?>[] parameterTypes =
ctor.getParameterTypes();
if (parameterTypes.length == paramTypes.length) {
//
// same number of parameters. . .
//
for (int i = 0; i < parameterTypes.length; i++) {
if (!parameterTypes[i].equals(paramTypes[i])) {
// one parameter not of correct type, so
bail. . .
// note ctor variable used as success marker
below
ctor = null;
break;
} else {
// Type[] gpType =
ctor.getGenericParameterTypes();
// for (int j = 0; j < gpType.length; j++) {
// char ch =
(gpType[j].equals(paramTypes[i]) ? '*' : ' ');
// System.out.format("%7c%s[%d]: %s%n",
ch, "GenericParameterType", j, gpType[j]);
// }
}
}
if (ctor != null) {
// all ctor parameter types match, so call
ctor.setAccessible(true)
// System.out.format("%s%n",
ctor.toGenericString());
// System.out.format(" [ synthetic=%-5b
var_args=%-5b ]%n", ctor.isSynthetic(), ctor.isVarArgs());

ctor.setAccessible(true);
return ctor;
}
}
}
}
} catch (Throwable t) {
// will return null below
}
return null; // error or ctor not found
}


/**
* The AccessibleObject class is the base class for Field, Method and
* Constructor objects. It provides the ability to flag a reflected
* object as suppressing default Java language access control checks
* when it is used. The access checks--for public, default (package)
* access, protected, and private members--are performed when Fields,
* Methods or Constructors are used to set or get fields, to invoke
* methods, or to create and initialize new instances of classes,
* respectively.
*
* <p>Setting the <tt>accessible</tt> flag in a reflected object
* permits sophisticated applications with sufficient privilege, such
* as Java Object Serialization or other persistence mechanisms, to
* manipulate objects in a manner that would normally be prohibited.
*
* @see AccessibleObject
*
**/
public static boolean setAccessible(final AccessibleObject o) {

try {
o.setAccessible(true);
return true;
} catch (Throwable /*SecurityException*/ e) {
return false;
}
}

}

Evan Jones

unread,
May 31, 2010, 5:31:57 PM5/31/10
to David Dabbs, Protocol Buffers
On May 31, 2010, at 14:25 , David Dabbs wrote:
> you may access a String's internals via reflection in a "safe," albeit
> potentially implementation-specific way. See class code below.
> As long as your java.lang.String uses "value" for the char[] and
> "offset" for the storage offset, this should work.
> No sun.misc.Unsafe used. Only tested/used on JDK6.

Good idea! Unfortunately, this isn't much faster for small strings. It
is faster if you just get the value char[] array. However, when I
modified my implementation to get both the char[] value and int
offset, it ended up being about the same speed for my test data set,
which is composed mostly of "short" UTF-8 and ASCII strings.
Unfortunately, a correct implementation will need to get both values.
Since this is also somewhat "dangerous," it doesn't seem like a great
idea for my data.

At any rate: I'll try to find some time to try and prepare a protocol
buffer patch with my "encode to a temporary ByteBuffer" trick, which
does make things a bit faster. I won't necessarily advocate this patch
to be included, but after having wasted this much time on this stuff,
I'll certainly try to maintain the patch for a while, in case others
are interested.

David Dabbs

unread,
Jun 1, 2010, 2:29:41 AM6/1/10
to Evan Jones, Protocol Buffers
Even with the extra call to access the offset, I would think there would be
some advantage to not making the data copies, which generate garbage cruft.

Am interested in your patch whenever it surfaces.

I seem to remember you saying that using an Encoder/Decoder didn't pay off
when the number of strings to en/decode was small.
Did the same hold true when using a ThreadLocal?


David

Evan


No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.819 / Virus Database: 271.1.1/2908 - Release Date: 05/31/10
13:25:00

Evan Jones

unread,
Jun 1, 2010, 6:31:50 AM6/1/10
to David Dabbs, Protocol Buffers
On Jun 1, 2010, at 2:29 , David Dabbs wrote:
> Even with the extra call to access the offset, I would think there
> would be some advantage to not making the data copies, which
> generate garbage cruft.

However, the way I am doing it doesn't generate any garbage: I keep a
temporary char[] buffer around to use with String.getChars(). The
cost is copying chars VS using reflection to access two fields. With
the small strings I tested (average ~30 bytes per string), the copy is
a bit cheaper than the reflection access. I assume that for larger
strings, the reflection approach will probably be better.

Which reminds me: I really need to test this with larger strings to
make sure it isn't dog slow in that case.

> I seem to remember you saying that using an Encoder/Decoder didn't
> pay off when the number of strings to en/decode was small. Did the
> same hold true when using a ThreadLocal?

From memory, the ThreadLocal appears to be very cheap, and not make
much performance difference, but I should double check this as well.

Kenton Varda

unread,
Jun 3, 2010, 6:12:39 PM6/3/10
to David Dabbs, Evan Jones, Christopher Smith, Protocol Buffers
Please don't use reflection to reach into private internals of classes you don't maintain.  We have "public" and "private" for a reason.  Furthermore, this access may throw a SecurityException if a SecurityManager is in use.
Reply all
Reply to author
Forward
0 new messages