Before I attempt to make any improvements, I want to see if anyone
(Kenton primarily) has any opinions if these make sense. They would
add ~100 lines of code to replace something which is now a few lines
of code, and it is a small improvement (approximately 40% less time
per encode/decode, on a list of 1400 strings in different languages).
I haven't tried adding this to protobufs yet, so final performance
improvements are unknown:
Problem 1: A Java protobuf string is stored as a String instance. It
typically gets converted to UTF-8 *twice*: Once in getSerializedSize()
via a call to CodedOutputStream.computeStringSize, then again in
writeTo().
Solution: Cache the byte[] version of String fields. This would
increase the memory size of each message (an additional pointer per
string, plus the space for the byte[]), but would HALVE the number of
conversions. I suspect this will be a fair bit faster. If added, it
should only be added for the SPEED generated messages.
Problem 2: Using the NIO encoders/decoders can be faster than
String.getBytes, but only if it is used >= 4 times. If used only once,
it is worse. The same is approximately true about decoding. Lame
results: http://evanjones.ca/software/java-string-encoding.html
Solution 1: Add a custom encoder/decoder to CodedOutputStream,
allocated as needed. This could be *bad* for applications that call
Message.toByteString or .toByteArray frequently for messages with few
strings, since that creates and throws away a single CodedOutputStream
instance.
Solution 2: Add a custom encoder/decoder per thread via a ThreadLocal.
This requires fetching the ThreadLocal, which is slightly expensive,
and adds some per-thread memory overhead (~ 4kB, tunable). however the
allocations are done ONCE per thread, which should be significantly
better.
--
Evan Jones
http://evanjones.ca/
Problem 1: A Java protobuf string is stored as a String instance. It
typically gets converted to UTF-8 *twice*: Once in getSerializedSize()
via a call to CodedOutputStream.computeStringSize, then again in
writeTo().
Solution: Cache the byte[] version of String fields. This would
increase the memory size of each message (an additional pointer per
string, plus the space for the byte[]), but would HALVE the number of
conversions. I suspect this will be a fair bit faster. If added, it
should only be added for the SPEED generated messages.
Problem 2: Using the NIO encoders/decoders can be faster than
String.getBytes, but only if it is used >= 4 times. If used only once,
it is worse. The same is approximately true about decoding. Lame
results: http://evanjones.ca/software/java-string-encoding.html
Solution 1: Add a custom encoder/decoder to CodedOutputStream,
allocated as needed. This could be *bad* for applications that call
Message.toByteString or .toByteArray frequently for messages with few
strings, since that creates and throws away a single CodedOutputStream
instance.
Solution 2: Add a custom encoder/decoder per thread via a ThreadLocal.
This requires fetching the ThreadLocal, which is slightly expensive,
and adds some per-thread memory overhead (~ 4kB, tunable). however the
allocations are done ONCE per thread, which should be significantly
better.
Problem 2: Using the NIO encoders/decoders can be faster than
String.getBytes, but only if it is used >= 4 times. If used only once,
it is worse. The same is approximately true about decoding. Lame
results: http://evanjones.ca/software/java-string-encoding.html
Solution 1: Add a custom encoder/decoder to CodedOutputStream,
allocated as needed. This could be *bad* for applications that call
Message.toByteString or .toByteArray frequently for messages with few
strings, since that creates and throws away a single CodedOutputStream
instance.
Solution 2: Add a custom encoder/decoder per thread via a ThreadLocal.
This requires fetching the ThreadLocal, which is slightly expensive,
and adds some per-thread memory overhead (~ 4kB, tunable). however the
allocations are done ONCE per thread, which should be significantly
better.
--
Evan Jones
http://evanjones.ca/
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
There should be a writeByteArray(int fieldNumber, byte[] value) in CodedOutputStream so that the cached bytes of strings would
be written directly. The ByteString would not help, it adds more memory since it creates a copy of the byte array.
This is a good idea, and it seems to me that this should definitely be
possible. It would need to be done somewhat carefully, since Message
objects are supposed to be thread safe, but I don't think this is
particularly hard. This would result in the first call to the
combination of getSerializedSize() and writeTo() only serializing
strings once, and each subsequent call would also serialize the string
once. The advantage would be nearly no "permanent" extra memory
overhead, beyond perhaps one extra field for messages that contain
strings. The disadvantage is that you serialize strings each time you
serialize the message.
The only additional tricky part: on subsequent serializations, it
would be useful to know the serialized size of the string, in order to
serialize the string directly into the output buffer, rather than
needing to create a temporary byte[] array to get the length. This is
also a solvable problem, and my lame benchmarks suggest this is only a
small improvement anyway.
On Dec 22, 2009, at 22:06 , David Yu wrote:
> I noticed this as well before ... the solution could be applied to
> *all* generated messages for efficiency.
I don't think other types have this "double encoding" overhead, but I
could be wrong. This is not a problem for nested messages, since the
call to Message.getSerializedSize() ends up
calling .getSerializedSize() on the sub-message, so the size will be
calculated without actually serializing the sub message.
While caching the serialized representation of the entire message
would be faster for applications that call message.writeTo() multiple
times on a single message, there is a significant memory cost to doing
that automatically. Personally, I think this is the sort of thing that
applications should do themselves, if they want it, and not something
that should be part of the core library.
On Dec 22, 2009, at 19:59 , Kenton Varda wrote:
> Fetching a threadlocal should just be a pointer dereference on any
> decent threading implementation. Is it really that expensive in Java?
It used to be very expensive. I think it does the "right thing" now,
but I'll measure to make sure. Its still more expensive than accessing
a local field.
> Solution 3: Maintain a private freelist of encoder objects within
> CodedOutputStream. Allocate one the first time a string is encoded
> on a particular stream object, and return it to the freelist on
> flush() (which is always called before discarding the stream unless
> an exception interrupts serialization). In may make sense for the
> freelist to additionally be thread-local to avoid locking, but if
> it's only one lock per serialization maybe it's not a big deal?
I would guess that this might be more expensive than the ThreadLocal,
but I don't know that for sure. It would avoid the "one encoder/
decoder per thread" overhead. Do you think it is worth it?
Evan
On Dec 22, 2009, at 19:59 , Kenton Varda wrote:This is a good idea, and it seems to me that this should definitely be possible. It would need to be done somewhat carefully, since Message objects are supposed to be thread safe, but I don't think this is particularly hard.
I wonder if we can safely discard the cached byte array during serialization on the assumption that most messages are serialized only once?
The only additional tricky part: on subsequent serializations, it would be useful to know the serialized size of the string, in order to serialize the string directly into the output buffer, rather than needing to create a temporary byte[] array to get the length. This is also a solvable problem, and my lame benchmarks suggest this is only a small improvement anyway.
Solution 3: Maintain a private freelist of encoder objects within CodedOutputStream. Allocate one the first time a string is encoded on a particular stream object, and return it to the freelist on flush() (which is always called before discarding the stream unless an exception interrupts serialization). In may make sense for the freelist to additionally be thread-local to avoid locking, but if it's only one lock per serialization maybe it's not a big deal?
I would guess that this might be more expensive than the ThreadLocal, but I don't know that for sure. It would avoid the "one encoder/decoder per thread" overhead. Do you think it is worth it?
* Add a private volatile byte[] cache for each string field.
* When computing the serialized size, serialize the string and store
it in the cache.
* When serializing, use the cache if available, then set the cache to
null.
I used the ProtoBench.java program included in the SVN repository,
using the messages included in the repository. Brief summary of results:
* Serializing a protocol buffer more than once is possibly slightly
slower (~2-10%). I'm guessing the reason is that since it already has
the message length, the extra conditionals and fields for string
caching just get in the way.
* Serializing a protocol buffer once, with the length prepended, is
significantly faster (~33% for SpeedMessage1, with 10 strings, ~80%
for SpeedMessage2, with lots of strings; the benchmark measures the
time to create the protocol buffer, so the improvement is probably
actually larger).
* Classes are a bit bigger (SpeedMessage1.class with 8 strings: 24802 -
> 25577)
* Size messages are unchanged.
* Messages without strings are unchanged.
Pros:
* Faster serialization of messages that contain strings.
Cons:
* More code (extra fields; conditional checking of the cache)
* More RAM (extra fields)
* Some changes to CodedOutputStream (more code to handle cached
strings).
Does this seem like a worthwhile optimization? If so, I'll clean up my
patch a bit and submit it for code review. Thanks,
I didn't try this exact idea, but my guess is that it may not be a
win: to get "fast" access to the char[] in the string, you need to
copy them into a separate char[] array. If you are doing this much
work, you might as well encode them into UTF-8 while you are at it.
Conclusion: 10% performance win for ASCII (garbage collection
savings); no win for general UTF-8 text. Not worth it for protocol
buffers, but I'll try digging into the decoding.
The approach I found worked the best:
1. Copy the string into a pre-allocated and re-used char[] array. This
is needed since the JDK does not permit access to the String's
char[] ,to enforce immutability. This is a performance "loss" VS the
JDK, which can access the char[] directly.
2. Encode into a pre-allocated byte[] array. This will completely
handle short strings. For long strings, you end up needing to allocate
additional temporary space. This is "better" than the JDK, which
allocates a new temporary buffer = 4 * str.length().
3. Allocate the final byte[] array, and System.arraycopy into it. This
is the same as the JDK.
Conclusion: This is only better than the JDK in that it reduces
allocation and garbage collection. It is worse than the JDK because it
requires a copy from the String into another char[]. On my tests with
ASCII-only data, it ends up ~10% faster. In my tests with UTF-8 data,
it ends up about the same speed. In other words: this probably isn't
worth it for Protocol Buffers: this is a small performance
improvement, but complicates the code significantly.
However: For applications that can encode the string directly to the
output buffer, this custom code can be significantly faster. However,
since protocol buffers needs to encode to another buffer first to get
the string length, this
I may spend some time poking around at string *decoding*, because
there we should be able to decode directly from the input buffer,
saving a copy and an allocation of a temporary byte[] array. Unclear
if this will actually be significantly faster, but it might be
slightly faster.
1. Copy the string into a pre-allocated and re-used char[] array. This
is needed since the JDK does not permit access to the String's
char[] ,to enforce immutability. This is a performance "loss" VS the
JDK, which can access the char[] directly....
Evan,
you may access a String's internals via reflection in a "safe," albeit
potentially implementation-specific way. See class code below.
As long as your java.lang.String uses "value" for the char[] and
"offset" for the storage offset, this should work.
No sun.misc.Unsafe used. Only tested/used on JDK6.
David
/***************************************************************************
**********/
import java.lang.reflect.*;
public final class ReflectionUtils {
/**
* There is no explicit Modifier constant for package-privete, so 0 is
used.
*/
public static final int MODIFIER_PACKAGE_PRIVATE = 0x00000000;
/** Field object for accessing Sting::value character storage. */
public static final Field STRING_VALUE_FIELD =
getFieldAccessible(String.class, "value");
/** Field object for accessing Sting::offset, the first index used in
the value[] char storage. */
public static final Field STRING_OFFSET_FIELD =
getFieldAccessible(String.class, "offset");
/**
* Package private String constructor which shares value array for
speed.
*
* Use when a number of String objects share the same char[] storage.
*
* See String(int offset, int count, char value[]).
*/
public static final Constructor<?> STRING_PP_CTOR =
getConstructor(String.class, MODIFIER_PACKAGE_PRIVATE, int.class, int.class,
char[].class);
/**
* To avoid violating final field semantics, take care to only _read_
* the char[] value returned.
*/
public static char[] getChars(final String s) {
try {
// use reflection to read the char[] value from the string. . .
return (char[]) STRING_VALUE_FIELD.get(s);
} catch (Throwable t) {
return null;
}
}
public static String sharedString(final char[] chars, final int offset,
final int length) {
try {
return (String) STRING_PP_CTOR.newInstance(offset, length,
chars);
} catch (InstantiationException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
} catch (Throwable t) {
t.printStackTrace();
}
return null;
}
public static Field getFieldAccessible(final Class<?> clazz, final
String fieldName) {
Field fld = null;
try {
fld = clazz.getDeclaredField(fieldName);
fld.setAccessible(true);
} catch (NoSuchFieldException e) {
e.printStackTrace();
} catch (SecurityException e) {
e.printStackTrace();
}
return fld;
}
public static Constructor<?> getConstructor(final Class<?> clazz, final
int searchModifier, final Class<?>... paramTypes) {
if(clazz==null) {
throw new IllegalArgumentException("A class parameter is
required");
}
try {
//
// There is no explicit Modifier accessor constant for
package-privete, so 0 is used.
//
for (Constructor<?> ctor : clazz.getDeclaredConstructors()) {
if (searchModifier == (ctor.getModifiers() &
(Modifier.PUBLIC | Modifier.PRIVATE | Modifier.PROTECTED))) {
//
// access modifier matches. . .
//
final Class<?>[] parameterTypes =
ctor.getParameterTypes();
if (parameterTypes.length == paramTypes.length) {
//
// same number of parameters. . .
//
for (int i = 0; i < parameterTypes.length; i++) {
if (!parameterTypes[i].equals(paramTypes[i])) {
// one parameter not of correct type, so
bail. . .
// note ctor variable used as success marker
below
ctor = null;
break;
} else {
// Type[] gpType =
ctor.getGenericParameterTypes();
// for (int j = 0; j < gpType.length; j++) {
// char ch =
(gpType[j].equals(paramTypes[i]) ? '*' : ' ');
// System.out.format("%7c%s[%d]: %s%n",
ch, "GenericParameterType", j, gpType[j]);
// }
}
}
if (ctor != null) {
// all ctor parameter types match, so call
ctor.setAccessible(true)
// System.out.format("%s%n",
ctor.toGenericString());
// System.out.format(" [ synthetic=%-5b
var_args=%-5b ]%n", ctor.isSynthetic(), ctor.isVarArgs());
ctor.setAccessible(true);
return ctor;
}
}
}
}
} catch (Throwable t) {
// will return null below
}
return null; // error or ctor not found
}
/**
* The AccessibleObject class is the base class for Field, Method and
* Constructor objects. It provides the ability to flag a reflected
* object as suppressing default Java language access control checks
* when it is used. The access checks--for public, default (package)
* access, protected, and private members--are performed when Fields,
* Methods or Constructors are used to set or get fields, to invoke
* methods, or to create and initialize new instances of classes,
* respectively.
*
* <p>Setting the <tt>accessible</tt> flag in a reflected object
* permits sophisticated applications with sufficient privilege, such
* as Java Object Serialization or other persistence mechanisms, to
* manipulate objects in a manner that would normally be prohibited.
*
* @see AccessibleObject
*
**/
public static boolean setAccessible(final AccessibleObject o) {
try {
o.setAccessible(true);
return true;
} catch (Throwable /*SecurityException*/ e) {
return false;
}
}
}
Good idea! Unfortunately, this isn't much faster for small strings. It
is faster if you just get the value char[] array. However, when I
modified my implementation to get both the char[] value and int
offset, it ended up being about the same speed for my test data set,
which is composed mostly of "short" UTF-8 and ASCII strings.
Unfortunately, a correct implementation will need to get both values.
Since this is also somewhat "dangerous," it doesn't seem like a great
idea for my data.
At any rate: I'll try to find some time to try and prepare a protocol
buffer patch with my "encode to a temporary ByteBuffer" trick, which
does make things a bit faster. I won't necessarily advocate this patch
to be included, but after having wasted this much time on this stuff,
I'll certainly try to maintain the patch for a while, in case others
are interested.
Am interested in your patch whenever it surfaces.
I seem to remember you saying that using an Encoder/Decoder didn't pay off
when the number of strings to en/decode was small.
Did the same hold true when using a ThreadLocal?
David
Evan
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.819 / Virus Database: 271.1.1/2908 - Release Date: 05/31/10
13:25:00
However, the way I am doing it doesn't generate any garbage: I keep a
temporary char[] buffer around to use with String.getChars(). The
cost is copying chars VS using reflection to access two fields. With
the small strings I tested (average ~30 bytes per string), the copy is
a bit cheaper than the reflection access. I assume that for larger
strings, the reflection approach will probably be better.
Which reminds me: I really need to test this with larger strings to
make sure it isn't dog slow in that case.
> I seem to remember you saying that using an Encoder/Decoder didn't
> pay off when the number of strings to en/decode was small. Did the
> same hold true when using a ThreadLocal?
From memory, the ThreadLocal appears to be very cheap, and not make
much performance difference, but I should double check this as well.