Research into speeding up encoding UTF8. I've already bumped up french by 70%.

491 views
Skip to first unread message

Kevin Burton

unread,
Sep 30, 2015, 7:01:01 PM9/30/15
to mechanical-sympathy
I'm trying to find any research done on improving the speed of UTF8 encoders.

I was looking at Jackson and the JDK UTF8 encoders and found a few novel strategies to improve performance.

My prototype using jmh as a benchmark harness and some initial sample documents from Wikipedia show speedups of 70% for French, 30% for German, and 15% for Korean.

I think I can get Korean up by 400% though... I'm working on two algorithms and I have only implemented the first. 

However, I can't seem to find any research in this area.  You would think that there would be a massive amount considering UTF8 drives a huge portion of the web.


Roman Leventov

unread,
Sep 30, 2015, 8:39:37 PM9/30/15
to mechanica...@googlegroups.com
There are some small interesting optimizations in Guava's Utf8 class: https://github.com/google/guava/blob/cb0cf6a40e815e662fb63438dbbfebaf6d9dcb80/guava/src/com/google/common/base/Utf8.java#L53-L103

(Haven't looked into JDK or Jackson source, probably they have these optimizations too)

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin Burton

unread,
Sep 30, 2015, 10:10:54 PM9/30/15
to mechanical-sympathy

Looks like these are more util methods for UTF8... It might be that most people don't care but in our situation encoding UTF8 is something we do often so should benefit our stack.  

Jay Askren

unread,
Oct 1, 2015, 11:13:50 AM10/1/15
to mechanical-sympathy

Nitsan Wakart

unread,
Oct 2, 2015, 10:08:49 AM10/2/15
to mechanica...@googlegroups.com
can you share the benchmark code?



Scott Carey

unread,
Oct 3, 2015, 3:01:07 AM10/3/15
to mechanical-sympathy
I wrote some UTF8 encoding / decoding based on a state machine a while back, using principles like this:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/  but never completed it.  I think its in some never applied patch to Apache Avro, somewhere, as it was not a big enough difference in speed versus the additional complexity of code in the project.  If I recall correctly, it helped quite a bit more one direction than the other.   Also, at the time passing the string "UTF-8" to the jre was faster than passing a Charset constant, because the former was better optimized and created less garbage.  I don't know if Java 8 has cleaned that up or not.

It could probably go even faster using Unsafe.

Kevin Burton

unread,
Oct 3, 2015, 4:00:42 PM10/3/15
to mechanical-sympathy
Yeah.  Unsafe should definitely give a boost. I'm actually about to create another thread about this now about how the JVM doesn't use Unsafe where it could.

My changes were more algorithmic, but I have to admit that the JVM is improving the code over and above what the algorithmic complexity suggests.

Specifically, some code should be 3x slower than I had assumed but it's really, in practice, only 20% slower.  I assume this is due to the Java compiler either doing very clever things, or very foolish things sometimes.

I might have to dive into the bytecode to see what it's actually doing but of course that makes the project I'm working on much harder.

On Saturday, October 3, 2015 at 12:01:07 AM UTC-7, Scott Carey wrote:
I wrote some UTF8 encoding / decoding based on a state machine a while back, using principles like this:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/  but never completed it.  I think its in some never applied patch to Apache Avro, somewhere, as it was not a big enough difference in speed versus the additional complexity of code in the project.  If I recall correctly, it helped quite a bit more one direction than the other.   Also, at the time passing the string "UTF-8" to the jre was faster than passing a Charset constant, because the former was better optimized and created less garbage.  I don't know if Java 8 has cleaned that up or not.


I agree about your feature creep and complexity here.  My encoder is alredy getting somewhat complex.  The Jackson encoder is also equally complex. 

Christoph Engelbert

unread,
Oct 4, 2015, 1:43:05 PM10/4/15
to mechanical-sympathy
Remember that Java 9 might change some internals to the String class: http://openjdk.java.net/jeps/254

Nitsan Wakart

unread,
Oct 8, 2015, 7:07:07 AM10/8/15
to mechanica...@googlegroups.com
Here's a benchmark you can use:
It also demonstrates the custom encoder I wrote which indeed uses Unsafe (and can indeed break in future should String internals change).
Using a JMH benchmark should also enable you to use the perfasm profiler to determine hotspots etc.
Have fun storming the castle :-)



On Saturday, October 3, 2015 10:00 PM, Kevin Burton <burto...@gmail.com> wrote:


Yeah.  Unsafe should definitely give a boost. I'm actually about to create another thread about this now about how the JVM doesn't use Unsafe where it could.

My changes were more algorithmic, but I have to admit that the JVM is improving the code over and above what the algorithmic complexity suggests.

Specifically, some code should be 3x slower than I had assumed but it's really, in practice, only 20% slower.  I assume this is due to the Java compiler either doing very clever things, or very foolish things sometimes.

I might have to dive into the bytecode to see what it's actually doing but of course that makes the project I'm working on much harder.

On Saturday, October 3, 2015 at 12:01:07 AM UTC-7, Scott Carey wrote:
I wrote some UTF8 encoding / decoding based on a state machine a while back, using principles like this:

http://bjoern.hoehrmann.de/ utf-8/decoder/dfa/  but never completed it.  I think its in some never applied patch to Apache Avro, somewhere, as it was not a big enough difference in speed versus the additional complexity of code in the project.  If I recall correctly, it helped quite a bit more one direction than the other.   Also, at the time passing the string "UTF-8" to the jre was faster than passing a Charset constant, because the former was better optimized and created less garbage.  I don't know if Java 8 has cleaned that up or not.


I agree about your feature creep and complexity here.  My encoder is alredy getting somewhat complex.  The Jackson encoder is also equally complex. 
Reply all
Reply to author
Forward
0 new messages