protobuf not handling special characters between Java server and C++ client

2,817 views
Skip to first unread message

Hitesh Jethwani

unread,
Jan 25, 2011, 3:27:43 PM1/25/11
to Protocol Buffers
We have a Java web server. When this sends some string, for instance,
É, this is received as multiple bytes on the C++ client end. As a
result something like SÉBASTIEN gets displayed like S<2 funny
characters>BASTIEN. What I assume is happening internally is that,
Java being UTF-16 by default, writes such characters across multiple
bytes. In C++, if these characters are handled individually and hence
the increase in string length and invalid display.
Also if on the C++ client I run, MultiByteToWideChar(CP_UTF8,...),
then this gets converted correctly into a wide char. The problem is
that I want to avoid using wide characters on the C++ end. Using
WideCharToMultiByte(CP_ACP,...) again on wide character in C++,
resolves my issue, but I want to avoid this double conversion.
As may be evident from above I am naive at Java and Protobuf. Any help
on this is appreciated.

Evan Jones

unread,
Jan 25, 2011, 4:26:39 PM1/25/11
to Hitesh Jethwani, Protocol Buffers
On Jan 25, 2011, at 15:27 , Hitesh Jethwani wrote:
> As may be evident from above I am naive at Java and Protobuf. Any
> help on this is appreciated.


The Java protocol buffer API encodes strings as UTF-8. Since C++ has
no unicode support, what you get on the other end is the raw UTF-8
encoded data. You'll need to use some Unicode API to process it in
whatever way your application requires. I suggest ICU:

http://site.icu-project.org/

Hope this helps,

Evan

--
http://evanjones.ca/

Hitesh Jethwani

unread,
Jan 25, 2011, 11:53:06 PM1/25/11
to Protocol Buffers
Thanks for pointing that out Evans.
> The Java protocol buffer API encodes strings as UTF-8. Since C++ has
> no unicode support, what you get on the other end is the raw UTF-8
> encoded data.
I was of the opinion that UTF8 encoding encodes each character using 8
bits or a byte. So not sure as to why the raw encoded data represents
the character using 2 bytes instead of one. Also if on the Java end,
if on the stream writer, I add something like:
writer.write(new String(msg.getBytes(), "UTF8").getBytes()) instead of
simply writer.write(msg.getBytes()), I see the characters as expected
on the C++ client. However this I believe messes up with the protobuf
headers, so on C++ I receive only a partial file upto the entry that
contains one such character.

> encoded data. You'll need to use some Unicode API to process it in
> whatever way your application requires. I suggest ICU:
Trying this out now. Will post an update shortly.

Thanks for the prompt response.

Hitesh Jethwani

unread,
Jan 25, 2011, 11:57:44 PM1/25/11
to Protocol Buffers
> I was of the opinion that UTF8 encoding encodes each character using 8
> bits or a byte.
My understanding of UTF8 was clearly wrong. Just did some reading
again, it encodes characters in bytes, and can use upto 4 bytes to
represent a character.

> if on the stream writer, I add something like:
> writer.write(new String(msg.getBytes(), "UTF8").getBytes()) instead of
> simply writer.write(msg.getBytes()), I see the characters as expected
> on the C++ client. However this I believe messes up with the protobuf
> headers, so on C++ I receive only a partial file upto the entry that
> contains one such character.

Still not sure on the above though.

Kenton Varda

unread,
Jan 26, 2011, 12:20:05 AM1/26/11
to Hitesh Jethwani, Protocol Buffers
On Tue, Jan 25, 2011 at 8:57 PM, Hitesh Jethwani <hjeth...@gmail.com> wrote:
> if on the stream writer, I add something like:
> writer.write(new String(msg.getBytes(), "UTF8").getBytes()) instead of
> simply writer.write(msg.getBytes()), I see the characters as expected
> on the C++ client. However this I believe messes up with the protobuf
> headers, so on C++ I receive only a partial file upto the entry that
> contains one such character.

Still not sure on the above though.

The reason this appears to work is because String.getBytes() encodes in ISO-8859-1 encoding by default.  This encoding represents each character as exactly one byte, and can only represent character codes U+0000 through U+00FF.  Since you are decoding the bytes as UTF-8 and then encoding them as ISO-8859-1, and since the character 'É' happens to be within the ISO-8859-1 range, you effectively decoded this character into a single byte.  On the C++ side, the protobuf library does not verify that the parsed bytes are actually valid UTF-8 (except in debug mode); it just passes them through.  So the string you see there includes the 'É' character as one byte.

However, you end up getting a parser error because the length of the string (in bytes) ends up being different from the length given in the encoded message. The length was originally computed with 'É' represented as two bytes, but now it is only one byte, so the length is wrong.

In general, decoding arbitrary bytes (like a protobuf) as if they were UTF-8 will lose information, so converting bytes -> UTF-8 -> bytes will corrupt the bytes.

Hitesh Jethwani

unread,
Jan 26, 2011, 3:43:06 AM1/26/11
to Protocol Buffers
> The reason this appears to work is because String.getBytes() encodes in
> ISO-8859-1 encoding by default.

Thanks a lot for the above. Just want to summarize my understanding.
C++ needs to explicitly decode the UTF8 encoded string, which is when
it will interpret the characters properly.
I can use the library ICU mentioned by Evans above. Also observed that
MultiByteToWideChar(CP_UTF8,...) helps me with this.
I cannot use wide string or ICU data structures, as I need to keep the
data in char format as char is used by our DB libraries to communicate
with stored procedures.
Now then when I run WideCharToMultiByte(CP_ACP,...) after this, it
converts the UTF8 wide string to ISO-8859-1 string which can be stored
in char.
For now I am fairly confident that java server would always return
characters that can be represented by ISO-8859-1 (as this is a
migration project from C++ server (no protobuf involved) to Java
server and earlier this issue was never faced).
Can we encode the protobuf data in ISO-8859-1 from the server end
itself?
(I understand in the long run, we need to migrate to DB libraries that
support unicode and change the client code completely to work with
wide characters)

Evan Jones

unread,
Jan 26, 2011, 1:31:16 PM1/26/11
to Hitesh Jethwani, Protocol Buffers
On Jan 26, 2011, at 3:43 , Hitesh Jethwani wrote:
> Can we encode the protobuf data in ISO-8859-1 from the server end
> itself?

Yes. In this case, you need to use the protocol buffer "bytes" type
instead of the protocol buffer "string" type, since you want to
exchange ISO-8859-1 bytes from program to program (bytes), not unicode
text (string).

On the Java side, you'll need to use
ByteString.copyFrom(myStringobject, "ISO-8859-1") to make a ByteString
out of a Java string.

Reply all
Reply to author
Forward
0 new messages