BinaryToJsonString mishandling strings containing UTF8 multibyte characters

472 views
Skip to first unread message

Ron

unread,
Nov 24, 2015, 2:42:22 PM11/24/15
to Protocol Buffers
Hi,

When using BinaryToJsonString or BinaryToJsonStream, I seem to encounter a problem whenever there's a message containing a string containing multibyte characters.
After some debugging, it seems the place where things start to go wrong is in ReadCodePoint (in json_escaping.cc) when the first byte of the multibyte character is being read from the string (as char) and assigned into a variable of type uint32. This casting directly from a signed 1-byte value to an unsigned 4-byte value seems to produce values that are different than intended and different than expected a little later on by some if-else statements trying to look at that value to determine the correct length of the multibyte character. From there things go wrong and the string isn't serialized and just gets dropped...

For now as a temporary solution I added a cast of the value returned by StringPiece's operator[ ] to uint8 before the assignment into uint32, but any advice or a more permanent solution will be appreciated.

Thanks,
Ron

Feng Xiao

unread,
Nov 24, 2015, 3:51:55 PM11/24/15
to Ron, Protocol Buffers
Could you provide a sample input that will fail for this reason?
 

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+u...@googlegroups.com.
To post to this group, send email to prot...@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.

Ron

unread,
Nov 25, 2015, 3:46:58 AM11/25/15
to Protocol Buffers, ronnnnn...@gmail.com
Sure.

For example, I defined the below message in the proto file:
message Person
{
 
string first_name = 1;
 
string last_name = 2;
}


When I set the first_name field to "Ron" both binary serialization and JSON serialization work fine.


But when I set it to "רון" (as UTF8) , while the serialization to binary is correct (shown here as base64):
CgbXqNeV158=

... when using BinaryToJsonString to get the JSON representation the value is mishandled and is ultimatately replaced with an empty string:
{ "firstName": "" }


This example will probably only work correctly with compilers that define char as unsigned by default, but with compilers that define char as signed (such as Microsoft's) - I think you should get the same (incorrect) result I pasted above.

Feng Xiao

unread,
Nov 25, 2015, 1:56:51 PM11/25/15
to Ron, Protocol Buffers
On Wed, Nov 25, 2015 at 12:47 AM Ron <ronnnnn...@gmail.com> wrote:
Sure.

For example, I defined the below message in the proto file:
message Person
{
 
string first_name = 1;
 
string last_name = 2;
}


When I set the first_name field to "Ron" both binary serialization and JSON serialization work fine.


But when I set it to "רון" (as UTF8) , while the serialization to binary is correct (shown here as base64):
CgbXqNeV158=

... when using BinaryToJsonString to get the JSON representation the value is mishandled and is ultimatately replaced with an empty string:
{ "firstName": "" }


This example will probably only work correctly with compilers that define char as unsigned by default, but with compilers that define char as signed (such as Microsoft's) - I think you should get the same (incorrect) result I pasted above.
Thanks for the explanation. Could you help file a bug for this on protobuf github site? If you know of an solution to this, you are also welcomed to send us a pull request.

Ron

unread,
Nov 26, 2015, 3:51:07 AM11/26/15
to Protocol Buffers, ronnnnn...@gmail.com

On Wednesday, November 25, 2015 at 8:56:51 PM UTC+2, Feng Xiao wrote:
Thanks for the explanation. Could you help file a bug for this on protobuf github site? If you know of an solution to this, you are also welcomed to send us a pull request.

Zachary Deretsky

unread,
Apr 5, 2016, 2:23:17 PM4/5/16
to Protocol Buffers, ronnnnn...@gmail.com
Ron,
could you post and example and some explanation on how to (de)serialize proto3 to JSON using
LIBPROTOBUF_EXPORT util::Status BinaryToJsonString(
    TypeResolver* resolver,
    const string& type_url,
    const string& binary_input,
    string* json_output,
    const JsonOptions& options);

How to create TypeResolver and what is type_url?

I am asking because you seem to be the only one with expertise on the subject.
Thank you, Zach. 

Ron Ben-Yosef

unread,
Apr 10, 2016, 3:57:47 PM4/10/16
to Protocol Buffers, ronnnnn...@gmail.com
type_url should be an identifier of the type of the protobuf message you're transcoding. By default the url of a specific message type looks like type.googleapis.com/<package_name>.<message_name>. I'd imagine the prefix might be configurable to something other than type.googleapis.com, but can't say for sure, haven't tried changing it.

A TypeResolver instance can be created with the function NewTypeResolverForDescriptorPool declared in type_resolver_util.h:

NewTypeResolverForDescriptorPool takes a pointer to a DescriptorPool. If the generated code for the relevant type of message has been compiled as part of your binary then its descriptor should be in the generated descriptor pool so you should just use that. Otherwise, you can build the descriptor and the pool from a FileDescriptorProto.


Usage might look something like this:
...


#include <google/protobuf/message.h>
#include <google/protobuf/descriptor.h>
#include <google/protobuf/util/json_util.h>
#include <google/protobuf/util/type_resolver_util.h>


using namespace google::protobuf;
using namespace google::protobuf::util;

...


void foo(const Message& msg)
{
 
...

 std
::string json_output;
 
TypeResolver* resolver = NewTypeResolverForDescriptorPool("type.googleapis.com", &DescriptorPool::generated_pool());
 
 
Status status = BinaryToJsonString(resolver, "type.googleapis.com/" + msg.GetTypeName(), msg.SerializeAsString(), &json_output);
 
 std
::cout << json_output;


 
delete resolver;
 
...
}


...


I hope this helps.


Ron

Zachary Deretsky

unread,
Apr 12, 2016, 8:20:12 PM4/12/16
to Protocol Buffers
Thanks Ron, now I can generate json, but is is invalid for maps. I just posted an example with my code.
Regards, Zach.
Reply all
Reply to author
Forward
0 new messages