Issue 196 in protobuf: Python: Ascii output is not assured to be in utf-8

prot...@googlecode.com

unread,

Jun 5, 2010, 9:35:23 PM6/5/10

to prot...@googlegroups.com

Status: New
Owner: ken...@google.com
Labels: Type-Defect Priority-Medium

New issue 196 by ken.fukushima: Python: Ascii output is not assured to be
in utf-8
http://code.google.com/p/protobuf/issues/detail?id=196

What steps will reproduce the problem?
1. In Python, set to a string field a unicode value that incudes not ascii
chars
2. Dump the value using text_format.PrintMessage
3. Parse it to a new protocol buffer using text_format.Merge
4. _Tokenizer.ConsumeString fails with UnicodeDecodeError.

What is the expected output? What do you see instead?
The library should be able to understand a message it output.

What version of the product are you using? On what operating system?
2.3.0

Please provide any additional information below.
The problem is that text_format.PrintMessage outputs a unicode value as is
without encoding it
in utf-8. text_format.Merge assumes its input is encoded in utf-8.

prot...@googlecode.com

unread,

Jul 2, 2010, 8:27:09 PM7/2/10

to prot...@googlegroups.com

Updates:
Status: Accepted

Comment #1 on issue 196 by ken...@google.com: Python: Ascii output is not

assured to be in utf-8
http://code.google.com/p/protobuf/issues/detail?id=196

(No comment was entered for this change.)

prot...@googlecode.com

unread,

Dec 6, 2010, 9:57:52 AM12/6/10

to prot...@googlegroups.com

Updates:
Status: Fixed
Labels: FixedIn-2.4.0

Comment #2 on issue 196 by liuj...@google.com: Python: Ascii output is not

assured to be in utf-8
http://code.google.com/p/protobuf/issues/detail?id=196

Now the text_format.PrintMessage has a parameter "as_utf", which I believe
fixes this.

prot...@googlecode.com

unread,

Dec 7, 2010, 10:25:25 PM12/7/10

to prot...@googlegroups.com

Updates:
Status: Accepted
Labels: -FixedIn-2.4.0

Comment #3 on issue 196 by ken...@google.com: Python: Ascii output is not

assured to be in utf-8
http://code.google.com/p/protobuf/issues/detail?id=196

Jisi, I'm not convinced that this is fixed. The as_utf parameter simply
prevents the printer from escaping character codes >= 128. The bug report
seems like it may actually be a problem in the parser. Also, round trips
should work correctly even if as_utf is not used. We should investigate
further, and make sure we have test cases that print and then parse a
message containing Unicode characters, both with and without as_utf enabled.

prot...@googlecode.com

unread,

Dec 8, 2010, 6:09:54 AM12/8/10

to prot...@googlegroups.com

Comment #4 on issue 196 by liuj...@google.com: Python: Ascii output is not

assured to be in utf-8
http://code.google.com/p/protobuf/issues/detail?id=196

I see, this was actually fixed by another CL that always encodes unicode
string field to utf8 to fix this bug in our internal branch(s/14267948).
testRoundTripExoticAsOneLine() already covered the as_utf8=False round trip
case, so I'll add another testcase to test as_utf8=True.

prot...@googlecode.com

unread,

Dec 14, 2010, 10:45:43 PM12/14/10

to prot...@googlegroups.com

Updates:
Status: Fixed
Labels: FixedIn-2.4.0

Comment #5 on issue 196 by ken...@google.com: Python: Ascii output is not

assured to be in utf-8
http://code.google.com/p/protobuf/issues/detail?id=196

Cool, it sounds like it is indeed fixed.

Reply all

Reply to author

Forward