Protocol buffer unicode python vs. c++

1,294 views
Skip to first unread message

saad...@gmail.com

unread,
Mar 20, 2009, 2:16:50 PM3/20/09
to Protocol Buffers
I am not very experienced programmer, but I will try to explain whats
happening:

I have books titles database in protocol buffer format. The message
Title has fields like:
optional string title = 1;
optional string description = 2;
optional string isbn = 3
...
...

When I convert my mysql data to pb, i use python and store it using
title_str_pb = title.SerializeToString()

When I read back the titles in python, everything works fine. Like:
t = Title()
t.ParseFromString(title_str_pb)
title = t.title
description = t.description

But now I want to use this protocol buffer data in c++. like:
Title t;
t.ParseFromString(title_str_pb)

and I get error:
Encountered string containing invalid UTF-8 data while parsing
protocol buffer. Strings must contain only UTF-8; use the 'bytes' type
for raw bytes.

I changed the string type to bytes type, then also I get the same
error.

I have a million book records stored in pb format. I don't want to
loose my data. Can somebody help please. As an alternative I will
restore my data back using python. But I want to use it in c++.

Kenton Varda

unread,
Mar 20, 2009, 2:38:04 PM3/20/09
to saad...@gmail.com, Protocol Buffers
If you changed all the "string" types to "bytes" instead, then you should not see that error.  Are you sure you did that?  If so, can you write a small demo program which produces this error, even when the protobuf type contains no "string" fields, and send it to me?

saad...@gmail.com

unread,
Mar 20, 2009, 3:09:12 PM3/20/09
to Protocol Buffers
oh sorry for the confusion, when i change it to byte, its not giving
the error, but the value is gibberish which contains some special
characters and values of 2-3 fields together.

I don't know what the problem is but I found the solution, here is
what I am doing:
I searched online and read some python docs and then I wrote another
python script and processing each protobuf data like:

t = Title()
t.ParseFromString(title_str_pb)

t.title = t.title.encode('utf-8')
t.description = t.description.encode('utf-8')
t.isbn = t.isbn.encode('utf-8')
...
...

and then writing it back to my database
title_str_pb = t.SerializeToString()

and now when I open it in c++, its not giving any error.

So, I think when I was adding the original data, I should have
called .encode('utf-8') on all the python strings.

Is there anything I am missing, or easy way to do it.


On Mar 20, 11:38 pm, Kenton Varda <ken...@google.com> wrote:
> If you changed all the "string" types to "bytes" instead, then you should
> not see that error. Are you sure you did that? If so, can you write a
> small demo program which produces this error, even when the protobuf type
> contains no "string" fields, and send it to me?
>

Kenton Varda

unread,
Mar 20, 2009, 3:18:46 PM3/20/09
to saad...@gmail.com, Petar Petrov, Protocol Buffers
Hmm, that sound very odd.  I think protocol buffers should be taking care of this automatically.  Can you give us an example of the "gibberish" and what you expected it to look like?

saad...@gmail.com

unread,
Mar 20, 2009, 3:39:59 PM3/20/09
to Protocol Buffers
with string type it gives the following error:
libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered
string containing invalid UTF-8 data while parsing protocol buffer.
Strings must contain only UTF-8; use the 'bytes' type for raw bytes.

When I change it to byte, it shows:
cout << t.title() << endl;
� x� � 5image_url� English edit home movies, and share MP3s in one

instead of showing:
cout << t.title() << endl;
How To Use A Chafing Dish (1912)



On Mar 21, 12:18 am, Kenton Varda <ken...@google.com> wrote:
> Hmm, that sound very odd. I think protocol buffers should be taking care of
> this automatically. Can you give us an example of the "gibberish" and what
> you expected it to look like?
>

saad...@gmail.com

unread,
Mar 20, 2009, 3:45:47 PM3/20/09
to Protocol Buffers
This is the pb_string before calling encoded('utf-8')
\nJhow-to-use-a-chafing-dish\x12 How To Use A Chafing Dish (1912)\x1a
\x18Sarah Tyson Heston Rorer"\n143687825X*
\r97814368782582\tPaperback8\xd8\x0f@\x01R\x14Kessinger PublishingX|h
\xd0\x08p\x89\x07x\xc7\x01\x82\x015img-436878258.jpg
\x92\x01\x07English

This is the pb_string after calling encode('utf-8')
\nJhow-to-use-a-chafing-dish\x12 How To Use A Chafing Dish (1912)\x1a
\x18Sarah Tyson Heston Rorer\x1a\x18Sarah Tyson Heston
Rorer"\n143687825X*\r97814368782582\tPaperback8\xd8\x0f@\x01R
\x14Kessinger PublishingX|b\x00h\xd0\x08p\x89\x07x
\xc7\x01\x82\x015img-436878258.jpg\x92\x01\x07English
Reply all
Reply to author
Forward
0 new messages