string vs. bytes

edan

unread,

May 10, 2009, 9:08:52 AM5/10/09

to prot...@googlegroups.com

I have some fields that may contain non-UTF8 data.
I understand that I just need to change their type from "string" to "bytes" and it should just work, transparently.
I have a few fields that probably will only contain ASCII i.e. legal UTF8, but I'm not 100% sure.
I am tempted to just turn them all to "bytes".
But this begs the question - what is the "string" type useful for, and why shouldn't I just always use "bytes" to be sure, all the time, and not both with "string" at all?
Does "string" add anything besides validation that only valid UTF8 is passing over the wire? Is there really a big benefit to this behavior? Or is there some other advantage that I'll miss out on by changing all my "string"s to "bytes"?

Thanks
--edan

Henner Zeller

unread,

May 10, 2009, 12:59:28 PM5/10/09

to edan, prot...@googlegroups.com

On Sun, May 10, 2009 at 6:08 AM, edan <eda...@gmail.com> wrote:
> I have some fields that may contain non-UTF8 data.
> I understand that I just need to change their type from "string" to "bytes"
> and it should just work, transparently.

yes. The're the same on the wire.

> I have a few fields that probably will only contain ASCII i.e. legal UTF8,
> but I'm not 100% sure.
> I am tempted to just turn them all to "bytes".
> But this begs the question - what is the "string" type useful for, and why
> shouldn't I just always use "bytes" to be sure, all the time, and not both
> with "string" at all?
> Does "string" add anything besides validation that only valid UTF8 is
> passing over the wire? Is there really a big benefit to this behavior? Or
> is there some other advantage that I'll miss out on by changing all my
> "string"s to "bytes"?

If you use the C++ api there is not much difference since both types
are represented as std::string in the API. It makes a big difference
for the Java API (and Python?), that have a native type for an UTF-8
string. In Java, if you deal with a protocol buffer 'string' type, the
generated API will return a java.lang.String while otherwise it will
return a ByteString. ByteString can hold any character while the
native Java String works only for UTF-8. So while 'ByteString' is more
flexible, 'String' is more convenient to deal with within Java code
because all string manipulation libraries can handle it.

So the benefit is a more convenient Api in the generated Java code.
And as well documentation: if you use 'string' you emphasize that a
field only contains readable text while 'bytes' might contain any
binary blob.

-h

dan.schm...@gmail.com

unread,

May 12, 2009, 9:47:29 AM5/12/09

to Protocol Buffers

I am having a very similar problem. Just installed the 2.0.3 version
and now all my serialisations complain.

libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered
string containing invalid UTF-8 data while parsing protocol buffer.
Strings must contain only UTF-8; use the 'bytes' type for raw bytes.

Now, C++ doesn't have a byte type. Just signed or unsigned chars, and
string is an array of those. So, what does it need? Would I be better
off serialising to a stream like the CodedStream?

I am very confused on the issue. I have the horrible feeling now that
I'm losing efficiency because serialising to string might mean that
I'm losing my raw data.

Otherwise, then the word ERROR on the output might be a bit too
strong.

If anybody can clarify, I'd be very grateful.

Dan

On May 10, 5:59 pm, Henner Zeller <h.zel...@acm.org> wrote:

Henner Zeller

unread,

May 12, 2009, 11:52:50 AM5/12/09

to dan.schm...@gmail.com, Protocol Buffers

Hi,

On Tue, May 12, 2009 at 6:47 AM, <dan.schm...@gmail.com> wrote:
>
> I am having a very similar problem. Just installed the 2.0.3 version
> and now all my serialisations complain.
>
> libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered
> string containing invalid UTF-8 data while parsing protocol buffer.
> Strings must contain only UTF-8; use the 'bytes' type for raw bytes.
>
> Now, C++ doesn't have a byte type. Just signed or unsigned chars, and
> string is an array of those.

the ProtocolBuffer 'byte' type translates into 'string' in C++. And an
array of chars is an array of bytes, so you're all fine.

Kenton Varda

unread,

May 12, 2009, 12:26:57 PM5/12/09

to dan.schm...@gmail.com, Protocol Buffers

Protocol Buffers has a "bytes" type. That's what it's talking about. Just change "string" to "bytes" in your .proto file. (They work exactly the same in C++ but are different in Java and Python.)

On Tue, May 12, 2009 at 6:47 AM, <dan.schm...@gmail.com> wrote:

dan.schm...@gmail.com

unread,

May 12, 2009, 1:43:03 PM5/12/09

to Protocol Buffers

Thanks very much for the answers guys. Most illustrative. The error
messages did in fact disappear with that simple change in all my proto
files.

Still, now that this error has shown in the code I have, I keep
wondering whether the fact that I'm serialising to string is
inefficient. What would be the case for using serialisation to a
stream then?

Thanks again for the help.

Dan

On May 12, 5:26 pm, Kenton Varda <ken...@google.com> wrote:
> Protocol Buffers has a "bytes" type. That's what it's talking about. Just
> change "string" to "bytes" in your .proto file. (They work exactly the same
> in C++ but are different in Java and Python.)
>

Kenton Varda

unread,

May 12, 2009, 3:19:36 PM5/12/09

to dan.schm...@gmail.com, Protocol Buffers

The serialized message is just an array of bytes. We use std::string as an efficient container for these bytes, but it is still just storing bytes. std::string, unlike Java's String, only contains bytes, not unicode characters. So, there is no performance penalty. In fact, serializing to a string is typically much faster than serializing to an abstract stream, especially with v2.1.0, since the code does not need to perform bounds checks (since it pre-allocates a string that is guaranteed to be large enough). The only case where you would not want to serialize to a string is if your message is very big, since some memory allocators do not behave well when allocating large contiguous blocks of memory. In this case, using streams allows the message to be written one piece at a time.

dan.schm...@gmail.com

unread,

May 22, 2009, 6:38:41 AM5/22/09

to Protocol Buffers

Thanks a lot for that response, and sorry for taking this long to
reply.

We've got small messages just now, so we're going to stick with the
serialisation to strings.

Dan

On May 12, 8:19 pm, Kenton Varda <ken...@google.com> wrote:
> The serialized message is just an array of bytes. We use std::string as an
> efficient container for these bytes, but it is still just storing bytes.
> std::string, unlike Java's String, only contains bytes, not unicode
> characters. So, there is no performance penalty. In fact, serializing to a
> string is typically much faster than serializing to an abstract stream,
> especially with v2.1.0, since the code does not need to perform bounds
> checks (since it pre-allocates a string that is guaranteed to be large
> enough). The only case where you would not want to serialize to a string is
> if your message is very big, since some memory allocators do not behave well
> when allocating large contiguous blocks of memory. In this case, using
> streams allows the message to be written one piece at a time.
>

Reply all

Reply to author

Forward