My co-worker, Theron Stanford, discovered that protobuf-perlxs doesn't correctly
preserve the UTF-8 flag on deserialized string fields. The protobuf spec
(http://code.google.com/apis/protocolbuffers/docs/proto.html#scalar) says that
"A string must always contain UTF-8 encoded or 7-bit ASCII text"; so we believe
that it's appropriate to upgrade non-UTF-8 Perl strings to UTF-8 before
serializing, and to set the UTF-8 flag after deserializing.
We wrote a test case (utf8test.tgz) that tests all the types of string fields
(required, optional, repeating) and all the different accessors (set_*, add_*,
copy_from, to_hashref) on UTF-8, Latin-1, and plain ASCII strings.
Currently, 8 of the tests (all the ones with UTF-8 input) fail. We've created a
patch (protobuf-perlxs-1.1-utf8_fix.patch) that does the following:
- for repeated and non-repeated string fields, and for copying to a string
field from a hashref, if the input is a non-UTF-8 string, copy it, call
sv_utf8_upgrade on the copy, and serialize from the upgraded string
- for all string fields, when deserializing, call SvUTF8_on to set the UTF-8 flag.
Any comments, suggestions, complaints? Anyone else using protobuf-perlxs?
--
Jeremy Leader
jle...@oversee.net
hi jeremy and theron,
thanks for the patch, i will get it into the next release. i have a few other issues to fix, and have been meaning to add support for extensions. i hope to get this out this month.
-dave
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers for Perl/XS" group.
To post to this group, send email to protobuf-perlxs@googlegroups.com.
To unsubscribe from this group, send email to protobuf-perlxs+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf-perlxs?hl=en.