Suppressing invalid UTF-8 data warnings?

74 views
Skip to first unread message

Florian Suri-Payer

unread,
Sep 5, 2024, 5:03:03 PMSep 5
to Protocol Buffers
Hi,

I've been using protobuf 3.5.1 in c++ and am using a message type with the following map type: `map<string, MyObject> txns = 1`

It is my understanding that `string` and `bytes` are the same in proto c++; for maps however one can only use `string` as keys. I'm using the key field to send around transaction digests which are byte strings consisting of cryptographic hashes. As far as I can tell, it makes no difference whether I use strings/bytes (the decoding works), yet I keep getting the error:
 
 `String field 'pequinstore.proto.MergedSnapshot.MergedTxnsEntry.key' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.`

I understand the error is complaining about my digests possibly not being UTF-8, but I'm unsure if I actually need to be concerned about it; I have not noticed any problems with parsing. Is there a way to suppress this error?

Or, if this is a serious error that could lead to non-deterministic behavior, do you have a suggested workaround? There is a lot of existing code that uses the map structure akin to an STL map, so I'd like to avoid re-factoring the protobuf into a repeated field if possible. 

Thanks,
Florian

Em Rauch

unread,
Sep 5, 2024, 5:19:00 PMSep 5
to Protocol Buffers
Using non-UTF8 data in a string field should be understood as incorrect, but realistically will work today as long as your messages are only used exactly by C++ Protobuf on the current release of protobuf and only ever with the binary wire format (not textproto or JSON encoding, etc).

Today the malformed utf8 enforcement exists to different degrees in the different languages (and even depending on the syntax of the .proto file), but its not semantically intended that a `string` field should be used for non-utf8 data in any language. It should be assumed that a serialized message with a map<string, ?> where the keys are non-utf8 may start to parse-fail in some future release of Protobuf.

Unfortunately bytes as a map key isn't allowed due to obscure technical concerns related to some non-C++ languages and the JSON representation, and we don't have an immediate plan to relax that.

Realistically your options are:
- Keep doing what you're doing, only ever keep these messages in C++ and binary wire encoding, ignore the warnings, know that it might stop working if a future release of protobuf
- Make your key data be valid utf8 strings instead (eg, use a base64 encoding of the digest instead of the raw digest bytes)
- Use repeated of a message with a key and value field instead of a map, and use your own struct as the in-memory representation when processing (move the data into/out of a STL map at the parse/serialization boundaries instead).

Sorry there's not a more trivial fix available for this usecase!

Florian Suri-Payer

unread,
Sep 6, 2024, 11:43:28 AMSep 6
to Protocol Buffers
Thank you for the detailed answer Em, I really appreciate it!

Good to know the warning can probably be ignored for now. I've opted to do the repeated option for now to avoid my logs being drowned in the warnings... I take it there is no way to suppress warnings?

Best,
Florian

Em Rauch

unread,
Sep 10, 2024, 5:11:39 PMSep 10
to Protocol Buffers
I think if you use a proto2 syntax message it actually will not perform this check as of today (only proto3 syntax file).

If that's not right, I unfortunately suspect the only way around it would be vendor the protobuf runtime into your codebase and comment out the check / log if its bothering you.

Florian Suri-Payer

unread,
Sep 13, 2024, 1:01:35 PMSep 13
to Protocol Buffers
Currently I was using `syntax = "proto2";`
I had gone ahead with the re-factor, so I think its fine. 

Thanks again Em,
Florian
Reply all
Reply to author
Forward
0 new messages