--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/CAOu%2B0LUKBUfoE-i_H2VjFcYFjKp0aeEXcAWWDbwCr5RAW4RRXw%40mail.gmail.com.
Good thing you are keeping an eye, I would have never noticed. I will be very careful with future encoding related PRs. Rolling back now.It looks like it was cherry-picked into 6.1. @Keerthana Kumar Can we undo that?
Thanks for catching this one! Sorry for LGTM that PR too fast, should have asked for opinions from experts first (and now I know who are the experts for encoding problems).
Hello,
> My understanding is that the way Bazel thinks about character encoding is that it parses all of its input as ISO 8859-1 and emits all of its output as ISO 8859-1. This is, of course, broken, but in practice, it works reasonably well because UTF-8 survives being parsed and then re-encoded as ISO 8859-1 intact.
While that's true that UTF-8 bytes can be roundtripped through ISO 8859-1 decoding+encoding, that's not what happens with bazel 6.0 on Windows.
My use case is a cc_test with
data = glob(["data/**"])
. The glob ends up picking up a file with special characters (in my case, both "ö" and some greek letters).
The glob input doesn't seem to do "ISO 8859-1 input"; the Java code actually has the correct filename in the Java-internal UTF-16 representation.Then upon writing the manifest file, the "ö" survives the encoding as ISO 8859-1 (as "ö", not as "ö"), but the Greek letter is destroyed (replaced with a question mark if I remember correctly).
What happens next is that build_runfiles_windows.cc reads the file and calls blaze_util::CstringToWstring(line) to convert it from UTF-8 to UTF-16. But the string isn't UTF-8, so this fails.
Also, since UTF-8 isn't fixed-width, the "space_pos" ends up being wrong if there's any valid UTF-8 sequences in the string.
Finally the string gets used with the Windows API which expects UTF-16.As a result, on Windows with bazel 6.0.0, build_runfiles_windows.cc fails if a filename picked up by a glob contains any non-ASCII characters. With my patch, a cc_test with a glob can successfully access a file containing both umlauts and Greek characters.
If bazel wanted to consistently use ISO 8859-1 everywhere, then why is blaze_util::CstringToWstring expecting UTF-8 input? And how would access to Greek filenames work if those characters don't survive being stored in the text file?
Daniel
What happens when Java writes the file:
It looks like https://github.com/bazelbuild/bazel/pull/15846 breaks that invariant? For example, if the byte 0xd6 ("ö" in ISO 8859-1) is on the input, it would be emitted as 0xc3 0xb6 in the runfiles manifest. Equivalently, if the input is 0xc3 0xb6 ("ö" in UTF-8) is on the input, it would mean "ö" in ISO 8859-1, and would therefore be emitted as 0xc3 0x83 0xc2 0xb6 in UTF-8.
Am I missing something here?
(#15846 doesn't seem to contain test cases, which may have been enough to allay my concerns)--
Lukács T. Berki | Software Engineer | lbe...@google.com |
Google Germany GmbH | Erika-Mann-Str. 33 | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891
--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/CAOu%2B0LWS_2dWdbDh%3DaBW5J7%2BvrSiXF68GQNUSDNx69d58eH7MQ%40mail.gmail.com.