UTF-8 in runfiles manifests

172 views
Skip to first unread message

Lukács T. Berki

unread,
Feb 21, 2023, 9:10:25 AM2/21/23
to Xudong Yang, Pedro Liberal Fernandez, dan...@danielgrunwald.de, bazel-dev, Nathan Harmata
Hey Xudong and Pedro and Daniel,

I saw https://github.com/bazelbuild/bazel/pull/15846 flying by and I have questions!

My understanding is that the way Bazel thinks about character encoding is that it parses all of its input as ISO 8859-1 and emits all of its output as ISO 8859-1. This is, of course, broken, but in practice, it works reasonably well because UTF-8 survives being parsed and then re-encoded as ISO 8859-1 intact.

It looks like https://github.com/bazelbuild/bazel/pull/15846 breaks that invariant? For example, if the byte 0xd6 ("ö" in ISO 8859-1) is on the input, it would be emitted as 0xc3 0xb6 in the runfiles manifest. Equivalently, if the input is 0xc3 0xb6 ("ö" in UTF-8) is on the input, it would mean "ö" in ISO 8859-1, and would therefore be emitted as 0xc3 0x83 0xc2 0xb6 in UTF-8.

Am I missing something here?

(#15846 doesn't seem to contain test cases, which may have been enough to allay my concerns)

--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

Tony Aiuto

unread,
Feb 21, 2023, 9:58:05 AM2/21/23
to Lukács T. Berki, Xudong Yang, Pedro Liberal Fernandez, dan...@danielgrunwald.de, bazel-dev, Nathan Harmata
Yes.  That fix looks wrong. We should never be encoding in UTF8 when writing files from starlark or from Java because that breaks the magic and variant. I've got a lot of study of this done from rules_pkg but I haven't put it together in any useful document for anyone else to read yet. I think the best course is to rollback that change and I'd be happy to help next week when I'm back.

--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/CAOu%2B0LUKBUfoE-i_H2VjFcYFjKp0aeEXcAWWDbwCr5RAW4RRXw%40mail.gmail.com.

Yun Peng

unread,
Feb 22, 2023, 4:14:54 AM2/22/23
to Pedro Liberal Fernandez, Tony Aiuto, Lukács T. Berki, Xudong Yang, dan...@danielgrunwald.de, bazel-dev, Nathan Harmata, Keerthana Kumar
Thanks for catching this one! Sorry for LGTM that PR too fast, should have asked for opinions from experts first (and now I know who are the experts for encoding problems).

> It looks like it was cherry-picked into 6.1. @Keerthana Kumar Can we undo that?


On Wed, Feb 22, 2023 at 9:58 AM Pedro Liberal Fernandez <p...@google.com> wrote:
Good thing you are keeping an eye, I would have never noticed. I will be very careful with future encoding related PRs. Rolling back now.

It looks like it was cherry-picked into 6.1. @Keerthana Kumar Can we undo that?

Lukács T. Berki

unread,
Feb 22, 2023, 4:22:25 AM2/22/23
to Yun Peng, Pedro Liberal Fernandez, Tony Aiuto, Xudong Yang, dan...@danielgrunwald.de, bazel-dev, Nathan Harmata, Keerthana Kumar
On Wed, Feb 22, 2023 at 10:14 AM Yun Peng <pcl...@google.com> wrote:
Thanks for catching this one! Sorry for LGTM that PR too fast, should have asked for opinions from experts first (and now I know who are the experts for encoding problems).
I'd say "my job" but this time, it was pure luck: it was the first cherrypick to 6.1 and as such, it caught my eye.

Lukács T. Berki

unread,
Feb 27, 2023, 4:04:41 AM2/27/23
to Daniel Grunwald, Xudong Yang, Pedro Liberal Fernandez, bazel-dev, Nathan Harmata
Hey,

Urgh, this is complicated :( TL;DR: I stand by my decision that the best thing to do was to roll that change back, then maybe re-submit something similar once there is someone who understands the whole story in its fullness.

What happens on Unixes is that the JNI file system interface takes the bytes in the name of each file and assumes that each is a Unicode character :


And since Bazel prints everything on its output in ISO8859-1 encoding and ISO8859-1 corresponds to Unicode in the 0x0000-0x00ff range, no matter what encoding file names use, this works out in the sense that one gets bytes in the same encoding on the output of Bazel as they were in the file system.

On Windows, Bazel uses the regular java.io.File#list():


which I don't know how it handles files with more complicated characters; if memory serves well, Win32 has both a Unicode and a non-Unicode API for listing directories, but that's about the extent of my knowledge.

re: why blaze_util::CstringToWstring expects UTF-8 input, I have no idea, but it's clear that your change (unfortunately) had wider ramifications than either your or its reviewers have anticipated. I'm sorry.



On Thu, Feb 23, 2023 at 5:47 PM Daniel Grunwald <dan...@danielgrunwald.de> wrote:

Hello,

> My understanding is that the way Bazel thinks about character encoding is that it parses all of its input as ISO 8859-1 and emits all of its output as ISO 8859-1. This is, of course, broken, but in practice, it works reasonably well because UTF-8 survives being parsed and then re-encoded as ISO 8859-1 intact.

While that's true that UTF-8 bytes can be roundtripped through ISO 8859-1 decoding+encoding, that's not what happens with bazel 6.0 on Windows.

My use case is a cc_test with data = glob(["data/**"]). The glob ends up picking up a file with special characters (in my case, both "ö" and some greek letters).
The glob input doesn't seem to do "ISO 8859-1 input"; the Java code actually has the correct filename in the Java-internal UTF-16 representation.

Then upon writing the manifest file, the "ö" survives the encoding as ISO 8859-1 (as "ö", not as "ö"), but the Greek letter is destroyed (replaced with a question mark if I remember correctly).

What happens next is that build_runfiles_windows.cc reads the file and calls blaze_util::CstringToWstring(line) to convert it from UTF-8 to UTF-16. But the string isn't UTF-8, so this fails.
Also, since UTF-8 isn't fixed-width, the "
space_pos" ends up being wrong if there's any valid UTF-8 sequences in the string.
Finally the string gets used with the Windows API which expects UTF-16.

As a result, on Windows with bazel 6.0.0, build_runfiles_windows.cc fails if a filename picked up by a glob contains any non-ASCII characters. With my patch, a cc_test with a glob can successfully access a file containing both umlauts and Greek characters.

If bazel wanted to consistently use ISO 8859-1 everywhere, then why is blaze_util::CstringToWstring expecting UTF-8 input? And how would access to Greek filenames work if those characters don't survive being stored in the text file?


Daniel

What happens when Java writes the file:


It looks like https://github.com/bazelbuild/bazel/pull/15846 breaks that invariant? For example, if the byte 0xd6 ("ö" in ISO 8859-1) is on the input, it would be emitted as 0xc3 0xb6 in the runfiles manifest. Equivalently, if the input is 0xc3 0xb6 ("ö" in UTF-8) is on the input, it would mean "ö" in ISO 8859-1, and would therefore be emitted as 0xc3 0x83 0xc2 0xb6 in UTF-8.

Am I missing something here?

(#15846 doesn't seem to contain test cases, which may have been enough to allay my concerns)

--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

Jeremy Volkman

unread,
Feb 27, 2023, 9:40:33 AM2/27/23
to Lukács T. Berki, Daniel Grunwald, Xudong Yang, Pedro Liberal Fernandez, bazel-dev, Nathan Harmata
This all sounds similar to an issue that still exists when Bazel tries to extract certain archives onto filesystems that expect utf-8 path names.


--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages