Making `char` a consistently unsigned type in Chromium

190 views
Skip to first unread message

dan...@chromium.org

unread,
Jan 14, 2022, 11:44:33 AM1/14/22
to chromium-dev
Hello,

In chromium today, `char` is sometimes signed, and sometimes unsigned. In particular, on ARM platforms our `char` is unsigned; otherwise it is signed.

Because the vast majority of our code is shared with ARM builds, it already must work with unsigned chars. But the presence of signed chars causes us some problems:

1) Because a char can be signed, comparing to >= 0 is valid. But on ARM, it's not valid. Clang can provide warnings for us when we make unsigned comparisons like this, via -Wtautological-unsigned-zero-compare. But it's not possible to turn it on while we have both signed and unsigned chars, as it's both correct to warn on ARM and incorrect to warn on Intel for the same code.

Yes, we could disable the warning on ARM, but then we exclude the check from ARM-specific code, and this warning is important. It is the only way to automatically catch bugs of the form

  for (size_t i = container.size() - 1; i >= 0; ++i)

which is something we can expect to encounter as we attempt to improve our signed correctness in order to raise our per-allocation-limit above 2GB.

2) FFI is more complicated when char is both signed and unsigned. While C/C++ have the ambiguously-signed `char`, a language like Rust does not, so we need to choose a type to convert to. If char can be signed, then there are many values which we must handle as errors when converting to a u8, for instance. Java also has unsigned chars, though we only use Java on ARM, where we have unsigned chars, so we currently avoid this problem in FFI there.

This means changing all our chars to be consistently signed would make FFI worse on ARM, whereas going the other way, to unsigned, makes FFI better elsewhere.

3) Code we import from Google internals also expects char to be unsigned, so this would make the build/runtime environment more consistent for that code as well.

Therefore, I am proposing to consistently build chrome with unsigned chars across all platforms. On clang that is -funsigned-char. On MSVC that is /J. It builds everywhere, and needed only 1 change in breakpad to do so.

Have I missed anything?

Thanks,
Dana

Alexei Svitkine

unread,
Jan 14, 2022, 11:50:40 AM1/14/22
to Dana Jansens, chromium-dev
Based on your summary, this seems like a good improvement. Does the change pass all the trybots already?

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAHtyhaQ_AWdBB%3Dzggk2fkNhoBEd7_7b4mCAjKUdY3-U1_3qscQ%40mail.gmail.com.

dan...@chromium.org

unread,
Jan 14, 2022, 11:53:58 AM1/14/22
to Alexei Svitkine, chromium-dev
On Fri, Jan 14, 2022 at 11:49 AM Alexei Svitkine <asvi...@chromium.org> wrote:
Based on your summary, this seems like a good improvement. Does the change pass all the trybots already?

K. Moon

unread,
Jan 14, 2022, 1:02:32 PM1/14/22
to dan...@chromium.org, Alexei Svitkine, chromium-dev
Just a small note: Java's char type is more like char16_t; Java's byte type is the closest analogue to C++'s char, and is signed.

dan...@chromium.org

unread,
Jan 14, 2022, 1:36:22 PM1/14/22
to K. Moon, Alexei Svitkine, chromium-dev
On Fri, Jan 14, 2022 at 1:00 PM K. Moon <km...@chromium.org> wrote:
Just a small note: Java's char type is more like char16_t; Java's byte type is the closest analogue to C++'s char, and is signed.

Thanks for the clarification. I see that String supports working with bytes or char and requires you to explicitly choose when accessing the data.

Regardless, there's no change for us on any platform that uses Java.

Jeremy Roman

unread,
Jan 14, 2022, 1:37:07 PM1/14/22
to km...@chromium.org, dan...@chromium.org, Alexei Svitkine, chromium-dev
Does this potentially cause problems linking with third-party or system libraries which were built with the platform convention for char signedness? This switch doesn't affect name mangling, so it seems like it could cause us to undetectably do signed/unsigned conversions at the linking boundary.

dan...@chromium.org

unread,
Jan 14, 2022, 1:46:41 PM1/14/22
to Jeremy Roman, K Moon, Alexei Svitkine, chromium-dev
On Fri, Jan 14, 2022 at 1:35 PM Jeremy Roman <jbr...@chromium.org> wrote:
Does this potentially cause problems linking with third-party or system libraries which were built with the platform convention for char signedness? This switch doesn't affect name mangling, so it seems like it could cause us to undetectably do signed/unsigned conversions at the linking boundary.

It would not change linking. The same bits would be given to said library as before. If a library explicitly wanted "-1" as an input with a char type, for example, no code change is required to pass it that.

For the most part, the compiler switch won't do anything, and we'd have to change our code to pass different values to said library before we had a problem (for eg passing 255 when it expects a positive signed char). But thank you, there is one thing that does change as a result, std::numeric_limits<char> will be different. It does not appear outside of tests at this time. It may be that a template makes use of it, then passes something based on that to a linked-in library (glibc/pthread/??) that wants a signed char, and something goes wrong. It does feel like a bit of a stretch, but maybe there's a way to look for potential problems there.

K. Moon

unread,
Jan 14, 2022, 1:51:01 PM1/14/22
to dan...@chromium.org, Alexei Svitkine, chromium-dev
On Fri, Jan 14, 2022 at 10:34 AM <dan...@chromium.org> wrote:
On Fri, Jan 14, 2022 at 1:00 PM K. Moon <km...@chromium.org> wrote:
Just a small note: Java's char type is more like char16_t; Java's byte type is the closest analogue to C++'s char, and is signed.

Thanks for the clarification. I see that String supports working with bytes or char and requires you to explicitly choose when accessing the data.

Sorry, one more nit.java.lang.String is more like std::u16string: It's logically always UTF-16 (although there's an optimization in recent versions of Java to internally store the data as byte[] if 8 bits is enough for the actual contents, similar to WTF::String.) The APIs that take byte are doing charset conversion (there's no assumption about what encoding each byte is using), and aren't commonly used outside of those use cases. (Like most more modern languages, Java makes a clear distinction in general between binary and character I/O.)

Regardless, there's no change for us on any platform that uses Java.

I think the only places it would matter would be FFI (using JNI, which I think uses an explicit typedef anyway) or IPC (and hopefully Mojo is insensitive to this change).

dan...@chromium.org

unread,
Jan 14, 2022, 2:10:24 PM1/14/22
to Jeremy Roman, K Moon, Alexei Svitkine, chromium-dev
On Fri, Jan 14, 2022 at 1:45 PM <dan...@chromium.org> wrote:
On Fri, Jan 14, 2022 at 1:35 PM Jeremy Roman <jbr...@chromium.org> wrote:
Does this potentially cause problems linking with third-party or system libraries which were built with the platform convention for char signedness? This switch doesn't affect name mangling, so it seems like it could cause us to undetectably do signed/unsigned conversions at the linking boundary.

It would not change linking. The same bits would be given to said library as before. If a library explicitly wanted "-1" as an input with a char type, for example, no code change is required to pass it that.

For the most part, the compiler switch won't do anything, and we'd have to change our code to pass different values to said library before we had a problem (for eg passing 255 when it expects a positive signed char). But thank you, there is one thing that does change as a result, std::numeric_limits<char> will be different. It does not appear outside of tests at this time. It may be that a template makes use of it, then passes something based on that to a linked-in library (glibc/pthread/??) that wants a signed char, and something goes wrong. It does feel like a bit of a stretch, but maybe there's a way to look for potential problems there.

I also went looking for use of CHAR_MAX. There's non-zero uses of that.

I see one place that may change, in std::moneypunct_byname, as it calls localeconv() and compares the frac_digits struct field against CHAR_MAX. If localeconv() returns CHAR_MAX as a special signal, then that function would disagree and get it wrong. The documentation does not suggest anywhere that localeconv() does, however, and we make no use of std::moneypunct_byname() in the entire repository.

Using a "char" as a "signed int" in APIs is not common but there may be other places that do something weird. The same bits will cross over as before, though. I believe things could only break when comparing to a constant that has changed.

Mark Mentovai

unread,
Jan 14, 2022, 4:07:07 PM1/14/22
to Dana Jansens, chromium-dev
This kind of thing gives me the heebie-jeebies. The signedness of unadorned char is ABI, inherent to a system, and to a large extent, unyielding. It’s quite possible to be able to squeak by with a change like this without causing problems at module boundaries because it’s actually quite rare to pass a scalar char into a function, or return one from a function, in the C tradition. I wouldn’t bet the farm on it, though.

For whatever it’s worth, the assertion that char is already unsigned on ARM (also bug 1287175) is incomplete. As a property of the ABI, the OS needs to be considered, not just the CPU. On Apple platforms (specifically, mac-arm64 and ios-arm64), char is, in fact, signed. I mention this in case any of the analysis that brought you to this point was based on the notion that anything we build for an ARM system must already be safe and could get a pass, or in case you tested by asking for -funsigned-char on a variety of x86_64 platforms but not on any arm64 platform. Since that’s not true, at the very least Apple-specific code can’t get a pass on these grounds.

(The fact that Apple went against the ARM-universe grain and declared char to be signed in their various ARM ABIs is a hint that there may be very sharp edges on their platforms in the form of ingrained expectations that unadorned char is signed, and rather than sanding them down, they opted to keep char signed. But it’s also possible that without being shackled by the pre-ARMv4 legacy of inefficient signed loads, they were able to go their own way favoring consistency across their platforms. More: an interesting historical note on why char is usually signed on ARM. This brings up another point, though: a platform’s choice of char signedness may have been made with efficiency in mind, and although this probably isn’t the case for anything we target today, it’s at least worth the inquiry.)

Another caution that comes to mind that I haven’t seen mentioned yet: we may have third-party code that uses unadorned char coupled with something that tells it what the properties of that char are. This might take the form of a macro, which could appear in a .gn file or in a checked-in header such as autoconf configure output. These would need to be updated.

More inline.

Hello,

In chromium today, `char` is sometimes signed, and sometimes unsigned. In particular, on ARM platforms our `char` is unsigned; otherwise it is signed.

Because the vast majority of our code is shared with ARM builds, it already must work with unsigned chars. But the presence of signed chars causes us some problems:

1) Because a char can be signed, comparing to >= 0 is valid. But on ARM, it's not valid. Clang can provide warnings for us when we make unsigned comparisons like this, via -Wtautological-unsigned-zero-compare. But it's not possible to turn it on while we have both signed and unsigned chars, as it's both correct to warn on ARM and incorrect to warn on Intel for the same code.

Yes, we could disable the warning on ARM, but then we exclude the check from ARM-specific code, and this warning is important. It is the only way to automatically catch bugs of the form

  for (size_t i = container.size() - 1; i >= 0; ++i)

which is something we can expect to encounter as we attempt to improve our signed correctness in order to raise our per-allocation-limit above 2GB.

How prevalent is the case where we test an unadorned char for ≥0? Given the fact that we can’t make any assumptions about signedness, such code smells broken regardless, and having it be flagged by -Wtautological-unsigned-zero-compare where char is unsigned sounds correct. Can we fix those cases and enable -Wtautological-unsigned-zero-compare independently of changing char’s signedness contrary to some ABI definitions?

In fact, I think that this is already the case with the clang that we use today (crclang++ is Chrome’s clang++):

mark@arm-and-hammer zsh% crclang++ --version
clang version 14.0.0 (https://github.com/llvm/llvm-project/ ed3a4a4948dedd264732105e910405138c9c0f3b)
Target: arm64-apple-darwin21.2.0
Thread model: posix
InstalledDir: /chrome/chrome/src/third_party/llvm-build/Release+Asserts/bin
mark@arm-and-hammer zsh% cat t_tautological_unsigned_zero_compare.cc
bool IsZeroOrPositive(char c) {
  return c >= 0;
}
mark@arm-and-hammer zsh% crclang++ -std=c++17 -funsigned-char -Wall -Wtautological-unsigned-zero-compare -Werror -c -o /dev/null t_tautological_unsigned_zero_compare.cc
mark@arm-and-hammer zsh% crclang++ -std=c++17 -funsigned-char -Wall -Wtautological-unsigned-zero-compare -Wtautological-unsigned-char-zero-compare -Werror -c -o /dev/null t_tautological_unsigned_zero_compare.cc
t_tautological_unsigned_zero_compare.cc:2:12: error: result of comparison of char expression >= 0 is always true, since char is interpreted as unsigned [-Werror,-Wtautological-unsigned-char-zero-compare]
  return c >= 0;
         ~ ^  ~
1 error generated.

It is possible to turn this warning on selectively, and to keep it away from unadorned char until we’re ready for that.
 
2) FFI is more complicated when char is both signed and unsigned. While C/C++ have the ambiguously-signed `char`, a language like Rust does not, so we need to choose a type to convert to. If char can be signed, then there are many values which we must handle as errors when converting to a u8, for instance. Java also has unsigned chars, though we only use Java on ARM, where we have unsigned chars, so we currently avoid this problem in FFI there.

This means changing all our chars to be consistently signed would make FFI worse on ARM, whereas going the other way, to unsigned, makes FFI better elsewhere.

I would argue that these problems can appear at any language/runtime boundary, and the correct solution isn’t to abuse the meaning of a type on either side of any fence, but to be more careful at the boundaries. In no case do I really find it compelling to bend the ABI-defined meaning of a fundamental type in our primary language, for the benefit of the few transitions we have into another.

3) Code we import from Google internals also expects char to be unsigned, so this would make the build/runtime environment more consistent for that code as well.

I’m pretty sure that this isn’t universal. For example, -Wno-char-subscripts and a bug caused by mixing -funsigned-char settings.

Therefore, I am proposing to consistently build chrome with unsigned chars across all platforms. On clang that is -funsigned-char. On MSVC that is /J. It builds everywhere, and needed only 1 change in breakpad to do so.

Have I missed anything?

Thanks,
Dana

dan...@chromium.org

unread,
Jan 14, 2022, 6:14:08 PM1/14/22
to Mark Mentovai, chromium-dev, nicola...@chromium.org
Thank you for your thoughts Mark, I really appreciate all the time and thought you put into this. Replies below!

+nicolascapens for a question as well.

On Fri, Jan 14, 2022 at 4:05 PM Mark Mentovai <ma...@chromium.org> wrote:
This kind of thing gives me the heebie-jeebies. The signedness of unadorned char is ABI, inherent to a system, and to a large extent, unyielding. It’s quite possible to be able to squeak by with a change like this without causing problems at module boundaries because it’s actually quite rare to pass a scalar char into a function, or return one from a function, in the C tradition. I wouldn’t bet the farm on it, though.

I think that if we linked more binaries into chrome it would be a lot riskier, but our binary ABI surface is quite small, as we compile almost everything from source, including libc++. 

For whatever it’s worth, the assertion that char is already unsigned on ARM (also bug 1287175) is incomplete. As a property of the ABI, the OS needs to be considered, not just the CPU. On Apple platforms (specifically, mac-arm64 and ios-arm64), char is, in fact, signed. I mention this in case any of the analysis that brought you to this point was based on the notion that anything we build for an ARM system must already be safe and could get a pass, or in case you tested by asking for -funsigned-char on a variety of x86_64 platforms but not on any arm64 platform. Since that’s not true, at the very least Apple-specific code can’t get a pass on these grounds.

Ah thank you, I did not realize this. I also learnt that not all Android builds are ARM from this conversation, as there is x86 Android too. Thus on Android as well, chars can be both signed and unsigned, and our compiled code works with both. I did not assume anything about any platform when thinking about and trying this change out. I can only say that all our tests pass, and that code inspection via code search doesn't give any obvious problems.
 
(The fact that Apple went against the ARM-universe grain and declared char to be signed in their various ARM ABIs is a hint that there may be very sharp edges on their platforms in the form of ingrained expectations that unadorned char is signed, and rather than sanding them down, they opted to keep char signed. But it’s also possible that without being shackled by the pre-ARMv4 legacy of inefficient signed loads, they were able to go their own way favoring consistency across their platforms. More: an interesting historical note on why char is usually signed on ARM. This brings up another point, though: a platform’s choice of char signedness may have been made with efficiency in mind, and although this probably isn’t the case for anything we target today, it’s at least worth the inquiry.)

That's an interesting point. I will try a pinpoint on Windows (a historically signed-char platform): https://pinpoint-dot-chromeperf.appspot.com/job/1563aaf10a0000 Infra seems quite broken today, so I will do another if this fails. :)
 
Another caution that comes to mind that I haven’t seen mentioned yet: we may have third-party code that uses unadorned char coupled with something that tells it what the properties of that char are. This might take the form of a macro, which could appear in a .gn file or in a checked-in header such as autoconf configure output. These would need to be updated.

Good idea! I don't see any defines that mention "char" or "sign". Code seems to rely on __CHAR_UNSIGNED__ instead which will do the correct thing.
 
More inline.

Hello,

In chromium today, `char` is sometimes signed, and sometimes unsigned. In particular, on ARM platforms our `char` is unsigned; otherwise it is signed.

Because the vast majority of our code is shared with ARM builds, it already must work with unsigned chars. But the presence of signed chars causes us some problems:

1) Because a char can be signed, comparing to >= 0 is valid. But on ARM, it's not valid. Clang can provide warnings for us when we make unsigned comparisons like this, via -Wtautological-unsigned-zero-compare. But it's not possible to turn it on while we have both signed and unsigned chars, as it's both correct to warn on ARM and incorrect to warn on Intel for the same code.

Yes, we could disable the warning on ARM, but then we exclude the check from ARM-specific code, and this warning is important. It is the only way to automatically catch bugs of the form

  for (size_t i = container.size() - 1; i >= 0; ++i)

which is something we can expect to encounter as we attempt to improve our signed correctness in order to raise our per-allocation-limit above 2GB.

How prevalent is the case where we test an unadorned char for ≥0? Given the fact that we can’t make any assumptions about signedness, such code smells broken regardless, and having it be flagged by -Wtautological-unsigned-zero-compare where char is unsigned sounds correct. Can we fix those cases and enable -Wtautological-unsigned-zero-compare independently of changing char’s signedness contrary to some ABI definitions?

I am not 100% since errors stop the build, but I don't think there's a ton of cases. But they exist in both first-party and third-party code, and we can't turn on the warning with them there. We could try splitting a few files out into source sets to avoid the warning, but that is pretty messy.

But also why are they wrong, given char can be signed? Fixing them to check only when appropriate is possible, and would require an ifdef or if constexpr(std::is_unsigned<char>::value) branch where one side compares to 0 and one doesn't. Just removing the check seems wrong if char is signed, perhaps you could convince me otherwise? Or would we fix them by using `signed char` instead of `char`? Or writing `unsigned char` in that place? That feels worse to me than changing all our chars, but if not to you then I'd like to understand why.

There is both:
-Wtautological-unsigned-zero-compare 
-Wtautological-unsigned-char-zero-compare 

However the first includes chars, and does not interact with the second on ARM. My first attempt was -Wtautological-unsigned-zero-compare  -Wno-tautological-unsigned-char-zero-compare but the former fires regardless. We could attempt to change the warnings in Clang, for sure, to improve this. I did think bringing some consistent behaviour to our code would be preferable if possible.

In fact, I think that this is already the case with the clang that we use today (crclang++ is Chrome’s clang++):

mark@arm-and-hammer zsh% crclang++ --version
clang version 14.0.0 (https://github.com/llvm/llvm-project/ ed3a4a4948dedd264732105e910405138c9c0f3b)
Target: arm64-apple-darwin21.2.0
Thread model: posix
InstalledDir: /chrome/chrome/src/third_party/llvm-build/Release+Asserts/bin
mark@arm-and-hammer zsh% cat t_tautological_unsigned_zero_compare.cc
bool IsZeroOrPositive(char c) {
  return c >= 0;
}
mark@arm-and-hammer zsh% crclang++ -std=c++17 -funsigned-char -Wall -Wtautological-unsigned-zero-compare -Werror -c -o /dev/null t_tautological_unsigned_zero_compare.cc
mark@arm-and-hammer zsh% crclang++ -std=c++17 -funsigned-char -Wall -Wtautological-unsigned-zero-compare -Wtautological-unsigned-char-zero-compare -Werror -c -o /dev/null t_tautological_unsigned_zero_compare.cc
t_tautological_unsigned_zero_compare.cc:2:12: error: result of comparison of char expression >= 0 is always true, since char is interpreted as unsigned [-Werror,-Wtautological-unsigned-char-zero-compare]
  return c >= 0;
         ~ ^  ~
1 error generated.

It is possible to turn this warning on selectively, and to keep it away from unadorned char until we’re ready for that.

Here -Wtautological-unsigned-zero-compare also warns against char when built for ARM: https://godbolt.org/z/v43GMEjhG (with or without -funsigned-char).

I did not know it would not warn for char on linux, with -funsigned-char. There's something weird going on between the warnings and -funsigned-char, perhaps.

2) FFI is more complicated when char is both signed and unsigned. While C/C++ have the ambiguously-signed `char`, a language like Rust does not, so we need to choose a type to convert to. If char can be signed, then there are many values which we must handle as errors when converting to a u8, for instance. Java also has unsigned chars, though we only use Java on ARM, where we have unsigned chars, so we currently avoid this problem in FFI there.

This means changing all our chars to be consistently signed would make FFI worse on ARM, whereas going the other way, to unsigned, makes FFI better elsewhere.

I would argue that these problems can appear at any language/runtime boundary, and the correct solution isn’t to abuse the meaning of a type on either side of any fence, but to be more careful at the boundaries. In no case do I really find it compelling to bend the ABI-defined meaning of a fundamental type in our primary language, for the benefit of the few transitions we have into another.

Primarily I view this as an improvement for our C++, to have a consistent representation of the type. I do see this as a knock-on benefit however. And I understand you are concerned about the interaction with binary dependencies.
 
3) Code we import from Google internals also expects char to be unsigned, so this would make the build/runtime environment more consistent for that code as well.

I’m pretty sure that this isn’t universal. For example, -Wno-char-subscripts and a bug caused by mixing -funsigned-char settings.

It is not universal, but it is the default and has been there for a long time. I gave up on finding the origin. The latter link from 2016 states that there is effort to reduce the cases that diverge from the norm. I think it's fair to expect code coming to us here would be expecting char to be unsigned. I also wondered if maybe in ObjectiveC or Darwin builds it would be different, but I don't see that. One of the handful appearances of -fsigned-char is swiftshader, a Chromium library, but we don't use the same in Chromium on ARM, perhaps +nicolascapens could you speak to why it's compiled with signed-char? That may be instructive, as it can compile successfully with -funsigned-char. Flatbuffers went the other way and made ARM use -fsigned-char in order to get their code to compile. These few external things seem to be the odd ones out to me.

dan...@chromium.org

unread,
Jan 18, 2022, 11:00:44 AM1/18/22
to Nicolas Capens, Mark Mentovai, chromium-dev, nicola...@chromium.org
I spent the weekend thinking more about what could go wrong with binary libraries that assume char is signed.

First, I believe they would have to assume char is _always_ signed. Otherwise, they can't use scalar chars to encode negative values, as that would not be portable. So it's unlikely that (and would be a bug if?) Linux libraries do this. Mac did choose to make char always signed, as you pointed out though, Mark.

So, what can go wrong? The bits themselves don't change, and the behaviour of the library doesn't change, so a bug would have to be introduced in code that we compile interpreting the bits differently. That sort of interpretation is exactly the thing this change is about actually. Here's an example of a bug we could introduce:

char c = library_function();
if (c < 0) handle_error(); else success();

If this library assumed char is always signed, and used a negative char to report an error, we would no longer catch the errors. This wouldn't be portable, but is theoretically possible. If we compile with -Wtautological-unsigned-char-zero-compare this would produce an error so that we could find the bug, and write `signed char` instead of `char` to correctly interpret the char's bits, and remove the error.

Thus, I would propose that we make the codebase clean for -Wtautological-unsigned-char-zero-compare and introduce that flag along with -funsigned-char on non-Windows (if anyone knows an equivalent MSVC flag please let me know). This would cover the most suspicious platform, which is Mac.

There's still a chance to miss errors if there was a set of possible return values, encoded in a `char`, such that for example < 10 is an error, including negative values, and >= 10 is success. This seems really unlikely though?

I tried adding -Wtautological-unsigned-char-zero-compare and there were 2 errors generated when building the chrome binary: a comparison in net/ which is fine with unsigned char, and a comparison in dead code in openscreen (which would have been fine with unsigned).

I can try building Mac on bots once the openscreen fix rolls. Mark, do you think this would increase your confidence sufficiently in terms of correctness? (I am re-collecting pinpoint results still.)

Thanks,
Dana

On Tue, Jan 18, 2022 at 10:13 AM Nicolas Capens <ca...@google.com> wrote:
Dana pointed out that SwiftShader in Google3 uses -fsigned-char. We should be able to remove that: b/422548590.

On Fri, Jan 14, 2022 at 9:16 PM Nicolas Capens <ca...@google.com> wrote:
Where did you find SwiftShader using -fsigned-char? I wouldn't be surprised if we did, but I couldn't immediately find it. Keep in mind that SwiftShader is used in Android and Google3 as well so we have to be careful when any code assumes signedness. Char being signed is the default on a lot of platforms as far as I know, and we use unsigned char for this reason in a lot of places, but I don't know offhand if we assume just char to be signed anywhere.

dan...@chromium.org

unread,
Jan 19, 2022, 1:53:49 PM1/19/22
to Thomas Sepez, Nicolas Capens, Mark Mentovai, chromium-dev, nicola...@chromium.org
The codebase now compiles everywhere with -Wtautological-unsigned-char-zero-compare now, I had to remove some dead third-party code, and fix one more (benign) comparison in //net. 

Pinpoint usually results in a regression if you sneeze near it, but the results look like noise on both Mac and Windows. I ran each config twice which makes that more clear (set "base" as the reference column).

Sort order is an interesting question, and I am not sure how to look for those yet.

Is this becoming increasingly convincing that it is safe to do though?

On Tue, Jan 18, 2022 at 12:59 PM Thomas Sepez <tse...@google.com> wrote:
Well, sort order would change, for example, if you coded your own compare function, no? 

Jeremy Roman

unread,
Jan 19, 2022, 4:57:35 PM1/19/22
to dan...@chromium.org, Thomas Sepez, Nicolas Capens, Mark Mentovai, chromium-dev, nicola...@chromium.org
I'm not going to strongly resist this but it seems like the shortest path is not to change the definition of char on various platforms to be inconsistent with the platform ABI, but instead to just use some combination of std::is_unsigned_v<char> / std::is_signed_v<char> and changing variables to be expressly "signed char" or "unsigned char", rather than baking a hidden assumption that we're writing in a modified dialect of C++.

I have to imagine any Google code we import will need to be made portable for other reasons. I was actually surprised to learn the prod environment doesn't match the usual x86 ABI; do you know why that was done?

dan...@chromium.org

unread,
Jan 20, 2022, 9:37:14 AM1/20/22
to Jeremy Roman, Thomas Sepez, Nicolas Capens, Mark Mentovai, chromium-dev, nicola...@chromium.org
On Wed, Jan 19, 2022 at 4:56 PM Jeremy Roman <jbr...@chromium.org> wrote:
I'm not going to strongly resist this but it seems like the shortest path is not to change the definition of char on various platforms to be inconsistent with the platform ABI, but instead to just use some combination of std::is_unsigned_v<char> / std::is_signed_v<char> and changing variables to be expressly "signed char" or "unsigned char", rather than baking a hidden assumption that we're writing in a modified dialect of C++.

I have to imagine any Google code we import will need to be made portable for other reasons. I was actually surprised to learn the prod environment doesn't match the usual x86 ABI; do you know why that was done?

I got some help, in order to track this down to 1999 in cl/6855.

A strong reason for using unsigned chars, and it seems the reason it was done then, was so you could pull a `char` out of a std::string (which forces unqualified char), and use it as an index. UTF-8 strings include values above 127, which means they start to have negative values in a signed char (though they are not meant to be negative), and require careful casting to unsigned char in order to use them correctly as an offset or index. Otherwise you get a memory safety bug.
 
For example:
std::string url = "https://foo.bar.baz";
if (lookup_structure[url[0]] == something)  // explodes if lookup_structure is a vector and char is signed.

I will note that blink's String uses LChar for 8-bit chars, which is an unsigned char. But we can't do similar with std::string, unless we -funsigned-char.

K. Moon

unread,
Jan 20, 2022, 11:12:50 AM1/20/22
to dan...@chromium.org, Jeremy Roman, Thomas Sepez, Nicolas Capens, Mark Mentovai, chromium-dev, nicola...@chromium.org
As someone who used to write a lot of string manipulation code in plain C back in the day, I personally would always cast to unsigned char before trying to use a character in a lookup table. Maybe I'm just more used to thinking at that level, though.

Mark Mentovai

unread,
Jan 20, 2022, 11:32:40 AM1/20/22
to Nicolas Capens, Dana Jansens, Jeremy Roman, Thomas Sepez, chromium-dev, Nicolas Capens, Nico Weber
It already does, with -Wchar-subscripts, which is part of -Wall.

mark@arm-and-hammer zsh% crclang++ -x c++ -std=c++17 -Wall -Werror -S -c - -o/dev/null <<< 'const int ints[256] = {}; int F(char c) { return ints[c]; }'
<stdin>:1:54: error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]
const int ints[256] = {}; int F(char c) { return ints[c]; }
                                                     ^~
1 error generated.


This warning appears even with -funsigned-char.

It looks like we have this warning enabled for the most part, except for third_party/minizip on NaCl.

I agree that char array subscripts are a bad and nonportable practice and should be avoided. The cast to unsigned at the few points of use is defined and probably typically appropriate.

On Thu, Jan 20, 2022 at 11:23 AM Nicolas Capens <ca...@google.com> wrote:
 
A strong reason for using unsigned chars, and it seems the reason it was done then, was so you could pull a `char` out of a std::string (which forces unqualified char), and use it as an index.

This seems like something the compiler should warn against. Even if Chromium forces char to be unsigned, it doesn't encourage writing portable code if we allow this kind of usage.

dan...@chromium.org

unread,
Jan 20, 2022, 2:41:46 PM1/20/22
to Mark Mentovai, Nicolas Capens, Jeremy Roman, Thomas Sepez, chromium-dev, Nicolas Capens, Nico Weber
On Thu, Jan 20, 2022 at 11:31 AM Mark Mentovai <ma...@chromium.org> wrote:
It already does, with -Wchar-subscripts, which is part of -Wall.

mark@arm-and-hammer zsh% crclang++ -x c++ -std=c++17 -Wall -Werror -S -c - -o/dev/null <<< 'const int ints[256] = {}; int F(char c) { return ints[c]; }'
<stdin>:1:54: error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]
const int ints[256] = {}; int F(char c) { return ints[c]; }
                                                     ^~
1 error generated.


This warning appears even with -funsigned-char.

It looks like we have this warning enabled for the most part, except for third_party/minizip on NaCl.

This is curious since third-party code doesn't switch on NaCl, and it's not present in minizip. It seems that we're fine to remove that so we'd have Wchar-subscripts everywhere.
 

I agree that char array subscripts are a bad and nonportable practice and should be avoided. The cast to unsigned at the few points of use is defined and probably typically appropriate.

This is great, as it's a place where changing signed to unsigned would have changed behaviour (like comparisons), and this means we don't use chars to index negatively into arrays (which would break on Android ARM anyhow).

It seems to me like the place we're landing is that:
a) Using chars in a way that involves negative values is not portable so code should not be doing it anyway.
b) We can't find any code that relies on such non-portable/incorrect behaviour.
c) If such a bug (non-portable code) exists, -funsigned-char might increase its visibility beyond various ARM targets.
d) Roughly all our code has to be portable between signed/unsigned char right now. Mac is the exception since Mac ARM is signed at the moment.
e) There's no observable performance impact for using funsigned-char

I believe making our platforms consistent reduces our risks bugs. Certainly adding more warnings will help to reduce our risks.

I wonder if this has been convincing enough for others? Specifically, Mark, you expressed the greatest worry. Could we try to make this change?

Thanks,
Dana

Nicolas Capens

unread,
Jan 21, 2022, 12:52:44 AM1/21/22
to Dana Jansens, Mark Mentovai, chromium-dev, nicola...@chromium.org
Where did you find SwiftShader using -fsigned-char? I wouldn't be surprised if we did, but I couldn't immediately find it. Keep in mind that SwiftShader is used in Android and Google3 as well so we have to be careful when any code assumes signedness. Char being signed is the default on a lot of platforms as far as I know, and we use unsigned char for this reason in a lot of places, but I don't know offhand if we assume just char to be signed anywhere.

On Fri, Jan 14, 2022 at 6:12 PM <dan...@chromium.org> wrote:

Nicolas Capens

unread,
Jan 21, 2022, 12:52:46 AM1/21/22
to Dana Jansens, Mark Mentovai, chromium-dev, nicola...@chromium.org
Dana pointed out that SwiftShader in Google3 uses -fsigned-char. We should be able to remove that: b/422548590.

On Fri, Jan 14, 2022 at 9:16 PM Nicolas Capens <ca...@google.com> wrote:

Nicolas Capens

unread,
Jan 21, 2022, 12:53:13 AM1/21/22
to Dana Jansens, Jeremy Roman, Thomas Sepez, Mark Mentovai, chromium-dev, nicola...@chromium.org, Nico Weber
 
A strong reason for using unsigned chars, and it seems the reason it was done then, was so you could pull a `char` out of a std::string (which forces unqualified char), and use it as an index.

This seems like something the compiler should warn against. Even if Chromium forces char to be unsigned, it doesn't encourage writing portable code if we allow this kind of usage.

On Thu, Jan 20, 2022 at 9:36 AM <dan...@chromium.org> wrote:

dan...@chromium.org

unread,
Jan 25, 2022, 11:52:34 AM1/25/22
to Thomas Sepez, Nicolas Capens, Mark Mentovai, chromium-dev, Nicolas Capens
On Tue, Jan 18, 2022 at 12:59 PM Thomas Sepez <tse...@google.com> wrote:
Well, sort order would change, for example, if you coded your own compare function, no? 

I've attempted to look for these, in some fashion. It's quite hard to be exhaustive here. I will note, as I think you were implying, std::sort on std::string will act on its contents as if the chars are unsigned: https://godbolt.org/z/axj369scz, but when char is unsigned, std::string sort order does not change but char sort order does: https://godbolt.org/z/M3x6fj47s

So if we compared `char` in a custom comparator, then it would sort in the same order as std::string with -funsigned. Without -funsigned-char, a custom char comparison would be different from std::string sort order, but only on the platforms where char is signed. If we rely on such sort order then this presents yet another case of non-portable code.

There are no tests that demonstrate a reliance on such non-portable code.

I tried looking at use of sort() with a custom comparator but there's too many. So I tried searching for `char` and sort(`. SkString doesn't provide ordered comparisons.

I also tried to look for char fields that are not pointers/strings, which could be part of an ordered comparison, though didn't find any issue. All such chars do not appear to be used for arbitrary string characters (which is good since they can not hold an entire UTF-8 character), and have a small set of defined values, being used as a replacement for an 8-bit bool (example) or enum (example).

I did come across a potential use of a char as an array index which would be incorrect (a memory bug) with signed char, though as fragile as it is, it is seemingly correct as the caller uses a uint8_t in both cases. And another which indexes by a `char` incorrectly, presumably because the values happen to never be negative, else it's currently a memory bug (and perhaps exploitable if this code is not dead).

So, in looking for issues, I found more support for -funsigned-char than for the status quo, I believe.

At this point, I am fairly convinced that we should give this a try. I will put up the CL for review, and hope to hear more if others are not yet convinced and have more concerns.

I really appreciate all the thoughts on things to look out for, which have helped me increase my confidence in this change.

Cheers,
Dana


On Tue, Jan 18, 2022, 8:00 AM <dan...@chromium.org> wrote:

Mark Mentovai

unread,
Jan 25, 2022, 5:03:15 PM1/25/22
to Dana Jansens, Thomas Sepez, Nicolas Capens, chromium-dev, Nicolas Capens
Hi, Dana. Sorry for the delay in responding.

The sort of inter-module problem I’m most concerned about is where a char crosses a module boundary into a system library that we don’t control. The surface area for system libraries beyond our control is quite large on most of our platforms.

A char passed directly across this boundary may pose a problem if it’s carried in a wider type. This can happen for ABI reasons, such as when a system library accepts a char argument that’s passed in a native-width register. These cases will typically appear innocuous, though, since the bit pattern in the least significant byte should still match everyone’s expectations. More concerning are cases like char *strchr(char const *, int). The fact that the “needle” argument is conveyed across the boundary in an int means that the input to that function is much more likely to be treated differently when called with an unsigned or signed char argument. Now, in the case of strchr, it’s not actually damaging at all, because the definition of strchr requires the int argument to be converted to char on the other end. I’m hard-pressed to think of a case in the standard C library where this would actually be a problem. This is part of the “maybe it will squeak by” comment I made in my first response on this thread.

But I’ve really only given this serious thought with respect to the standard C library, and we do consume much more than that on most systems. There are many higher-level platform-specific libraries that we talk to. Often, there’s also a higher-level character typedef to wrap characters in these cases, and so long as everyone uses that consistently, there’s no problem. But this would be the most mysterious aspect that the proposed change might have, and it’s where most of my heebie-jeebies are concentrated.

I’m most comfortable with cranking up the warnings on questionable uses of unadorned char. How many of the concerning cases that you’ve found can we cover with warnings that work within the ABI, without trying to twist it?

Mark

dan...@chromium.org

unread,
Jan 25, 2022, 5:25:47 PM1/25/22
to Mark Mentovai, Thomas Sepez, Nicolas Capens, chromium-dev, Nicolas Capens
On Tue, Jan 25, 2022 at 5:01 PM Mark Mentovai <ma...@chromium.org> wrote:
Hi, Dana. Sorry for the delay in responding.

The sort of inter-module problem I’m most concerned about is where a char crosses a module boundary into a system library that we don’t control. The surface area for system libraries beyond our control is quite large on most of our platforms.

A char passed directly across this boundary may pose a problem if it’s carried in a wider type. This can happen for ABI reasons, such as when a system library accepts a char argument that’s passed in a native-width register. These cases will typically appear innocuous, though, since the bit pattern in the least significant byte should still match everyone’s expectations. More concerning are cases like char *strchr(char const *, int). The fact that the “needle” argument is conveyed across the boundary in an int means that the input to that function is much more likely to be treated differently when called with an unsigned or signed char argument. Now, in the case of strchr, it’s not actually damaging at all, because the definition of strchr requires the int argument to be converted to char on the other end. I’m hard-pressed to think of a case in the standard C library where this would actually be a problem. This is part of the “maybe it will squeak by” comment I made in my first response on this thread.

But I’ve really only given this serious thought with respect to the standard C library, and we do consume much more than that on most systems. There are many higher-level platform-specific libraries that we talk to. Often, there’s also a higher-level character typedef to wrap characters in these cases, and so long as everyone uses that consistently, there’s no problem. But this would be the most mysterious aspect that the proposed change might have, and it’s where most of my heebie-jeebies are concentrated.

I’m most comfortable with cranking up the warnings on questionable uses of unadorned char. How many of the concerning cases that you’ve found can we cover with warnings that work within the ABI, without trying to twist it?

Thanks for the idea Mark, but I don't know that it would be a reasonable warning to add. An array dereference is intentionally allowed to be negative, so using a signed value is valid.

int p[4] = [1, 2, 3, 4];
int* p2 = p + 2;
char index = -1;
int beforep2 = p2[index];

Again, if the `index` is a `char` this is not portable code. You'd get a different runtime behaviour on unsigned-char platforms, with *(p2+255) instead *(p2-1). But on a given platform, it's not illegal... The only warning I could imagine is warning against indexing by a `char` at all, though that's also a valid thing to do.

So, should we make our code consistent across platforms instead? I think so. I also tried google-searching for examples of -funsigned-char causing a bug, but the most relevant thing was this very thread and associated bug.

I am less worried about introducing bugs inside binary libraries that we link in - as behaviour only changes there if we start passing in different values as a result of this change. As unsigned wrap-around is well-defined, even subtracting chars and passing them in works as before. Rather, bugs would more likely come from us interpreting chars differently, specifically with ordered comparisons. operator== will not change, only > and < change, and only in our own compiled code.

We can't point at a single example of a bug, with a fair bit of honest searching. And we know of another large codebase which does this, across all its platforms. So I think it's worth a try?

Cheers,
Dana

K. Moon

unread,
Jan 25, 2022, 7:17:00 PM1/25/22
to dan...@chromium.org, Mark Mentovai, Thomas Sepez, Nicolas Capens, chromium-dev, Nicolas Capens
If I recall correctly, p2[-1] would be undefined behavior in standard C/C++, as for an array of size N, indices are only valid for 0 to N+1.

K. Moon

unread,
Jan 25, 2022, 7:17:56 PM1/25/22
to dan...@chromium.org, Mark Mentovai, Thomas Sepez, Nicolas Capens, chromium-dev, Nicolas Capens
Correction: 0 to N (meant to number these 1 to N, then decided starting at 0 would be less confusing for the reader).

K. Moon

unread,
Jan 25, 2022, 7:23:47 PM1/25/22
to dan...@chromium.org, Mark Mentovai, Thomas Sepez, Nicolas Capens, chromium-dev, Nicolas Capens
On closer reading, your example appears to be fine. Feel free to disregard, but it seems like the sort of thing there'd be a diagnostic for, as it seems like there are better alternatives in most cases.

dan...@chromium.org

unread,
Jan 26, 2022, 9:05:43 AM1/26/22
to K. Moon, Mark Mentovai, Thomas Sepez, Nicolas Capens, chromium-dev, Nicolas Capens
On Tue, Jan 25, 2022 at 7:22 PM K. Moon <km...@chromium.org> wrote:
On closer reading, your example appears to be fine. Feel free to disregard, but it seems like the sort of thing there'd be a diagnostic for, as it seems like there are better alternatives in most cases.

The alternative would probably just be to use pointer arithmetic directly instead of operator[]. i.e. `p += negative_index; *p` which is exactly what p[negative_index] is defined to be.

K. Moon

unread,
Jan 26, 2022, 11:14:13 AM1/26/22
to danakj chromium, Mark Mentovai, Thomas Sepez, Nicolas Capens, chromium-dev, Nicolas Capens
A couple of other options that come to mind:
  1. p[2 - 1] (index arithmetic instead of pointer arithmetic)
  2. *(p2 - 1)
Even though p2[-1] (and -1[p2]) are semantically the same as *(p2 - 1), I would consider them stylistically distinct. Clang's AST preserves the distinction, so it's possible to build diagnostics around this. There appear to be very few cases of the pattern "[-" in the code base, so it seems unusual anyway.

dan...@chromium.org

unread,
Jan 26, 2022, 11:16:24 AM1/26/22
to K. Moon, Mark Mentovai, Thomas Sepez, Nicolas Capens, chromium-dev, Nicolas Capens
On Wed, Jan 26, 2022 at 11:12 AM K. Moon <km...@chromium.org> wrote:
A couple of other options that come to mind:
  1. p[2 - 1] (index arithmetic instead of pointer arithmetic)
  2. *(p2 - 1)
Even though p2[-1] (and -1[p2]) are semantically the same as *(p2 - 1), I would consider them stylistically distinct. Clang's AST preserves the distinction, so it's possible to build diagnostics around this. There appear to be very few cases of the pattern "[-" in the code base, so it seems unusual anyway.

Right, but to be clear the problem isnt indexing by a negative constant, it's indexing by a `char` that happens to sometimes be signed, and thus if it gets large enough (which happens for real-world encodings for strings), it becomes negative.

K. Moon

unread,
Jan 26, 2022, 11:23:11 AM1/26/22
to dan...@chromium.org, Mark Mentovai, Thomas Sepez, Nicolas Capens, chromium-dev, Nicolas Capens
But we're assuming -Wchar-subscripts, though, right?
Reply all
Reply to author
Forward
0 new messages