[llvm-dev] [RFC] Coding Standards: "prefer `int` for regular arithmetic, use `unsigned` only for bitmask and when you intend to rely on wrapping behavior."

Mehdi AMINI via llvm-dev

unread,

Jun 8, 2019, 1:18:07 PM6/8/19

to llvm-dev

Hi,

The LLVM coding style does not specify anything about the use of signed/unsigned integer, and the codebase is inconsistent (there is a majority of code that is using unsigned index in loops today though).

I'd like to suggest that we specify to prefer `int` when possible and use `unsigned` only for bitmask and when you intend to rely on wrapping behavior, see: https://reviews.llvm.org/D63049

A lot has been written about this, good references are [unsigned: A Guideline for Better Code] https://www.youtube.com/watch?v=wvtFGa6XJDU) and [Garbage In, Garbage Out: Arguing about Undefined Behavior...](https://www.youtube.com/watch?v=yG1OZ69H_-o), as well as this panel discussion:

- https://www.youtube.com/watch?v=Puio5dly9N8#t=12m12s

- https://www.youtube.com/watch?v=Puio5dly9N8#t=42m40s

Other coding guidelines already embrace this:

- http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-signed

- http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-unsigned

- https://google.github.io/styleguide/cppguide.html#Integer_Types

It is rare that overflowing (and wrapping) an unsigned integer won't trigger a program bug when the overflow was not intentionally handled. Using signed arithmetic means that you can actually trap on over/underflow and catch these bugs (when using fuzzing for instance).

Chandler explained this nicely in his CPPCon 2016 talk "Garbage In, Garbage Out: Arguing about Undefined Behavior...". I encourage to watch the full talk but here is one relevant example: https://youtu.be/yG1OZ69H_-o?t=2006 , and here he mentions how Google experimented with this internally: https://youtu.be/yG1OZ69H_-o?t=2249

Unsigned integer also have a discontinuity right to the left of zero. Suppose A, B and C are small positive integers close to zero, say all less than a hundred or so. Then given:

A + B > C

and knowing elementary school algebra, one can rewrite that as:

A > B - C

But C might be greater than B, and the subtraction would produce some huge number. This happens even when working with seemingly harmless numbers like A=2, B=2, and C=3.

--

Mehdi

Tim Northover via llvm-dev

unread,

Jun 8, 2019, 2:12:41 PM6/8/19

to Mehdi AMINI, llvm-dev

On Sat, 8 Jun 2019 at 18:18, Mehdi AMINI via llvm-dev
<llvm...@lists.llvm.org> wrote:
> The LLVM coding style does not specify anything about the use of signed/unsigned integer, and the codebase is inconsistent (there is a majority of code that is using unsigned index in loops today though).
>
> I'd like to suggest that we specify to prefer `int` when possible and use `unsigned` only for bitmask and when you intend to rely on wrapping behavior, see: https://reviews.llvm.org/D63049

I'd prefer us to have something neater than static_cast<int> for the
loop problem before we made that change. Perhaps add an ssize (or
equivalent) method to all of our internal data structures? They're a
lot more common than std::* containers.

But it's not something that would keep me up at night.

Cheers.

Tim.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Chandler Carruth via llvm-dev

unread,

Jun 9, 2019, 2:05:00 AM6/9/19

to Tim Northover, llvm-dev

On Sat, Jun 8, 2019 at 11:12 AM Tim Northover via llvm-dev <llvm...@lists.llvm.org> wrote:

On Sat, 8 Jun 2019 at 18:18, Mehdi AMINI via llvm-dev
<llvm...@lists.llvm.org> wrote:
> The LLVM coding style does not specify anything about the use of signed/unsigned integer, and the codebase is inconsistent (there is a majority of code that is using unsigned index in loops today though).
>
> I'd like to suggest that we specify to prefer `int` when possible and use `unsigned` only for bitmask and when you intend to rely on wrapping behavior, see: https://reviews.llvm.org/D63049

I'd prefer us to have something neater than static_cast<int> for the
loop problem before we made that change.

Not sure this is much of a problem in LLVM-land... We already encourage a loop style that avoids the classical problem:

```

for (int Index = 0, Size = MyVector.size(); Index < Size; ++Index) {

// ...

}

```

Alternatively, LLVM supports a form I like even more:

```

for (int Index : llvm::seq<int>(0, MyVector.size())) {

// ...

}

```

If passing the explicit type is an issue, we could fix that?

FWIW, I'd also be fine changing the return value ef `size()` for LLVM containers, and given the variability of standard library container performance, I'm generally a fan of LLVM code consistently using its own containers.

-Chnadler

Michael Kruse via llvm-dev

unread,

Jun 10, 2019, 12:25:54 PM6/10/19

to Tim Northover, llvm-dev

Am Sa., 8. Juni 2019 um 13:12 Uhr schrieb Tim Northover via llvm-dev
<llvm...@lists.llvm.org>:

> I'd prefer us to have something neater than static_cast<int> for the
> loop problem before we made that change. Perhaps add an ssize (or
> equivalent) method to all of our internal data structures? They're a
> lot more common than std::* containers.

+1

Since C++20 is also introducing ssize [1] members, this makes a lot of
sense to me. Using it would help avoiding an unsigned comparison as in

if (IndexOfInterestingElement >= Container.size())
...

to sneak in from the start.

Michael

[1] http://wg21.link/p1227r1

James Henderson via llvm-dev

unread,

Jun 10, 2019, 12:59:16 PM6/10/19

to Michael Kruse, llvm-dev

Maybe it's just because I work in code around the binary file formats almost exclusively, but unsigned (or more often uint64_t) is FAR more common than int everywhere I go. I don't have time right now to read up on the different links you provided, and I expect this is covered in them, but it also seems odd to me to use int in a loop when indexing in a container (something that can't always be avoided), given the types of size() etc.

Jake Ehrlich via llvm-dev

unread,

Jun 10, 2019, 1:16:07 PM6/10/19

to James Henderson, llvm-dev

I'm in the same situation James is in and thus have the same bias but I'll +1 that comment nevertheless. I think I prefer using size_t or the uintX_t types where applicable. Only when I need a signed value do I use one.

Reid Kleckner via llvm-dev

unread,

Jun 10, 2019, 1:30:15 PM6/10/19

to Jake Ehrlich, llvm-dev

+1 / me too

I don't think there's a problem here that needs to be fixed with a coding standard.

Aaron Ballman via llvm-dev

unread,

Jun 10, 2019, 1:32:13 PM6/10/19

to Jake Ehrlich, llvm-dev

On Mon, Jun 10, 2019, 7:16 PM Jake Ehrlich via llvm-dev <llvm...@lists.llvm.org> wrote:

I'm in the same situation James is in and thus have the same bias but I'll +1 that comment nevertheless. I think I prefer using size_t or the uintX_t types where applicable. Only when I need a signed value do I use one.

+1 to prefering unsigned types.

~Aaron

Mehdi AMINI via llvm-dev

unread,

Jun 10, 2019, 5:04:14 PM6/10/19

to Aaron Ballman, llvm-dev

On Mon, Jun 10, 2019 at 10:32 AM Aaron Ballman via llvm-dev <llvm...@lists.llvm.org> wrote:

On Mon, Jun 10, 2019, 7:16 PM Jake Ehrlich via llvm-dev <llvm...@lists.llvm.org> wrote:
I'm in the same situation James is in and thus have the same bias but I'll +1 that comment nevertheless. I think I prefer using size_t or the uintX_t types where applicable. Only when I need a signed value do I use one.

+1 to prefering unsigned types.

I'd appreciate if you guys could provide rational that address the extensive arguments and opinion provided in the C++ community that I tried to summarize in the link above.

Otherwise I don't know what to take out of unmotivated "+1".

--

Mehdi

Tim Northover via llvm-dev

unread,

Jun 10, 2019, 5:45:06 PM6/10/19

to Mehdi AMINI, llvm-dev, Aaron Ballman

On Mon, 10 Jun 2019 at 22:04, Mehdi AMINI via llvm-dev
<llvm...@lists.llvm.org> wrote:
> I'd appreciate if you guys could provide rational that address the extensive arguments and opinion provided in the C++ community that I tried to summarize in the link above.
> Otherwise I don't know what to take out of unmotivated "+1".

Yeah, I want to make it clear that I am in favour of "int" as a
default, even if preparatory changes are needed to get there. My main
reservation was over the loops

for (int i = 0; i != static_cast<int>(X.size()); ++i)

required to satisfy compilers when *::size returns unsigned. I dislike the

for (int i = 0, Size = X.size(); i != Size; ++i)

form that has traditionally been suggested because it's still horribly verbose.

I hadn't even heard of the llvm::seq construct Chandler mentioned
though, and I think it addresses all my worries. It looks really neat
to me, and I'm not bothered by the duplicated "int".

So my current position is that I'm all in favour of switching to int,
with llvm::seq as the suggested alternative in for loops.

Cheers.

Tim.

Aaron Ballman via llvm-dev

unread,

Jun 11, 2019, 3:37:17 AM6/11/19

to Mehdi AMINI, llvm-dev

Sorry for the brevity, I am currently travelling and responding on a cell phone. I won't be able to give you a full accounting until later, but 1) I don't see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won't use this convention consistently anyway. I have yet to be convinced by the c++ community's very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.

~Aaron

Mehdi AMINI via llvm-dev

unread,

Jun 11, 2019, 3:51:40 AM6/11/19

to Aaron Ballman, llvm-dev

On Tue, Jun 11, 2019 at 12:37 AM Aaron Ballman <aaron....@gmail.com> wrote:

Sorry for the brevity, I am currently travelling and responding on a cell phone. I won't be able to give you a full accounting until later,

There is no hurry right now!

but 1) I don't see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won't use this convention consistently anyway.

As far as I can tell, the same arguments applies to "unsigned" I believe: it does not represent the full extent of size_t on every platform either! Yet we're still using it as an index, including to index into objects/array.

I have yet to be convinced by the c++ community's very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.

Note that this is not about "switching" anything: there is no standard in LLVM right now (as far as I know) and the codebase is inconsistent (I am using `int` in general for a while I believe).

Aaron Ballman via llvm-dev

unread,

Jun 11, 2019, 4:01:44 AM6/11/19

to Mehdi AMINI, llvm-dev

On Tue, Jun 11, 2019, 9:51 AM Mehdi AMINI <joke...@gmail.com> wrote:

On Tue, Jun 11, 2019 at 12:37 AM Aaron Ballman <aaron....@gmail.com> wrote:
Sorry for the brevity, I am currently travelling and responding on a cell phone. I won't be able to give you a full accounting until later,

There is no hurry right now!

but 1) I don't see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won't use this convention consistently anyway.

As far as I can tell, the same arguments applies to "unsigned" I believe: it does not represent the full extent of size_t on every platform either! Yet we're still using it as an index, including to index into objects/array.

I have yet to be convinced by the c++ community's very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.

Note that this is not about "switching" anything: there is no standard in LLVM right now (as far as I know) and the codebase is inconsistent (I am using `int` in general for a while I believe).

Glad to hear there's no plan to change the codebase, but then I fail to see the purpose of the coding rule. Won't it boil down to "use the correct datatype for what you're doing?"

What I mostly want to avoid is code that uses signed values for inherently unsigned operations like comparing against the size of an object. Then we start wanting ssize() functions or using casts for confused reasons.

~Aaron

Mehdi AMINI via llvm-dev

unread,

Jun 11, 2019, 4:10:50 AM6/11/19

to Aaron Ballman, llvm-dev

On Tue, Jun 11, 2019 at 1:01 AM Aaron Ballman <aaron....@gmail.com> wrote:

On Tue, Jun 11, 2019, 9:51 AM Mehdi AMINI <joke...@gmail.com> wrote:

On Tue, Jun 11, 2019 at 12:37 AM Aaron Ballman <aaron....@gmail.com> wrote:
Sorry for the brevity, I am currently travelling and responding on a cell phone. I won't be able to give you a full accounting until later,

There is no hurry right now!

but 1) I don't see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won't use this convention consistently anyway.

As far as I can tell, the same arguments applies to "unsigned" I believe: it does not represent the full extent of size_t on every platform either! Yet we're still using it as an index, including to index into objects/array.

I have yet to be convinced by the c++ community's very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.

Note that this is not about "switching" anything: there is no standard in LLVM right now (as far as I know) and the codebase is inconsistent (I am using `int` in general for a while I believe).

Glad to hear there's no plan to change the codebase

Just like every guideline, new code should follow unless there is a good reason not to (the wording is "prefer").

, but then I fail to see the purpose of the coding rule. Won't it boil down to "use the correct datatype for what you're doing?"

How do you define what is the correct datatype?

What I mostly want to avoid is code that uses signed values for inherently unsigned operations like comparing against the size of an object.

I believe `if (index < v.size() - 1)` can be very surprising when v.empty().

--

Mehdi

Zachary Turner via llvm-dev

unread,

Jun 11, 2019, 12:45:30 PM6/11/19

to Mehdi AMINI, llvm-dev, Aaron Ballman

I'm personally against changing everything to signed integers. To me, this is an example of making code strictly less readable and more confusing in order to fight deficiencies in the language standard. I get the problem that it's solving, but I view this as mostly a theoretical problem, whereas being able to read the code and have it make sense is a practical problem that we must face on a daily basis. If you change everything to signed integers, you may catch a real problem with it a couple of times a year. And by "real problem" here, I'm talking about a miscompile or an actual bug that surfaces in production somewhere, rather than a "yes, it seems theoretically possible for this to overflow".

On the other hand, a large number of people need to work in this codebase every day, and multiplied over the same time period, my belief is that having the code make sense and be simple has a higher net value.

It simply doesn't make sense (conceptually) to use a signed type for domains that are inherently unsigned, like the size of an object. IMO, we should revisit this if and when the deficiencies in the C++ Standard are addressed.

David Greene via llvm-dev

unread,

Jun 11, 2019, 12:59:14 PM6/11/19

to James Henderson via llvm-dev

James Henderson via llvm-dev <llvm...@lists.llvm.org> writes:

> Maybe it's just because I work in code around the binary file formats
> almost exclusively, but unsigned (or more often uint64_t) is FAR more
> common than int everywhere I go. I don't have time right now to read
> up on the different links you provided, and I expect this is covered
> in them, but it also seems odd to me to use int in a loop when
> indexing in a container (something that can't always be avoided),
> given the types of size() etc.

The reason to use int as an index in loops is for its algebraic
properties, which allows some optimizations (vectorization, for
example), where unsigned would cause problems.

Basically, the optimizer can assume overflow is undefined behavior and
optimize accordingly.

+10000 for preferring signed unless there's a good reason for unsigned.

-David

Zachary Turner via llvm-dev

unread,

Jun 11, 2019, 3:00:15 PM6/11/19

to David Greene, James Henderson via llvm-dev

I would argue that unless you've benchmarked a piece of code found this to be a problem, then writing code with the express purpose of having it be more optimizable at the expense of readability and ease of understanding is a classic example of premature optimization. And if you have, then you can used signed integers in that specific piece of code, and leave a comment about why it should not be changed.

Mehdi AMINI via llvm-dev

unread,

Jun 11, 2019, 3:25:04 PM6/11/19

to Zachary Turner, David Greene, James Henderson via llvm-dev

I agree that readability, maintainability, and ability to debug/find issues are key.

I haven't found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned *increases* my mental load when reading code.

--

Mehdi

Michael Kruse via llvm-dev

unread,

Jun 11, 2019, 3:26:20 PM6/11/19

to Zachary Turner, llvm-dev, Aaron Ballman

Am Di., 11. Juni 2019 um 11:45 Uhr schrieb Zachary Turner via llvm-dev
<llvm...@lists.llvm.org>:

>
> I'm personally against changing everything to signed integers. To me, this is an example of making code strictly less readable and more confusing in order to fight deficiencies in the language standard. I get the problem that it's solving, but I view this as mostly a theoretical problem, whereas being able to read the code and have it make sense is a practical problem that we must face on a daily basis. If you change everything to signed integers, you may catch a real problem with it a couple of times a year. And by "real problem" here, I'm talking about a miscompile or an actual bug that surfaces in production somewhere, rather than a "yes, it seems theoretically possible for this to overflow".

Doesn't it make it already worth it?

> On the other hand, a large number of people need to work in this codebase every day, and multiplied over the same time period, my belief is that having the code make sense and be simple has a higher net value.
>
> It simply doesn't make sense (conceptually) to use a signed type for domains that are inherently unsigned, like the size of an object. IMO, we should revisit this if and when the deficiencies in the C++ Standard are addressed.

The underlying problem is that the C family of languages mixes two
orthogonal properties: value range and overflow behavior. There is no
unsigned type with undefined wraparound. So the question becomes: What
property is more important to reflect? Do we want catch unintended
wraparound behavior using a sanitizer/make optimizations based on it?
Do we need the additional range provided by an unsigned type? As
Chandler says in one of his talks linked earlier: "If you need more
bits, use more bits" (such as int64_t).

Michael

Zachary Turner via llvm-dev

unread,

Jun 11, 2019, 3:59:55 PM6/11/19

to Michael Kruse, llvm-dev, Aaron Ballman

On Tue, Jun 11, 2019 at 12:24 PM Mehdi AMINI <joke...@gmail.com> wrote:

I agree that readability, maintainability, and ability to debug/find issues are key.
I haven't found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned *increases* my mental load when reading code.

I'm on the other end. I'm always reading the code wondering "is this going to warn?" "Why could a container ever have a negative number of elements?" "The maximum value representable by the return type (unsigned) is larger than that of the value i'm storing it in (signed), so an overflow could happen even if there were no error. What then?"

On Tue, Jun 11, 2019 at 12:26 PM Michael Kruse <llv...@meinersbur.de> wrote:

Am Di., 11. Juni 2019 um 11:45 Uhr schrieb Zachary Turner via llvm-dev
<llvm...@lists.llvm.org>:
>
> I'm personally against changing everything to signed integers. To me, this is an example of making code strictly less readable and more confusing in order to fight deficiencies in the language standard. I get the problem that it's solving, but I view this as mostly a theoretical problem, whereas being able to read the code and have it make sense is a practical problem that we must face on a daily basis. If you change everything to signed integers, you may catch a real problem with it a couple of times a year. And by "real problem" here, I'm talking about a miscompile or an actual bug that surfaces in production somewhere, rather than a "yes, it seems theoretically possible for this to overflow".

Doesn't it make it already worth it?

vector.size() returns a size_t, which on 64-bit platforms can represent types values larger than those that can fit into an int64_t. So to turn your argument around, since it's theoretically possible to have a vector with more items than an int64_t can represent, isn't it already worth it to use size_t, which is an unsigned type?

Andrew Kelley via llvm-dev

unread,

Jun 11, 2019, 4:10:40 PM6/11/19

to llvm...@lists.llvm.org

On 6/11/19 3:59 PM, Zachary Turner via llvm-dev wrote:
> On Tue, Jun 11, 2019 at 12:24 PM Mehdi AMINI <joke...@gmail.com
> <mailto:joke...@gmail.com>> wrote:
>
> I agree that readability, maintainability, and ability to debug/find
> issues are key.
> I haven't found myself in a situation where unsigned was helping my
> readability: on the opposite actually I am always wondering where is
> the expecting wrap-around behavior and that is one more thing I have
> to keep in mind when I read code that manipulate unsigned. So YMMV
> but using unsigned *increases* my mental load when reading code.
>
> I'm on the other end. I'm always reading the code wondering "is this
> going to warn?" "Why could a container ever have a negative number of
> elements?" "The maximum value representable by the return type
> (unsigned) is larger than that of the value i'm storing it in (signed),
> so an overflow could happen even if there were no error. What then?"

This is why the Zig frontend has the arithmetic operators +,/,*,- assert
that overflow does not occur for *both* signed and unsigned integers.
The wrapping operators +%,/%,*%,-% are available when one needs a
guaranteed twos complement wraparound.

This gives the best of both worlds, where you can use proper types
matching the range of values, yet have the desired performance
optimizations (and sanitation checks) that only signed integers have in
C/C++.

In fact, in Zig, unsigned integers are actually more optimizable than
signed integers! https://godbolt.org/z/QICZyy

Andrew

signature.asc

Jake Ehrlich via llvm-dev

unread,

Jun 11, 2019, 4:23:04 PM6/11/19

to Zachary Turner, llvm-dev, Aaron Ballman

This whole debate seems kind of odd to me. I don't know that cases where it isn't clear what type to use come up that often. If a value can truly never be negative you should use an unsigned value. If a value can be negative, you should use a signed value. Anecdotal evidence in my case is that the vast majority of values are unsigned by this rule.

Is there a reason to use a signed value when you know a value will never be negative? Trapping on overflow doesn't seem motivated to me to me since I'm not aware of anything that does that. UBSan also checks for overflow in unsigned types by default as well so you can still check for that issue.

I'm not going to go watch the YouTube videos but the ES.102 lacks merit. On systems I work with the bug they mention wouldn't be caught the way they say. They also use subtraction (a rare operation IMO) as a motivating example and arbitrarily declare large values to be less obvious bugs than negative values without evidence to this.

ES.101 is valid but is not a reason to prefer signed to unsigned values in any context. I've also run into a number of instances of signed shifts being used and the interplay between negation and bitwise operators being used. Not that those are common but it's just to say that exceptions exist even to that rule.

Michael Spencer via llvm-dev

unread,

Jun 11, 2019, 4:44:44 PM6/11/19

to Zachary Turner, llvm-dev, Aaron Ballman

sequence containers (like vector) cannot hold more items than a ptrdiff_t (a signed type) can represent due to requirements based on std::distance.

I don't see a use for unsigned types outside of bit manipulation.

- Michael Spencer

Quentin Colombet via llvm-dev

unread,

Jun 11, 2019, 5:33:25 PM6/11/19

to Zachary Turner, llvm-dev, Aaron Ballman

On Jun 11, 2019, at 12:59 PM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:

On Tue, Jun 11, 2019 at 12:24 PM Mehdi AMINI <joke...@gmail.com> wrote:
I agree that readability, maintainability, and ability to debug/find issues are key.
I haven't found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned *increases* my mental load when reading code.

I'm on the other end. I'm always reading the code wondering "is this going to warn?" "Why could a container ever have a negative number of elements?" "The maximum value representable by the return type (unsigned) is larger than that of the value i'm storing it in (signed), so an overflow could happen even if there were no error. What then?”

+1

On Tue, Jun 11, 2019 at 12:26 PM Michael Kruse <llv...@meinersbur.de> wrote:
Am Di., 11. Juni 2019 um 11:45 Uhr schrieb Zachary Turner via llvm-dev
<llvm...@lists.llvm.org>:
>
> I'm personally against changing everything to signed integers. To me, this is an example of making code strictly less readable and more confusing in order to fight deficiencies in the language standard. I get the problem that it's solving, but I view this as mostly a theoretical problem, whereas being able to read the code and have it make sense is a practical problem that we must face on a daily basis. If you change everything to signed integers, you may catch a real problem with it a couple of times a year. And by "real problem" here, I'm talking about a miscompile or an actual bug that surfaces in production somewhere, rather than a "yes, it seems theoretically possible for this to overflow".

Doesn't it make it already worth it?
vector.size() returns a size_t, which on 64-bit platforms can represent types values larger than those that can fit into an int64_t. So to turn your argument around, since it's theoretically possible to have a vector with more items than an int64_t can represent, isn't it already worth it to use size_t, which is an unsigned type?

Aaron Ballman via llvm-dev

unread,

Jun 12, 2019, 2:55:20 AM6/12/19

to Zachary Turner, llvm-dev

On Tue, Jun 11, 2019, 9:59 PM Zachary Turner <ztu...@roblox.com> wrote:

On Tue, Jun 11, 2019 at 12:24 PM Mehdi AMINI <joke...@gmail.com> wrote:
I agree that readability, maintainability, and ability to debug/find issues are key.
I haven't found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned *increases* my mental load when reading code.
I'm on the other end. I'm always reading the code wondering "is this going to warn?" "Why could a container ever have a negative number of elements?" "The maximum value representable by the return type (unsigned) is larger than that of the value i'm storing it in (signed), so an overflow could happen even if there were no error. What then?"

Strong +1 to this.

~Aaron

Renato Golin via llvm-dev

unread,

Jun 12, 2019, 4:01:51 AM6/12/19

to Aaron Ballman, LLVM Dev

+1 to both points here.

Hubert Tong via llvm-dev

unread,

Jun 12, 2019, 10:18:45 AM6/12/19

to Renato Golin, LLVM Dev, Aaron Ballman

On Wed, Jun 12, 2019 at 4:01 AM Renato Golin via llvm-dev <llvm...@lists.llvm.org> wrote:

+1 to both points here.

On Wed, 12 Jun 2019, 07:55 Aaron Ballman via llvm-dev, <llvm...@lists.llvm.org> wrote:

On Tue, Jun 11, 2019, 9:59 PM Zachary Turner <ztu...@roblox.com> wrote:
On Tue, Jun 11, 2019 at 12:24 PM Mehdi AMINI <joke...@gmail.com> wrote:
I agree that readability, maintainability, and ability to debug/find issues are key.
I haven't found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned *increases* my mental load when reading code.
I'm on the other end. I'm always reading the code wondering "is this going to warn?"

Indeed, using comparison against 0u from the "obvious context" has saved me from warnings.

"Why could a container ever have a negative number of elements?" "The maximum value representable by the return type (unsigned) is larger than that of the value i'm storing it in (signed), so an overflow could happen even if there were no error. What then?"

Strong +1 to this.

clang::ASTContext::toCharUnitsFromBits(int64_t) makes me cringe. It tends to be called with uint64_t values like the Width field in the instance of clang::TypeInfo.

Krzysztof Parzyszek via llvm-dev

unread,

Jun 12, 2019, 11:05:11 AM6/12/19

to Michael Kruse, Zachary Turner, llvm...@lists.llvm.org, Aaron Ballman

> The underlying problem is that the C family of languages mixes two orthogonal properties: value range and overflow behavior. There is no unsigned type with undefined wraparound. So the question becomes: What property is more important to reflect? Do we want catch unintended wraparound behavior using a sanitizer/make optimizations based on it?

That's a valid argument, but I suspect that the vast majority of loops using an unsigned induction variable, start at an explicit 0 and go up by 1. Such loops cannot overflow, so there is nothing to catch there.

--
Krzysztof Parzyszek kpar...@quicinc.com LLVM compiler development

Mehdi AMINI via llvm-dev

unread,

Jun 12, 2019, 11:50:54 AM6/12/19

to Krzysztof Parzyszek, llvm...@lists.llvm.org, Aaron Ballman

On Wed, Jun 12, 2019 at 8:05 AM Krzysztof Parzyszek via llvm-dev <llvm...@lists.llvm.org> wrote:

> The underlying problem is that the C family of languages mixes two orthogonal properties: value range and overflow behavior. There is no unsigned type with undefined wraparound. So the question becomes: What property is more important to reflect? Do we want catch unintended wraparound behavior using a sanitizer/make optimizations based on it?

That's a valid argument, but I suspect that the vast majority of loops using an unsigned induction variable, start at an explicit 0 and go up by 1. Such loops cannot overflow, so there is nothing to catch there.

This isn't entirely true: some comparison can exist in the code `if (idx - 1 < N) { ....}` which does not give the same result when `idx` is unsigned and zero.

Just like Michael mentioned: there are two aspects of unsigned, and the wrap-around behavior is the one causing bugs. After being bitten by unsigned wrap-around twice a year, and because debugging these isn't fun (and correctness is important and hard), my code reading (and others, as I sourced initially) has adjusted to be suspicious and careful around unsigned: if I see unsigned I need to check more invariants when I read the code.

Obviously this is a tradeoff: losing the "this can't be negative" property enforced in the type system is sad, but I haven't encountered any bugs or issue caused by using int in place of unsigned (while the opposite is true).

(any bug caused "negative indexing of a container" would be as much of a bug with unsigned, please watch Chandler's talk).

--

Mehdi

David Greene via llvm-dev

unread,

Jun 12, 2019, 12:44:33 PM6/12/19

to Mehdi AMINI, James Henderson via llvm-dev

Mehdi AMINI <joke...@gmail.com> writes:

> I agree that readability, maintainability, and ability to debug/find
> issues are key. I haven't found myself in a situation where unsigned
> was helping my readability: on the opposite actually I am always
> wondering where is the expecting wrap-around behavior and that is one
> more thing I have to keep in mind when I read code that manipulate
> unsigned. So YMMV but using unsigned *increases* my mental load when
> reading code.

My experience is the same. "unsigned" to me gives no useful information
and actually requires more mental thought to look for pitfalls.

David Greene via llvm-dev

unread,

Jun 12, 2019, 12:54:16 PM6/12/19

to Jake Ehrlich via llvm-dev, Aaron Ballman

Jake Ehrlich via llvm-dev <llvm...@lists.llvm.org> writes:

> This whole debate seems kind of odd to me. I don't know that cases
> where it isn't clear what type to use come up that often. If a value
> can truly never be negative you should use an unsigned value. If a
> value can be negative, you should use a signed value. Anecdotal
> evidence in my case is that the vast majority of values are unsigned
> by this rule.
>
> Is there a reason to use a signed value when you know a value will
> never be negative?

Since this thread is really long, I want to make sure to address this
specific point even though it's been covered elsewhere.

One reason to prefer signed is optimization. The compiler simply cannot
optimize code with unsigned as well as it can with signed, because of
unsigned's breaking of standard integer algebra. This affects
everything from simple expression simplification to vectorization and
parallelization. Using unsigned can have serious performance
consequences. Because of the nature of the work I do, I see it all the
time.

Some have said this is premature optimization but to me there is no
additional mental load with signed. In fact it's less for me than
unsigned because of the mental gymnastics I have to go through to verify
code that uses unsigned.

-David

David Greene via llvm-dev

unread,

Jun 12, 2019, 12:57:05 PM6/12/19

to Krzysztof Parzyszek via llvm-dev, Aaron Ballman

Krzysztof Parzyszek via llvm-dev <llvm...@lists.llvm.org> writes:

>> The underlying problem is that the C family of languages mixes two
>> orthogonal properties: value range and overflow behavior. There is
>> no unsigned type with undefined wraparound. So the question becomes:
>> What property is more important to reflect? Do we want catch
>> unintended wraparound behavior using a sanitizer/make optimizations
>> based on it?
>
> That's a valid argument, but I suspect that the vast majority of loops
> using an unsigned induction variable, start at an explicit 0 and go up
> by 1. Such loops cannot overflow, so there is nothing to catch there.

I have in fact seen loops that rely on overflow of induction variables,
as crazy as that sounds.

But again, for loops it's not (directly) about overflow as much as it is
about optimization. You're leaving a lot of potential performance on
the floor with unsigned, precisely because compilers have to guard
against overflowing induction variables.

-David

Quentin Colombet via llvm-dev

unread,

Jun 12, 2019, 1:12:28 PM6/12/19

to David Greene, Jake Ehrlich via llvm-dev, Aaron Ballman

> On Jun 12, 2019, at 9:54 AM, David Greene via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Jake Ehrlich via llvm-dev <llvm...@lists.llvm.org> writes:
>
>> This whole debate seems kind of odd to me. I don't know that cases
>> where it isn't clear what type to use come up that often. If a value
>> can truly never be negative you should use an unsigned value. If a
>> value can be negative, you should use a signed value. Anecdotal
>> evidence in my case is that the vast majority of values are unsigned
>> by this rule.
>>
>> Is there a reason to use a signed value when you know a value will
>> never be negative?
>
> Since this thread is really long, I want to make sure to address this
> specific point even though it's been covered elsewhere.
>
> One reason to prefer signed is optimization.

FWIW. If you care about optimization, signed size_t is probably the way to go in general.

Int type will incur a sign extension for any address accesses on 64-bit platform (32-bit to 64-bit extension).
Unsigned on the other hand creates zero extension which are most of the time free.

Thus, unsigned is sometimes better for codegen than signed, in particular in a compiler code base where vectorization is not really a thing.

Anyway, it seems to me that there are enough people on both sides of the fence that this shouldn’t be in the coding standard.

My 2c.

Quentin

Stephen Checkoway via llvm-dev

unread,

Jun 12, 2019, 1:29:42 PM6/12/19

to Mehdi AMINI, llvm...@lists.llvm.org

Hi Mehdi,

Drive by comment, please feel free to disregard.

> On Jun 12, 2019, at 11:50, Mehdi AMINI via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> (any bug caused "negative indexing of a container" would be as much of a bug with unsigned,

Signed comparison with a maximum size or number of things is a classic software vulnerability. E.g.,

#define MAX_WIDGETS 200
struct widget widgets[MAX_WIDGETS];
...
if (count >= MAX_WIDGETS)
return -1;
/* now do something with widgets[count] */

With unsigned integers, there would be no bug. There are simple bugs using either signed or unsigned integers so I don't find that dispositive.

> please watch Chandler's talk).

It was a great talk! I'm glad you pointed it out, but I'm not convinced any of the examples he discusses impact loop counters, which seems to be one of the main motivations for this. Let's go through them.

1. Shifting by more than the bit width <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=24m40s> doesn't involve signed values.

2. Signed left shift vs. multiplication <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=27m48s> isn't relevant for loops.

3. The example from Dietz et al. <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=33m15s> boils down to computing 16u + (n-1) * 8u. As far as I can tell, the signedness of the quantities involved does not impact the computation. See <https://godbolt.org/z/HVA6xD>.

4. Using a 32-bit, unsigned integer to index an array <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=39m22s>. Here, using a signed 32-bit integer lets the compiler promote it to a signed 64-bit integer and the compiler can produce slightly better code. Using an appropriately sized integer for the index, namely size_t, has the same effect except with unusual ABIs like x32*. ssize_t would also be fine (and better than size_t for x32).

5. C++ doesn't have a supported arithmetic right shift <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=51m49s>. Great point, but not relevant here.

6. UB twitter's favorite example: memcpy(0, 0, 0) <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=55m0s>. Again, not relevant here.

Up thread, Chandler suggests
> for (int Index = 0, Size = MyVector.size(); Index < Size; ++Index)
or
> for (int Index : llvm::seq<int>(0, MyVector.size()))

I believe int can produce better code than unsigned int here, but neither seem correct. Surely both should be size_t (again, ssize_t would be fine). I'm not sure if LLVM containers ever contain more than INT_MAX elements. If not, then int is fine but since it'll produce the same code as using a correctly sized integer, I don't see the point. (As an aside, the talk by Jon Kalb you linked to <https://www.youtube.com/watch?v=wvtFGa6XJDU&t=4m45s> says that if you need one extra bit, you'll probably need two extra bits. "A sneeze puts you over." But that's another 2 billion more entries in your data structure so I'm not sure I buy that argument.)

One of the guides you linked to <https://google.github.io/styleguide/cppguide.html#Integer_Types>, says "When appropriate, you are welcome to use standard types like size_t and ptrdiff_t." Looping over structures from 0 to some size expressed as a size_t seems like an appropriate time to use size_t to me.

I haven't yet heard or read a compelling argument for using an integer type smaller than the register width for arithmetic, which int usually is these days. I can't think of a good reason to use unsigned ever though.

* Example provided by Rich Felker <https://twitter.com/RichFelker/status/1138166935575826433>.

Cheers,

Steve

--
Stephen Checkoway

Michael Kruse via llvm-dev

unread,

Jun 12, 2019, 6:08:44 PM6/12/19

to Zachary Turner, llvm-dev, Aaron Ballman

Am Di., 11. Juni 2019 um 14:59 Uhr schrieb Zachary Turner <ztu...@roblox.com>:
>> > I'm personally against changing everything to signed integers. To me, this is an example of making code strictly less readable and more confusing in order to fight deficiencies in the language standard. I get the problem that it's solving, but I view this as mostly a theoretical problem, whereas being able to read the code and have it make sense is a practical problem that we must face on a daily basis. If you change everything to signed integers, you may catch a real problem with it a couple of times a year. And by "real problem" here, I'm talking about a miscompile or an actual bug that surfaces in production somewhere, rather than a "yes, it seems theoretically possible for this to overflow".
>>
>> Doesn't it make it already worth it?
>
> vector.size() returns a size_t, which on 64-bit platforms can represent types values larger than those that can fit into an int64_t. So to turn your argument around, since it's theoretically possible to have a vector with more items than an int64_t can represent, isn't it already worth it to use size_t, which is an unsigned type?

According to your definition above, this is not a "real problem".
64-bit CPUs do not support using the full 64-bit virtual address
space. Even if one does in the future, PTRDIFF_MAX (and as a
consequence std::distance that a std::vector must support) effectively
limits the max allocation size that can be worked with without
undefined behavior.

To summarize, with signed sizes we could fix a few production-bugs a
year, while unsigned sizes only have a theoretical advantage in
production.

Chandler Carruth via llvm-dev

unread,

Jun 12, 2019, 9:21:22 PM6/12/19

to Renato Golin, LLVM Dev, Aaron Ballman

FWIW, the talks linked by Mehdi really do talk about these things and why I don't think the really are the correct trade-off.

Even if you imagine an unsigned type that doesn't allow wrapping, I think this is a really bad type. The problem is that you have made the most common value of the type (zero in every study I'm aware of) be a boundary condition. Today, it wraps to a huge value if you cross it. Afterward, it would trap. Both are super surprising.

Another way of looking at the same lens: do you subtract these values? Should `a + (b - c)` be the same as `(a + b) - c`? You either need a signed type or wrapping to have reasonable answers here. And if you solve this with wrapping, then it makes any attempt to write assertions or other checks in the same type system very difficult. The fact that you write an assert to check for "did I accidentally go past zero?" by conjuring some "it's probably too large" value and then comparing if it is *greater* than that is ... extraordinarily confusing.

Meanwhile, with signed types, it is quite easy to write asserts that check for non-negative values in the correct places. They are easy to read and produce easily understood errors. The boundary conditions are uncommon.

Even on the C++ standards committee, there is remarkably strong consensus that in the *absence* of unsigned types coming back from `.size()` methods and such, we should be using signed types for the reasons above.

The fact that we have unsigned `size_t` in a bunch of places is, IMO, a concern and it is important to have good ways of avoiding warnings. But I think we have so very many ways that don't require us to just use unsigned types everywhere and deal with the above issues:

- Change the return types of our containers `size()` methods.

- Add a `ssize()` method. (This is the direction the committee is moving AFAICT, but they are constrained by a powerful desire to break zero code, where as LLVM's containers have much more API freedom.)

- Use idioms like the one I suggested with `llvm::seq`.

Any or all of these seem significantly preferable to the readability concerns I outline above, at least to me. This is why I am still *strongly* in favor of signed types and assertions around value at known points where the value should obey that assertion.

-Chandler

Stefan Teleman via llvm-dev

unread,

Jun 13, 2019, 12:59:07 PM6/13/19

to Chandler Carruth, LLVM Dev, Aaron Ballman

On Wed, Jun 12, 2019 at 9:21 PM Chandler Carruth via llvm-dev
<llvm...@lists.llvm.org> wrote:
>
> FWIW, the talks linked by Mehdi really do talk about these things and why I don't think the really are the correct trade-off.
>
> Even if you imagine an unsigned type that doesn't allow wrapping, I think this is a really bad type. The problem is that you have made the most common value of the type (zero in every study I'm aware of) be a boundary condition. Today, it wraps to a huge value if you cross it. Afterward, it would trap. Both are super surprising.

[ ... ]

> Any or all of these seem significantly preferable to the readability concerns I outline above, at least to me. This is why I am still *strongly* in favor of signed types and assertions around value at known points where the value should obey that assertion.

Have there been any documented cases in LLVM where a for() loop with
an unsigned int induction variable has wrapped around to 0? In other
words, is there any container - either LLVM or C++ Standard Library -
that ended up storing more than UINT_MAX or ULLONG_MAX elements?

I'm looking at these values in <limits.h>:

#define UINT_MAX 4294967295U
#define ULLONG_MAX 18446744073709551615ULL

and I am having a really hard time imagining a llvm::SmallVector<Foo>
storing 18446744073709551615ULL + 1ULL Foo elements. But i'm happy to
be proven wrong.

As far as the C++ Standard Library is concerned, all the containers
implement std::<container-type>::max_size(), which is of type
std::size_t and is always - and intentonally - smaller than either
UINT_MAX or ULLONG_MAX.

So I'm not even sure how an unsigned induction variable testing for
std::vector<Foo>::size() or a std::string::size() - for example -
would ever end up wrapping around to 0. The container will blow up
when its number of elements attempts to exceed its max_size().

Plus, it's not that hard to write

std::vector<Foo> FooVector;
for (unsigned I = 0; I < ${SOMETHING} && I < FooVector.max_size(); ++I) {
}

if unsigned's wrap-around is a material concern.

Maybe the compiler should just warn when sizeof(unsigned) <
sizeof(std::container<Foo>::max_size()). I think that would be enough
of a hint, and in the vast majority of cases it will be moot anyway.

Just my 0.02.

--
Stefan Teleman
stefan....@gmail.com

Jameson Nash via llvm-dev

unread,

Jun 13, 2019, 1:17:52 PM6/13/19

to Stefan Teleman, LLVM Dev, Aaron Ballman

> Should `a + (b - c)` be the same as `(a + b) - c`? You either need a signed type or wrapping to have reasonable answers here

Depending on what "reasonable" means here, only wrapping (unsigned in C) gets you this commutative property. For a signed value with C, it's possible for one of these to be undefined behavior, while the other returns a reasonable value. For instance, `a == b == c == std:: numeric_limits<typeof(a)>::min()` is probably unusual as a value, but could be used as a sentinel (perhaps to represent an infinite or empty set). Of course, the unsigned result might just be nonsense.

Anyways, I don't have a strong opinion either way, since I think they both can have surprises.

One other occasional benefit to using unsigned that can be surprising is that power-of-two division is slightly cheaper (since it doesn't need to handle negative numbers):

(ssize_t)x / 2

shrq $63, %rax

leaq (%rax,%rdi), %rax

sarq %rax

(size_t)x / 2

shrq %rdi

> is there any container

I'd posit that UINT_MAX is uncommon, but pretty easy to exceed (although it needs a 64-bit machine to represent it). For example, anything that might need to handle the return value of `MemoryBuffer::getFile` could come across a file that's larger than 2GB.

Adrien Guinet via llvm-dev

unread,

Jun 18, 2019, 7:49:19 AM6/18/19

to llvm...@lists.llvm.org

My 2 cents on this.

I watched Chandler's talk, and it looks like there are two statements that are a bit
contradictory one to another:

https://www.youtube.com/watch?v=yG1OZ69H_-o&t=2820s "it's fairly easy to get enough bits
w/o using the sign bits"
versus
https://www.youtube.com/watch?v=yG1OZ69H_-o&t=2846s "using size_t takes more space in data
structures"

Why not using size_t for indexes/length, while storing uint32_t integers and cast them
back to size_t when necessary (i.e. for space reasons)?

About checking for overflows, let's take the various examples here:
https://godbolt.org/z/TIGiVe, which are various ways to load a byte at index "n+10" where
n is user-controlled. In this case, we need to check that "n+10" does not overflow.
The first thing is that, IMHO, overflow checks are tricky to get right for both the signed
and unsigned case (and has led to numerous vulnerabilities). But that's a personal
opinion. If we talk about code size (and performance), the load3 function confirms what
Chandler showed in his talk (using uint32_t can be a performance issue on some 64-bit
systems, do to the enforced wrapping that has to be done on 32bits integers). The other
load* functions all emit the intended "mov eax, DWORD PTR [rdi+40+rsi*4]". Notable
differences arise in load2, which needs to sign extend esi to rsi. In load1, there is also
an extra instruction to set the value of (2**63-12).

One conclusion is that, at least on x86-64, using size_t for these kind of overflow checks
is the most efficient. But I might miss something / do something wrong in this analysis,
and I'd like another POV on this!

One argument for the usage of signed integers (actually integer types that don't allow
wrapping) is indeed the ability to automatically insert these overflow checks at compile time.

On 6/8/19 7:17 PM, Mehdi AMINI via llvm-dev wrote:
> Hi,
>
> The LLVM coding style does not specify anything about the use of signed/unsigned integer,
> and the codebase is inconsistent (there is a majority of code that is using unsigned index
> in loops today though).
>
> I'd like to suggest that we specify to prefer `int` when possible and use `unsigned` only
> for bitmask and when you intend to rely on wrapping behavior,
> see: https://reviews.llvm.org/D63049
>
> A lot has been written about this, good references are [unsigned: A Guideline for Better
> Code] https://www.youtube.com/watch?v=wvtFGa6XJDU) and [Garbage In, Garbage Out: Arguing
> about Undefined Behavior...](https://www.youtube.com/watch?v=yG1OZ69H_-o), as well as this
> panel discussion:
> - https://www.youtube.com/watch?v=Puio5dly9N8#t=12m12s
> - https://www.youtube.com/watch?v=Puio5dly9N8#t=42m40s
>
> Other coding guidelines already embrace this:
>
> - http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-signed
> - http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-unsigned
> - https://google.github.io/styleguide/cppguide.html#Integer_Types
>
> It is rare that overflowing (and wrapping) an unsigned integer won't trigger a program bug
> when the overflow was not intentionally handled. Using signed arithmetic means that you
> can actually trap on over/underflow and catch these bugs (when using fuzzing for instance).
>
> Chandler explained this nicely in his CPPCon 2016 talk "Garbage In, Garbage Out: Arguing
> about Undefined Behavior...". I encourage to watch the full talk but here is one relevant
> example: https://youtu.be/yG1OZ69H_-o?t=2006 , and here he mentions how Google
> experimented with this internally: https://youtu.be/yG1OZ69H_-o?t=2249
>
> Unsigned integer also have a discontinuity right to the left of zero. Suppose A, B and C
> are small positive integers close to zero, say all less than a hundred or so. Then given:
> A + B > C
> and knowing elementary school algebra, one can rewrite that as:
> A > B - C
> But C might be greater than B, and the subtraction would produce some huge number. This
> happens even when working with seemingly harmless numbers like A=2, B=2, and C=3.
>
> --
> Mehdi

Reply all

Reply to author

Forward