I'd prefer us to have something neater than static_cast<int> for the
loop problem before we made that change. Perhaps add an ssize (or
equivalent) method to all of our internal data structures? They're a
lot more common than std::* containers.
But it's not something that would keep me up at night.
Cheers.
Tim.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
On Sat, 8 Jun 2019 at 18:18, Mehdi AMINI via llvm-dev
<llvm...@lists.llvm.org> wrote:
> The LLVM coding style does not specify anything about the use of signed/unsigned integer, and the codebase is inconsistent (there is a majority of code that is using unsigned index in loops today though).
>
> I'd like to suggest that we specify to prefer `int` when possible and use `unsigned` only for bitmask and when you intend to rely on wrapping behavior, see: https://reviews.llvm.org/D63049
I'd prefer us to have something neater than static_cast<int> for the
loop problem before we made that change.
+1
Since C++20 is also introducing ssize [1] members, this makes a lot of
sense to me. Using it would help avoiding an unsigned comparison as in
if (IndexOfInterestingElement >= Container.size())
...
to sneak in from the start.
Michael
I'm in the same situation James is in and thus have the same bias but I'll +1 that comment nevertheless. I think I prefer using size_t or the uintX_t types where applicable. Only when I need a signed value do I use one.
On Mon, Jun 10, 2019, 7:16 PM Jake Ehrlich via llvm-dev <llvm...@lists.llvm.org> wrote:I'm in the same situation James is in and thus have the same bias but I'll +1 that comment nevertheless. I think I prefer using size_t or the uintX_t types where applicable. Only when I need a signed value do I use one.+1 to prefering unsigned types.
Yeah, I want to make it clear that I am in favour of "int" as a
default, even if preparatory changes are needed to get there. My main
reservation was over the loops
for (int i = 0; i != static_cast<int>(X.size()); ++i)
required to satisfy compilers when *::size returns unsigned. I dislike the
for (int i = 0, Size = X.size(); i != Size; ++i)
form that has traditionally been suggested because it's still horribly verbose.
I hadn't even heard of the llvm::seq construct Chandler mentioned
though, and I think it addresses all my worries. It looks really neat
to me, and I'm not bothered by the duplicated "int".
So my current position is that I'm all in favour of switching to int,
with llvm::seq as the suggested alternative in for loops.
Cheers.
Tim.
Sorry for the brevity, I am currently travelling and responding on a cell phone. I won't be able to give you a full accounting until later,
but 1) I don't see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won't use this convention consistently anyway.
I have yet to be convinced by the c++ community's very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.
On Tue, Jun 11, 2019 at 12:37 AM Aaron Ballman <aaron....@gmail.com> wrote:Sorry for the brevity, I am currently travelling and responding on a cell phone. I won't be able to give you a full accounting until later,There is no hurry right now!but 1) I don't see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won't use this convention consistently anyway.As far as I can tell, the same arguments applies to "unsigned" I believe: it does not represent the full extent of size_t on every platform either! Yet we're still using it as an index, including to index into objects/array.
I have yet to be convinced by the c++ community's very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.Note that this is not about "switching" anything: there is no standard in LLVM right now (as far as I know) and the codebase is inconsistent (I am using `int` in general for a while I believe).
On Tue, Jun 11, 2019, 9:51 AM Mehdi AMINI <joke...@gmail.com> wrote:On Tue, Jun 11, 2019 at 12:37 AM Aaron Ballman <aaron....@gmail.com> wrote:Sorry for the brevity, I am currently travelling and responding on a cell phone. I won't be able to give you a full accounting until later,There is no hurry right now!but 1) I don't see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won't use this convention consistently anyway.As far as I can tell, the same arguments applies to "unsigned" I believe: it does not represent the full extent of size_t on every platform either! Yet we're still using it as an index, including to index into objects/array.I have yet to be convinced by the c++ community's very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.Note that this is not about "switching" anything: there is no standard in LLVM right now (as far as I know) and the codebase is inconsistent (I am using `int` in general for a while I believe).Glad to hear there's no plan to change the codebase
, but then I fail to see the purpose of the coding rule. Won't it boil down to "use the correct datatype for what you're doing?"
What I mostly want to avoid is code that uses signed values for inherently unsigned operations like comparing against the size of an object.
Doesn't it make it already worth it?
> On the other hand, a large number of people need to work in this codebase every day, and multiplied over the same time period, my belief is that having the code make sense and be simple has a higher net value.
>
> It simply doesn't make sense (conceptually) to use a signed type for domains that are inherently unsigned, like the size of an object. IMO, we should revisit this if and when the deficiencies in the C++ Standard are addressed.
The underlying problem is that the C family of languages mixes two
orthogonal properties: value range and overflow behavior. There is no
unsigned type with undefined wraparound. So the question becomes: What
property is more important to reflect? Do we want catch unintended
wraparound behavior using a sanitizer/make optimizations based on it?
Do we need the additional range provided by an unsigned type? As
Chandler says in one of his talks linked earlier: "If you need more
bits, use more bits" (such as int64_t).
Michael
I agree that readability, maintainability, and ability to debug/find issues are key.I haven't found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned *increases* my mental load when reading code.
Am Di., 11. Juni 2019 um 11:45 Uhr schrieb Zachary Turner via llvm-dev
<llvm...@lists.llvm.org>:
>
> I'm personally against changing everything to signed integers. To me, this is an example of making code strictly less readable and more confusing in order to fight deficiencies in the language standard. I get the problem that it's solving, but I view this as mostly a theoretical problem, whereas being able to read the code and have it make sense is a practical problem that we must face on a daily basis. If you change everything to signed integers, you may catch a real problem with it a couple of times a year. And by "real problem" here, I'm talking about a miscompile or an actual bug that surfaces in production somewhere, rather than a "yes, it seems theoretically possible for this to overflow".
Doesn't it make it already worth it?
On Jun 11, 2019, at 12:59 PM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:
On Tue, Jun 11, 2019 at 12:24 PM Mehdi AMINI <joke...@gmail.com> wrote:I agree that readability, maintainability, and ability to debug/find issues are key.I haven't found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned *increases* my mental load when reading code.
I'm on the other end. I'm always reading the code wondering "is this going to warn?" "Why could a container ever have a negative number of elements?" "The maximum value representable by the return type (unsigned) is larger than that of the value i'm storing it in (signed), so an overflow could happen even if there were no error. What then?”
On Tue, Jun 11, 2019 at 12:26 PM Michael Kruse <llv...@meinersbur.de> wrote:Am Di., 11. Juni 2019 um 11:45 Uhr schrieb Zachary Turner via llvm-dev
<llvm...@lists.llvm.org>:
>
> I'm personally against changing everything to signed integers. To me, this is an example of making code strictly less readable and more confusing in order to fight deficiencies in the language standard. I get the problem that it's solving, but I view this as mostly a theoretical problem, whereas being able to read the code and have it make sense is a practical problem that we must face on a daily basis. If you change everything to signed integers, you may catch a real problem with it a couple of times a year. And by "real problem" here, I'm talking about a miscompile or an actual bug that surfaces in production somewhere, rather than a "yes, it seems theoretically possible for this to overflow".
Doesn't it make it already worth it?vector.size() returns a size_t, which on 64-bit platforms can represent types values larger than those that can fit into an int64_t. So to turn your argument around, since it's theoretically possible to have a vector with more items than an int64_t can represent, isn't it already worth it to use size_t, which is an unsigned type?
On Tue, Jun 11, 2019 at 12:24 PM Mehdi AMINI <joke...@gmail.com> wrote:I agree that readability, maintainability, and ability to debug/find issues are key.I haven't found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned *increases* my mental load when reading code.I'm on the other end. I'm always reading the code wondering "is this going to warn?" "Why could a container ever have a negative number of elements?" "The maximum value representable by the return type (unsigned) is larger than that of the value i'm storing it in (signed), so an overflow could happen even if there were no error. What then?"
+1 to both points here.On Wed, 12 Jun 2019, 07:55 Aaron Ballman via llvm-dev, <llvm...@lists.llvm.org> wrote:On Tue, Jun 11, 2019, 9:59 PM Zachary Turner <ztu...@roblox.com> wrote:On Tue, Jun 11, 2019 at 12:24 PM Mehdi AMINI <joke...@gmail.com> wrote:I agree that readability, maintainability, and ability to debug/find issues are key.I haven't found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned *increases* my mental load when reading code.I'm on the other end. I'm always reading the code wondering "is this going to warn?"
"Why could a container ever have a negative number of elements?" "The maximum value representable by the return type (unsigned) is larger than that of the value i'm storing it in (signed), so an overflow could happen even if there were no error. What then?"Strong +1 to this.
> The underlying problem is that the C family of languages mixes two orthogonal properties: value range and overflow behavior. There is no unsigned type with undefined wraparound. So the question becomes: What property is more important to reflect? Do we want catch unintended wraparound behavior using a sanitizer/make optimizations based on it?
That's a valid argument, but I suspect that the vast majority of loops using an unsigned induction variable, start at an explicit 0 and go up by 1. Such loops cannot overflow, so there is nothing to catch there.
> On Jun 12, 2019, at 9:54 AM, David Greene via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Jake Ehrlich via llvm-dev <llvm...@lists.llvm.org> writes:
>
>> This whole debate seems kind of odd to me. I don't know that cases
>> where it isn't clear what type to use come up that often. If a value
>> can truly never be negative you should use an unsigned value. If a
>> value can be negative, you should use a signed value. Anecdotal
>> evidence in my case is that the vast majority of values are unsigned
>> by this rule.
>>
>> Is there a reason to use a signed value when you know a value will
>> never be negative?
>
> Since this thread is really long, I want to make sure to address this
> specific point even though it's been covered elsewhere.
>
> One reason to prefer signed is optimization.
FWIW. If you care about optimization, signed size_t is probably the way to go in general.
Int type will incur a sign extension for any address accesses on 64-bit platform (32-bit to 64-bit extension).
Unsigned on the other hand creates zero extension which are most of the time free.
Thus, unsigned is sometimes better for codegen than signed, in particular in a compiler code base where vectorization is not really a thing.
Anyway, it seems to me that there are enough people on both sides of the fence that this shouldn’t be in the coding standard.
My 2c.
Quentin
Drive by comment, please feel free to disregard.
> On Jun 12, 2019, at 11:50, Mehdi AMINI via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> (any bug caused "negative indexing of a container" would be as much of a bug with unsigned,
Signed comparison with a maximum size or number of things is a classic software vulnerability. E.g.,
#define MAX_WIDGETS 200
struct widget widgets[MAX_WIDGETS];
...
if (count >= MAX_WIDGETS)
return -1;
/* now do something with widgets[count] */
With unsigned integers, there would be no bug. There are simple bugs using either signed or unsigned integers so I don't find that dispositive.
> please watch Chandler's talk).
It was a great talk! I'm glad you pointed it out, but I'm not convinced any of the examples he discusses impact loop counters, which seems to be one of the main motivations for this. Let's go through them.
1. Shifting by more than the bit width <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=24m40s> doesn't involve signed values.
2. Signed left shift vs. multiplication <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=27m48s> isn't relevant for loops.
3. The example from Dietz et al. <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=33m15s> boils down to computing 16u + (n-1) * 8u. As far as I can tell, the signedness of the quantities involved does not impact the computation. See <https://godbolt.org/z/HVA6xD>.
4. Using a 32-bit, unsigned integer to index an array <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=39m22s>. Here, using a signed 32-bit integer lets the compiler promote it to a signed 64-bit integer and the compiler can produce slightly better code. Using an appropriately sized integer for the index, namely size_t, has the same effect except with unusual ABIs like x32*. ssize_t would also be fine (and better than size_t for x32).
5. C++ doesn't have a supported arithmetic right shift <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=51m49s>. Great point, but not relevant here.
6. UB twitter's favorite example: memcpy(0, 0, 0) <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=55m0s>. Again, not relevant here.
Up thread, Chandler suggests
> for (int Index = 0, Size = MyVector.size(); Index < Size; ++Index)
or
> for (int Index : llvm::seq<int>(0, MyVector.size()))
I believe int can produce better code than unsigned int here, but neither seem correct. Surely both should be size_t (again, ssize_t would be fine). I'm not sure if LLVM containers ever contain more than INT_MAX elements. If not, then int is fine but since it'll produce the same code as using a correctly sized integer, I don't see the point. (As an aside, the talk by Jon Kalb you linked to <https://www.youtube.com/watch?v=wvtFGa6XJDU&t=4m45s> says that if you need one extra bit, you'll probably need two extra bits. "A sneeze puts you over." But that's another 2 billion more entries in your data structure so I'm not sure I buy that argument.)
One of the guides you linked to <https://google.github.io/styleguide/cppguide.html#Integer_Types>, says "When appropriate, you are welcome to use standard types like size_t and ptrdiff_t." Looping over structures from 0 to some size expressed as a size_t seems like an appropriate time to use size_t to me.
I haven't yet heard or read a compelling argument for using an integer type smaller than the register width for arithmetic, which int usually is these days. I can't think of a good reason to use unsigned ever though.
* Example provided by Rich Felker <https://twitter.com/RichFelker/status/1138166935575826433>.
Cheers,
Steve
--
Stephen Checkoway
According to your definition above, this is not a "real problem".
64-bit CPUs do not support using the full 64-bit virtual address
space. Even if one does in the future, PTRDIFF_MAX (and as a
consequence std::distance that a std::vector must support) effectively
limits the max allocation size that can be worked with without
undefined behavior.
To summarize, with signed sizes we could fix a few production-bugs a
year, while unsigned sizes only have a theoretical advantage in
production.
Have there been any documented cases in LLVM where a for() loop with
an unsigned int induction variable has wrapped around to 0? In other
words, is there any container - either LLVM or C++ Standard Library -
that ended up storing more than UINT_MAX or ULLONG_MAX elements?
I'm looking at these values in <limits.h>:
#define UINT_MAX 4294967295U
#define ULLONG_MAX 18446744073709551615ULL
and I am having a really hard time imagining a llvm::SmallVector<Foo>
storing 18446744073709551615ULL + 1ULL Foo elements. But i'm happy to
be proven wrong.
As far as the C++ Standard Library is concerned, all the containers
implement std::<container-type>::max_size(), which is of type
std::size_t and is always - and intentonally - smaller than either
UINT_MAX or ULLONG_MAX.
So I'm not even sure how an unsigned induction variable testing for
std::vector<Foo>::size() or a std::string::size() - for example -
would ever end up wrapping around to 0. The container will blow up
when its number of elements attempts to exceed its max_size().
Plus, it's not that hard to write
std::vector<Foo> FooVector;
for (unsigned I = 0; I < ${SOMETHING} && I < FooVector.max_size(); ++I) {
}
if unsigned's wrap-around is a material concern.
Maybe the compiler should just warn when sizeof(unsigned) <
sizeof(std::container<Foo>::max_size()). I think that would be enough
of a hint, and in the vast majority of cases it will be moot anyway.
Just my 0.02.
--
Stefan Teleman
stefan....@gmail.com
> is there any container
I watched Chandler's talk, and it looks like there are two statements that are a bit
contradictory one to another:
https://www.youtube.com/watch?v=yG1OZ69H_-o&t=2820s "it's fairly easy to get enough bits
w/o using the sign bits"
versus
https://www.youtube.com/watch?v=yG1OZ69H_-o&t=2846s "using size_t takes more space in data
structures"
Why not using size_t for indexes/length, while storing uint32_t integers and cast them
back to size_t when necessary (i.e. for space reasons)?
About checking for overflows, let's take the various examples here:
https://godbolt.org/z/TIGiVe, which are various ways to load a byte at index "n+10" where
n is user-controlled. In this case, we need to check that "n+10" does not overflow.
The first thing is that, IMHO, overflow checks are tricky to get right for both the signed
and unsigned case (and has led to numerous vulnerabilities). But that's a personal
opinion. If we talk about code size (and performance), the load3 function confirms what
Chandler showed in his talk (using uint32_t can be a performance issue on some 64-bit
systems, do to the enforced wrapping that has to be done on 32bits integers). The other
load* functions all emit the intended "mov eax, DWORD PTR [rdi+40+rsi*4]". Notable
differences arise in load2, which needs to sign extend esi to rsi. In load1, there is also
an extra instruction to set the value of (2**63-12).
One conclusion is that, at least on x86-64, using size_t for these kind of overflow checks
is the most efficient. But I might miss something / do something wrong in this analysis,
and I'd like another POV on this!
One argument for the usage of signed integers (actually integer types that don't allow
wrapping) is indeed the ability to automatically insert these overflow checks at compile time.
On 6/8/19 7:17 PM, Mehdi AMINI via llvm-dev wrote:
> Hi,
>
> The LLVM coding style does not specify anything about the use of signed/unsigned integer,
> and the codebase is inconsistent (there is a majority of code that is using unsigned index
> in loops today though).
>
> I'd like to suggest that we specify to prefer `int` when possible and use `unsigned` only
> for bitmask and when you intend to rely on wrapping behavior,
> see: https://reviews.llvm.org/D63049
>
> A lot has been written about this, good references are [unsigned: A Guideline for Better
> Code] https://www.youtube.com/watch?v=wvtFGa6XJDU) and [Garbage In, Garbage Out: Arguing
> about Undefined Behavior...](https://www.youtube.com/watch?v=yG1OZ69H_-o), as well as this
> panel discussion:
> - https://www.youtube.com/watch?v=Puio5dly9N8#t=12m12s
> - https://www.youtube.com/watch?v=Puio5dly9N8#t=42m40s
>
> Other coding guidelines already embrace this:
>
> - http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-signed
> - http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-unsigned
> - https://google.github.io/styleguide/cppguide.html#Integer_Types
>
> It is rare that overflowing (and wrapping) an unsigned integer won't trigger a program bug
> when the overflow was not intentionally handled. Using signed arithmetic means that you
> can actually trap on over/underflow and catch these bugs (when using fuzzing for instance).
>
> Chandler explained this nicely in his CPPCon 2016 talk "Garbage In, Garbage Out: Arguing
> about Undefined Behavior...". I encourage to watch the full talk but here is one relevant
> example: https://youtu.be/yG1OZ69H_-o?t=2006 , and here he mentions how Google
> experimented with this internally: https://youtu.be/yG1OZ69H_-o?t=2249
>
> Unsigned integer also have a discontinuity right to the left of zero. Suppose A, B and C
> are small positive integers close to zero, say all less than a hundred or so. Then given:
> A + B > C
> and knowing elementary school algebra, one can rewrite that as:
> A > B - C
> But C might be greater than B, and the subtraction would produce some huge number. This
> happens even when working with seemingly harmless numbers like A=2, B=2, and C=3.
>
> --
> Mehdi