Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

sorting 5million minute data: choose std::list or std::vector

25 views
Skip to first unread message

MM

unread,
Feb 2, 2011, 9:53:07 AM2/2/11
to
I have about 5million rows like these (day, minute, and 4 doubles)

YYYYMMDD, HHMM, double, double, double, double

which I parse from a file and store into

struct one_line {
boost::uint32_t day;
boost::uint32_t minute;
double d1, d2, d3, d4;
};

I may need to sort the container once the data is parsed, sort based off the
day and minute.

Would sorting a std::list<one_line> be faster than std::vector<one_line>?
Is the best to use std::map<Key, Value> where Key is
std::pair<boost::uint32_t ,boost::uint32_t> (day and minute) ?

regards,

Leigh Johnston

unread,
Feb 2, 2011, 10:09:31 AM2/2/11
to

std::list and std::map are both node based containers which involve a
separate allocation for each element whilst std::vector involves a
single allocation for all the elements it contains; this may be a
consideration when you have a large number of elements.

As far as sorting speeds are concerned it can depend on the size of the
element in question (whether swapping std::list node pointers is faster
than value swapping vector elements); for example if the element type
was "int" rather than your struct sorting std::vector could be an order
of magnitude faster than sorting std::list. You should run a test
benchmark to determine what is best.

std::set may be more appropriate than std::map; std::multiset may be
more appropriate than std::set if more than one row can have the same
"day" and "minute" values.

/Leigh

Leigh Johnston

unread,
Feb 2, 2011, 10:31:07 AM2/2/11
to

On VC++ I find sorting a std::vector of your struct to be twice as fast
as sorting a std::list of your struct.

/Leigh

MM

unread,
Feb 2, 2011, 10:51:04 AM2/2/11
to

"Leigh Johnston" <le...@i42.co.uk> wrote in message
news:t9qdnZxybYck5dTQ...@giganews.com...
strange, I guess it depends on how unordered the list was.
both sorts are Nlog(N) , aren't they?
I'd think moving pointers is faster than value moving 4 doubles and 2 ints?

thanks,


Leigh Johnston

unread,
Feb 2, 2011, 10:57:02 AM2/2/11
to

Yes they both have the same complexity but constant factors can come
into play. :)

My test was of 5000000 records of random data (very unordered).

If you keep increasing the size of the struct eventually std::list::sort
should overtake std::sort of std::vector.

/Leigh

Pete Becker

unread,
Feb 2, 2011, 11:38:22 AM2/2/11
to

The structure copies can be done with a single call to memcpy; the
pointer copies have to be done separately.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Juha Nieminen

unread,
Feb 2, 2011, 12:18:25 PM2/2/11
to
MM <finju...@gmail.com> wrote:
> Would sorting a std::list<one_line> be faster than std::vector<one_line>?

Is it really that hard to test it?

Juha Nieminen

unread,
Feb 2, 2011, 12:21:06 PM2/2/11
to
MM <finju...@gmail.com> wrote:
> strange, I guess it depends on how unordered the list was.
> both sorts are Nlog(N) , aren't they?
> I'd think moving pointers is faster than value moving 4 doubles and 2 ints?

A doubly-linked list takes significantly more memory than a vector
(especially for such a relatively small element type), and the elements
of the list will usually end up being significantly more dispersed in
RAM while the elements in a vector will be at contiguous memory locations.
This means that sorting (or even just traversing) the list will usually
cause significantly more cache misses than a vector.

Bo Persson

unread,
Feb 2, 2011, 2:35:44 PM2/2/11
to

Note that you have to update 6 pointers to move a node in a doubly
linked list. :-)

Often std::vector has an edge by being more cache friendly. The
elements are adjacent. And smaller than list nodes.

Bo Persson


Jorgen Grahn

unread,
Feb 3, 2011, 3:24:15 PM2/3/11
to

AOL. Test it and report your results here. We don't really know what
will work best --or if the difference is big enough to matter.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Jorgen Grahn

unread,
Feb 3, 2011, 3:26:13 PM2/3/11
to
On Wed, 2011-02-02, MM wrote:
> I have about 5million rows like these (day, minute, and 4 doubles)
>
> YYYYMMDD, HHMM, double, double, double, double
>
> which I parse from a file and store into
>
> struct one_line {
> boost::uint32_t day;
> boost::uint32_t minute;
> double d1, d2, d3, d4;
> };

Nitpick: don't call HHMM 'minute', or you'll be surprised one day when
you forget that 102 doesn't mean 102 minutes but 62 minutes.

Yannick Tremblay

unread,
Feb 4, 2011, 9:39:22 AM2/4/11
to
In article <slrnikm3r3.1...@frailea.sa.invalid>,

Jorgen Grahn <grahn...@snipabacken.se> wrote:
>On Wed, 2011-02-02, MM wrote:
>> I have about 5million rows like these (day, minute, and 4 doubles)
>>
>> YYYYMMDD, HHMM, double, double, double, double
>>
>> which I parse from a file and store into
>>
>> struct one_line {
>> boost::uint32_t day;
>> boost::uint32_t minute;
>> double d1, d2, d3, d4;
>> };
>
>Nitpick: don't call HHMM 'minute', or you'll be surprised one day when
>you forget that 102 doesn't mean 102 minutes but 62 minutes.

Additional nitpick: don't use bit specified integer unless you *need*
to know the size of the integers in bits. Normal "unsigned integer"
would have been perfectly fine here.

James Kanze

unread,
Feb 4, 2011, 12:08:41 PM2/4/11
to
On Feb 4, 2:39 pm, ytrem...@nyx.net (Yannick Tremblay) wrote:
> In article <slrnikm3r3.15e.grahn+n...@frailea.sa.invalid>,
> Jorgen Grahn <grahn+n...@snipabacken.se> wrote:

Normal int would have been perfectly fine here. On the other
hand, if he's got 5 million of them, short or even signed char
might be more appropriate (although in practice, alignment
requirements will probably mean that he won't gain anything by
using a type smaller than int).

--
James Kanze

Leigh Johnston

unread,
Feb 4, 2011, 12:19:58 PM2/4/11
to

Eschewing unsigned integral types is sometimes irrational. *If* it
makes no sense to have negative days or minutes then why use a signed
integral type?

/Leigh

Juha Nieminen

unread,
Feb 5, 2011, 4:24:12 AM2/5/11
to
Leigh Johnston <le...@i42.co.uk> wrote:
> Eschewing unsigned integral types is sometimes irrational. *If* it
> makes no sense to have negative days or minutes then why use a signed
> integral type?

Because if you are calculating the difference of two dates, you will
have to deal with the wrap-around problem if you are using unsigned
types, usually by explicitly casting the integrals being compared to
signed (which kind of defeats the whole purpose of them being unsigned
in the first place).

There might not be negative days in dates, but there may well be when
calculating differences between dates.

Leigh Johnston

unread,
Feb 5, 2011, 7:57:15 AM2/5/11
to

Which is why is said "*If*" in my reply. On the implementation I use
static_cast<int> has no overhead and unlike some people I don't mind the
verbosity of the C++ style casts.

From my code:

typedef int pixel_pos;
typedef std::size_t pixel_size;
struct dimensions
{
dimensions() : cx(0), cy(0) {}
dimensions(pixel_size aCx, pixel_size aCy) : cx(aCx), cy(aCy) {}
pixel_size cx;
pixel_size cy;
friend dimensions operator*(const dimensions& lhs, std::size_t amount)
{ return dimensions(lhs.cx * amount, lhs.cy * amount); }
};
struct coordinate
{
coordinate() : x(0), y(0) {}
coordinate(pixel_pos aX, pixel_pos aY) : x(aX), y(aY) {}
pixel_pos x;
pixel_pos y;
};
struct rect : coordinate, dimensions
{
rect() {}
rect(const coordinate& aCoordinate, const dimensions& aDimensions) :
coordinate(aCoordinate), dimensions(aDimensions) {}
operator RECT() const
{
RECT ret = { x, y, x + static_cast<pixel_pos>(cx), y +
static_cast<pixel_pos>(cy) };
return ret;
}
};


...
...

bool underMouse = aCursorPosition.x >= iconPos.x &&
aCursorPosition.y >= iconPos.y &&
aCursorPosition.x < iconPos.x + static_cast<pixel_pos>(iconSize.cx)
&& aCursorPosition.y < iconPos.y + static_cast<pixel_pos>(iconSize.cy);


/Leigh

gwowen

unread,
Feb 7, 2011, 7:05:51 AM2/7/11
to
On Feb 4, 5:19 pm, Leigh Johnston <le...@i42.co.uk> wrote:

> Eschewing unsigned integral types is sometimes irrational.  *If* it
> makes no sense to have negative days or minutes then why use a signed
> integral type?

There are sometimes good reasons. Unsigned overflow has defined
semantics; signed does not. If the bare silicon of the target
platform's registers does not follow the defined unsigned semantics,
using unsigned can result in a lot of unnecessary code generation
(checking for overflow and implementing correct behaviour)

I've written code for compilers/platforms for which

int find(char foo,const char arr[],size_t sz){
for(int q=0;q<sz;++sz){
if(arr[q] == foo) return q;
}
return sz;
}

produces *considerably* smaller code than

int find(char foo,const char arr[],size_t sz){
for(unsigned q=0;q<sz;++sz){
if(arr[q] == foo) return q;
}
return sz;
}

Of course, that latter has a larger domain of defined behaviour, but
in reality you may well know that you're never going to hit either
limit.

Leigh Johnston

unread,
Feb 7, 2011, 8:06:57 AM2/7/11
to

Presumably you meant "++q" not "++sz"? In which case on VC++ I get:

00000 33 c0 xor eax, eax
00002 38 14 08 cmp BYTE PTR [eax+ecx], dl
00005 74 0b je SHORT $LN5@find1
00007 40 inc eax
00008 83 f8 06 cmp eax, 6
0000b 72 f5 jb SHORT $LL4@find1
0000d b8 06 00 00 00 mov eax, 6
00012 c3 ret 0

versus

00000 33 c0 xor eax, eax
00002 38 14 08 cmp BYTE PTR [eax+ecx], dl
00005 74 0b je SHORT $LN5@find2
00007 40 inc eax
00008 83 f8 06 cmp eax, 6
0000b 72 f5 jb SHORT $LL4@find2
0000d b8 06 00 00 00 mov eax, 6
00012 c3 ret 0

i.e. no difference between signed and unsigned versions.

Your use of an "int" return type in the second version looks dubious.

/Leigh

gwowen

unread,
Feb 7, 2011, 8:18:55 AM2/7/11
to
On Feb 7, 1:06 pm, Leigh Johnston <le...@i42.co.uk> wrote:

>> I've written code for compilers/platforms for which [snip]
> [Your correct code fix snipped]


> In which case on VC++ I get:

Then I think we can safely say that VC++ was not the platform to which
I was referring can't we? Sheesh, some people are so keen to be
thought of as smarter than everyone else in the whole world, they end
up looking like idiots.

Stick to arguing with Paul, at least you *are* probably smarter than
him.

Leigh Johnston

unread,
Feb 7, 2011, 8:42:22 AM2/7/11
to
On 07/02/2011 13:18, gwowen wrote:
> On Feb 7, 1:06 pm, Leigh Johnston<le...@i42.co.uk> wrote:
>
>>> I've written code for compilers/platforms for which [snip]
>> [Your correct code fix snipped]
>> In which case on VC++ I get:
>
> Then I think we can safely say that VC++ was not the platform to which
> I was referring can't we? Sheesh, some people are so keen to be
> thought of as smarter than everyone else in the whole world, they end
> up looking like idiots.
>

No we can safely assume that there is no difference because your
examples were naive (q being promoted to unsigned in the comparison q<sz
in your "int" version).

If instead we compare the more realistic:

typedef int rubbish_size_type;

rubbish_size_type rubbish_find(char foo,const char
arr[],rubbish_size_type sz){
for(rubbish_size_type q=0;q<sz;++q){


if(arr[q] == foo) return q;
}
return sz;
}

versus

typedef std::size_t sensible_size_type;

sensible_size_type sensible_find(char foo,const char
arr[],sensible_size_type sz){
for(sensible_size_type q=0;q<sz;++q){


if(arr[q] == foo) return q;
}
return sz;
}

we get:

00000 33 c0 xor eax, eax
00002 38 14 08 cmp BYTE PTR [eax+ecx], dl

00005 74 0b je SHORT $LN5@rubbish_fi


00007 40 inc eax
00008 83 f8 06 cmp eax, 6

0000b 7c f5 jl SHORT $LL4@rubbish_fi


0000d b8 06 00 00 00 mov eax, 6
00012 c3 ret 0

versus

00000 33 c0 xor eax, eax
00002 38 14 08 cmp BYTE PTR [eax+ecx], dl

00005 74 0b je SHORT $LN5@sensible_f


00007 40 inc eax
00008 83 f8 06 cmp eax, 6

0000b 72 f5 jb SHORT $LL4@sensible_f


0000d b8 06 00 00 00 mov eax, 6
00012 c3 ret 0

Here the only difference is intel opcode jl versus intel opcode jb which
IICR have the same performance characteristics.

[troll snipped]

/Leigh

gwowen

unread,
Feb 7, 2011, 8:47:50 AM2/7/11
to
On Feb 7, 1:42 pm, Leigh Johnston <le...@i42.co.uk> wrote:

> Here the only difference is intel opcode jl versus intel opcode jb which
> IICR have the same performance characteristics.

I'm not talking about an intel processor, so the effacacy of intel
opcodes is utterly irrelevant.

Leigh Johnston

unread,
Feb 7, 2011, 8:58:00 AM2/7/11
to

It was simply a balanced, pertinent counter-example to your claim of
multiple platforms you have worked on producing *considerably* more code
for the unsigned version. Like it or not the Intel platform is quite
pervasive (not some esoteric platform that nobody cares much about).

Can you produce some assembler backing up your claim (using my version
of the functions)?

/Leigh

gwowen

unread,
Feb 7, 2011, 9:20:42 AM2/7/11
to
On Feb 7, 1:58 pm, Leigh Johnston <le...@i42.co.uk> wrote:
> It was simply a balanced, pertinent counter-example to your claim of
> multiple platforms

If you think "one platform agrees with me" provides a counter
example .

> Like it or not the Intel platform is quite pervasive (not some esoteric platform that nobody cares much about).

I know that. In fact, that was entirely my point. I was talking
about an esoteric platform - specifically a custom DSP Asic who's
default signed integer arithmetic saturated (because on embedded DSPs,
that is almost always what you want signed arithmetic to do - clip
your signal rather than wrap it around).

So every time you do an unsigned addition on that processor, the C
standard insists that you check for overflow and correct the results
if overflow is about to occur. If you use signed arithmetic, the
compiler just loads the registers and calls the appropriate
instruction. (In fact, if memory recalls, unsigned arithmetic was done
using a subroutine, signed arithmetic was a single opcode, although I
may have misremembered that).

As to specific cases where the compiler can use the "overflow is
undefined" to optimize, with assembler generated, your attention is
drawn to http://www.airs.com/blog/archives/120

Leigh Johnston

unread,
Feb 7, 2011, 9:42:42 AM2/7/11
to

The implementation specific behaviour of some esoteric platform (or any
specific platform for that matter) should not necessarily direct us as
to what constitutes good C++ coding practice.

Unsigned integral types exist and eschewing them is *usually* irrational
but *not always* (as is the case with your esoteric platform). For
example std::size_t is an unsigned integral type and is quite pervasive
in the C++ Standard Library (see std::allocator).

/Leigh

gwowen

unread,
Feb 7, 2011, 10:09:22 AM2/7/11
to
On Feb 7, 2:42 pm, Leigh Johnston <le...@i42.co.uk> wrote:

> The implementation specific behaviour of some esoteric platform (or any
> specific platform for that matter) should not necessarily direct us as
> to what constitutes good C++ coding practice.
>

Given that all I originally said was "There are sometimes good
reasons" -- the thing which you claimed to have disproved with a
counterexample [not sure how you disprove a "sometimes" with a
counterexample, but hey ho] -- its good to see you've come around.

Unsigned integral types - I use them a lot, they're handy, especially
when I'm writing code for Intel and other general purpose processors
[though I bet there are some GPUs that work like DSPs for similar
reasons]. Eschewing (good word, by the way) them without a good reason
is pretty silly, but as I originally -- and correctly, as you now
agree -- said *THERE ARE SOMETIMES GOOD REASONS*.

Leigh Johnston

unread,
Feb 7, 2011, 10:47:54 AM2/7/11
to
On 07/02/2011 15:09, gwowen wrote:

My original reply stated more or less what you claim I *now* agree with:

"Eschewing unsigned integral types is sometimes irrational."

The only difference being that you are using the word "silly" and I am
using the stronger word "irrational". :)

/Leigh

gwowen

unread,
Feb 7, 2011, 11:50:27 AM2/7/11
to
On Feb 7, 3:47 pm, Leigh Johnston <le...@i42.co.uk> wrote:

> My original reply stated more or less what you claim I *now* agree with:

So what was your "counterexample" meant to demonstrate?

Leigh Johnston

unread,
Feb 7, 2011, 11:57:01 AM2/7/11
to

It was simply to add some balance to your initial reply in which you
mention compilers/platforms (plural) producing *considerably* smaller
code when not using unsigned which I initially viewed as argument.

/Leigh

Juha Nieminen

unread,
Feb 7, 2011, 12:01:11 PM2/7/11
to
Leigh Johnston <le...@i42.co.uk> wrote:
> Which is why is said "*If*" in my reply. On the implementation I use
> static_cast<int> has no overhead and unlike some people I don't mind the
> verbosity of the C++ style casts.

It's not about verbosity. It's about clarity and the maning of your
code. If you say that a variable is unsigned and then later you cast it
to signed, you *lied* in your variable declaration. That can only cause
confusion.

Leigh Johnston

unread,
Feb 7, 2011, 12:06:59 PM2/7/11
to

Wrong. The cast is sometimes required when using the unsigned variable
in an expression involving signed variables (as clearly indicated in the
example code I posted); a cast is not required in other contexts. It
seems to me that the only confusion is your understanding of the issue.

/Leigh

Juha Nieminen

unread,
Feb 7, 2011, 12:19:33 PM2/7/11
to

But that's exactly why you need to make the original variable signed:
Because it can be used in signed expressions. It makes no sense to make
it unsigned when it will be used as a signed value anyways.

Whenever you need to cast an unsigned variable to a signed one in an
expression, that's a sign of bad design. You are using the variable in
a way that contradicts its original declaration.

Bo Persson

unread,
Feb 7, 2011, 12:21:17 PM2/7/11
to

And now we know one platform (singular) where the code is not
considerably smaller?


Bo Persson


Leigh Johnston

unread,
Feb 7, 2011, 12:33:22 PM2/7/11
to

Wrong; it is not a sign of bad design at all. If a variable is used as
an unsigned value more often than as a signed value and it logically
makes sense for it to be unsigned (e.g. it is never negative) then
making the variable unsigned makes perfect sense.

1) Unsigned integral types exist for a reason. Types are used to build
abstractions and having a richer set of building blocks is beneficial.
2) Casts exist for a reason. Type conversion is unavoidable (if
avoiding insanity).

You might be in the "use int everywhere" camp but I am not. Perhaps
Java is a more suitable language for you rather than C++.

/Leigh

Kaba

unread,
Feb 7, 2011, 4:58:35 PM2/7/11
to
Leigh Johnston wrote:
> Wrong; it is not a sign of bad design at all. If a variable is used as
> an unsigned value more often than as a signed value and it logically
> makes sense for it to be unsigned (e.g. it is never negative) then
> making the variable unsigned makes perfect sense.

--8x--

> You might be in the "use int everywhere" camp but I am not. Perhaps
> Java is a more suitable language for you rather than C++.

For what it's worth, here's my take on this. This is a piece of text I
have written earlier (explaining the form). I am in the 'use int
everywhere' camp:) Ideas can't and shouldn't be forced, but they can be
shared; the following text reflects how I reason my position, not that
the other camps are wrong.

In this article I present two problems in using unsigned
integers in a program. These are the _zero boundary problem_,
and the _extended positive range problem_.

Zero boundary problem
---------------------

Most of the time we are assuming that, away from boundaries caused
by the finite representation, an integer object works like an element
of the integer ring ZZ. In addition, it seems plausible that most
integer calculations are concentrated around a neighborhood of zero.

The _zero-boundary problem_ (of unsigned integers) is that the zero
is on the boundary beyond which the assumption of working with integers
falls apart. Thus, the probability of introducing errors in computations
increases greatly. Furthermore, those errors can not be catched, since
every value is legal.

### Looping with unsigned integers

Looping over values from n to zero backwards demonstrates the zero
boundary problem with unsigned integers:

for (unsigned int i = n;i >= 0;--i)
{
// Do something.
}

Since there are no negative values to fail the loop test,
this loop never finishes. In contrast, with signed integers the problem
does not exist, since the boundaries are located far away from zero:

for (int i = n;i >= 0;--i)
{
// Do something.
}

Extended positive range problem
-------------------------------

Conversion between an unsigned integer and a signed integer
is an information destroying process in either direction.
The only way to avoid this is to use one or the other
consistently.

If the normal arithmetic is the intention, then a signed integer
represents a more general concept than an unsigned integer: the former
covers both the negative and positive integers, whereas the latter
only covers non-negative integers. In programming terms, it is
possible to create a program using signed integers alone, however,
the same can't be said about the unsigned integers. Therefore, if
only one type is to be used consistently, the choice should be a
signed integer. However, let us do some more analysis.

Since any non-trivial program must use signed integers, the use of
unsigned integers eventually leads to an unsigned-signed
conversion. In particular, because `std::size_t` in the Standard
Library is an unsigned integer, there are few programs that can
escape the unsigned-signed conversions.

Despite these conversions, programs still seem to work. The reason
for this, I reflect, is that the unsigned integers are normally
not taken into the extended positive range they allow.
The _extended positive range problem_ is that if the unsigned integers
are taken to their extended positive range, then the signed-unsigned
conversions become a reality. Ensuring correctness under such a threat
is hard, if not practically impossible.

Conclusion
----------

Using unsigned integers to model integers decreases the probability
that a program is correct.

--
http://kaba.hilvi.org

Leigh Johnston

unread,
Feb 7, 2011, 5:39:55 PM2/7/11
to

It is debatable if the probability of introducing errors increases
greatly when using unsigned integral types; this is certainly an
assertion on your part, do you have any evidence to back it up?

>
> ### Looping with unsigned integers
>
> Looping over values from n to zero backwards demonstrates the zero
> boundary problem with unsigned integers:
>
> for (unsigned int i = n;i>= 0;--i)
> {
> // Do something.
> }

for (unsigned int i = n + 1;i-- > 0;)
{
// Do something.
}

Yes the horse named std::size_t has indeed already bolted. Stable door.

>
> Despite these conversions, programs still seem to work. The reason
> for this, I reflect, is that the unsigned integers are normally
> not taken into the extended positive range they allow.
> The _extended positive range problem_ is that if the unsigned integers
> are taken to their extended positive range, then the signed-unsigned
> conversions become a reality. Ensuring correctness under such a threat
> is hard, if not practically impossible.

You ensure correctness as you would ensure correctness using any other
language feature.

>
> Conclusion
> ----------
>
> Using unsigned integers to model integers decreases the probability
> that a program is correct.
>

Bugs can be created using any language feature therefore it is, IMO,
wrong to eschew a particular language feature based on the possibility
of bug creation unless there is clear evidence that the language feature
in question has a high probability of bug creation (for some definition
of "high").

What you have presented is little more than conjecture IMO.

/Leigh

Kaba

unread,
Feb 7, 2011, 6:47:09 PM2/7/11
to
Leigh Johnston wrote:
> On 07/02/2011 21:58, Kaba wrote:
> > The _zero-boundary problem_ (of unsigned integers) is that the zero
> > is on the boundary beyond which the assumption of working with integers
> > falls apart. Thus, the probability of introducing errors in computations
> > increases greatly. Furthermore, those errors can not be catched, since
> > every value is legal.
>
> It is debatable if the probability of introducing errors increases
> greatly when using unsigned integral types; this is certainly an
> assertion on your part, do you have any evidence to back it up?

Consider any function taking in an unsigned integer (e.g.
std::vector::resize()). It is possible that the computation of a new
size, as a bug, ends up being a negative number, which is then wrapped
to a large positive number. This bug can't be catched, since every value
is legal.

Something similar can happen with signed integers: a large negative
number wraps to large positive number. But this is much less probable:
the action is concentrated around zero.

Correctness is the most important property of a program. In the
described case using a signed integer _brings visible_ an often
occurring bug.

The problem of unchecked bugs gets worse with increasing levels of
abstraction. In the worst case you go through a deep layer of functions,
with an unexplained crash, although the actual problem was that an
argument of the top-most function did not fulfill its precondition (e.g.
size < 0). It's quite comparable to the thousand-lines template errors
when using Boost.Spirit.

> > ### Looping with unsigned integers
> >
> > Looping over values from n to zero backwards demonstrates the zero
> > boundary problem with unsigned integers:
> >
> > for (unsigned int i = n;i>= 0;--i)
> > {
> > // Do something.
> > }
>
> for (unsigned int i = n + 1;i-- > 0;)
> {
> // Do something.
> }

Nice:) However, the point is that to carry out a task you have to be
careful to avoid a bug. That's something that should not be left to
humans (because by Murphy's law...).

> > Despite these conversions, programs still seem to work. The reason
> > for this, I reflect, is that the unsigned integers are normally
> > not taken into the extended positive range they allow.
> > The _extended positive range problem_ is that if the unsigned integers
> > are taken to their extended positive range, then the signed-unsigned
> > conversions become a reality. Ensuring correctness under such a threat
> > is hard, if not practically impossible.
>
> You ensure correctness as you would ensure correctness using any other
> language feature.

I don't think it can be afforded, because no one can put the effort to
check each unsigned-signed conversion for trouble. The practical way,
which I also use, is to close the eyes and avoid exercising a program to
its extremes (e.g. 2^32 - 1 objects).

Interestingly, the 64-bit integers help with the extended range problem
_in practice_: the extreme values are then often avoided naturally (e.g.
no one creates an std::vector with 2^64 elements). But the zero-boundary
problem remains.

> > Conclusion
> > ----------
> >
> > Using unsigned integers to model integers decreases the probability
> > that a program is correct.
> >
>
> Bugs can be created using any language feature therefore it is, IMO,
> wrong to eschew a particular language feature based on the possibility
> of bug creation unless there is clear evidence that the language feature
> in question has a high probability of bug creation (for some definition
> of "high").

The possibility of a bug creation is not a problem. The problem is that
using unsigned integers prevents me from catching bugs, and that way
decreases my confidence on program correctness.

> What you have presented is little more than conjecture IMO.

Critical thinking appreciated.

--
http://kaba.hilvi.org

James Kanze

unread,
Feb 7, 2011, 8:16:01 PM2/7/11
to

Two reasons, really. The first is simple: the *normal* integral
type in C++ is plain int. You don't use anything else unless
plain int won't do the job. You don't use unsigned for the same
reason you don't use long, or short, or whatever. The second is
more fundamental: C++ doesn't have an unsigned type with
appropriate semantics which would be appropriate for days or
minutes, at least in the general case (where you might have to
take the difference between days or minutes). That's not to say
that you should never use unsigned---there are cases where you
want its special semantics. And there are doubtlessly cases
where it really doesn't matter---things like serial numbers, for
example. (But there again, unless there is some reason why int
isn't appropriate, then it should be int. Anything other than
int tells the reader that you need its special
features---smaller size, bigger range, etc.)

--
James Kanze

James Kanze

unread,
Feb 7, 2011, 8:20:10 PM2/7/11
to
On Feb 7, 1:18 pm, gwowen <gwo...@gmail.com> wrote:
> On Feb 7, 1:06 pm, Leigh Johnston <le...@i42.co.uk> wrote:

[...]


> Stick to arguing with Paul, at least you *are* probably smarter than
> him.

Ouch. Now that is hitting low.

(Seriously, there is one machine currently being sold today
where the compiler has an option to turn off standard compliant
behavior of unsigned integers, because of the runtime overhead.
But it's not common enough for it to be an issue for most
people. And from other postings in the past, I think Leigh's in
the camp that only Wintel matters.)

--
James Kanze

Leigh Johnston

unread,
Feb 7, 2011, 8:26:12 PM2/7/11
to

Nonsense; see posts else-thread.

/Leigh

Juha Nieminen

unread,
Feb 8, 2011, 1:10:21 AM2/8/11
to
Leigh Johnston <le...@i42.co.uk> wrote:
> Wrong; it is not a sign of bad design at all. If a variable is used as
> an unsigned value more often than as a signed value and it logically
> makes sense for it to be unsigned (e.g. it is never negative) then
> making the variable unsigned makes perfect sense.

What is "more often"? 50%? Do you count the individual instances where
it's being used in the program and change the type accordingly? What if
the percentage changes later?

The major problem with using unsigned types in variables which are being
used in signed expressions is that it's easy to forget the cast, which
easily introduces bugs which might not be immediately obvious.

The most typical case where this introduces bugs is where the unsigned
type is mixed with signed types on an expression that ends up being of a
floating point type. If you forget the cast, a result which should be in
the order of tens might end up to be wrongly in the order of billions.
The problem is that this might not be apparent immediately, only with
certain values (eg. if you are calculating coordinates based on some
input data or whatever).

You might think, for example, "the width and height of an image are
never negative, hence it makes sense to make then unsigned". However,
if the width and height are used to calculate screen coordinates of
objects (which is rather typical in many situations), you will end up
with the useless casts and a high risk of forgetting a cast and potentially
have your coordinates 2 billion pixels to the right and down.

In these cases it actually makes more sense to have the width and height
as signed integers.

And yes, I have experience on this precise problem.

> Perhaps
> Java is a more suitable language for you rather than C++.

Where did that come from?

Paul

unread,
Feb 8, 2011, 1:47:59 AM2/8/11
to

"James Kanze" <james...@gmail.com> wrote in message
news:9aaad3a2-2b6a-461d...@p16g2000vbo.googlegroups.com...

> On Feb 7, 1:18 pm, gwowen <gwo...@gmail.com> wrote:
>> On Feb 7, 1:06 pm, Leigh Johnston <le...@i42.co.uk> wrote:
>
> [...]
>> Stick to arguing with Paul, at least you *are* probably smarter than
>> him.
>
> Ouch. Now that is hitting low.
>

Look 3 idiots together! ^_^

Juha Nieminen

unread,
Feb 8, 2011, 2:00:41 AM2/8/11
to
Paul <pchr...@yahoo.co.uk> wrote:
> Look 3 idiots together! ^_^

Honestly, what do you think you are going to achieve with that kind of
attitude?

Paul

unread,
Feb 8, 2011, 2:16:14 AM2/8/11
to

"Juha Nieminen" <nos...@thanks.invalid> wrote in message
news:4d50ea19$0$2852$7b1e...@news.nbl.fi...

Or is it 4? �_�

Richard Kettlewell

unread,
Feb 8, 2011, 6:54:55 AM2/8/11
to
Kaba <no...@here.com> writes:

> ### Looping with unsigned integers
>
> Looping over values from n to zero backwards demonstrates the zero
> boundary problem with unsigned integers:
>
> for (unsigned int i = n;i >= 0;--i)
> {
> // Do something.
> }
>
> Since there are no negative values to fail the loop test,
> this loop never finishes.

While agreeing that unsigned can booby-trapped in some cases, I don't
think this is one of them: a decent compiler will spot the trivial
comparison and generate a warning.

--
http://www.greenend.org.uk/rjk/

Leigh Johnston

unread,
Feb 8, 2011, 9:21:52 AM2/8/11
to
On 08/02/2011 06:10, Juha Nieminen wrote:
> Leigh Johnston<le...@i42.co.uk> wrote:
>> Wrong; it is not a sign of bad design at all. If a variable is used as
>> an unsigned value more often than as a signed value and it logically
>> makes sense for it to be unsigned (e.g. it is never negative) then
>> making the variable unsigned makes perfect sense.
>
> What is "more often"? 50%? Do you count the individual instances where
> it's being used in the program and change the type accordingly? What if
> the percentage changes later?

Don't be a retard. By more often I simply meant that if you used an
incorrect abstraction then you would probably end up with more
"instances" of it being used in a context which requires a cast.
However the actual number is an irrelevant metric yes.

>
> The major problem with using unsigned types in variables which are being
> used in signed expressions is that it's easy to forget the cast, which
> easily introduces bugs which might not be immediately obvious.

You can create a bug using any language feature. Fix and recompile.

>
> The most typical case where this introduces bugs is where the unsigned
> type is mixed with signed types on an expression that ends up being of a
> floating point type. If you forget the cast, a result which should be in
> the order of tens might end up to be wrongly in the order of billions.
> The problem is that this might not be apparent immediately, only with
> certain values (eg. if you are calculating coordinates based on some
> input data or whatever).

You can create a bug using any language feature. Fix and recompile.

>
> You might think, for example, "the width and height of an image are
> never negative, hence it makes sense to make then unsigned". However,
> if the width and height are used to calculate screen coordinates of
> objects (which is rather typical in many situations), you will end up
> with the useless casts and a high risk of forgetting a cast and potentially
> have your coordinates 2 billion pixels to the right and down.

You can create a bug using any language feature. Fix and recompile.

>
> In these cases it actually makes more sense to have the width and height
> as signed integers.

In your opinion.

>
> And yes, I have experience on this precise problem.

So do I. I don't have any problem with casts; they are a perfectly
acceptable part of a statically typed language such as C++.

>
>> Perhaps
>> Java is a more suitable language for you rather than C++.
>
> Where did that come from?

As you seem to be in the "use int everywhere" camp Java may be a more
appropriate language for you.

/Leigh

Kaba

unread,
Feb 8, 2011, 10:24:19 AM2/8/11
to

Yep. So instead:

unsigned int a = f();

for (unsigned int i = n;i >= a;--i)
{
// Do something.
}

The loop works except when a = 0.

--
http://kaba.hilvi.org

Richard Kettlewell

unread,
Feb 8, 2011, 11:03:49 AM2/8/11
to
Kaba <no...@here.com> writes:
> Richard Kettlewell wrote:

>> While agreeing that unsigned can booby-trapped in some cases, I don't
>> think this is one of them: a decent compiler will spot the trivial
>> comparison and generate a warning.
>
> Yep. So instead:
>
> unsigned int a = f();
>
> for (unsigned int i = n;i >= a;--i)
> {
> // Do something.
> }
>
> The loop works except when a = 0.

Much better l-)

--
http://www.greenend.org.uk/rjk/

Martin

unread,
Feb 9, 2011, 8:36:05 PM2/9/11
to
On Feb 7, 3:58 pm, Kaba <n...@here.com> wrote:
> Leigh Johnston wrote:
[snip]

>
> Zero boundary problem
> ---------------------
>
> Most of the time we are assuming that, away from boundaries caused
> by the finite representation, an integer object works like an element
> of the integer ring ZZ. In addition, it seems plausible that most
> integer calculations are concentrated around a neighborh