Ill-formed code or a compiler bug (mingw-w64 g++)?

K. Frank

unread,

Nov 6, 2011, 9:58:04 AM11/6/11

to

Hello Group!

We've been discussing over on the mingw64 list:

http://sourceforge.net/mailarchive/message.php?msg_id=28355621

some unexpected behavior in a 64-bit test program when built under the
64-bit
mingw-w64 version of g++.

We're trying to decide whether the test program is ill-formed, and
therefore
that the compiler is off the hook, or whether we might have stumbled
across
a compiler bug.

Some context: The test program is logically incorrect in that the
code is not
expected to do what it was nominally intended to do. That's not the
point.
Rather, the question is whether the code is legal, well-formed c++,
and whether
the compiler is generating the correct output. It seems that the key
point is
the comparisons between 32-bit and 64-bit integral types (the variable
pos, an
unsigned, and std::string::npos, a size_t) in the test code.

Here is the code:

====================

#include <iostream>
#include <streambuf>
#include <string>

#include <cstdio>

class StdoutStream : public std::basic_streambuf<char>
{
public:
StdoutStream(std::ostream &stream) : m_stream (stream) {
m_old_buf = stream.rdbuf();
stream.rdbuf (this);
}
~StdoutStream() {
// output anything that is left
if (!m_string.empty()) printf ("%s", m_string.c_str());
m_stream.rdbuf (m_old_buf);
}

protected:
virtual int_type overflow(int_type v) {
if (v == '\n') {
printf ("%s", m_string.c_str());
m_string.erase (m_string.begin(), m_string.end());
}
else m_string += v;
return v;
}

virtual std::streamsize xsputn (const char *p, std::streamsize n) {
m_string.append (p, p + n);
unsigned pos = 0;
// size_t pos = 0; // this is the correct declaration
while (pos != std::string::npos) { // 32-bit unsigned compared
against 64-bit size_t
// while (pos != ((unsigned) -1)) { // this works
printf ("xsputn: top of loop, pos = %d\n", pos);
pos = m_string.find ('\n');
if (pos != std::string::npos) { // 32-bit unsigned compared
against 64-bit size_t
// if (pos != ((unsigned) -1)) { // this works
std::string tmp(m_string.begin(), m_string.begin() + pos + 1);
printf ("%s", tmp.c_str());
m_string.erase (m_string.begin(), m_string.begin() + pos + 1);
}
printf ("xsputn: bottom of loop, pos = %d\n", pos);
}
printf ("xsputn: before return, n = %d, pos = %d\n", n, pos);
return n;
}

private:
std::ostream &m_stream;
std::streambuf *m_old_buf;
std::string m_string;
};

int main (int argc, char *argv[]) {

printf ("hello...\n");

StdoutStream *out = new StdoutStream (std::cout);

if (out) { // to avoid unused variable warning
// prints neither message to text1
// std::cout << "Message 1 (with endl)..." << std::endl;
// std::cout << "Message 2 (with endl)..." << std::endl;

// prints only "Message 1" to text1
try {
std::cout << "Message 1 (with '\\n')...\n";
std::cout << "Message 2 (with '\\n')...\n";
}
catch (...) { // no exception thrown -- not the ptoblem
printf ("exception caught...\n");
}

// prints both messages to text1
// std::cout << "Message 1 (with two '\\n's)...\nMessage 2 (with
two '\\n's)...\n";
}

printf ("goodbye!\n");

return 0;
}

====================

It compiles without error:

Here is the unexpected output when it runs:

C:\>stdout_stream_error
hello...
xsputn: top of loop, pos = 0
Message 1 (with '\n')...
xsputn: bottom of loop, pos = 24
xsputn: top of loop, pos = 24
goodbye!

Note, in particular, that the second time through the loop, the "top
of loop"
message is printed, but the "bottom of loop" message is not.
Furthermore,
it does not appear that the loop (and xsputn function) is exited by
virtue of
an exception being thrown, as the catch block is main does not appear
to
be being executed.

I believe (that because of the logically incorrect 32-bit / 64-bit
comparison)
that the code should enter an infinite loop.

When compiled using a 32-bit version of g++ (so that the logic error
goes
away), the program gives the following desired (and seemingly correct)
output:

C:\>stdout_stream_error
hello...
xsputn: top of loop, pos = 0
Message 1 (with '\n')...
xsputn: bottom of loop, pos = 24
xsputn: top of loop, pos = 24
xsputn: bottom of loop, pos = -1
xsputn: before return, n = 25, pos = -1
xsputn: top of loop, pos = 0
Message 2 (with '\n')...
xsputn: bottom of loop, pos = 24
xsputn: top of loop, pos = 24
xsputn: bottom of loop, pos = -1
xsputn: before return, n = 25, pos = -1
goodbye!

Would anyone have some ideas about what might be going on here?

I apologize for the slightly complicated test program. I haven't been
able to reproduce
the unexpected behavior in a simpler setting without the
basic_streambuf and xsputn
stuff. (An attempt to keep the loop structure and the 32-bit / 64-bit
comparisons, but
get rid of the basic_streambuf-derived class produces code that enters
an infinite loop.)

Thanks in advance for your insights.

K. Frank

Paavo Helde

unread,

Nov 6, 2011, 10:30:03 AM11/6/11

to

"K. Frank" <kfran...@gmail.com> wrote in news:44424c25-fe00-4531-9e07-
69cdf8...@du8g2000vbb.googlegroups.com:

> Hello Group!
>
> We've been discussing over on the mingw64 list:
>
> http://sourceforge.net/mailarchive/message.php?msg_id=28355621
>
> some unexpected behavior in a 64-bit test program when built under the
> 64-bit
> mingw-w64 version of g++.
>
> We're trying to decide whether the test program is ill-formed, and
> therefore
> that the compiler is off the hook, or whether we might have stumbled
> across
> a compiler bug.

Your program is ill-formed.

>
> Some context: The test program is logically incorrect in that the
> code is not
> expected to do what it was nominally intended to do. That's not the
> point.
> Rather, the question is whether the code is legal, well-formed c++,
> and whether
> the compiler is generating the correct output. It seems that the key
> point is
> the comparisons between 32-bit and 64-bit integral types (the variable
> pos, an
> unsigned, and std::string::npos, a size_t) in the test code.
>

> virtual std::streamsize xsputn (const char *p, std::streamsize n) {
> m_string.append (p, p + n);
> unsigned pos = 0;
> // size_t pos = 0; // this is the correct declaration
> while (pos != std::string::npos) { // 32-bit unsigned compared
> against 64-bit size_t
> // while (pos != ((unsigned) -1)) { // this works
> printf ("xsputn: top of loop, pos = %d\n", pos);
> pos = m_string.find ('\n');
> if (pos != std::string::npos) { // 32-bit unsigned compared
> against 64-bit size_t

if find() does not find '\n', it returns npos (-1 casted to
std::string::size_type). This will be truncated to 0xffffffff when
assigned to unsigned pos.

> // if (pos != ((unsigned) -1)) { // this works
> std::string tmp(m_string.begin(), m_string.begin() + pos + 1);

Here, if 'pos+1' were evaluated first, it would overwrap and yield zero,
which is a legal op for unsigned ints. However, as m_string.begin()
returns a 64-bit iterator and the + operator associates left-to-right,
all the additions are done in 64-bit, so you attempt to construct a 4G
tmp string, which will most probably cause UB.

If you put some parens around 'pos+1' here and a couple of lines later,
then you indeed get a well-formed infinite loop program.

Cheers
Paavo

Paavo Helde

unread,

Nov 6, 2011, 10:37:21 AM11/6/11

to

"K. Frank" <kfran...@gmail.com> wrote in news:44424c25-fe00-4531-9e07-
69cdf8...@du8g2000vbb.googlegroups.com:

> // size_t pos = 0; // this is the correct declaration

Actually the correct declaration is

std::string::size_type pos = 0;

One should make a shorter typedef for that type if it is used much.

Cheers
Paavo

K. Frank

unread,

Nov 6, 2011, 11:00:04 AM11/6/11

to

Hi Paavo!

Thanks very much. I believe that your analysis is correct.

Some comments below...

On Nov 6, 10:30 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
> "K. Frank" <kfrank2...@gmail.com> wrote in news:44424c25-fe00-4531-9e07-
> 69cdf896e...@du8g2000vbb.googlegroups.com:
>
> > Hello Group!
> ...

> > We're trying to decide whether the test program is ill-formed, and
> > therefore
> > that the compiler is off the hook, or whether we might have stumbled
> > across
> > a compiler bug.
>
> Your program is ill-formed.

> ...

> if find() does not find '\n', it returns npos (-1 casted to
> std::string::size_type). This will be truncated to 0xffffffff when
> assigned to unsigned pos.
>
> > // if (pos != ((unsigned) -1)) { // this works
> > std::string tmp(m_string.begin(), m_string.begin() + pos + 1);
>
> Here, if 'pos+1' were evaluated first, it would overwrap and yield zero,
> which is a legal op for unsigned ints. However, as m_string.begin()
> returns a 64-bit iterator and the + operator associates left-to-right,
> all the additions are done in 64-bit, so you attempt to construct a 4G
> tmp string, which will most probably cause UB.

I think you're right.

Let me see if I can restate your argument:

pos = m_string.find ('\n');
if (pos != std::string::npos) {

std::string tmp(m_string.begin(), m_string.begin() + pos +
1);

1) pos = m_string.find ('\n');

m_string.find ('\n') either returns a legal position in m_stirng or
npos.
However, because of the 32-bit / 64-bit mismatch, pos is converted to
a
32-bit -1, which is neither npos, nor a legal position in m-string.

2) if (pos != std::string::npos) {

Because the 32-bit pos is not actually equal to npos, this if
statement
does not protect us from executing the next line. (But, so far,
everything
is legal.)

3) std::string tmp(m_string.begin(), m_string.begin() + pos + 1);

As soon as we create the iterator "m_string.begin() + pos" (or perhaps
as
soon as we use it) we have undefined behavior because the iterator
points
neither to a position within m_string nor to "one past the end."

I would say that the undefined behavior is due to indexing "into"
m_string past its end, rather than trying to instantiate a 4GB string.
(I would think that the latter would be legal, but should throw a
bad_alloc, or something.)

> If you put some parens around 'pos+1' here and a couple of lines later,
> then you indeed get a well-formed infinite loop program.

Yes, I agree. To analyze a bit further, it's a bit of a happenstance
that
adding the parentheses "(pos + 1)" makes the code legal.

pos is still a wacky value -- an unsigned 32-bit -1. But as you point
out
"(pos + 1)" is now zero, and "m_string.begin + 0" is a legal iterator
into
m-string, so no undefined behavior occurs. (But if pos had been some
other
wacky value, such as an unsigned 32-bit -2, then adding the
parentheses would
not have eliminated the undefined behavior.)

Does this all sound right?

> Cheers
> Paavo

Thanks for clearing this up.

K. Frank

Paavo Helde

unread,

Nov 6, 2011, 2:22:32 PM11/6/11

to

"K. Frank" <kfran...@gmail.com> wrote in news:7c78b247-fbca-4f00-bf66-
e36e8a...@n13g2000vbv.googlegroups.com:

The latter depends on the size of m_string, to be pedantic. In principle
you can have strings over 4G in length in a 64-bit program.

> 2) if (pos != std::string::npos) {
>
> Because the 32-bit pos is not actually equal to npos, this if
> statement
> does not protect us from executing the next line. (But, so far,
> everything
> is legal.)

Yep

>
> 3) std::string tmp(m_string.begin(), m_string.begin() + pos + 1);
>
> As soon as we create the iterator "m_string.begin() + pos" (or perhaps
> as
> soon as we use it) we have undefined behavior because the iterator
> points
> neither to a position within m_string nor to "one past the end."

Yes, that's correct. However, technically one can construct iterators
with any random value in most mainstream implementations, so unless the
implementation takes extra efforts to detect UB here this step would
usually not cause real problems. UB anyway, of course.

>
> I would say that the undefined behavior is due to indexing "into"
> m_string past its end, rather than trying to instantiate a 4GB string.
> (I would think that the latter would be legal, but should throw a
> bad_alloc, or something.)

Allocating 4G is perfectly OK in a 64-bit program and should succeed on a
decent computer. The real problem is that the code is trying to
initialize it by copying some 4G of data from the memory which does not
exist (yes,this is basically the same as "indexing into m_string past its
end").

>
>> If you put some parens around 'pos+1' here and a couple of lines
later,
>> then you indeed get a well-formed infinite loop program.
>
> Yes, I agree. To analyze a bit further, it's a bit of a happenstance
> that
> adding the parentheses "(pos + 1)" makes the code legal.
>
> pos is still a wacky value -- an unsigned 32-bit -1. But as you point
> out
> "(pos + 1)" is now zero, and "m_string.begin + 0" is a legal iterator
> into
> m-string, so no undefined behavior occurs. (But if pos had been some
> other
> wacky value, such as an unsigned 32-bit -2, then adding the
> parentheses would
> not have eliminated the undefined behavior.)

Yes, that's right. However note that npos is very strictly defined by the
standard so that adding 1 to it is guaranteed to yield zero. I'm not sure
if this is intentional or not, in some algorithms it comes handy and
allows to skip an extra check (in the expense of obcuring the code so I'm
not convinced this is such a good idea).

>
> Does this all sound right?

Yep

Cheers
Paavo