Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unexpected results from a mul instruction

36 views
Skip to first unread message

Juckett

unread,
Oct 18, 2009, 10:44:33 PM10/18/09
to

I've been trying to track down why some alignment unit tests were
failing in an optimized build and have come across something strange
(tested in both VS2005 and VS2008 express). I've simpified the
optimized code to reproduce the bug.

Basically I have two functions that round the address of a local
variable "input" up to the next multiple of 13. The C form would be
something like "output = ( (input + 12) * 13 ) / 13". The asm looks a
bit odd due to division being optimized out into a large multiple and
a shift.

The function "Test4()" gives the expected output and "Test5()" does
not. The only difference is that Test4 loads the offset address of
"input" into a register before multiplying it by eax. Test5 just
multiplies directly using an offset from the EBP register. The results
of these multiplies are different. The generated optimized C++ code
that lead me here is similar to Test5.

I'm not really an assembly expert so maybe I am just missing something
obvious. I've also had a friend run the executable to confirm the
output.

I'm getting the following output from the below program:

test4
input = 0x0012fe40
output = 0x0012fe41
test5
input = 0x0012fe40
output = 0x004011a2

-----------------------------------------------------------------------------------------


#include <stdio.h>
#include <conio.h>


void Test4()
{
unsigned int input = 0;
unsigned int output = 0;
__asm
{
mov eax,4EC4EC4Fh
lea ecx,input+12
mul ecx
shr edx,2
mov eax,13
mul edx
mov output,eax
}
printf("input = 0x%08x\n", &input);
printf("output = 0x%08x\n", output);
}

void Test5()
{
unsigned int input = 0;
unsigned int output = 0;
__asm
{
mov eax,4EC4EC4Fh
mul input+12
shr edx,2
mov eax,13
mul edx
mov output,eax
}
printf("input = 0x%08x\n", &input);
printf("output = 0x%08x\n", output);
}


void main()
{
printf("test4\n");
Test4();

printf("test5\n");
Test5();


// pause at end of program
_getch();
}

Dick Wesseling

unread,
Oct 19, 2009, 2:59:10 AM10/19/09
to

In article <4adbd291$0$4969$9a6e...@unlimited.newshosting.com>,
Juckett <ryanj...@MUNGED.microcosmotalk.com> writes:

> The function "Test4()" gives the expected output and "Test5()" does
> not. The only difference is that Test4 loads the offset address of
> "input" into a register before multiplying it by eax. Test5 just
> multiplies directly using an offset from the EBP register.

It does not. It multiplies indirectly, The _value_ stored at the
address input+12 is used as an operand, not the address itself.

Juckett

unread,
Oct 19, 2009, 3:00:00 AM10/19/09
to

After some more learning and investigation, I hd some of thaat
assembly wrong. After fixing it up, it ran as expected, so now i'm
wondering if I hit some sort of bug in the optimizer? I'm also only to
repro the issue in the VS2005 compiler and not VS2008 express which is
where I started having the problem initially (in much more compilcated
code).

This probably isn't an x86 issue anymore, but if anyone wants to see
the potentially incorrect assembly from VS2005 and the program that
generated it, look below.

===================

This has the fixed Test4 and Test5 functionas as well as the Test3
function which is the C code generating bad results when optimizations
are on.

I get this output:

test3
input = 0x0012ff38
output = 0x008b1b2a <--- not the input rounded up to multiple of
13
test4
input = 0x0012ff34
output = 0x0012ff2b
test5
input = 0x0012ff34
output = 0x0012ff38

This is the disassembly for the inlines Test3 function

0040100D lea eax,[esp+0Ch]
00401011 push eax
00401012 push offset string "input = 0x%08x\n" (40B69Ch)
00401017 mov dword ptr [esp+14h],3
0040101F call printf (401107h)
00401024 mov eax,4EC4EC4Fh
00401029 mul eax,dword ptr [esp+20h]
0040102D shr edx,2
00401030 imul edx,edx,0Dh
00401033 push edx
00401034 push offset string "output = 0x%08x\n" (40B6B0h)
00401039 call printf (401107h)


---------------------------------------------------


#include <stdio.h>
#include <conio.h>

void Test3()
{
unsigned int val = 3;
unsigned int input = (unsigned int)&val;
unsigned int output = ( ((input + 12) / 13) * 13 );
printf("input = 0x%08x\n", input);


printf("output = 0x%08x\n", output);
}


void Test4()
{
unsigned int val = 4;
unsigned int input = (unsigned int)&val;


unsigned int output = 0;
__asm
{
mov eax,4EC4EC4Fh

lea ecx,val


mul ecx
shr edx,2
mov eax,13
mul edx
mov output,eax
}
printf("input = 0x%08x\n", &input);
printf("output = 0x%08x\n", output);
}

void Test5()
{
unsigned int val = 5;
unsigned int input = (unsigned int)&val;


unsigned int output = 0;
__asm
{
mov eax,4EC4EC4Fh
mul input

shr edx,2
mov eax,13
mul edx
mov output,eax
}
printf("input = 0x%08x\n", &input);
printf("output = 0x%08x\n", output);
}

void main()
{
printf("test3\n");
Test3();

James Harris

unread,
Oct 19, 2009, 3:53:30 AM10/19/09
to

On 19 Oct, 03:44, Juckett <ryanjuck...@MUNGED.microcosmotalk.com>
wrote:

> I've been trying to track down why some alignment unit tests were
> failing in an optimized build and have come across something strange
> (tested in both VS2005 and VS2008 express). I've simpified the
> optimized code to reproduce the bug.
>
> Basically I have two functions that round the address of a local
> variable "input" up to the next multiple of 13. The C form would be

> something like "output =3D ( (input + 12) * 13 ) / 13". The asm looks a


> bit odd due to division being optimized out into a large multiple and
> a shift.

I presume you mean

output =3D ((input + 12) / 13) * 13

>
> The function "Test4()" gives the expected output and "Test5()" does
> not. The only difference is that Test4 loads the offset address of
> "input" into a register before multiplying it by eax. Test5 just
> multiplies directly using an offset from the EBP register. The results
> of these multiplies are different. The generated optimized C++ code
> that lead me here is similar to Test5.
>
> I'm not really an assembly expert so maybe I am just missing something
> obvious. I've also had a friend run the executable to confirm the
> output.

I'm not familiar with the assembler format used below but if

lea ecx, input + 12

has the contents of the variable "input" substituted in the expression
as (using square brackets to indicate memory referencing)

mov edx, [input]
lea ecx, [edx + 12]

it will add 12 to the value of input which is what's wanted. This is
the same as

mov edx, [input]
add edx, 12

For Test5, however,

mul input + 12

cannot be a single instruction to do what I think you want - i.e. to
add 12 to a value and then multiply by it. Perhaps it is being taken
as

mul [input + 12]

- i.e. multiply by the contents of the 32-bit word which is 12 bytes
away from the input variable. This would be equivalent to

mov edx, [input + 12]
mul edx

To fix Test5 perhaps you need something to replace the single "mul
input+12" instruction such as

mov edx, input
add edx, 12
mul edx

but the exact form may vary as I am guessing as to your assembler
format. The assembler listing - if you can produce one from VS (Visual
Studio?) - would show what opcodes are being generated.

The bottom line is you cannot add 12 and multiply in one instruction
so "mul input+12" doesn't look right. (And input should be at location
"input", not at input + 12.)

James

>
> I'm getting the following output from the below program:
>
> test4

> input =3D 0x0012fe40
> output =3D 0x0012fe41
> test5
> input =3D 0x0012fe40
> output =3D 0x004011a2
>
> -------------------------------------------------------------------------=


----------------
>
> #include <stdio.h>
> #include <conio.h>
>
> void Test4()
> {

> =A0 =A0 =A0 =A0 unsigned int input =3D 0;
> =A0 =A0 =A0 =A0 unsigned int output =3D 0;
> =A0 =A0 =A0 =A0 __asm
> =A0 =A0 =A0 =A0 {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov eax,4EC4EC4Fh
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 lea ecx,input+12
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mul ecx
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 shr edx,2
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov eax,13
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mul edx
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov output,eax
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 printf("input =3D 0x%08x\n", &input);
> =A0 =A0 =A0 =A0 printf("output =3D 0x%08x\n", output);
>
> }
>
> void Test5()
> {
> =A0 =A0 =A0 =A0 unsigned int input =3D 0;
> =A0 =A0 =A0 =A0 unsigned int output =3D 0;
> =A0 =A0 =A0 =A0 __asm
> =A0 =A0 =A0 =A0 {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov eax,4EC4EC4Fh
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mul input+12
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 shr edx,2
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov eax,13
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mul edx
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov output,eax
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 printf("input =3D 0x%08x\n", &input);
> =A0 =A0 =A0 =A0 printf("output =3D 0x%08x\n", output);
>
> }
>
> void main()
> {
> =A0 =A0 =A0 =A0 printf("test4\n");
> =A0 =A0 =A0 =A0 Test4();
>
> =A0 =A0 =A0 =A0 printf("test5\n");
> =A0 =A0 =A0 =A0 Test5();
>
> =A0 =A0 =A0 =A0 // pause at end of program
> =A0 =A0 =A0 =A0 _getch();
>
> }
>
>

Juckett

unread,
Oct 19, 2009, 4:51:11 AM10/19/09
to

On Oct 18, 11:59=A0pm, f...@MUNGED.microcosmotalk.com (Dick Wesseling)
wrote:
> In article <4adbd291$0$4969$9a6e1...@unlimited.newshosting.com>,

> =A0 =A0 =A0 =A0 Juckett <ryanjuck...@MUNGED.microcosmotalk.com> writes:
>
> > The function "Test4()" gives the expected output and "Test5()" does
> > not. The only difference is that Test4 loads the offset address of
> > "input" into a register before multiplying it by eax. Test5 just
> > multiplies directly using an offset from the EBP register.
>
> It does not. It multiplies indirectly, The _value_ stored at the
> address input+12 is used as an operand, not the address itself.

Hi Dick,

Thanks for the reply. From what I can tell, the visual studio
optimizer just isnt understanding that. It's taking a temporary value
set to the address and then sending that address to the mul
instruction expecting the address and not the dereferenced value to be
used.

I've simplified it down to the following causing in correct outputs in
a fully optimized build. Time to find a work around :(

#include <stdio.h>
#include <conio.h>

void main()
{
char buffer[ 256 ] =3D { 0 };

unsigned input =3D (unsigned)buffer;
unsigned output =3D ( ((input + 12) / 13) * 13 );

printf("input =3D 0x%08x\n", input);


printf("output =3D 0x%08x\n", output);

// pause at end of program
_getch();
}

Juckett

unread,
Oct 19, 2009, 10:13:28 AM10/19/09
to

Hi James,

As far as I can tell you are correct. If you are interested in the
actual generated visual studio assembly, I submitted a detailed bug
report of it at
https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=498462
(expand the details for the actual and desired disassembly output). It
seems to be related to dividing a stack address by a non-power of two
number which is probably a rare operation (I only had it happen
because of some extensive alignment unit tests - I doubt anyone would
actually need alignment to a multiple of 13).

Helge Kruse

unread,
Oct 19, 2009, 10:13:55 AM10/19/09
to

"Juckett" <ryanj...@MUNGED.microcosmotalk.com> wrote in message
news:4adc287f$0$5108$9a6e...@unlimited.newshosting.com...


> Thanks for the reply. From what I can tell, the visual studio
> optimizer just isnt understanding that. It's taking a temporary value
> set to the address and then sending that address to the mul
> instruction expecting the address and not the dereferenced value to be
> used.

What do you want to achieve with the calculation? You take the address of an
variable called buffer and calculate something.

> void main()
> {
> char buffer[ 256 ] = { 0 };
>
> unsigned input = (unsigned)buffer;
> unsigned output = ( ((input + 12) / 13) * 13 );
>

You told the compiler to use the address with the assignment of an array to
an integer. Since that does not work out of the box, you use a cast. I can't
see what the compiler should not understand.

Regards,
Helge


Juckett

unread,
Oct 19, 2009, 2:15:22 PM10/19/09
to

> What do you want to achieve with the calculation? You take the address of an
> variable called buffer and calculate something.

It is just rounding the addbress of buffer up to a byte alignment of
13 (not a very practicle thing to do, but that is what it is doing).

> You told the compiler to use the address with the assignment of an array to
> an integer. Since that does not work out of the box, you use a cast. I can't
> see what the compiler should not understand.

Right, the compiler should understand it and I don't see any
assumptions like aliasing that are being broken. As I mentioned
previously, it is generating code (when full optimizations are on) to
use a value in the buffer when rounding it up rather than trying to
round up the address of the buffer itself.

The dissambler outputs a line:
00401030 mul eax,dword ptr [esp+20h]

It should actually output two instructions likes so for the correct
result:
lea ecx,dword ptr [esp+20h]
mul eax,ecx

Terje Mathisen

unread,
Oct 20, 2009, 7:15:16 AM10/20/09
to

Juckett wrote:
>> What do you want to achieve with the calculation? You take the address of an
>> variable called buffer and calculate something.
>
> It is just rounding the addbress of buffer up to a byte alignment of
> 13 (not a very practicle thing to do, but that is what it is doing).

Your C code is broken!

Start by splitting the problem into three parts:

a) Take the address of your buffer and convert that value to an intptr_t.

unsigned pointer2unsigned(void *p)
{
unsigned n = (unsigned) ((intptr_t) p);
return n;
}

b) Round up the result of (a) to the nearest multiple of 13

unsigned next13multiple(unsigned n)
{
return ((n+12)/13)*13;
}

c) Convert the resulting integer back to a pointer.

Test each part in isolation and verify that yes, it does work correctly
for all possible 32-bit inputs!

When you do that, the (b) function will probably look very close to this
when compiled as a standalone function:

mov eax,[esp+4]
mov edx,(0x800000000 + 12)/13 // Reciprocal multiplier, rounded up
add eax,12
mul edx
mov eax,13
shr edx,3
mul edx

When inlined, the MOV EAX,[] and ADD EAX,12 parts can be combined into a
LEA EAX,[EBX+12] (assuming EBX contains your buffer variable), i.e.
pretty much identical to what you've shown us.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,
Oct 20, 2009, 4:33:04 PM10/20/09
to

Terje Mathisen wrote:
> When you do that, the (b) function will probably look very close to this
> when compiled as a standalone function:
>
> mov eax,[esp+4]
> mov edx,(0x800000000 + 12)/13 // Reciprocal multiplier, rounded up
> add eax,12
> mul edx
> mov eax,13
> shr edx,3
> mul edx

I have to comment to my own post here:

13 is a pretty nice multiplier, particularly when the intermediate
result is already scaled by 8, which we can take advantage of:

mov eax,[esp+4]
mov edx,(0x800000000 + 12)/13 // Reciprocal multiplier, rounded up
add eax,12
mul edx

;; EDX contains (n+12)/13 * 8 (plus some garbage)
mov eax,edx
and edx,0xfffffff8 ;; Gets rid of the garbage
shr eax,3
add edx,eax
lea eax,[edx+eax*4]

Depending upon the speed of your integer MUL opcode, this combination of
shift, mask, lea might be faster!

Juckett

unread,
Oct 20, 2009, 7:22:32 PM10/20/09
to

> Your C code is broken!

I would love if that was the case, but I'm not seeing where that would
be. Maybe I'm missing something?

> a) Take the address of your buffer and convert that value to an intptr_t.
>
> unsigned pointer2unsigned(void *p)
> {

> =A0 =A0 =A0unsigned n =3D (unsigned) ((intptr_t) p);
> =A0 =A0 =A0return n;
>
> }

I do this step in the following line (from the posted code). I don't
go through an intptr_t, but on Win32 that shouldn't matter right?

unsigned input =3D (unsigned)buffer;

> b) Round up the result of (a) to the nearest multiple of 13
>
> unsigned next13multiple(unsigned n)
> {

> =A0 =A0 =A0return ((n+12)/13)*13;
>
> }
>

I do this step in the following line (from the posted code). Looks to
match exactly.

unsigned output =3D ( ((input + 12) / 13) * 13 );


> c) Convert the resulting integer back to a pointer.

This step I don't even do in the code I posted because I am already at
a point where I can validate the results.

I'll add the intptr_t cast in the test later tonight when I get a
chance, but I'd be surprissed if that is required to make the compiler
to generate functioning code for this one case. The code works with
other multiples and with global buffers (which are forced to load into
a register).

- Ryan

Terje Mathisen

unread,
Oct 21, 2009, 1:47:25 AM10/21/09
to

Juckett wrote:
>> Your C code is broken!
>
> I would love if that was the case, but I'm not seeing where that would
> be. Maybe I'm missing something?

Are you sure that you're actually taking the address of your buffer, and
not loading the contents of the first element? I don't remember the
exact code anymore.

Anyway, you should definitely single-step through the resulting code, in
disassembly mode.

Re rounding up: You can also take the address of the 12th element and
truncate:

char *round13(char *buffer)
{
unsigned n = (unsigned) ((intptr_t) &buffer[12]);
n /= 13;
n *= 13;
return (char *) n;

Richard Russell

unread,
Oct 21, 2009, 5:56:11 AM10/21/09
to

On 21 Oct, 06:47, Terje Mathisen wrote:
> Juckett wrote:
> >> Your C code is broken!
>
> > I would love if that was the case, but I'm not seeing where that would
> > be. Maybe I'm missing something?
>
> Are you sure that you're actually taking the address of your buffer, and
> not loading the contents of the first element? I don't remember the
> exact code anymore.

Surely, if the fault was in the C code, then switching compiler
optimization on and off wouldn't alter the result? Also, if you
compile the test program (in the bug report submitted to Microsoft)
under gcc it works correctly, even with full optimizations enabled.
It's hard to imagine more compelling evidence that the compiler is at
fault.

Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.

Richard Russell

unread,
Oct 21, 2009, 5:57:04 AM10/21/09
to

On 20 Oct, 21:33, Terje Mathisen
<Terje.Mathi...@MUNGED.microcosmotalk.com> wrote:
> mov eax,edx
> =A0 =A0 =A0and edx,0xfffffff8 ;; Gets rid of the garbage
> =A0 =A0 =A0shr eax,3
> =A0 =A0 =A0add edx,eax
> =A0 =A0 =A0lea eax,[edx+eax*4]

>
> Depending upon the speed of your integer MUL opcode, this combination of
> shift, mask, lea might be faster!

GCC does the multiply-by-13 thus:

8D 04 52 lea eax,[edx*3]
8D 04 82 lea eax,[edx+eax*4]

Terje Mathisen

unread,
Oct 21, 2009, 10:40:48 AM10/21/09
to

Richard Russell wrote:
> GCC does the multiply-by-13 thus:
>
> 8D 04 52 lea eax,[edx*3]

More probably it does

LEA EAX,[EDX+EDX*2]

which gives the same result. :-)

Terje Mathisen

unread,
Oct 21, 2009, 10:41:25 AM10/21/09
to

Richard Russell wrote:
> Surely, if the fault was in the C code, then switching compiler
> optimization on and off wouldn't alter the result? Also, if you
> compile the test program (in the bug report submitted to Microsoft)
> under gcc it works correctly, even with full optimizations enabled.
> It's hard to imagine more compelling evidence that the compiler is at
> fault.

Sorry, I just didn't understand the connection between the source code
and the asm, it simply didn't make sense.

A compiler bug is very believable indeed, I've found a few in my time. :-(

Richard Russell

unread,
Oct 21, 2009, 3:30:18 PM10/21/09
to

On 21 Oct, 15:40, Terje Mathisen
<Terje.Mathi...@MUNGED.microcosmotalk.com> wrote:
> More probably it does
>
> =A0 =A0 =A0 =A0 LEA EAX,[EDX+EDX*2]

>
> which gives the same result. :-)

Many assemblers (including NASM and my own) accept the instruction as
I listed it:

lea eax,[edx*3]

I don't see any point in decomposing it into the form you use. It may
more accurately represent what is going on 'under the hood' but
assembly-language (as opposed to machine-code) is meant to make the
code more human-readable!

Juckett

unread,
Oct 22, 2009, 5:43:46 AM10/22/09
to

If anyone is interested, microsoft tested and confimed the bug on
their end in VS2008. They also said that it looks to already be fixed
in the recently released VS2010 Beta 2.

- Ryan

Helge Kruse

unread,
Oct 22, 2009, 8:52:18 AM10/22/09
to

"Juckett" <ryanj...@MUNGED.microcosmotalk.com> wrote in message

news:4ae02952$0$5650$9a6e...@unlimited.newshosting.com...


>
> If anyone is interested, microsoft tested and confimed the bug on
> their end in VS2008. They also said that it looks to already be fixed
> in the recently released VS2010 Beta 2.

Is VS2005 affected too? I fear it would not be fixed there.

Helge


Richard Russell

unread,
Oct 22, 2009, 11:27:29 AM10/22/09
to

On 22 Oct, 10:43, Juckett <ryanjuck...@MUNGED.microcosmotalk.com>
wrote:

> If anyone is interested, microsoft tested and confimed the bug on
> their end in VS2008. They also said that it looks to already be fixed
> in the recently released VS2010 Beta 2.

Being a natural sceptic, I worry about their comment that "I haven't
tracked down what change fixed the problem". There must be a
possibility that it hasn't actually been fixed, but that a change
elsewhere means that your particular test case no longer provokes it.

Kevin G. Rhoads

unread,
Oct 22, 2009, 4:06:44 PM10/22/09
to

>Surely, if the fault was in the C code, then switching compiler
>optimization on and off wouldn't alter the result?

If the HLL source is correct, changing optimization levels should
be OK except when there is a compiler bug.

If the HLL source is NOT correct, then changing optimization
could interact with the bug in weird ways. Once the source
has an error, all further suppositions need to be checked.

0 new messages