migrating nasm code to to x86

Phil Carmody

unread,

Oct 18, 2008, 7:11:20 AM10/18/08

to

I've got some old nasm code that I wrote for x86/linux (and x86/cygwin),
and I'm trying to compile it for a new core2 box. Of course it's barfing
on all kinds of stuff:

<<<
cglobal p3_bl_1prime_multi2
p3_bl_1prime_multi2:

push ebp
mov ebp, esp
push edi
push esi
push ebx
>>>

As all the pushes do this:

<<<
factorial_m_p3.asm:18: error: instruction not supported in 64-bit mode
factorial_m_p3.asm:20: error: instruction not supported in 64-bit mode
factorial_m_p3.asm:21: error: instruction not supported in 64-bit mode
factorial_m_p3.asm:22: error: instruction not supported in 64-bit mode
>>>

I don't want to simply fork the file into an x86 and an x86_64 version
with different register names, but can do that if it's the simpest thing
to do. How have other people coped with these kinds of issues? Some
macros, perhaps for preamble and postamble? Macros just for register
names? Any advice would be welcome.

Phil
--
The fact that a believer is happier than a sceptic is no more to the
point than the fact that a drunken man is happier than a sober one.
The happiness of credulity is a cheap and dangerous quality.
-- George Bernard Shaw (1856-1950), Preface to Androcles and the Lion

Alexei A. Frounze

unread,

Oct 18, 2008, 7:35:11 AM10/18/08

to

On Oct 18, 3:11 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:

> I've got some old nasm code that I wrote for x86/linux (and x86/cygwin),
> and I'm trying to compile it for a new core2 box. Of course it's barfing
> on all kinds of stuff:
>
> <<<
> cglobal p3_bl_1prime_multi2
> p3_bl_1prime_multi2:
>
> push ebp
> mov ebp, esp
> push edi
> push esi
> push ebx
>
>
>
> As all the pushes do this:
>
> <<<
> factorial_m_p3.asm:18: error: instruction not supported in 64-bit mode
> factorial_m_p3.asm:20: error: instruction not supported in 64-bit mode
> factorial_m_p3.asm:21: error: instruction not supported in 64-bit mode
> factorial_m_p3.asm:22: error: instruction not supported in 64-bit mode
>
>
>
> I don't want to simply fork the file into an x86 and an x86_64 version
> with different register names, but can do that if it's the simpest thing
> to do. How have other people coped with these kinds of issues? Some
> macros, perhaps for preamble and postamble? Macros just for register
> names? Any advice would be welcome.

In 64-bit mode you can only push 8 or 2 bytes, not 4 with a single
PUSH instruction. You could indeed use some special macros for
register names, e.g. _rax that would expand to different things in
different CPU modes (eax or rax).

Alex

Frank Kotler

unread,

Oct 18, 2008, 1:38:08 PM10/18/08

to

Phil Carmody wrote:
> I've got some old nasm code that I wrote for x86/linux (and x86/cygwin),
> and I'm trying to compile it for a new core2 box. Of course it's barfing
> on all kinds of stuff:
>
> <<<
> cglobal p3_bl_1prime_multi2
> p3_bl_1prime_multi2:
>
> push ebp
> mov ebp, esp
> push edi
> push esi
> push ebx
>
> As all the pushes do this:
>
> <<<
> factorial_m_p3.asm:18: error: instruction not supported in 64-bit mode
> factorial_m_p3.asm:20: error: instruction not supported in 64-bit mode
> factorial_m_p3.asm:21: error: instruction not supported in 64-bit mode
> factorial_m_p3.asm:22: error: instruction not supported in 64-bit mode
>
> I don't want to simply fork the file into an x86 and an x86_64 version
> with different register names, but can do that if it's the simpest thing
> to do. How have other people coped with these kinds of issues? Some
> macros, perhaps for preamble and postamble? Macros just for register
> names? Any advice would be welcome.

If you just want to get a 32-bit executable (from tools that default to
64 bits), try adding "-melf_i386" to the command line to ld. If you want
a 64-bit executable, you'll have to change the register names... and
probably other things. This 64-bit shit - completely different ABI - is
a "nail in the coffin of assembly language" IMHO.

Best,
Frank

Herbert Kleebauer

unread,

Oct 18, 2008, 2:09:28 PM10/18/08

to

Frank Kotler wrote:

> probably other things. This 64-bit shit - completely different ABI - is
> a "nail in the coffin of assembly language" IMHO.

Is there any free place left where you can insert an additional nail?

Chuck Crayne

unread,

Oct 18, 2008, 2:18:42 PM10/18/08

to

On Sat, 18 Oct 2008 14:11:20 +0300
Phil Carmody <thefatphi...@yahoo.co.uk> wrote:

> Some
> macros, perhaps for preamble and postamble? Macros just for register
> names? Any advice would be welcome.

You can use the __OUTPUT_FORMAT__ macro, either in open code, or
within macros, to handle such things. However, that's just the tip of
the iceberg. All address variables must be 64 bit. The Linux syscalls
have been renumbered. And if you call any library funcions, the
parameters are passed in registers, instead of on the stack.

--
Chuck
http://www.pacificsites.com/~ccrayne/charles.html

H. Peter Anvin

unread,

Oct 18, 2008, 2:56:52 PM10/18/08

to Frank Kotler

Frank Kotler wrote:
>
> If you just want to get a 32-bit executable (from tools that default to
> 64 bits), try adding "-melf_i386" to the command line to ld. If you want
> a 64-bit executable, you'll have to change the register names... and
> probably other things. This 64-bit shit - completely different ABI - is
> a "nail in the coffin of assembly language" IMHO.
>

Well, there are plenty of proper, and improper, uses for assembly
language. Trying to write portable applications is obviously not one of
them. However, there are plenty of cases where you can get a many times
speedup by writing core routines in assembly, and then it's probably
worth writing for each target architecture -- and x86-32 and x86-64 are
different target architectures (as is x86-16; it is probably more
different from the other two than they are from each other).

-hpa

Phil Carmody

unread,

Oct 18, 2008, 6:21:33 PM10/18/08

to

Frank Kotler <fbko...@verizon.net> writes:
> Phil Carmody wrote:
>> I've got some old nasm code that I wrote for x86/linux (and x86/cygwin),
>> and I'm trying to compile it for a new core2 box. Of course it's
>> barfing on all kinds of stuff:
>>

>> push ebp

>> factorial_m_p3.asm:18: error: instruction not supported in 64-bit mode
>>

>> Any advice would be welcome.
>
> If you just want to get a 32-bit executable (from tools that default
> to 64 bits), try adding "-melf_i386" to the command line to ld. If you
> want a 64-bit executable, you'll have to change the register
> names... and probably other things. This 64-bit shit - completely
> different ABI - is a "nail in the coffin of assembly language" IMHO.

Fortunately I couldn't give a flying aardvark about my routines
calling any other routines - I'm writing optimised inner loops,
nothing more.

From the small straw poll it seems that just having a macro which
expands to different register names for the two architectures is
the simplest way to go. Fortunately that's easy for my inner loops,
as it's 99% FPU number-crunching, and only rarely do I need integer
registers, or memory access.

Possibly more fortunately, my actual source code isn't even written
in assembly, I wrote a script to turn slightly higher level code
into assembly (do you remember I mentioned that about 2 years ago?).
So I could just implement it in my pre-processor, and not bloat any
source code.

I guess I should familiarise myself with the linux/x64_64 C
calling convention. Then again, I'm prepared to just use global
variables for simplicity. Re-entrancy is not an issue at all.

H. Peter Anvin

unread,

Oct 19, 2008, 12:15:33 AM10/19/08

to Phil Carmody

Phil Carmody wrote:
>
> I guess I should familiarise myself with the linux/x64_64 C
> calling convention. Then again, I'm prepared to just use global
> variables for simplicity. Re-entrancy is not an issue at all.
>

A very simple way to deal with these kinds of issues across multiple
ABIs is to just pass in a structure as the sole argument. It means
relatively minimal differences between ABIs.

Other than that, you can, as you say, use the preprocessor; either your
own or NASM's.

-hpa

Nathan...@gmail.com

unread,

Oct 19, 2008, 2:57:00 PM10/19/08

to

On Oct 18, 7:11 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:
>
> p3_bl_1prime_multi2:
>

I can't get my aardvark to "sit" but it is really good at math:

; For Windows:
; nasm -f win32 --prefix _ -o sayprimes.obj sayprimes.asm
; gcc -o sayprimes.exe sayprimes.obj
;
; For Linux:
; nasm -f elf32 -o sayprimes.o sayprimes.asm
; gcc -o sayprimes sayprimes.o

section .data
ispmsg: db ' %d', 10, 0

section .text
global main
extern printf

main:

push ecx
finit
xor ecx, ecx
mloop:
push ecx
call isprime
inc ecx
cmp ecx, 2147483647
jle mloop
pop ecx
xor eax, eax
ret

isprime:

push ebp
mov ebp, esp

sub esp, 8
push ecx
push edx

mov dword [ebp-8], 1
cmp dword [ebp+8], 2
jge next
mov dword [ebp-8], 0
jmp sayprime

next:
jne nextt
jmp sayprime

nextt:
mov eax, [ebp+8]
and eax, 1
cmp eax, 0
je nexttt

fild dword [ebp+8]
fsqrt
fistp dword [ebp-4]

mov ecx, 3
jmp ccheck
floop:
mov eax, [ebp+8]
cdq
idiv ecx
cmp edx, 0
jne check
mov dword [ebp-8], 0
jmp sayprime
check:
add ecx, 2
ccheck:
cmp ecx, [ebp-4]
jle floop
jmp sayprime

nexttt:
mov dword [ebp-8], 0

sayprime:
xor eax, eax
cmp eax, [ebp-8]
je goback
mov eax, [ebp+8]
push eax
push ispmsg
call printf
add esp, 8

goback:
pop edx
pop ecx
mov esp, ebp
pop ebp
ret 4

Nathan.

Phil Carmody

unread,

Oct 20, 2008, 4:14:26 AM10/20/08

to

"H. Peter Anvin" <h...@zytor.com> writes:
> Phil Carmody wrote:
>>
>> I guess I should familiarise myself with the linux/x64_64 C calling
>> convention. Then again, I'm prepared to just use global
>> variables for simplicity. Re-entrancy is not an issue at all.
>
> A very simple way to deal with these kinds of issues across multiple
> ABIs is to just pass in a structure as the sole argument. It means
> relatively minimal differences between ABIs.

Pointers were the final problem I had.
My pre-processor needed to cope with code that had to dereference
pointers (of unknown size), sometimes in order to grab another
pointer (of unknown size).

Another issue I now have is that my tight little loops now no longer
fit into a Jcc short <Imm8>, as the code has become a tad bloated

e.g. (from objdump, with unresolved addresses)

ce: 8b 35 00 00 00 00 mov 0x0,%esi
d4: 83 c6 04 add $0x4,%esi
->
dd: 48 8b 34 25 00 00 00 mov 0x0,%rsi
e4: 00
e5: 48 83 c6 04 add $0x4,%rsi

10d: 31 d2 xor %edx,%edx
10f: 31 c9 xor %ecx,%ecx
111: 31 c0 xor %eax,%eax
->
123: 48 31 d2 xor %rdx,%rdx
126: 48 31 c9 xor %rcx,%rcx
129: 48 31 c0 xor %rax,%rax

I presume that there's not really much point in size-optimizing
such small things any more. Everything's in the L1 code cache,
that's surely the most important thing.

> Other than that, you can, as you say, use the preprocessor; either
> your own or NASM's.

Both, in the end. Quite a mess, to be honest. I really don't
like the multi-headed monster the x86 has turned into. It's a
shame that my asm is still faster than my C, otherwise I'd just
stick to C. My Alpha and Power C code usually run at near-as-
damn it 100%, with no visible chance to fill the pipelines any
more even if I were to hit assembly, so I've got into the habit
of not needing assembly language.

The preprocessor source now looks like this:
"""
segment .text

;;; Multifactorial with a multiply for 2 different numbers
;;; assume initial value is fres0 fres1
cglobal p3_bl_1prime_multi2
p3_bl_1prime_multi2:

$_STACK_FRAME_SAVE_$

xor $GP1$, $GP1$ ; hack to work on 64 archs
mov $GP1/32$, DWORD [p3bl_pitch] ; interleaving of the helpers
mov $GP4$, PWORD [p3bl_itom] ; doubles or floats depending on factorial.h
imul $GP1$, MULTIPLIER_STEP

; we always index forward by 2 numbers, so simply move the pointer
add $GP4$, $GP1$
"""

Where all those /\$GP(\d+)(?:\/(\d+))\$/ are turned into general purpose
registers, either full width, or of the appropriate size, done in my
pre-processor script.

As you can see, I couldn't find a quick way of loading a 32-bit
value [p3bl_pitch] so that if you're on a 64-bit architecture,
the top 32 bits would be blanked. Nasm doesn't recognize that
mov[zs]x r32, r/m32
is just an ordinary 32-bit move with no extension.

Is there a macro which would compare operand sizes, and rewrite such
mov[zs]x's?

The output of the above chunk from the preprocessor is:
"""
segment .text

;;; Multifactorial with a multiply for 2 different numbers
;;; assume initial value is fres0 fres1
cglobal p3_bl_1prime_multi2
p3_bl_1prime_multi2:

push RBP
mov RBP, RSP
push RBX
; ignore r12-15

xor RBX, RBX ; hack to work on 64 archs
mov EBX, DWORD [p3bl_pitch] ; interleaving of the helpers
mov RSI, PWORD [p3bl_itom] ; doubles or floats depending on factorial.h
imul RBX, MULTIPLIER_STEP

; we always index forward by 2 numbers, so simply move the pointer
add RSI, RBX
"""
on x86_64 and
"""
segment .text

;;; Multifactorial with a multiply for 2 different numbers
;;; assume initial value is fres0 fres1
cglobal p3_bl_1prime_multi2
p3_bl_1prime_multi2:

push EBP
mov EBP, ESP
push EDI
push ESI
push EBX

xor EBX, EBX ; hack to work on 64 archs
mov EBX, DWORD [p3bl_pitch] ; interleaving of the helpers
mov ESI, PWORD [p3bl_itom] ; doubles or floats depending on factorial.h
imul EBX, MULTIPLIER_STEP
"""
on x86_32.

PWORD (a 'pointer word') will be expanded into DWORD or QWORD using
simple nasm macros in a machine-specific include file.

The source then continues as before, mostly obsessed with FPU stuff,
where basically my preprocessor lets me name the values on the stack,
and follows them as they move around:
"""
fld 1 as 1.
fld QWORD [p3bf_fp0] as fp
1. /= fp as fpr
fld QWORD [p3bf_fres1] as fres1'
fld fp as fp'
fp' -= fres1' as fres1
...
fucomi fres0 ; flags set
fxch fres0
sete $GP0/8$ ; hit or not
fres0 *= MULTIPLIER_TYPE [$GP4$]
"""
becomes:
"""
fld1 ; st0=1.=st0
fld QWORD [p3bf_fp0] ; st0=fp, 1.=st1
fdiv st1, st0 ; st0=fp, fpr=st1
fld QWORD [p3bf_fres1] ; st0=fres1', fp, fpr=st2
fld st1 ; st0=fp', fres1', fp, fpr=st3
fsub st0, st1 ; st0=fres1, fres1', fp, fpr=st3
...
fucomi st2 ; st0=1., fres1, fres0, fpr=st3
fxch st2 ; st0=fres0, fres1, 1., fpr=st3
sete AL ; hit or not
fmul MULTIPLIER_TYPE [RSI] ; st0=fres0, fres1, 1., fpr=st3
"""
or
"""
fld1 ; st0=1.=st0
fld QWORD [p3bf_fp0] ; st0=fp, 1.=st1
fdiv st1, st0 ; st0=fp, fpr=st1
fld QWORD [p3bf_fres1] ; st0=fres1', fp, fpr=st2
fld st1 ; st0=fp', fres1', fp, fpr=st3
fsub st0, st1 ; st0=fres1, fres1', fp, fpr=st3
...
fucomi st2 ; st0=1., fres1, fres0, fpr=st3
fxch st2 ; st0=fres0, fres1, 1., fpr=st3
sete AL ; hit or not
fmul MULTIPLIER_TYPE [ESI] ; st0=fres0, fres1, 1., fpr=st3
"""
on the 2 arch's.

So basically I've got something which just about works, but I've
wasted too much time on this already, so I'll just try to make do
with that.

Phil Carmody

unread,

Oct 20, 2008, 4:27:12 AM10/20/08

to

Nathan...@gmail.com writes:
> On Oct 18, 7:11 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
> wrote:
>>
>> p3_bl_1prime_multi2:
>>
>
> I can't get my aardvark to "sit" but it is really good at math:
>
> ; For Windows:
> ; nasm -f win32 --prefix _ -o sayprimes.obj sayprimes.asm
> ; gcc -o sayprimes.exe sayprimes.obj
> ;
> ; For Linux:
> ; nasm -f elf32 -o sayprimes.o sayprimes.asm
> ; gcc -o sayprimes sayprimes.o
>
> section .data
> ispmsg: db ' %d', 10, 0
>
> section .text
> global main
> extern printf
>
> main:
>
> push ecx
> finit

Is that necessary any more? I thought that went out of the window
with the 486?

> xor ecx, ecx
> mloop:
> push ecx
> call isprime
> inc ecx
> cmp ecx, 2147483647
> jle mloop

For that, you want a sieve. Eratosthenes (with a wheel of 6) will do.

Chuck Crayne

unread,

Oct 20, 2008, 2:24:09 PM10/20/08

to

On Mon, 20 Oct 2008 11:14:26 +0300
Phil Carmody <thefatphi...@yahoo.co.uk> wrote:

> Another issue I now have is that my tight little loops now no longer
> fit into a Jcc short <Imm8>, as the code has become a tad bloated

For those cases, as in your example, where the values will fit into a
32-bit word, you can continue to use the 32 bit register names, as the
hardware will automatically clear the high order 32 bits.

--
Chuck
http://www.pacificsites.com/~ccrayne/charles.html

Alexei A. Frounze

unread,

Oct 20, 2008, 5:31:20 PM10/20/08

to

On Oct 20, 12:14 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:
...

> Another issue I now have is that my tight little loops now no longer
> fit into a Jcc short <Imm8>, as the code has become a tad bloated

Look closely at the Jcc instructions. There's a Jcc rel16/32 form with
encoding 00Fh, 08Xh, rel16/32, where X is the condition.

Alex

H. Peter Anvin

unread,

Oct 20, 2008, 6:31:46 PM10/20/08

to

Phil Carmody wrote:
>>
>> push ecx
>> finit
>
> Is that necessary any more? I thought that went out of the window
> with the 486?
>

It is necessary if you want to reset the contents of the FPU; however,
the process initial conditions already have the FPU reset so it
shouldn't be necessary.

-hpa

Phil Carmody

unread,

Oct 21, 2008, 4:27:26 AM10/21/08

to

It looks like when I just use Jcc <label>, NASM makes it Imm8 if
it can do (with -O, presumably). That's good enough for me.

Oh - a big thank you to the NASM crew - debian stable doesn't
incorporate NASM 2, so I had to install my own, and it configured
and compiled with nary a glitch. I particularly like the fact that
-Wall remains silent.

Phil

Nathan...@gmail.com

unread,

Oct 21, 2008, 4:27:19 PM10/21/08

to

On Oct 20, 4:27 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:
>

> > push ecx
> > finit
>
> Is that necessary any more? I thought that went out of the window
> with the 486?
>

I always thought it was the preceding "wait" that the 486 made
obsolete. It is interesting to note that Nasm has insert this
instruction (without my "asking" for it) into the binary:

51 PUSH ECX
9B WAIT
DBE3 FINIT

Hey Herbert! Nasm is not 1-to-1!!! How much more "evidence" do I
need to gather to prove that "NASM is not an assembler" by your
standards? :)

Nathan.

Rod Pemberton

unread,

Oct 22, 2008, 3:52:37 AM10/22/08

to

<Nathan...@gmail.com> wrote in message
news:84d6af50-1a37-4ac4...@26g2000hsk.googlegroups.com...

> It is interesting to note that Nasm has insert this
> instruction (without my "asking" for it) into the binary:
>
> 51 PUSH ECX
> 9B WAIT
> DBE3 FINIT

DBE3 is not FINIT. It's FNINIT:

51 PUSH ECX
9B WAIT

DBE3 FNINIT

9BDBE3 is FINIT.

I appears you "asked" for it...

Rod Pemberton

Nathan...@gmail.com

unread,

Oct 22, 2008, 9:21:38 PM10/22/08

to

On Oct 22, 3:52 am, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
> <NathanCBa...@gmail.com> wrote in message

Must be an 'artifact' of Ollydbg's disassembly.

Nathan.

migrating nasm code to to x86_64

Phil Carmody

Alexei A. Frounze

Frank Kotler

Herbert Kleebauer

Chuck Crayne

H. Peter Anvin

Phil Carmody

H. Peter Anvin

Nathan...@gmail.com

Phil Carmody

Phil Carmody

Chuck Crayne

Alexei A. Frounze

H. Peter Anvin

Phil Carmody

Nathan...@gmail.com

Rod Pemberton

Nathan...@gmail.com