<<<
cglobal p3_bl_1prime_multi2
p3_bl_1prime_multi2:
push ebp
mov ebp, esp
push edi
push esi
push ebx
>>>
As all the pushes do this:
<<<
factorial_m_p3.asm:18: error: instruction not supported in 64-bit mode
factorial_m_p3.asm:20: error: instruction not supported in 64-bit mode
factorial_m_p3.asm:21: error: instruction not supported in 64-bit mode
factorial_m_p3.asm:22: error: instruction not supported in 64-bit mode
>>>
I don't want to simply fork the file into an x86 and an x86_64 version
with different register names, but can do that if it's the simpest thing
to do. How have other people coped with these kinds of issues? Some
macros, perhaps for preamble and postamble? Macros just for register
names? Any advice would be welcome.
Phil
--
The fact that a believer is happier than a sceptic is no more to the
point than the fact that a drunken man is happier than a sober one.
The happiness of credulity is a cheap and dangerous quality.
-- George Bernard Shaw (1856-1950), Preface to Androcles and the Lion
In 64-bit mode you can only push 8 or 2 bytes, not 4 with a single
PUSH instruction. You could indeed use some special macros for
register names, e.g. _rax that would expand to different things in
different CPU modes (eax or rax).
Alex
If you just want to get a 32-bit executable (from tools that default to
64 bits), try adding "-melf_i386" to the command line to ld. If you want
a 64-bit executable, you'll have to change the register names... and
probably other things. This 64-bit shit - completely different ABI - is
a "nail in the coffin of assembly language" IMHO.
Best,
Frank
> probably other things. This 64-bit shit - completely different ABI - is
> a "nail in the coffin of assembly language" IMHO.
Is there any free place left where you can insert an additional nail?
> Some
> macros, perhaps for preamble and postamble? Macros just for register
> names? Any advice would be welcome.
You can use the __OUTPUT_FORMAT__ macro, either in open code, or
within macros, to handle such things. However, that's just the tip of
the iceberg. All address variables must be 64 bit. The Linux syscalls
have been renumbered. And if you call any library funcions, the
parameters are passed in registers, instead of on the stack.
--
Chuck
http://www.pacificsites.com/~ccrayne/charles.html
Well, there are plenty of proper, and improper, uses for assembly
language. Trying to write portable applications is obviously not one of
them. However, there are plenty of cases where you can get a many times
speedup by writing core routines in assembly, and then it's probably
worth writing for each target architecture -- and x86-32 and x86-64 are
different target architectures (as is x86-16; it is probably more
different from the other two than they are from each other).
-hpa
Fortunately I couldn't give a flying aardvark about my routines
calling any other routines - I'm writing optimised inner loops,
nothing more.
From the small straw poll it seems that just having a macro which
expands to different register names for the two architectures is
the simplest way to go. Fortunately that's easy for my inner loops,
as it's 99% FPU number-crunching, and only rarely do I need integer
registers, or memory access.
Possibly more fortunately, my actual source code isn't even written
in assembly, I wrote a script to turn slightly higher level code
into assembly (do you remember I mentioned that about 2 years ago?).
So I could just implement it in my pre-processor, and not bloat any
source code.
I guess I should familiarise myself with the linux/x64_64 C
calling convention. Then again, I'm prepared to just use global
variables for simplicity. Re-entrancy is not an issue at all.
A very simple way to deal with these kinds of issues across multiple
ABIs is to just pass in a structure as the sole argument. It means
relatively minimal differences between ABIs.
Other than that, you can, as you say, use the preprocessor; either your
own or NASM's.
-hpa
I can't get my aardvark to "sit" but it is really good at math:
; For Windows:
; nasm -f win32 --prefix _ -o sayprimes.obj sayprimes.asm
; gcc -o sayprimes.exe sayprimes.obj
;
; For Linux:
; nasm -f elf32 -o sayprimes.o sayprimes.asm
; gcc -o sayprimes sayprimes.o
section .data
ispmsg: db ' %d', 10, 0
section .text
global main
extern printf
main:
push ecx
finit
xor ecx, ecx
mloop:
push ecx
call isprime
inc ecx
cmp ecx, 2147483647
jle mloop
pop ecx
xor eax, eax
ret
isprime:
push ebp
mov ebp, esp
sub esp, 8
push ecx
push edx
mov dword [ebp-8], 1
cmp dword [ebp+8], 2
jge next
mov dword [ebp-8], 0
jmp sayprime
next:
jne nextt
jmp sayprime
nextt:
mov eax, [ebp+8]
and eax, 1
cmp eax, 0
je nexttt
fild dword [ebp+8]
fsqrt
fistp dword [ebp-4]
mov ecx, 3
jmp ccheck
floop:
mov eax, [ebp+8]
cdq
idiv ecx
cmp edx, 0
jne check
mov dword [ebp-8], 0
jmp sayprime
check:
add ecx, 2
ccheck:
cmp ecx, [ebp-4]
jle floop
jmp sayprime
nexttt:
mov dword [ebp-8], 0
sayprime:
xor eax, eax
cmp eax, [ebp-8]
je goback
mov eax, [ebp+8]
push eax
push ispmsg
call printf
add esp, 8
goback:
pop edx
pop ecx
mov esp, ebp
pop ebp
ret 4
Nathan.
Pointers were the final problem I had.
My pre-processor needed to cope with code that had to dereference
pointers (of unknown size), sometimes in order to grab another
pointer (of unknown size).
Another issue I now have is that my tight little loops now no longer
fit into a Jcc short <Imm8>, as the code has become a tad bloated
e.g. (from objdump, with unresolved addresses)
ce: 8b 35 00 00 00 00 mov 0x0,%esi
d4: 83 c6 04 add $0x4,%esi
->
dd: 48 8b 34 25 00 00 00 mov 0x0,%rsi
e4: 00
e5: 48 83 c6 04 add $0x4,%rsi
10d: 31 d2 xor %edx,%edx
10f: 31 c9 xor %ecx,%ecx
111: 31 c0 xor %eax,%eax
->
123: 48 31 d2 xor %rdx,%rdx
126: 48 31 c9 xor %rcx,%rcx
129: 48 31 c0 xor %rax,%rax
I presume that there's not really much point in size-optimizing
such small things any more. Everything's in the L1 code cache,
that's surely the most important thing.
> Other than that, you can, as you say, use the preprocessor; either
> your own or NASM's.
Both, in the end. Quite a mess, to be honest. I really don't
like the multi-headed monster the x86 has turned into. It's a
shame that my asm is still faster than my C, otherwise I'd just
stick to C. My Alpha and Power C code usually run at near-as-
damn it 100%, with no visible chance to fill the pipelines any
more even if I were to hit assembly, so I've got into the habit
of not needing assembly language.
The preprocessor source now looks like this:
"""
segment .text
;;; Multifactorial with a multiply for 2 different numbers
;;; assume initial value is fres0 fres1
cglobal p3_bl_1prime_multi2
p3_bl_1prime_multi2:
$_STACK_FRAME_SAVE_$
xor $GP1$, $GP1$ ; hack to work on 64 archs
mov $GP1/32$, DWORD [p3bl_pitch] ; interleaving of the helpers
mov $GP4$, PWORD [p3bl_itom] ; doubles or floats depending on factorial.h
imul $GP1$, MULTIPLIER_STEP
; we always index forward by 2 numbers, so simply move the pointer
add $GP4$, $GP1$
"""
Where all those /\$GP(\d+)(?:\/(\d+))\$/ are turned into general purpose
registers, either full width, or of the appropriate size, done in my
pre-processor script.
As you can see, I couldn't find a quick way of loading a 32-bit
value [p3bl_pitch] so that if you're on a 64-bit architecture,
the top 32 bits would be blanked. Nasm doesn't recognize that
mov[zs]x r32, r/m32
is just an ordinary 32-bit move with no extension.
Is there a macro which would compare operand sizes, and rewrite such
mov[zs]x's?
The output of the above chunk from the preprocessor is:
"""
segment .text
;;; Multifactorial with a multiply for 2 different numbers
;;; assume initial value is fres0 fres1
cglobal p3_bl_1prime_multi2
p3_bl_1prime_multi2:
push RBP
mov RBP, RSP
push RBX
; ignore r12-15
xor RBX, RBX ; hack to work on 64 archs
mov EBX, DWORD [p3bl_pitch] ; interleaving of the helpers
mov RSI, PWORD [p3bl_itom] ; doubles or floats depending on factorial.h
imul RBX, MULTIPLIER_STEP
; we always index forward by 2 numbers, so simply move the pointer
add RSI, RBX
"""
on x86_64 and
"""
segment .text
;;; Multifactorial with a multiply for 2 different numbers
;;; assume initial value is fres0 fres1
cglobal p3_bl_1prime_multi2
p3_bl_1prime_multi2:
push EBP
mov EBP, ESP
push EDI
push ESI
push EBX
xor EBX, EBX ; hack to work on 64 archs
mov EBX, DWORD [p3bl_pitch] ; interleaving of the helpers
mov ESI, PWORD [p3bl_itom] ; doubles or floats depending on factorial.h
imul EBX, MULTIPLIER_STEP
"""
on x86_32.
PWORD (a 'pointer word') will be expanded into DWORD or QWORD using
simple nasm macros in a machine-specific include file.
The source then continues as before, mostly obsessed with FPU stuff,
where basically my preprocessor lets me name the values on the stack,
and follows them as they move around:
"""
fld 1 as 1.
fld QWORD [p3bf_fp0] as fp
1. /= fp as fpr
fld QWORD [p3bf_fres1] as fres1'
fld fp as fp'
fp' -= fres1' as fres1
...
fucomi fres0 ; flags set
fxch fres0
sete $GP0/8$ ; hit or not
fres0 *= MULTIPLIER_TYPE [$GP4$]
"""
becomes:
"""
fld1 ; st0=1.=st0
fld QWORD [p3bf_fp0] ; st0=fp, 1.=st1
fdiv st1, st0 ; st0=fp, fpr=st1
fld QWORD [p3bf_fres1] ; st0=fres1', fp, fpr=st2
fld st1 ; st0=fp', fres1', fp, fpr=st3
fsub st0, st1 ; st0=fres1, fres1', fp, fpr=st3
...
fucomi st2 ; st0=1., fres1, fres0, fpr=st3
fxch st2 ; st0=fres0, fres1, 1., fpr=st3
sete AL ; hit or not
fmul MULTIPLIER_TYPE [RSI] ; st0=fres0, fres1, 1., fpr=st3
"""
or
"""
fld1 ; st0=1.=st0
fld QWORD [p3bf_fp0] ; st0=fp, 1.=st1
fdiv st1, st0 ; st0=fp, fpr=st1
fld QWORD [p3bf_fres1] ; st0=fres1', fp, fpr=st2
fld st1 ; st0=fp', fres1', fp, fpr=st3
fsub st0, st1 ; st0=fres1, fres1', fp, fpr=st3
...
fucomi st2 ; st0=1., fres1, fres0, fpr=st3
fxch st2 ; st0=fres0, fres1, 1., fpr=st3
sete AL ; hit or not
fmul MULTIPLIER_TYPE [ESI] ; st0=fres0, fres1, 1., fpr=st3
"""
on the 2 arch's.
So basically I've got something which just about works, but I've
wasted too much time on this already, so I'll just try to make do
with that.
Is that necessary any more? I thought that went out of the window
with the 486?
> xor ecx, ecx
> mloop:
> push ecx
> call isprime
> inc ecx
> cmp ecx, 2147483647
> jle mloop
For that, you want a sieve. Eratosthenes (with a wheel of 6) will do.
> Another issue I now have is that my tight little loops now no longer
> fit into a Jcc short <Imm8>, as the code has become a tad bloated
For those cases, as in your example, where the values will fit into a
32-bit word, you can continue to use the 32 bit register names, as the
hardware will automatically clear the high order 32 bits.
--
Chuck
http://www.pacificsites.com/~ccrayne/charles.html
Look closely at the Jcc instructions. There's a Jcc rel16/32 form with
encoding 00Fh, 08Xh, rel16/32, where X is the condition.
Alex
It is necessary if you want to reset the contents of the FPU; however,
the process initial conditions already have the FPU reset so it
shouldn't be necessary.
-hpa
It looks like when I just use Jcc <label>, NASM makes it Imm8 if
it can do (with -O, presumably). That's good enough for me.
Oh - a big thank you to the NASM crew - debian stable doesn't
incorporate NASM 2, so I had to install my own, and it configured
and compiled with nary a glitch. I particularly like the fact that
-Wall remains silent.
Phil
I always thought it was the preceding "wait" that the 486 made
obsolete. It is interesting to note that Nasm has insert this
instruction (without my "asking" for it) into the binary:
51 PUSH ECX
9B WAIT
DBE3 FINIT
Hey Herbert! Nasm is not 1-to-1!!! How much more "evidence" do I
need to gather to prove that "NASM is not an assembler" by your
standards? :)
Nathan.
DBE3 is not FINIT. It's FNINIT:
51 PUSH ECX
9B WAIT
DBE3 FNINIT
9BDBE3 is FINIT.
I appears you "asked" for it...
Rod Pemberton
Must be an 'artifact' of Ollydbg's disassembly.
Nathan.