thanks
dan
This is actually quite easy, the hard part is to do it perfectly. :-)
You must start by scaling your input number by a suitable power of 10,
such that the resulting number is in the [1.0 - 10.0> range.
The proper way to do this is to start by taking the binary exponent,
multiply this number by log10(2) to get an estimated power of ten (e10).
Calculate 10^abs(e10), and compare this number with the input value.
If it is too big, adjust it.
If the input value was less than 1.0, multiply by the scale factor,
otherwise divide.
At this point the remaining code should be trivial, but you might like a
little tip:
To generate successive decimal digits from a number like 5.4567823, you
can just make a loop which extracts the integer portion, and then
subtracts that value, something like:
next:
FIST [digit] ; scaled_number
FILD [digit] ; floor(number) scaled_number
FSUBP st(1),st ; remainder
FMUL [ten]
mov eax,[digit]
add al,'0'
mov [edi],al
dec ecx
jnz next
This is probably faster than a BCD conversion!
Even better is to use the code above to extract N-digit blocks, and then
use integer only code to convert this to decimal digits.
A properly prescaled (in fp?) value can be converted to decimal using
nothing but LEA, AND and SHR operations, so this can be _very_ quick.
:-)
My best code uses less than 75 cycles worst case on a plain Pentium to
convert a 32-bit unsigned value to ascii, with a 3 to 5 digit loop this
can be even more efficient.
Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
It's not very elegant, but works for all cases.
-----------------------------------------------
;
; Convert Numbers to Display Format Using 8087
;
; The 80387 will store integers in Packed Decimal format. This
routine
; multiplies a number by a power of ten equal to the number of decimal
; places, stores the rounded packed decimal value and converts the
; result to ASCII, inserting the decimal point and sign.
;
; The 80387 is reset in this routine.
;
; Entry Point: OCONVERT - Number already on 8087 stack top
;
; Calling Sequence: Number to convert is on stack top
; DX - Format, DH = Decimal Places, DL = Total
Length
; including sign and decimal point.
;
; Return: OUTA - ASCII Number right adjusted (Format
length)
; BX = Offset of first character in OUTA
; Carry - Return code (Won't fit in format)
;
PUSH SI
PUSH ES
MOV BX,DS
MOV ES,BX
XOR BX,BX ;Clear BX
MOV CX,BX ; and CX
MOV DI,OFFSET OUTA ;Place to store converted number
ADD DI,19 ;Point to end of output area
MOV BYTE PTR [DI+1],'$' ;Make sure $ delimiter in OUTA
PUSH DI ;Save end of area address
PUSH DX ;Save Format
MOV CL,DH ;Decimal Places to CX
CMP CL,0 ;Test if Integer
JE OC2 ;JMP if Integer
OC1: FIMUL TEN ;Scale by mulitplying by power of 10
LOOP OC1
OC2: FBSTP BCDWORK ;Store Packed Decimal
MOV SI,OFFSET BCDWORK ;Work Area
FINIT ;Make Certain Store Finished and reset
8087
MOV CL,DL ;Use length for loop count
OC3: CLD ;Make sure load direction is correct
LODSB ;Get byte = two digits
MOV AH,AL ;Copy into AH
SHR AH,4 ;Shift high order to numeric nibble
AND AX,0F0FH ;Put Zone on Digits
ADD AX,3030H
STD ;Store in Reverse Order
STOSB ;Store character
LOOP OC4 ;Loop Until Done
JMP SHORT OC7 ;Go insert sign
OC4: MOV AL,AH ;Store other digit
STOSB
LOOP OC3 ;Loop
OC7: CMP BYTE PTR BCDWORK+9,0 ;Test Sign Position
MOV AL," " ;Assume Plus
JE OC8 ;Jump if Positive
MOV AL,"-" ;Get Minus Sign
OC8: STOSB ;Store Sign
CLD ;DI points to position below sign
CMP DH,0 ;Test integer
JNE OC9 ;Insert decimal point if not integer
INC DI ;Point to sign position
MOV BX,DI
JMP SHORT LZ ;Remove Leading Zeros
; Insert decimal point
OC9: MOV CL,DL ;Total length to CX
SUB CL,DH ;Minus decimal places
INC CX ;Plus 1 for sign
MOV DX,DI ;Save sign position
MOV SI,DI
INC SI ;SI point to sign
REP MOVSB
MOV AL,"."
STOSB
MOV BX,DX ;BX = sign position
; Remove leading zeros
LZ: MOV AL,[BX] ;Get Sign
MOV AH,' ' ;Blank in AH
OC10: CMP [BX]+1,BYTE PTR '0' ;Leading zero?
JNE OC11 ;No. End.
CMP [BX]+2,BYTE PTR BYTE PTR '.' ;Is it next to decimal point
JE OC11 ;Leave leading zero in front of decimal
MOV [BX],AH ;Store blank
INC BX ;Next position
JMP SHORT OC10 ;Loop
OC11: MOV [BX],AL ;Store sign
POP DX ;Format
POP BX ;Address of end of area
CMP [BX],BYTE PTR ' ' ;Were all zeros removed
JNE OC12
MOV [BX],BYTE PTR '0'
OC12: MOV DH,0 ;Remove decimal places
SUB BX,DX ;Address of number
INC BX ; at format length
POP ES
POP SI
RET
OCONVERT ENDP
James E. Morrison
astr...@i84.net
astrolabe web pages at: http://myhouse.com/mc/planet/astrodir/astrolab.htm
;
Daniel wrote in message <7ah9bm$mbe$1...@winter.news.rcn.net>...
>Can I have some pointers on how to convert floating point numbers
>(32/64/80 bit) to bcd (which will then go to ascii for display).
>I would prefer just to know the maths behind this rather than see
>someone elses code. I have worked out a way to convert the part to the
>right of the binary point but not when this is then scaled by
>2^something.
>
>thanks
>dan
Terje,
it would be interesting to see your extremely fast integer to ASCII
conversion. The fastest I managed is about 20 cycles longer on a
PentiumMMX (see below).
-- Norbert
ulong2ascii converts a 32-bit unsigned integer into a decimal ASCIIZ string
(a null terminated string as used by the C language). A two part algorithm is
used. Using multiplication with the reciprocal, the integer is divided by
1e9. The quotient represents the most significant decimal digit. The
remainder, which is a number < 1e9, is then converted into a scaled integer
representation where 2^28 represents 1. In each step, the integer part is
extracted and forms the next decimal digit and the remaining fraction is then
multiplied by 10 in preparation for the extarction of the next decimal digit.
Define a symbol FASTMUL if the target CPU has a fast integer multiply (5
cycles or less), e.g. K6 and PentiumII. For CPUs with slow integer multiply
such as PentiumMMX, undefine FASTMUL to replace one of the multiplies with a
table look up. Note that the table lookup might cause cache misses.
This routine has been timed at 91 cycles on PentiumII with FASTMUL defined,
and at 98 cycles on a PentiumMMX with FASTMUL undefined, and cache hits during
table lookup. Timing is independent of the magnitude of the input.
This code has been tested exhaustively against the sprintf routines of both
the Microsoft Visual C and the Watcom C compiler.
; ; ulong2ascii ; ; input: ; eax = unsigned long to be converted to ASCIIZ
string ; edi = pointer to character buffer which receives result (at least
11 chars) ; ; output: ; none (buffer filled with numbers) ; ; destroys: ;
eax, ebx, ecx, edx, esi, edi ; eflags ;
MACRO ulong2ascii
.DATA
IFNDEF FASTMUL
cvttab DD 0000000000 ; 0 * 1e9
DD 1000000000 ; 1 * 1e9
DD 2000000000 ; 2 * 1e9
DD 3000000000 ; 3 * 1e9
DD 4000000000 ; 4 * 1e9
ENDIF
.CODE
mov ecx,eax ; save original argument
mov esi,89705f41h ; 1e-9*2^61 rounded
mul esi ; divide by 1e9 by mult. with recip.
add eax,80000000h ; round division result
mov esi,0abcc7712h ; 2^28/1e8 * 2^30 rounded up
adc edx,0 ; EDX<31:29> = argument / 1e9
mov eax,ecx ; restore original argument
shr edx,29 ; leading decimal digit, 0...4
mov ecx,8 ; produce eight more digits
mov ebx,edx ; flags whether non-zero digit seen yet
or edx,'0' ; convert digit to ASCII
mov [edi],dl ; store out to memory
cmp ebx,1 ; first digit nonzero ? CY=0 : CY=1
sbb edi,-1 ; incr. pointer if first digit non-zero
IFDEF FASTMUL
imul ebx,1000000000 ; multiply quotient digit by divisor
sub eax,ebx ; remainder after first digit
ELSE
sub eax,[cvttab+ebx*4] ; subtract quotient digit * divisor
ENDIF
mul esi ; convert number < 1e9
shld edx,eax, 2 ; into fraction such
inc edx ; that 1.0 = 2^28
mov eax,edx ; save result
shr eax,28 ; next digit
and edx,0fffffffh ; fraction part
or ebx,eax ; any non-zero yet ?
or eax,'0' ; convert digit to ASCII
$cvt_loop:
mov [edi],al ; store digit out to memory
add edx,edx ; 2*fraction
cmp ebx,1 ; any non-zero digit seen ? CY=0 : CY=1
lea edx,[edx*4+edx] ; 10*fraction, new digit EAX<31:28>,
; new fraction EAX<27:0>
sbb edi,-1 ; incr. ptr if any non-zero digit seen
mov eax,edx ; save result
shr eax,28 ; next digit = integer part
and edx,0fffffffh ; fraction part
or ebx,eax ; any non-zero digit yet ?
or eax,'0' ; convert digit to ASCII
dec ecx ; one more digit
jnz $cvt_loop ; until all nine digits done
mov [edi],al ; store last digit out to memory
mov byte ptr [edi+1],ah ; place string end marker
ENDM
>
> --
> - <Terje.M...@hda.hydro.com>
> Using self-discipline, see http://www.eiffel.com/discipline
> "almost all programming can be viewed as an exercise in caching"
>
-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
That's still very nice code, my version is slightly faster because I use
an initial split on 100,000.
After this I can use properly pipelined code to extract digits from both
of those 5-digit blocks in parallel!
Otherwise I use exactly the same kind of thinking as in your version.
Here's both a C and an inline asm version of my algorithm:
You'll notice that the asm version uses 4 MUL/IMUL operations, which are
responsible for the major part of the cycles on a Pentium, but which
also makes it very quick on a PII.
You'll also notice that except for the MULs, every single cycle have a
pair of fast instructions, with no lost slots of stalls. :-)
Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
char *dtoa_c(char *buf, unsigned long n)
{
unsigned _int64 l64, h64;
unsigned long l, h, i;
l64 = n;
l64 *= 2814749767;
l64 += n >> 1;
h = (unsigned long) (l64 >> 48);
l = n - h * 100000;
l64 = l;
h64 = h;
h64 *= 429497;
l64 *= 429497;
buf[0] = (char) (h64 >> 32) + '0';
buf[5] = (char) (l64 >> 32) + '0';
h = (unsigned long) h64;
l = (unsigned long) l64;
h = (h + 15) >> 4;
l = (l + 15) >> 4;
for (i = 1; i < 5; i++) {
h *= 10;
l *= 10;
buf[i] = (char) (h >> 28) + '0';
buf[i+5] = (char) (l >> 28) + '0';
h &= 0x0fffffff;
l &= 0x0fffffff;
}
buf[10] = '\0';
return buf;
}
char *dtoa(char *buf, unsigned long n)
{
__asm {
mov ebx,[n]
mov eax,2814749767
mul ebx
shr ebx,1
xor ecx,ecx
mov edi,[buf]
add eax,ebx
adc ecx,edx
mov eax,100000
shr ecx,16 // ECX = high part
mov ebx,[n]
imul eax,ecx // High part * 100k
sub ebx,eax // Low part
mov eax,429497
mul ecx
mov ecx,eax
add dl,'0'
mov eax,429497
mov [edi],dl
mul ebx
mov ebx,eax
add ecx,7
shr ecx,3
add dl,'0'
mov [edi+5],dl
add ebx,7
shr ebx,3
lea ecx,[ecx+ecx*4]
mov edx,ecx
and ecx,0fffffffh
shr edx,28
lea ebx,[ebx+ebx*4]
add dl,'0'
mov eax,ebx
shr eax,28
mov [edi+1],dl
and ebx,0fffffffh
add al,'0'
mov [edi+6],al
lea ecx,[ecx+ecx*4]
lea ebx,[ebx+ebx*4]
mov edx,ecx
mov eax,ebx
and ecx,07ffffffh
shr edx,27
and ebx,07ffffffh
shr eax,27
add dl,'0'
add al,'0'
mov [edi+2],dl
mov [edi+7],al
lea ecx,[ecx+ecx*4]
lea ebx,[ebx+ebx*4]
mov edx,ecx
mov eax,ebx
and ecx,03ffffffh
shr edx,26
and ebx,03ffffffh
shr eax,26
add dl,'0'
add al,'0'
mov [edi+3],dl
mov [edi+8],al
lea ecx,[ecx+ecx*4]
shr ecx,25
lea ebx,[ebx+ebx*4]
shr ebx,25
add cl,'0'
add bl,'0'
mov [edi+10],ah
mov [edi+4],cl
mov [edi+9],bl
}
return buf;
}
Very nice. It's seems you saved cycles by
1. Unrolling the loop (I loose one cycle per loop iteration, or 9 cycles)
2. not compressing out leading zeros (I loose one cycle per loop iteration,
or another 9 cycles)
This seems to make up for the majority of the difference in the speed of
our codes on a Pentium. The parallelization (i.e. working on two blocks in
parallel) doesn't seem to buy much on a Pentium, but should be great on PII,
or K7. But it makes the problem of compressing out leading zeros harder.
What would your solution look like if you had to compress out leading zeros?
Of course, in the context of the original posters question, one wouldn't
want to compress out the leading zeros.
-- Norbert
This actually saved more since I work on two digits/iteration, and can
skip the multiplication by 2 (shift), leaving only a LEA R,[R+R*4] to do
the multiplication by 5.
> 2. not compressing out leading zeros (I loose one cycle per loop iteration,
> or another 9 cycles)
This might cost more, since you'll have at least one branch mispredict
on any loop.
>
> This seems to make up for the majority of the difference in the speed of
> our codes on a Pentium. The parallelization (i.e. working on two blocks in
> parallel) doesn't seem to buy much on a Pentium, but should be great on PII,
> or K7.
No, the pairing makes most of a difference on a Pentium, since there's
no auto-parallelization on this cpu. The P6-family cpus are capable of
figuring out at least some independent operations by itself.
OTOH, on a PII with the 3-4 cycle MUL, the dual-issue or better on all
the remaining parts of the code makes a relatively bigger difference.
The nice symmetry of my solution just feels "right". :-)
> But it makes the problem of compressing out leading zeros harder.
> What would your solution look like if you had to compress out leading zeros?
Do that as a separate step at the end. Besides, when I'm formatting
large numbers I like to insert the proper spacer between every three
digits, so I have to do a second step anyway.
> Of course, in the context of the original posters question, one wouldn't
> want to compress out the leading zeros.
Right.
Terje
I agree with this analysis. Just for the fun of it, I will unroll my loop
and take out the leading zero suppression, and measure it again.
> > This seems to make up for the majority of the difference in the speed of
> > our codes on a Pentium. The parallelization (i.e. working on two blocks in
> > parallel) doesn't seem to buy much on a Pentium, but should be great on PII,
> > or K7.
>
> No, the pairing makes most of a difference on a Pentium, since there's
> no auto-parallelization on this cpu. The P6-family cpus are capable of
> figuring out at least some independent operations by itself.
I think we are talking about two different things here. Yes, Pentium is
an in-order machine with strict rules on dual decode/execute (what decodes
together executes together), while P6 is an OOO machine with decoupled
decode and execute.
I understood your original mail to claim that the major benefit is from
splitting 10000:100000 instead of 1e9:100000000, and a quick comparison
of the two codes didn't bear that out, instead revealing that my code is
slower mostly due to the loop and the suppression of leading zeros. I
agree that UV-pairing is important on Pentium, and I have strived (striven?)
to get as much pairing going in my code as possible.
As I said, I will remove the loop and the zero suppression and then time
the code again, so we have an apples to apples comparison. Might take a
while though since I am off to the Intel Developer Conference.
-- Norbert
> ju...@my-dejanews.com wrote:
> > Very nice. It's seems you saved cycles by
> > 2. not compressing out leading zeros (I loose one cycle per loop iteration,
> > or another 9 cycles)
> This might cost more, since you'll have at least one branch mispredict
> on any loop.
OK, so we have a problem, we've only solved one in a family of problems - lovely though your code might be for the one under scrutiny. (but TANSTATFC [*])
So, in the case of needing to strip leading zeroes:
If we have control over the interface, could we not specify that the procedure
returned a pointer to the first non-zero character.
Say we set up two registers (EBP & ESI appear usable) with the address of
the first character in the output buffer.
At each stage in the unrolled loop, can we perform a test on the zero-ness
of the digit, and if zero, increment the pointer by 1. At the end we need
to recombine the two pointers (if the [0-5] range is 5, use the one in the
[5-10] range).
This would elide the branch mis-predict, which I seem to remember is worth
avoiding on all recent Pentia.
But can it be done? I forget the 'test for zero'-like bit twiddles now.
If there were a register shortage, apart from the final stage of the unroll,
these two 'pointers' could be xh and xl 8-bit registers, in EBX or EDX, and
one or both of these could be replaced by ESI or EBP.
FatPhil
* I dunno, with favourable alignment, perhaps
replace 8
"add ?l, '0'"
with 2
"add [edi+?], 30303030h" equivalents such as
"mov e?x, [edi+?];...; add e?x, 30303030h;...; mov [edi+?], e?x"
--
Phil Carmody
Not speaking for or on behalf of Nokia Telecommunications.
In C++, the phrase "client code shouldn't and doesn't care about these
parts" is spelled "private," and privates are best hidden (except in
some Scandinavian countries with more liberal laws). Herb Sutter 1997
OK, as promised I took my code, unrolled the loop, and removed the suppression
of leading zeroes. This also allowed me to restrict the registers destroyed
to EAX, ECX, and EDX. This is nice because under MSVC, these register don't
need to be preserved across function calls. I'll append the new code below. I
timed both my new code and Terje's code on a PentiumMMX, a K6-2, and a PII.
Here are the measured latencies in cycles:
Terje's code Norbert's code Norbert's code
FASTMUL undefined FASTMUL defined
PentiumMMX 71 76 84
K6-2 46 57 55
PII 46 62 63
As you can see, Terje's code is the hands-down winner. Did anybody expect
anything else :-) As I anticipated, the larger part of the overhead in
running my original code on the Pentium versus Terje's code was due to
the use of a loop in my code versus Terje's unrolled code, and the absence
of leading zero suppression in Terje's code.
After I modified my code to allow for an apples-to-apples comparison, I am
getting pretty close to Terje's code on the Pentium (98->76 cycles). However,
Terje's code has two independent computational stream, allowing better
instruction scheduling. My new code suffers from AGIs on Pentium. The
explicit parallelism of Terje's approach also helps in a significant way
on the K6-2 and PII which are out of order machines. While my new code
allows for some instruction level parallelism, it is still essentially a
single dependency chain (which limits the parallelism an OOO machine can
extract), versus two such chains in Terje's code.
Finally, Terje made the observation that multiplies by 10 to extract the
quotient digits can be simplified to multiplies by 5 (with an adjustment
of the shift count).
Terje, if you should ever be interested in a code optimization job here
in Silicon Valley, drop me a line at ju...@acm.org :-)
-- Norbert
FASTMUL TEXTEQU <1>
;;---------------------------------------------------------------------------
-- ;; ULONG2ASCII converts a 32-bit unsigned integer into a decimal ASCIIZ
string ;; (a null terminated string as used by the C language). Leading zeros
are not ;; suppressed, i.e. the output is identical to the output of
printf("%010lu"). ;; A two part algorithm is used. Using multiplication with
the reciprocal, the ;; integer is divided by 1e9. The quotient represents the
most significant ;; decimal digit. The remainder, which is a number < 1e9, is
then converted ;; into a scaled integer representation where 2^28 represents
1. In each step, ;; the integer part is extracted and forms the next decimal
digit and the ;; remaining fraction is then multiplied by 10 in preparation
for the ;; extraction of the next decimal digit. ;; ;; Define a symbol
FASTMUL if the target CPU has a fast integer multiply (5 ;; cycles or less),
e.g. K6 and PentiumII. For CPUs with slow integer multiply ;; such as
PentiumMMX, undefine FASTMUL to replace one of the multiplies with ;; a table
look up. Note that the table lookup might cause cache misses. ;; ;; This
routine has been timed at 76 cycles on a PentiumMMX with FASTMUL ;;
undefined, and cache hits during table lookup. Timing is independent of the
;; magnitude of the input. ;; ;; This code has been tested exhaustively
against the sprintf routine of the ;; Microsoft Visual C compiler. ;; ;; IN:
EAX unsigned long to be converted ;; EDI pointer to resulting string (at
least 11 characters) ;; DESTROYS: EAX, ECX, EDX
;;---------------------------------------------------------------------------
--
ULONG2ASCII MACRO
.DATA
align 4
IFNDEF FASTMUL
cvttab DD 0000000000 ; 0 * 1e9
DD 1000000000 ; 1 * 1e9
DD 2000000000 ; 2 * 1e9
DD 3000000000 ; 3 * 1e9
DD 4000000000 ; 4 * 1e9
ENDIF
.CODE
MOV ECX, EAX ; save original argument MOV EDX, 89705F41h ; 1e-9*2^61
rounded MUL EDX ; divide by 1e9 by mult. with recip. ADD EAX, 80000000h
; round division result ADC EDX, 0 ; EDX<31:29> = argument / 1e9 MOV
EAX, ECX ; restore original argument SHR EDX, 29 ; leading decimal digit,
0...4 MOV ECX, EDX ; save first quotient digit OR EDX, '0' ; convert
digit to ASCII MOV [EDI], DL ; store out to memory MOV EDX, 0ABCC7712h
; 2^28/1e8 * 2^30 rounded up IFDEF FASTMUL IMUL ECX, 1000000000 ; multiply
1st quotient digit by divisor SUB EAX, ECX ; remainder after first digit
ELSE SUB EAX, cvttab[ECX*4] ; subtract quotient digit * divisor ENDIF MUL
EDX ; convert number < 1e9 SHLD EDX, EAX, 2 ; into fraction such INC
EDX ; that 1.0 = 2^28 MOV EAX, EDX ; save result
SHR EDX, 28 ; next digit AND EAX, 0FFFFFFFh ; fraction part OR EDX,
'0' ; convert digit to ASCII ADD EAX, EAX ; 2*fraction MOV [EDI+1], DL
; store digit out to memory LEA EAX, [EAX*4+EAX] ; 10*fraction, new digit
EAX<31:28>, new fraction EAX<27:0> MOV EDX, EAX ; save result SHR EAX,
28 ; next digit = integer part AND EDX, 0FFFFFFFh ; fraction part OR
EAX, '0' ; convert digit to ASCII ADD EDX, EDX ; 2*fraction MOV
[EDI+2], AL ; store digit out to memory LEA EDX, [EDX*4+EDX] ;
10*fraction, new digit EDX<31:28>, new fraction EDX<27:0> MOV EAX, EDX ;
save result
SHR EDX, 28 ; next digit AND EAX, 0FFFFFFFh ; fraction part OR EDX,
'0' ; convert digit to ASCII ADD EAX, EAX ; 2*fraction MOV [EDI+3], DL
; store digit out to memory LEA EAX, [EAX*4+EAX] ; 10*fraction, new digit
EAX<31:28>, new fraction EAX<27:0> MOV EDX, EAX ; save result SHR EAX,
28 ; next digit = integer part AND EDX, 0FFFFFFFh ; fraction part OR
EAX, '0' ; convert digit to ASCII ADD EDX, EDX ; 2*fraction MOV
[EDI+4], AL ; store digit out to memory LEA EDX, [EDX*4+EDX] ;
10*fraction, new digit EDX<31:28>, new fraction EDX<27:0> MOV EAX, EDX ;
save result
SHR EDX, 28 ; next digit AND EAX, 0FFFFFFFh ; fraction part OR EDX,
'0' ; convert digit to ASCII ADD EAX, EAX ; 2*fraction MOV [EDI+5], DL
; store digit out to memory LEA EAX, [EAX*4+EAX] ; 10*fraction, new digit
EAX<31:28>, new fraction EAX<27:0> MOV EDX, EAX ; save result SHR EAX,
28 ; next digit = integer part AND EDX, 0FFFFFFFh ; fraction part OR
EAX, '0' ; convert digit to ASCII ADD EDX, EDX ; 2*fraction MOV
[EDI+6], AL ; store digit out to memory LEA EDX, [EDX*4+EDX] ;
10*fraction, new digit EDX<31:28>, new fraction EDX<27:0> MOV EAX, EDX ;
save result
SHR EDX, 28 ; next digit = integer part AND EAX, 0FFFFFFFh ; fraction
part OR EDX, '0' ; convert digit to ASCII ADD EAX, EAX ; 2*fraction
MOV [EDI+7], DL ; store digit out to memory LEA EAX, [EAX*4+EAX] ;
10*fraction, new digit EAX<31:28>, new fraction EAX<27:0> MOV EDX, EAX ;
save result SHR EAX, 28 ; next digit = integer part AND EDX, 0FFFFFFFh
; fraction part OR EAX, '0' ; convert digit to ASCII ADD EDX, EDX ;
2*fraction MOV [EDI+8], AL ; store digit out to memory LEA EAX,
[EDX*4+EDX] ; 10*fraction, new digit EAX<31:28>, new fraction EAX<27:0> SHR
EAX, 28 ; next digit = integer part MOV [EDI+10], AH ; place string end
marker OR EAX, '0' ; convert digit to ASCII MOV [EDI+9], AL ; store
last digit out to memory
ULONG2ASCII ENDM
> Terje's code Norbert's code Norbert's code
> FASTMUL undefined FASTMUL defined
>
> PentiumMMX 71 76 84
> K6-2 46 57 55
> PII 46 62 63
>
>;;---------------------------------------------------------------------------
Have you considered using table lookups ?
I think , this should be doable in ~20-30 cycles.
You'd need >10K for the tables , but if you are doing _lots_ of conversions,
this should be significantly faster.
(If not , who cares anyway ?)
-- qscgz
Nice! :-)
> running my original code on the Pentium versus Terje's code was due to
> the use of a loop in my code versus Terje's unrolled code, and the absence
> of leading zero suppression in Terje's code.
>
> After I modified my code to allow for an apples-to-apples comparison, I am
> getting pretty close to Terje's code on the Pentium (98->76 cycles). However,
> Terje's code has two independent computational stream, allowing better
> instruction scheduling. My new code suffers from AGIs on Pentium. The
Indeed. This was the main idea behind the split on 100,000:
To allow fully independent instruction scheduling.
Secondly, with just 4+1 pairs of digits, using unrolled code was a lot
more natural.
> explicit parallelism of Terje's approach also helps in a significant way
> on the K6-2 and PII which are out of order machines. While my new code
> allows for some instruction level parallelism, it is still essentially a
> single dependency chain (which limits the parallelism an OOO machine can
> extract), versus two such chains in Terje's code.
>
> Finally, Terje made the observation that multiplies by 10 to extract the
> quotient digits can be simplified to multiplies by 5 (with an adjustment
> of the shift count).
This also depends on having unrolled code, so it was a nice bonus from
the approach I selected.
> Terje, if you should ever be interested in a code optimization job here
> in Silicon Valley, drop me a line at ju...@acm.org :-)
I've gotten quite a few offers over the years, and done some work via
email/net, but I like it here in Oslo, Norway. :-)
Terje
Well, I am not personally a big friend of table based methods for anything
unless the tables are very small. More often than not I have found that the
memory accesses come back to bite you.
In this case, I am not quite sure how a table lookup based algorithm would
work. Could you please outline what the contents of the tables and the code
would look like?
-- Norbert
decimal(eax)=decimal(al)+256*decimal(ah)+256^2*decimal(a2)+256^3*decimal(a3)
I've been thinking at something like this:
(not yet optimized ; untested )
mov ebx,0 \ mov bl,al \ movq mm0,[t0+ebx*8]
mov bl,ah \ paddb mm0,[t1+ebx*8]
shr eax,16 \ mov bl,al \ paddb mm0,[t2+ebx*8]
mov bl,ah \ paddb mm0,[t3+ebx*8] \ movq mm1,[t4+ebx*8]
movq [t],mm0 \ movq [t+8],mm1
mov bx,[t+8] \ mov ax,[t5+ebx*2] \ mov [t+8],ax \ mov bx,[t6+ebx*2]
add bx,[t+6] \ mov ax,[t5+ebx*2] \ mov [t+6],ax \ mov bx,[t6+ebx*2]
add bx,[t+4] \ mov ax,[t5+ebx*2] \ mov [t+4],ax \ mov bx,[t6+ebx*2]
add bx,[t+2] \ mov ax,[t5+ebx*2] \ mov [t+2],ax \ mov bx,[t6+ebx*2]
add bx,[t+0] \ mov ax,[t5+ebx*2] \ mov [t+0],ax \ mov bx,[t6+ebx*2]
;adjusting 2 decimal digits at once
ret
t0:db 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,1, .. 0,0,0,0,0,2,5,5 ; 8*256 bytes
t1:db 0,0,0,0,0,0,0,0, 0,0,0,0,0,2,5,6, .. 0,0,0,6,5,2,8,0
t2:db 0,0,0,0,0,0,0,0, 0,0,0,6,5,5,3,6, .. 1,6,7,1,1,6,8,0
t3:db 0,0,0,0,0,0,0,0, 1,6,7,7,7,2,1,6, .. 7,8,1,9,0,0,8,0
t4:db 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0, .. 0,0,0,0,0,0,4,2
t5:db '000102030405060708091011..40' , ..,'1011..50..,..,40 ; 40*2*40 bytes
t6:db 0,0, 0,0, ..,.., 0,4 ; 40*2*40 bytes
t:db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ;-------------------
r:db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ~16K
optimistically counting ~25 cycles on a PMMX , when optimized
and parallellized .
-- qscgz
Bjorn Stromvall
---
mov ecx,eax ; save original argument
mov edx,89705f41h ; 1e-9*2^61 rounded
mul edx ; divide by 1e9 by mult. with recip.
add eax,eax ; round division result
inc edi
adc edx,0 ; EDX<31:29> = argument / 1e9
mov eax,ecx ; restore original argument
shr edx,29 ; leading decimal digit, 0...4
mov ecx,'0' ; convert digit to ASCII
or ecx,edx
mov [edi+10-1],dh ; place string end marker
;
sub eax,[cvttab+edx*4] ; subtract quotient digit * divisor
mov edx,0abcc7712h ; 2^28/1e8 * 2^30 rounded up
mul edx ; convert number < 1e9
shr eax,30
mov [edi-1],cl ; store out to memory
lea ecx,[4*edx+1]
lea edx,[4*edx+1]
add edx,eax ; that 1.0 = 2^28
add eax,ecx ; that 1.0 = 2^28
REPT 7
shr eax,28 ; next digit
and edx,0fffffffh ; fraction part
or eax,"0"
add edx,edx ; 2*fraction
mov [edi],al ; store digit out to memory
inc edi
lea eax,[edx*4+edx] ; 10*fraction, new digit EAX<31:28>, new fract.
EAX<27:0>
lea edx,[edx*4+edx] ; 10*fraction, new digit EDX<31:28>, new fract.
EDX<27:0>
ENDM
shr eax,28 ; next digit
and edx,0fffffffh ; fraction part
or eax,"0"
mov [edi],al ; store digit out to memory
lea eax,[edx*4+edx] ; 10*fraction, new digit EAX<31:28>, new fract.
EAX<27:0>
shr eax,27 ; next digit = integer part
or eax,'0' ; convert digit to ASCII
mov [edi+1],al ; store last digit out to memory
Very nice! I have to study this a bit more closely to see what are
all the things you did. I will also stick this into my timing frame
work and measure it across all the platform, then post an updated
comparison.
Looks like I am a bit rusty as far as code optimizations are concerned.
I worked in CPU design the past five years and have only recently
switched back to software. It's also hard to remember what is optimal
across a larger group of different CPUs. Unfortunately, I don't have
that much time to look at each piece of code. I recall the good old
days when I was still a student and could puzzle around for hours on
just one tiny piece of code. Well, enough excuses :-)
Isn't it amazing how the competitive spirit in this newsgroup can
whittle down almost any problem quickly?
-- Norbert
> ---
>
> mov ecx,eax ; save original argument
> mov edx,89705f41h ; 1e-9*2^61 rounded
>
> mul edx ; divide by 1e9 by mult. with recip.
>
> add eax,eax ; round division result
> inc edi
>
> adc edx,0 ; EDX<31:29> = argument / 1e9
> mov eax,ecx ; restore original argument
>
> shr edx,29 ; leading decimal digit, 0...4
> mov ecx,'0' ; convert digit to ASCII
>
> or ecx,edx
> mov [edi+10-1],dh ; place string end marker
>
> ;
>
> sub eax,[cvttab+edx*4] ; subtract quotient digit * divisor
> mov edx,0abcc7712h ; 2^28/1e8 * 2^30 rounded up
>
> mul edx ; convert number < 1e9
>
> shr eax,30
> mov [edi-1],cl ; store out to memory
>
> lea ecx,[4*edx+1]
> lea edx,[4*edx+1]
>
> add edx,eax ; that 1.0 = 2^28
> add eax,ecx ; that 1.0 = 2^28
>
> REPT 7
> shr eax,28 ; next digit
> and edx,0fffffffh ; fraction part
>
> or eax,"0"
> add edx,edx ; 2*fraction
>
> mov [edi],al ; store digit out to memory
> inc edi
>
> lea eax,[edx*4+edx] ; 10*fraction, new digit EAX<31:28>, new fract.
> EAX<27:0>
> lea edx,[edx*4+edx] ; 10*fraction, new digit EDX<31:28>, new fract.
> EDX<27:0>
> ENDM
>
> shr eax,28 ; next digit
> and edx,0fffffffh ; fraction part
>
> or eax,"0"
>
> mov [edi],al ; store digit out to memory
> lea eax,[edx*4+edx] ; 10*fraction, new digit EAX<31:28>, new fract.
> EAX<27:0>
>
> shr eax,27 ; next digit = integer part
>
> or eax,'0' ; convert digit to ASCII
>
> mov [edi+1],al ; store last digit out to memory
The "trick" is basically to make the "inner loop" as fast as possible,
taking
lea AGI and pairing into consideration. I also used lea with a correctly
shifted ASCII "0" in most of the ASCII convertions...
(This can also be used for the last 2 ACSII conversions in Terje's code
to save a cycle on a PMMX).
Timing is indicated as being simply 1 cycle/pair (9 cycles for the muls)
New code as follows; input:eax=number, edi=buffer, modified=eax,ecx,edx
BTW, I have no idea why things get posted twice, will try to look into
this...
Bjorn Stromvall
---
mov ecx,eax ; save original argument
mov edx,89705f41h ; 1e-9*2^61 rounded ;2^61/1e9
mul edx ; divide by 1e9 by mult. with recip.
add eax,eax ; round division result
;
adc edx,0 ; EDX<31:29> = argument / 1e9
mov eax,ecx ; restore original argument
shr edx,29 ; leading decimal digit, 0...4
mov ecx,'0' ; convert digit to ASCII
or ecx,edx
mov [edi+10],dh ; place string end marker
;
;
sub eax,[cvttab+edx*4] ; subtract quotient digit * divisor
mov edx,0abcc7712h ; 2^28/1e8 * 2^30 rounded up ;2^58/1e8
mul edx ; convert number < 1e9
shr eax,30
mov [edi],cl ; store out to memory
lea ecx,[4*edx+1]
lea edx,[4*edx+1]
add edx,eax ; that 1.0 = 2^28
add eax,ecx ; that 1.0 = 2^28
shr eax,28 ; next digit
and edx,0fffffffh ; fraction part
or eax,"0"
nop
lea ecx,[edx*4+edx]
lea edx,[edx*4+edx]
shr ecx,27 ; next digit
and edx,07ffffffh ; fraction part
mov [edi+1],al ; store digit out to memory
or ecx,"0"
CNTR = 0
REPT 6
lea eax,[edx*4+edx+("0" SHL (26-CNTR))]
lea edx,[edx*4+edx]
shr eax,26-CNTR
and edx,(1 SHL (26-CNTR))-1
mov [edi+CNTR+3],al
nop
CNTR = CNTR+1
ENDM
lea eax,[edx*4+edx+("0" SHL 20)]
;
shr eax,20
mov [edi+2],cl ; store `special' digit out to memory
mov [edi+9],al ; store last digit out to memory
;
Terje Bjorn (ver 1) Bjorn (ver 2) Norbert unrolled
P55C 71 64 56 (62 w/o NOPs) 76
K6-2 46 60 51 (47 w/o NOPs) 57
PentiumII 46 88 71 (72 w/o NOPs) 62
I am puzzled about the PII results, as Bjorn's code seems to run
significantly slower than my unrolled version from which it is
derived. I used the same measuring framework I used the last time
(using RDTSC to measure elapsed time, and CPUID to serialize the
machine to make sure we catch all the execution cycles on the OOO
processors). I subtract out the overhead measured with no code
between the start clock and stop clock code. So, I am reasonably
sure that the numbers are for real. In case someone would like to
double-check the PII latencies, the PII I used is a Deschutes, 333
MHz. I bought it immediately when they became available last year
(in order to reverse engineer the FXSAVE/FXRSTOR instructions :-)
Is it possible that the PII has problems with the excessive number
of LEA instructions in Bjorn's code? One major difference between
my code and Bjorn's is that he cut a cycle from the dependency chain
for each digit by duplicating an LEA, instead of doing one LEA and
duplicating the result as my original code did.
Terje's code still has a bit of an advantage on the OOO processors
as it exposes more parallelism. One minor drawback is that his code
uses EBX in addition to EAX, ECX, and EDX used my the other codes.
I think we have pretty much beaten this horse to death now. But if
anybody thinks he can make the code faster still, I'd be happy to
run it through it's paces ...
-- Norbert
In article <7bp65h$rg0$1...@winter.news.rcn.net>,
<al...@umea.mail.telia.com> wrote:
> Well, after wasting(?) some more time I saved another 8 cycles in the
> "inner loop" for 54 cycles on a PentiumMMX. Since this is the only
> machine I have available now & I only considered optimizing for it, it
> may very well not be optimal for a K6/P6 (remove the nops perhaps...)...
>
> The "trick" is basically to make the "inner loop" as fast as possible,
> taking
> lea AGI and pairing into consideration. I also used lea with a correctly
> shifted ASCII "0" in most of the ASCII convertions...
> (This can also be used for the last 2 ACSII conversions in Terje's code
> to save a cycle on a PMMX).
>
> Timing is indicated as being simply 1 cycle/pair (9 cycles for the muls)
>
> New code as follows; input:eax=number, edi=buffer, modified=eax,ecx,edx
>
> BTW, I have no idea why things get posted twice, will try to look into
> this...
>
> Bjorn Stromvall
>
> ---
>
> mov ecx,eax ; save original argument
> mov edx,89705f41h ; 1e-9*2^61 rounded ;2^61/1e9
>
> mul edx ; divide by 1e9 by mult. with recip.
>
> add eax,eax ; round division result
> ;
>
> adc edx,0 ; EDX<31:29> = argument / 1e9
> mov eax,ecx ; restore original argument
>
> shr edx,29 ; leading decimal digit, 0...4
> mov ecx,'0' ; convert digit to ASCII
>
> or ecx,edx
> mov [edi+10],dh ; place string end marker
>
> ;
> ;
>
> sub eax,[cvttab+edx*4] ; subtract quotient digit * divisor
> mov edx,0abcc7712h ; 2^28/1e8 * 2^30 rounded up ;2^58/1e8
>
> mul edx ; convert number < 1e9
>
> shr eax,30
> mov [edi],cl ; store out to memory
>
> lea ecx,[4*edx+1]
> lea edx,[4*edx+1]
>
> add edx,eax ; that 1.0 = 2^28
> add eax,ecx ; that 1.0 = 2^28
>
> shr eax,28 ; next digit
> and edx,0fffffffh ; fraction part
>
> or eax,"0"
> nop
>
> lea ecx,[edx*4+edx]
> lea edx,[edx*4+edx]
>
> shr ecx,27 ; next digit
> and edx,07ffffffh ; fraction part
>
> mov [edi+1],al ; store digit out to memory
> or ecx,"0"
>
> CNTR = 0
> REPT 6
> lea eax,[edx*4+edx+("0" SHL (26-CNTR))]
> lea edx,[edx*4+edx]
>
> shr eax,26-CNTR
> and edx,(1 SHL (26-CNTR))-1
>
> mov [edi+CNTR+3],al
> nop
> CNTR = CNTR+1
> ENDM
>
> lea eax,[edx*4+edx+("0" SHL 20)]
> ;
>
> shr eax,20
> mov [edi+2],cl ; store `special' digit out to memory
>
> mov [edi+9],al ; store last digit out to memory
> ;
As I said, it's optimized for the PMMX, so it might not
perform too well on P6 machines... It would be nice to get
an expert opinion on why it is so slow, is it just too
many P0 uops?
BTW, Norbert, I'm not sure where your extra 2 clocks come from.
I thought you added something like
mov eax,[]
mov edi,[]
initially, but this should just take 1 clock on a PMMX.
Perhaps cache bank conflicts (or even AGI stalls)? The
mov edi,[] could also be put in THE empty slot (now there's
only one!). Or I could just be missing something, it won't
be the first time!
Note that the table has changed & the mul operand is stored here too.
The code now modifies all registers, not ideal for HLL... The basic
idea was to get rid of the second multiply and get all digits using
the first one & double (almost) percision multiply by 5 initially.
On a wider machine with more registers(renaming) it could be slightly
faster by allowing for more simultaneous leas, but since this could
be the reason why at least the P6 dislikes my code, I won't try it...
Bjorn Stromvall
---
mul [mulr] ; div by 1e9 by mul with recip. ;2^61/1e9
shr eax,3 ; lo>>3
;
mov ebx,eax
lea ebp,[edx+4*edx+2] ; *5 hi + fix
shr ebx,32-3-1 ; rounding for 1st digit
lea ecx,[eax+4*eax] ; *5 lo>>3
shr ecx,32-3 ; *5 lo>>32
mov [edi+10],bh ; place string end marker
adc ebp,ecx ; *5 finished & fixed in ebp rounded
add edx,ebx ; edx <31:29> = argument / 1e9 rounded
mov eax,ebp ; for 3rd digit
mov ecx,ebp ; 2nd digit
and eax,(1 SHL 28)-1; for 3rd digit
lea esi,[ebp*4+ebp] ; 4th..10th
shr edx,29 ; leading decimal digit, 0...4
and esi,(1 SHL 27)-1
lea ebx,[eax*4+eax] ; 3rd digit
add edx,"0" ; 1st ACSII digit
lea eax,[esi*4+esi+("0" SHL 26)]
lea esi,[esi*4+esi] ; 4th..10th (skipped one mask step...)
shr eax,26
and esi,(1 SHL 26)-1
mov ebp,[cvttab2+4*edx-4*"0"] ; รท10* 1st digit
mov [edi+3],al
lea eax,[esi*4+esi+("0" SHL 25)]
lea esi,[esi*4+esi]
shr eax,25
and esi,(1 SHL 25)-1
shr ebx,27 ; 3rd digit
mov [edi+4],al
lea eax,[esi*4+esi+("0" SHL 24)]
lea esi,[esi*4+esi]
shr eax,24
and esi,(1 SHL 24)-1
sub ecx,ebp ; fix 2nd digit
mov [edi+5],al
lea eax,[esi*4+esi+("0" SHL 23)]
lea esi,[esi*4+esi]
shr eax,23
and esi,(1 SHL 23)-1
shr ecx,28 ; 2nd digit
mov [edi+6],al
lea eax,[esi*4+esi+("0" SHL 22)]
lea esi,[esi*4+esi]
shr eax,22
and esi,(1 SHL 22)-1
or ebx,"0"
mov [edi+7],al
lea eax,[esi*4+esi+("0" SHL 21)]
lea esi,[esi*4+esi]
shr eax,21
and esi,(1 SHL 21)-1
mov [edi],dl ; store 1st digit out to memory
mov [edi+8],al
lea eax,[esi*4+esi+("0" SHL 20)]
mov [edi+2],bl ; store 3rd digit out to memory
shr eax,20
or ecx,"0"
mov [edi+9],al ; store last digit out to memory
mov [edi+1],cl ; store 2nd digit out to memory
mulr dd 89705F41h ; 1e-9*2^61 rounded ;2^61/1e9
cvttab2 dd 00 SHL 28
dd 10 SHL 28
dd 20 SHL 28
dd 30 SHL 28
dd 40 SHL 28