[++] Re: itoa() revisited

Terje Mathisen

unread,

Jan 3, 1998, 3:00:00 AM1/3/98

to

DOSTERT Patrick wrote:
>
> for those looking for a fast integer to ascii routine
> here it is
>
> I know it looks like a mess, that's because it is :-)
>
> anyway, it is written with NASM (get it, you'll love it)
> an is supposed to be used from a C-prog
>
> declared as " void itoaAsm( int n, char* p); "
>
> it takes a signed 32bit integer and a pointer to some memory location where to put the result
>
> it returns nothing, just fills the given memory loc. with the ascii representation of the given int, null terminated
>
> it's fast !
>
> any comments welcome ( any really )

> _itoaAsm:

[snipped a _very_ long unrolled asm section]

OK, your code is basically an unrolled series of compare & subtract, for
all possible digit positions.

This means that it will be maximally slow for a number like
1.999.999.999, so if you want fast code, it is better to look for an
algorithm which is both fast and independent of the input values.

The best I've been able to come up with is to use code that first splits
the number into two parts, modulo 100.000, and then converts both parts
to ascii simultaneously.

The conversions are handled by scaling each number into a 4:28 scaled
integer, where themost significant decimal digit ends up in the top 4
bits, and the remainder is the last 4 digits.

This code works on the full unsigned range, i.e. 0 to 0xffffffff, and it
uses 3 or 4 MUL operations, and _no_ DIVs at all! :-)

I don't remember exactly what the running time is, but it has to be less
than 100 cycles worst case on a Pentium, probably something like 70. On
a PPro it is very nice, since MULs are much faster on these cpus!

The posted compare & subtract code will probably be _very_ bad on a
PPro, since you'll get at least one mis-predicted branch on every digit
position.

OTOH, if most numbers are quite small, then it won't be too bad anyway,
and the branch misses will still be slightly better than the usual
DIV-based library code.

Terje

--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Jan 3, 1998, 3:00:00 AM1/3/98

to

Mark Gibson wrote:
> I'm looking for a fast set of itoa and atoi conversion routines
> that will work on large integers (128-bits or about 36 decimal digits).
> If you or anyone else knows where I can find such routines in x86
> assembler, please let me know.

That's easy:

For itoa128() I'd start by splitting the number into 9-digit groups,
using long division by 1.000.000.000.

This will give you one to four such groups which can then be trivially
converted with ltoa() and then concatenated, or you can just use

sprintf(buf,"%l%09l%09l%09l", t3, t2, t1, t0);

and then possibly strip off leading zeroes.

For atoi128() you just work in the opposite direction: Split the input
number into 9-digit groups and convert them into (unsigned?) longs.

Finally you'll have to use a little asm to multiply each part by 1e9
before adding it to the next-most significant part.

DOSTERT Patrick

unread,

Jan 4, 1998, 3:00:00 AM1/4/98

to

On 3 Jan 1998 20:48:10 GMT, Terje Mathisen <Terje.M...@hda.hydro.com> wrote:

>[snipped a _very_ long unrolled asm section]
>

unrolling the loops doubled the speed :-)

>This means that it will be maximally slow for a number like
>1.999.999.999, so if you want fast code, it is better to look for an
>algorithm which is both fast and independent of the input values.
>

it takes 38 cycles for 1 digit numbers and 5 to 20 additionnal cycles per digit > 1

1.999.999.999 needs +- 195

complete timings:

999999999 190 555555555 135
99999999 169 55555555 121
9999999 151 5555555 110
999999 134 555555 100
99999 113 55555 86
9999 93 5555 73
999 77 555 63
99 56 55 49
9 38 5 38
888888888 182 111111111 71
88888888 162 11111111 64
8888888 145 1111111 61
888888 129 111111 59
88888 109 11111 53
8888 90 1111 49
888 75 111 47
88 55 11 41
8 38 1 38

obviously it's hard to predict but it is much better then divs

>The best I've been able to come up with is to use code that first splits
>the number into two parts, modulo 100.000, and then converts both parts
>to ascii simultaneously.
>
>The conversions are handled by scaling each number into a 4:28 scaled
>integer, where themost significant decimal digit ends up in the top 4
>bits, and the remainder is the last 4 digits.
>
>This code works on the full unsigned range, i.e. 0 to 0xffffffff, and it
>uses 3 or 4 MUL operations, and _no_ DIVs at all! :-)
>

now that sounds interesting ! ( as usual, it comes from you)

but could you give some more details ? maybe the complete code ?

cu
P

" Do it only with the Best "

Terje Mathisen

unread,

Jan 6, 1998, 3:00:00 AM1/6/98

to

DOSTERT Patrick wrote:
>=20

> On 3 Jan 1998 20:48:10 GMT, Terje Mathisen <Terje.M...@hda.hydro.com> wrote:

>=20

> >[snipped a _very_ long unrolled asm section]
> >

>=20

> unrolling the loops doubled the speed :-)

How did you measure this?

A single run calibrated with RDTSC, or multiple runs in a big loop?

If the latter, did you vary the input number semi-randomly between each
call?

Your unrolled code is a near-perfect example of why it is very hard to
benchmark library code like itoa(): When called with a new number, almost
all the branches will be mis-predicted, while a bunch of calls will indeed
give near-perfect prediction.

I believe the only (or at least main) reason you got that 2X speed was due
to better branch prediction.

> >This means that it will be maximally slow for a number like
> >1.999.999.999, so if you want fast code, it is better to look for an
> >algorithm which is both fast and independent of the input values.
> >

>=20

> it takes 38 cycles for 1 digit numbers and 5 to 20 additionnal cycles per digit > 1

>=20
> 1.999.999.999 needs +- 195
>=20
> complete timings:
>=20

> 999999999 190 555555555 135
> 99999999 169 55555555 121
> 9999999 151 5555555 110
> 999999 134 555555 100
> 99999 113 55555 86
> 9999 93 5555 73
> 999 77 555 63
> 99 56 55 49
> 9 38 5 38
> 888888888 182 111111111 71
> 88888888 162 11111111 64
> 8888888 145 1111111 61
> 888888 129 111111 59
> 88888 109 11111 53
> 8888 90 1111 49
> 888 75 111 47
> 88 55 11 41
> 8 38 1 38

>=20

> obviously it's hard to predict but it is much better then divs

That is right, but with the bug unroll, you're going to use up an
appreciable part of both the code cache, and more importantly, the branch
target buffer.

>=20

> >The best I've been able to come up with is to use code that first splits
> >the number into two parts, modulo 100.000, and then converts both parts
> >to ascii simultaneously.
> >
> >The conversions are handled by scaling each number into a 4:28 scaled
> >integer, where themost significant decimal digit ends up in the top 4
> >bits, and the remainder is the last 4 digits.
> >
> >This code works on the full unsigned range, i.e. 0 to 0xffffffff, and it
> >uses 3 or 4 MUL operations, and _no_ DIVs at all! :-)
> >

>=20

> now that sounds interesting ! ( as usual, it comes from you)

>=20

> but could you give some more details ? maybe the complete code ?

OK, here's a copy of a post I made last year to a list server:

Hello!

This list seems to have fallen asleep, so I thought I'd write something
about some fast code I've been working on recently:

It started with the work I did on finding reciprocals giving exact DIV
results using a much faster MUL instead.

Starting with this, I have found a way to convert any (unsigned) 32-bit
value to ascii which is close to an order of magnitude faster than the
obvious DIV/MOD approach:

I first multiply the input number by the scaled reciprocal of 100,000:

2^48/100000 (rounded up)

The top 16 bits of the result (h) is exactly the same as would result
from a division by 100000, i.e. the top 5 digits of the input number.

I get the remainder, i.e. the bottom 5 digits, by multiplying back by
100000 and subtracting.

By multiplying (h) by ceil(2^32/100000) =3D 429497, the high part of the
result will be the first digit, and the remainder is the next 4
fractional bits.

Similarly I multiply (l) by 429497, which also results in a digit in the
high part (EDX) and a fraction in the low part.

After getting the first and sixth digits this way, I divide both
remainders by 8 (SHR) (while rounding up), and start on the final 8
digits:

Each digit is generated by multiplying the current value by 5 (using
LEA), taking the top 4 significant bits as the result, and then masking
off those bits.

I have measured this code as taking less than 50 cycles on a PPro, and
probably around 80 on a Pentium.

Both numbers compare very favourably to the 40 cycles taken for a single
DIV!

(It would probably be a win though to use a single test&branch to
specialcase input numbers below 100000, since these could be handled with
just a single MUL for setup, probably resulting in Pentium times in the
30-50 cycles range.)

Terje

PS. I used brute force to prove that this works: I tried all 2^32
possible input values, used atoi() to convert back to binary, and
compared the result with the original number.

I have both a (semi-)portable C version, depending only upon a 64-bit
integer type, and the optimized asm version. The asm code you will notice
has zero branches! :-)

PPS. Here is the actual code, (c) Terje Mathisen 1997:

char *dtoa_c(char *buf, unsigned long n)
{
unsigned _int64 l64, h64;
unsigned long l, h, i;

l64 =3D n;

l64 *=3D 2814749767;
l64 +=3D n >> 1;
h =3D (unsigned long) (l64 >> 48);
l =3D n - h * 100000;

l64 =3D l;
h64 =3D h;

h64 *=3D 429497;
l64 *=3D 429497;

buf[0] =3D (char) (h64 >> 32) + '0';
buf[5] =3D (char) (l64 >> 32) + '0';
h =3D (unsigned long) h64;
l =3D (unsigned long) l64;

h =3D (h + 15) >> 4;
l =3D (l + 15) >> 4;

for (i =3D 1; i < 5; i++) {
h *=3D 10;
l *=3D 10;
buf[i] =3D (char) (h >> 28) + '0';
buf[i+5] =3D (char) (l >> 28) + '0';
h &=3D 0x0fffffff;
l &=3D 0x0fffffff;
}
buf[10] =3D '\0';

return buf;
}

char *dtoa(char *buf, unsigned long n)
{
__asm {
mov ebx,[n]
mov eax,2814749767

mul ebx

shr ebx,1
xor ecx,ecx

mov edi,[buf]
add eax,ebx

adc ecx,edx
mov eax,100000

shr ecx,16 // ECX =3D high part
mov ebx,[n] // Retrieve org. number

imul eax,ecx // High part * 100k

sub ebx,eax // Remainder =3D Low part
mov eax,429497

mul ecx

mov ecx,eax
add dl,'0'

mov eax,429497
mov [edi],dl

mul ebx

mov ebx,eax
add ecx,7

shr ecx,3
add dl,'0'

mov [edi+5],dl
add ebx,7

shr ebx,3
lea ecx,[ecx+ecx*4]

mov edx,ecx
and ecx,0fffffffh

shr edx,28
lea ebx,[ebx+ebx*4]

add dl,'0'
mov eax,ebx

shr eax,28
mov [edi+1],dl

and ebx,0fffffffh
add al,'0'

mov [edi+6],al
lea ecx,[ecx+ecx*4]

lea ebx,[ebx+ebx*4]
mov edx,ecx

mov eax,ebx
and ecx,07ffffffh

shr edx,27
and ebx,07ffffffh

shr eax,27
add dl,'0'

add al,'0'
mov [edi+2],dl

mov [edi+7],al
lea ecx,[ecx+ecx*4]

lea ebx,[ebx+ebx*4]
mov edx,ecx

mov eax,ebx
and ecx,03ffffffh

shr edx,26
and ebx,03ffffffh

shr eax,26
add dl,'0'

add al,'0'
mov [edi+3],dl

mov [edi+8],al
lea ecx,[ecx+ecx*4]

shr ecx,25
lea ebx,[ebx+ebx*4]

shr ebx,25
add cl,'0'

add bl,'0'
mov [edi+10],ah

mov [edi+4],cl
mov [edi+9],bl
}

return buf;
}

--=20

DOSTERT Patrick

unread,

Jan 7, 1998, 3:00:00 AM1/7/98

to

Ok, ok, I'am convinced
I promise to never again produce bad branching code :-)

>> unrolling the loops doubled the speed :-)
>
>How did you measure this?
>

I used:

t = 0;
for(i = 0; i < 125000; i++)
{
for(n = 1; n < 100000000; n *= 11)
{
t += itoaAsm(-n,p);
t += itoaAsm( n,p);
}
}
printf("%f\n",(float)t / 2000000);

it gives nice different length numbers
with the return value added to t beeing the rdtsc count of the routine

>> but could you give some more details ? maybe the complete code ?
>
>OK, here's a copy of a post I made last year to a list server:
>

GREAT !!!
now that's code

I guess I'll need a weekend or two to get through this...
(I particularly enjoyed the "I used brute force..." part)

now something for the records:

my very first attempt at ascii-conversion and mmx (or the other way around)

this routine performs unbelievably bad
it has terrible timing/pipelining/you name it...
needs 280+ cycles (150+ for fbstp, 50 for emms, ...)
works only with 8 or less digit numbers

but it is mmx based !

which I belief doesn't happen at lot around here
anyone out there actually using mmx ? what for ?

anyway, it's a perfect example where mmx should be avoided at all costs

enjoy ! it's short, easy, mmx, commented (!) and SLOW

feel free to comment (on mmx, not the routine :-)

cu
P

" Do it only with the Best "

p.s. I used to read (and still do) Agner Fog's pentium optimization text
is there something similar dealing with mmx ? maybe an update ? anything ?
(yes, I do have the Intel manuals, but they are as usual :-)

--- begin cut here ---

BITS 32
GLOBAL _FnAsm

%define Value 8
%define Result 12

SECTION .text

_FnAsm:
push ebp
mov ebp,esp ; stack frame
pusha ; only eax, ecx and edx are used but, oh well...

fild dword [ebp+Value] ; number to convert
fbstp [temp] ; to BCD (150+ cycles, avoid at all costs)

movq mm0,[temp] ; load BCD value
movq mm1,mm0 ; copy
psrlq mm0,4 ; put the relevant bits in place
pand mm0,[cte0F] ; forget garbage
pand mm1,[cte0F] ; more to forget
punpcklbw mm1,mm0 ; combine both parts (interesting instruction)
paddd mm1,[cte30] ; convert to ascii (30h = '0')
mov eax,[ebp+Result] ; where to put the result-string
movq [eax],mm1 ; return the result (more or less)
emms ; really bad, up to 50 cycles
mov ecx,[eax] ; wrong byte order...
mov edx,[eax+4]
bswap ecx ; ...exchange it
bswap edx ; I know that's bad but the whole function is slow, so no need to optimize here
mov [eax],edx
mov [eax+4],ecx
mov byte [eax+8],0 ; end with 0 byte

popa ; done
mov esp,ebp
pop ebp
ret ; get out of here, NOW

SECTION .data

cte0F: dd 0x0F0F0F0F,0x0F0F0F0F
cte30: dd 0x30303030,0x30303030

SECTION .bss

temp: resb 32

--- end cut here ---