e.g. extract( 12345, 2) = 4
Yours
P
I would think the simplest thing to do would be to perform the
appropriate number of divisions by 10, then look at the remainder.
Given your example, you do two divisions
division quotient remainder
-------- -------- ---------
12345/10 1234 5
1234/10 123 4 <- your answer
123/10 12 3
etc.
Notice that extract(12345, 6) gives zero, which is OK, but you'll have
to handle negative numbers as a special case.
I don't know if you consider this a (pseudo-)ASCII conversion or not,
but it seems to me about the only way...
Regards,
-=Dave
Just my (10-010) cents
I can barely speak for myself, so I certainly can't speak for B-Tree.
Change is inevitable. Progress is not.
SECTION .text
_getDigit:
push ebp
mov ebp,esp
push ebx
push ecx
mov eax,10
mov ebx,[ebp + 8] ; number
mov ecx,ascii ; pointer to asciiconversion-table
sub eax,[ebp + 12] ; 10 - digitWanted
; (1 <= digitWanted <= 10(most-significant))
jz count
while:
sub ebx,[ecx] ; number = number - ascii[edx]
jge while ; number >= ascii[edx] ?
add ebx,[ecx]
add ecx,4 ; next value in ascii-table
dec eax ; next digit
ja while ; until found digitWanted
count:
inc eax ; count number of "sub"s...
sub ebx,[ecx]
jae count
dec eax ; result in eax :)
pop ecx
pop edx
mov esp,ebp
pop ebp
ret
SECTION .data
ascii: dd 1000000000,100000000,10000000,1000000,100000,10000,1000,100,10,1
that's what I'm currently using... :)
it works
it's fast
and it definitely beats anything involving "div" (counter example welcome)
but, it doesn't run in constant time !
getting digit 10 is way faster than getting digit 3 for example
constant time is, of course, no requirement
but, since there is such a difference between 1 and 10
I was guessing that there's probably some way
to speed up 1 (maybe slow down 10) so the _overall_ performance improves
it's surely possible to gain a cycle or two by some instruction
reordering or a couple of well placed nops...
but I was more thinking along the lines of "a better algorithm"...
any input welcome...
Yours
P
There's always at least one more way to do it! :-)
This particular procedure cries out for a table lookup and a couple of
multiplications:
(I assume that the number is a 32-bit unsigned value?)
The obvious approach would be to load into a fp register, multiply by a
suitable reciprocal value, remove the integer part, and then multiply
the fraction by 10 to extract the digit.
I'm certain that it's possible to do this exactly because you have 53
mantissa bits in a fp value, which is a lot more than 32, so even
numbers like 999,999,999 will be converted properly, you just have to
make sure that the scale factors are rounded the proper way.
Anyway, the above is probably the fastest way on a Pentium.
On a P6 we have a fast 32*32->64 MUL opcode, so we can save on the
conversion costs:
; number in EAX, digit # in EBX
mul scale_factor[ebx*4]
mov cl,shift_table[ebx]
shrd eax,edx,cl
; At this point EAX contains the fractional bits (rounded up) left
behind
; after scaling the input number by so much as to get rid of the digits
in
; front of the one we want!
mov edx,10
mul edx
The most significant decimal digit have now made it all the way into
EDX, we're done!
Total time: About 10 cycles.
:-)
Terje
PS. The values needed in the two tables should be obvious, right?
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
[ ... ]
> mul scale_factor[ebx*4]
> mov cl,shift_table[ebx]
[ more code elided ... ]
> Total time: About 10 cycles.
...plus a couple hundred cycles to read the data in from the tables,
unless they happen to already be in the cache.
> PS. The values needed in the two tables should be obvious, right?
Hmm...there's a story about a professor who's giving a talk about a
new proof he's come up with, and in the middle of it he says something
like "from the preceding it is obvious that..." and one of the people
in the audience says "are you sure it's obvious?". The professor
looks at the board again, wrinkles his forehead, starts to think
about, wonders out of the room, finally comes back in about a half
hour later, and says "Yes, it IS obvious."
--
Later,
Jerry.
The universe is a figment of its own imagination.
FLD [value]
FBSTP [result] ; 10-bytes (index into 18 nibbles of BCD)
First, two cache misses will cost from 50-100+ cycles, depending upon
the cpu/bus speed ratio.
Second, since these two tables will be _very_ short (10*4=40 and 10*1=10
bytes, total = 50 bytes) I can safely assume that they will be in cache.
The only alternative would be that this code is used so seldom that the
tables are flushed, but since this is presumably very speedcritical
code, this should not happen.
>
> > PS. The values needed in the two tables should be obvious, right?
>
> Hmm...there's a story about a professor who's giving a talk about a
> new proof he's come up with, and in the middle of it he says something
> like "from the preceding it is obvious that..." and one of the people
> in the audience says "are you sure it's obvious?". The professor
> looks at the board again, wrinkles his forehead, starts to think
> about, wonders out of the room, finally comes back in about a half
> hour later, and says "Yes, it IS obvious."
Actually, I only needed about 5 minutes to decide against calculating
all the table values and including them, I'll have to leave some fun
stuff to you guys (&girls?) right? :-)
Anyway, it really is quite obvious:
In fp, I would want to multiply by 0.1 to extract the first (lowest)
digit, 0.01 to get the next etc:
Start with 123456, extract 3rd digit:
123456 * 0.001 = 123.456
Extract fractional part:
0.456
Multiply by 10
4.56
Extract integer part:
4
This is the required answer.
Using integer-only code I'll do the same, the only "problem" is that I
cannot multiply by 0.001.
Instead I'll multiply by 2^32 * 0.001, shifted left 9 bits to get 32
significant bits in the scale factor, and then shift right by the same 9
bits afterwards.
The only trick is that the scale factor must be rounded up to make sure
that the result will be correct after the final truncation:
65536*65536*512*0.001 = 2199023255.552
This means that the corresponding table entry becomes
ceil(2199023255.552) = 2199023256
The shift factors are the largest power of two which is smaller than the
corresponding power of 10:
10 8 3
100 64 6
1000 512 9
10000 8192 13
...
Terje
>[cut]
>Instead I'll multiply by 2^32 * 0.001, shifted left 9 bits to get 32
>significant bits in the scale factor, and then shift right by the same 9
>bits afterwards.
>
>The only trick is that the scale factor must be rounded up to make sure
>that the result will be correct after the final truncation:
>
> 65536*65536*512*0.001 = 2199023255.552
>
>This means that the corresponding table entry becomes
>
> ceil(2199023255.552) = 2199023256
>
>The shift factors are the largest power of two which is smaller than the
>corresponding power of 10:
>
>
> 10 8 3
> 100 64 6
> 1000 512 9
> 10000 8192 13
>...
>
>Terje
1: 3435973837 3
2: 2748779070 6
3: 2199023256 9
4: 3518437209 13
5: 2814749768 16
6: 2251799814 19
7: 3602879702 23
8: 2882303762 26
9: 2305843010 29
10: 3689348815 33
works perfectly for 1 - 9
but, what do you do with that last entry ?
Yours
P
Good one, that is a special case. :-)
According to the algorithm I gave to calculate the factors, the number
should have been
10: 7378697629.483821 34
To avoid having to branch or otherwise specialcase the top digit, we'll
have to scale the number back to where the shift count becomes less than
32:
10: 922337203.6854776 31
This means that the final table entry should be 922337204
We'll have to verify that this works, but this is fortunately quite
easy:
I'll just try all the boundary cases, if it works for those, then it'll
also work for the numbers in between:
999,999,999 * 922337204 / 2^63 = 0.0999999...
1,000,000,000 -> 0.10000000...
1,999,999,999 -> 0.19999999...
2,000,000,000 -> 0.20000000...
2,999,999,998 -> 0.29999999...
2,999,999,999 -> 0.30000000...
3,999,999,998 -> 0.39999999...
3,999,999,999 -> 0.40000000...
Oops! It will break for two possible input numbers: 3e9-1 and 4e9-1.
The reason for the problem is fairly obvious: The scale factor has fewer
significant bits than the input number, this is an absolute no-no.
The problem can be fixed in many ways, here's a few:
By ignoring it, i.e. just state that this algorithm only works with
positive signed numbers (i.e. less than 2^31).
By increasing the size of the scale factor.
The latter is what I used for the code I wrote to use reciprocal
multiplication as an exact replacement for integer division by a
constant.
It means that the scale factors have to be 33-bit numbers, either by
having an extra top bit or by half a bit at the bottom end:
Top bit added:
; Input value in ECX, digit in EBX
mov eax,scale_factor[ebx*4]
mul ecx
add edx,ecx ; 65-bit result in carry:edx:eax
This approach requires the top bit of the scale factor to be set, but
this will not work with a shift that's less than 34 bits, so I'll have
to use a pair of shift operations, after first rotating any carry into
the top position.
Bottom bit added:
; Input value in ECX, digit in EBX
mov eax,scale_factor[ebx*4]
mul ecx
shr ecx,1
add eax,ecx
adc edx,0
For this to work, the bottom bit has to be set, but this is much easier
to manage since I can just remove any trailing zero bits in the scale
factor, before shifting it right one more time.
Another way to visualize it is that the fractional part of the scale
factor has to be less than 0.5, so that it will become 0.5 after
rounding up to the nearest half bit.
...
I know that these scale factors will get rid of all the digits above the
position we're interested in, but I am not sure that it will also work
after scaling the fractional part back into the integer position, since
this requires 3 more correct bits.
I have written a program to generate the factors and test them all, it
seems like my gut feeling was right, there isn't enough significant bits
to handle the subsequent multiplication by 10.
It might be possible to come up with a 64-bit scale factor instead,
where 33 to 64 bits where used, and where the interesting result ends up
in the middle 32 bits of the 96-bit product, obviating the need for a
64-bit shift operation.
This is starting to get a bit complicated though. :-)
mov eax,scale_factor_low[ebx*4]
mul ecx
mov esi,eax
mov edi,edx
mov eax,scale_factor_high[ebx*4]
mul ecx
add eax,edi
; adc edx,0 ; Can be skipped!
; 96-bit result in EDX:EAX:ESI, with implied decimal point between EDX
and EAX
At this point it might be just as fast to go back to the fp approach.
:-(
... Much later ...
I've implemented the 64/96 bit approach as well, and it still won't
work, due to the loss of precision during the shift. :-(
Currently I'm trying a final tweak, which is sort of a hybrid between
the 32 and 64 bit versions, where I use 64-bit scale factor, with a
final shift, and where I also increment the 32-bit (middle) part of the
96-bit product before multiplying by 10 to get the desired digit.
So far, it seems to work? :-)
I started testing at 0xffffffff, and have currently made it down to
4.2e9, so this will have to run overnight to make an exhaustive test of
all input numbers and all digits.
Terje
PS. The real cowardly way would be to go back to using my old ascii
conversion function, which blindly handles all 10 digits, and then
extract the one we want. After all, that function does run in about 50
clock cycles, so it is still comparable to a single integer DIV
operation.
>PatD wrote:
>>
[cut: old stuff]
>> 10: 3689348815 33
>>
>> but, what do you do with that last entry ?
>
>Good one, that is a special case. :-)
>
[cut: some table discussion]
>
>The problem can be fixed in many ways, here's a few:
>
>By ignoring it, i.e. just state that this algorithm only works with
>positive signed numbers (i.e. less than 2^31).
>
LOL..., good one :)
>By increasing the size of the scale factor.
>
[cut: some good read that leads to the following :)]
>
>; 96-bit result in EDX:EAX:ESI, with implied decimal point between EDX
>and EAX
>
>At this point it might be just as fast to go back to the fp approach.
>:-(
>
hm...
>... Much later ...
>
[cut: current project]
>
>I started testing at 0xffffffff, and have currently made it down to
>4.2e9, so this will have to run overnight to make an exhaustive test of
>all input numbers and all digits.
>
>Terje
all night ?
what are you running this on ?
>
>PS. The real cowardly way would be to go back to using my old ascii
>conversion function, which blindly handles all 10 digits, and then
>extract the one we want. After all, that function does run in about 50
>clock cycles, so it is still comparable to a single integer DIV
>operation.
that's an option...
here's another one:
( declared as: "extern unsigned int getDigit(unsigned int n, unsigned int d);" )
BITS 32
GLOBAL _getDigit
SECTION .text
_getDigit:
push ebp
mov ebp,esp
push ebx
push ecx
push edx
mov eax,[ebp + 8]
mov ebx,[ebp + 12]
mul dword [tbl + 4 * ebx - 4]
mov cl,[shift + ebx - 1]
shrd eax,edx,cl
mov ebx,10
mul ebx
mov eax,edx
pop edx
pop ecx
pop ebx
mov esp,ebp
pop ebp
ret
SECTION .data
tbl: dd 0xcccccccd,0xa3d70a3e,0x83126e98,0xd1b71759,0xa7c5ac48,0x8637bd06,0xd6bf94d6,0xabcc7712,0x89705f42,0xdbe6ff
shift: db 3,6,9,13,16,19,23,26,29,25
it may look like cheating, but, it actually works :)
and...
it's about 6 times faster than the bbt
we're not worthy... :)
Yours
P
oops... two things I forgot to mention:
(they were supposed to go into a "p.s.", but, after reviewing the msg so far... oh well :)
1.
it still fails for some really extreme input like 3999999999...
then again, with that particular value, you get 2 more wrong digits ("5" (!) and "9")
still, it's pretty save to assume that this simply won't ever happen
so, it works for at least 99.99% of all possible values, even 4294967295
should someone come up with a "final" solution, it's welcome
until then, I can perfectly do with the function as shown above :)
2.
just in case anyone's wondering: bbt = best before terje (happens pretty often around here btw)
Yours
P
" Fast enough, isn't "
That is of course exactly the (not quite perfect) algorithm I first
suggested.
> 1.
> it still fails for some really extreme input like 3999999999...
Right.
> should someone come up with a "final" solution, it's welcome
> until then, I can perfectly do with the function as shown above :)
The theoretical base is fairly clear: To do the intial division by
reciprocal multiplication exactly, the scale factor must have 33
significant bits, i.e. 1 more than the length of the inputs.
To also guarantee that the the next digit is right requires another 3
bits, for a total of 36 bits.
> 2.
> just in case anyone's wondering: bbt = best before terje (happens pretty often around here btw)
Thanks. :-)
Terje