Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Integer roots game

486 views
Skip to first unread message

minf...@arcor.de

unread,
Dec 6, 2022, 9:40:35 AM12/6/22
to
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

CR 20 usqrt
CR 200 usqrt
CR 2000 usqrt
CR 20000 usqrt
CR 200000 usqrt
CR 2000000 usqrt
CR 20000000 usqrt
CR 200000000 usqrt

\ results:
include usqrt.fth
root:4 iterations: 2
root:14 iterations: 4
root:44 iterations: 6
root:141 iterations: 8
root:447 iterations: 10
root:1414 iterations: 12
root:4472 iterations: 14
root:14142 iterations: 16 ok

Hans Bezemer

unread,
Dec 6, 2022, 10:52:08 AM12/6/22
to
On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:
> Inspired by another thread I came up with the iterative code shown below.
> The challenge: find a shorter AND faster version of USQRT!

Done:
4 3
14 4
44 6
141 8
447 9
1414 11
4472 13
14142 14

: sqrt-rem ( n -- sqrt rem)
>r 0 1 begin dup r@ > except 4 * repeat
begin \ find a power of 4 greater than TORS
dup 1 > \ compute while greater than unity
while
2/ 2/ swap over over + negate r@ + \ integer divide by 4
dup 0< if drop 2/ else rdrop >r 2/ over + then swap
repeat drop r> ( sqrt rem)
;

BTW - less iterations doesn't always mean "faster". I got another pretty fast one, but it always requires 31 iterations on a 64 bit architecture.

Hans Bezemer

Hans Bezemer

unread,
Dec 6, 2022, 11:25:11 AM12/6/22
to
On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:
> Inspired by another thread I came up with the iterative code shown below.
> The challenge: find a shorter AND faster version of USQRT!
I'll give you that: it is FAST! However, this one is even faster (on my system):

: usqrt
>r 0 r@ 2/ ( ct x0 R: u)
begin
r@ over / over + 2/ swap over over < ( ct x1 x0 f)
while
drop swap 1+ swap
repeat rdrop nip nip
;
time pp4th -x usqrt.4th

Yours:
real 0m0,762s
user 0m0,761s
sys 0m0,000s

Mine:
real 0m0,682s
user 0m0,682s
sys 0m0,000s

Hans Bezemer

minf...@arcor.de

unread,
Dec 6, 2022, 11:28:21 AM12/6/22
to
Nice, although it has more loops and more operations in the main loop,
but which doesn't spoil the cake.
BTW what is that non-standard word: EXCEPT (guessing: 0= WHILE) ?

Hans Bezemer

unread,
Dec 6, 2022, 11:37:07 AM12/6/22
to
On Tuesday, December 6, 2022 at 5:28:21 PM UTC+1, minf...@arcor.de wrote:
> BTW what is that non-standard word: EXCEPT (guessing: 0= WHILE) ?
You guessed correctly, Sir. I always hated "0= WHILE" and "0= IF", so I made
myself a few alternatives.

Hans Bezemer

minf...@arcor.de

unread,
Dec 6, 2022, 11:46:25 AM12/6/22
to
Even nicer, also faster here (I have implemented only TOS-caching, but not register locals).
BTW over over is 2dup. For perfectionists: u2/ u/ and u< could double the range.

dxforth

unread,
Dec 6, 2022, 11:53:29 AM12/6/22
to
On 7/12/2022 3:28 am, minf...@arcor.de wrote:
> ...
> Nice, although it has more loops and more operations in the main loop,

This one duplicates your logic.

: SQRT ( +n -- +root )
0 over 2/ >r
BEGIN
over r@ / r@ + 2/
dup r@ <
WHILE
rdrop >r 1+
REPEAT drop
." root:" r> . ." iterations: " . drop ;

Anton Ertl

unread,
Dec 6, 2022, 12:21:48 PM12/6/22
to
"minf...@arcor.de" <minf...@arcor.de> writes:
>Inspired by another thread I came up with the iterative code shown below.
>The challenge: find a shorter AND faster version of USQRT!
>Have fun!
>( please don't short-circuit through FSQRT or such ;-)
>
>\ code:
>: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
> 0 to ct u 2/ to x0
> BEGIN u x0 / x0 + 2/ to x1
> x1 x0 <
> WHILE x1 to x0
> ct 1+ to ct
> REPEAT
> ." root:" x0 . ." iterations: " ct . ;

This is the specialization of the Newton-Raphson method for the square
root, see
<https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

For the initial value I would use

1 bits/cell u clz - 2/ lshift

where clz ( x -- u ) is the count-leading-zeros function that many
architectures have built-in. Newton-Raphson profits a lot from a good
initial value.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

Hans Bezemer

unread,
Dec 6, 2022, 12:24:41 PM12/6/22
to
On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
> > Nice, although it has more loops and more operations in the main loop,
> This one duplicates your logic.
That one shaves another 8/100 of a second from my version. ;-)

Hans Bezemer

none albert

unread,
Dec 6, 2022, 12:35:24 PM12/6/22
to
In article <3c089374-8f18-4d6b...@googlegroups.com>,
My take on this.

Newton is only crazy fast if the start value is good.
The following takes 10 iterations away:

\ For N return FLOOR of the square root of n.
: SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP RDROP ;

S[ ] OK -1 1 RSHIFT
S[ 9223372036854775807 ] OK SQRT
S[ 3037000499 ] OK

(that is MAX-INT , if you had not noticed.

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

none albert

unread,
Dec 6, 2022, 12:38:46 PM12/6/22
to
In article <2022Dec...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>"minf...@arcor.de" <minf...@arcor.de> writes:
>>Inspired by another thread I came up with the iterative code shown below.
>>The challenge: find a shorter AND faster version of USQRT!
>>Have fun!
>>( please don't short-circuit through FSQRT or such ;-)
>>
>>\ code:
>>: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
>> 0 to ct u 2/ to x0
>> BEGIN u x0 / x0 + 2/ to x1
>> x1 x0 <
>> WHILE x1 to x0
>> ct 1+ to ct
>> REPEAT
>> ." root:" x0 . ." iterations: " ct . ;
>
>This is the specialization of the Newton-Raphson method for the square
>root, see
><https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
>
>For the initial value I would use
>
>1 bits/cell u clz - 2/ lshift
>
>where clz ( x -- u ) is the count-leading-zeros function that many
>architectures have built-in. Newton-Raphson profits a lot from a good
>initial value.

Where `clz is available you should use it. In this context I consider it
as cheating.
If clz is not available in hardware, it takes up as much iterations
(or more) than you save on the Newton iteration.

>
>- anton

Marcel Hendrix

unread,
Dec 6, 2022, 1:01:52 PM12/6/22
to
On Tuesday, December 6, 2022 at 6:35:24 PM UTC+1, none albert wrote:
[..]
> Newton is only crazy fast if the start value is good.
> The following takes 10 iterations away:
>
> \ For N return FLOOR of the square root of n.
> : SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
> BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
> NIP REPEAT DROP RDROP ;
>
> S[ ] OK -1 1 RSHIFT
> S[ 9223372036854775807 ] OK SQRT
> S[ 3037000499 ] OK
>
> (that is MAX-INT , if you had not noticed.
[..]

FORTH> : SQRT DUP >R 10 RSHIFT 1024 MAX BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE NIP REPEAT DROP -R ;
Redefining `SQRT` ok
FORTH> : test timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test 3037000499000000000 53.520 seconds elapsed. ok
FORTH> forget sqrt : test2 timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test2 3037000499000000000 0.435 seconds elapsed. ok

-marcel

minf...@arcor.de

unread,
Dec 6, 2022, 1:05:13 PM12/6/22
to
Anton Ertl schrieb am Dienstag, 6. Dezember 2022 um 18:21:48 UTC+1:
> "minf...@arcor.de" <minf...@arcor.de> writes:
> >Inspired by another thread I came up with the iterative code shown below.
> >The challenge: find a shorter AND faster version of USQRT!
> >Have fun!
> >( please don't short-circuit through FSQRT or such ;-)
> >
> >\ code:
> >: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
> > 0 to ct u 2/ to x0
> > BEGIN u x0 / x0 + 2/ to x1
> > x1 x0 <
> > WHILE x1 to x0
> > ct 1+ to ct
> > REPEAT
> > ." root:" x0 . ." iterations: " ct . ;
> This is the specialization of the Newton-Raphson method for the square
> root, see
> <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
>

Correct. Another algorithm would include tabulation and interpolation. Probably faster
but eats more space.

If anyone is interested, there is a practical story behind the game: flow measurement
through orifices where dp behaves about quadratic with the flow. For instance the basic
principles are explained here:
https://www.efunda.com/formulae/fluids/calc_orifice_flowmeter.cfm

minf...@arcor.de

unread,
Dec 6, 2022, 4:05:11 PM12/6/22
to
Cute. But all those algorithms use slow division in the loop.

Here is a version ( in pseudo-Forth ;-) } using only shifts and additions:
: USQRT {: u | T H B S -- uroot32 :}
0 to H $8000 to B 15 to S
BEGIN
H 2* B + S lshift to T
T u < IF
H B + to H
u T - to u
THEN
S 1- to S
B 2/ to B
B 0=
UNTIL
." root:" H . ;

BTW (not shown above) for syntactic sugar I regularly use
:= += -/ *= /=
with locals. Makes code more readable.

Anton Ertl

unread,
Dec 6, 2022, 5:14:13 PM12/6/22
to
Here's a LOG2 definition (originally by Wil Baden
<wilbadenE...@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):

: log2 ( n -- log2 )
-1 swap ( log2 n)
dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
dup $000000000000ff00 and if swap 8 + swap 8 rshift then
dup $00000000000000f0 and if swap 4 + swap 4 rshift then
dup $000000000000000c and if swap 2 + swap 2 rshift then
dup $0000000000000002 and if swap 1 + swap 1 rshift then
+ ;

With this function the initial value becomes:

1 u log2 2/ lshift

It will be interesting to see at what values the crossover is between saving iterations for Newton, and the additional cost of log2.

dxforth

unread,
Dec 6, 2022, 7:09:23 PM12/6/22
to
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.

minf...@arcor.de

unread,
Dec 6, 2022, 7:49:48 PM12/6/22
to
Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
is that he was born before Charles Moore .. about 2000 years ... :o)

dxforth

unread,
Dec 6, 2022, 8:22:16 PM12/6/22
to
On 7/12/2022 11:49 am, minf...@arcor.de wrote:
> dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:
>> On 7/12/2022 4:24 am, Hans Bezemer wrote:
>>> On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
>>>>> Nice, although it has more loops and more operations in the main loop,
>>>> This one duplicates your logic.
>>> That one shaves another 8/100 of a second from my version. ;-)
>> Locals are supposed to reduce 'stack juggling' and simplify code but in this
>> instance it was the juggling between variables that proved extraneous.
>
> Lol!! Tell that Heron of Alexandria, it is his root-finding method.

Maybe but it was your implementation. So did the non-locals version
prove to be "shorter AND faster"? That was your challenge - no?

minf...@arcor.de

unread,
Dec 7, 2022, 4:20:46 AM12/7/22
to
Admitted that, I won't yawn at that old "shun locals" talk again. If the proposal
just duplicates Heron's logic, it is just equivalent IMO. If speed depends on the
used compiler, it is not better really - lxf seems even capable to generate the same
code for both versions.

OTOH Marcel, Albert and Anton accelerated the simple algorithm by better start values.

Yesterday I found a different algorithm in the net that avoids slow divisions and should
therefore be even faster. The ARM code at the end of the document is a gem:
http://www.azillionmonkeys.com/qed/ulerysqroot.pdf

I didn't do timings but the other code that I presented in the thread was a linear translation
of the C version to Forth. I checked it with gforth.

P.S. pray forgive that while doing this I did not kick out those nasty locals. ;o)


dxforth

unread,
Dec 7, 2022, 5:47:33 AM12/7/22
to
On 7/12/2022 8:20 pm, minf...@arcor.de wrote:
> dxforth schrieb am Mittwoch, 7. Dezember 2022 um 02:22:16 UTC+1:
>> On 7/12/2022 11:49 am, minf...@arcor.de wrote:
>>> dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:
>>>> On 7/12/2022 4:24 am, Hans Bezemer wrote:
>>>>> On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
>>>>>>> Nice, although it has more loops and more operations in the main loop,
>>>>>> This one duplicates your logic.
>>>>> That one shaves another 8/100 of a second from my version. ;-)
>>>> Locals are supposed to reduce 'stack juggling' and simplify code but in this
>>>> instance it was the juggling between variables that proved extraneous.
>>>
>>> Lol!! Tell that Heron of Alexandria, it is his root-finding method.
>> Maybe but it was your implementation. So did the non-locals version
>> prove to be "shorter AND faster"? That was your challenge - no?
>
> Admitted that, I won't yawn at that old "shun locals" talk again. If the proposal
> just duplicates Heron's logic, it is just equivalent IMO. If speed depends on the
> used compiler, it is not better really -

If the improvement is consistent across compilers, how is it not better?
You can say you don't care - but that's a different argument.

> lxf seems even capable to generate the same
> code for both versions.

According to Jeff Fox, Moore considered optimizing locals a backwards solution
to the problem of the user failing to optimize the stack.

https://groups.google.com/g/comp.lang.forth/c/liQl9h9OzgY/m/PfO98kkYEXYJ

Anton Ertl

unread,
Dec 7, 2022, 6:13:57 AM12/7/22
to
"minf...@arcor.de" <minf...@arcor.de> writes:
>Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
>is that he was born before Charles Moore .. about 2000 years ... :o)

But the question is how Heron described it. Modern descriptions of
course use medern notation and terminology, but the original notation
and terminology is often quite different.

Gerry Jackson

unread,
Dec 7, 2022, 6:36:24 AM12/7/22
to
In ANS/2012 Forth, solutions that use 2/ on an unsigned number with the
ms bit set won't work for unsigned square root as 2/ leaves the ms bit
unchanged.

Here's a recursive solution I wrote a few months ago based on:
https://en.wikipedia.org/wiki/Integer_square_root#Using_bitwise_operations

I've put stack comments in so it can be related to the original in
Wikipedia.

: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
r> u> if 1- then ( -- u1 )
;

Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using
the stack vs locals.

--
Gerry

minf...@arcor.de

unread,
Dec 7, 2022, 7:07:18 AM12/7/22
to
Anton Ertl schrieb am Mittwoch, 7. Dezember 2022 um 12:13:57 UTC+1:
> "minf...@arcor.de" <minf...@arcor.de> writes:
> >Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
> >is that he was born before Charles Moore .. about 2000 years ... :o)
> But the question is how Heron described it. Modern descriptions of
> course use medern notation and terminology, but the original notation
> and terminology is often quite different.

To my amateurish knowledge, mathematics at that time (Pythagoreans put aside)
was mostly influenced by physics (i.e. mechanics, astronomy, etc.) and geometry
(i.e. in architecture).

Therefore I guess that Heron's recursion (probably invented much earlier in Egypt)
had been described as algorithmic recipe using geometric elements and short text
annotations in Greek.

Certainly from our modern notations a form using variables comes closer to the original
than stack machine code.

minf...@arcor.de

unread,
Dec 7, 2022, 7:13:46 AM12/7/22
to
Gerry Jackson schrieb am Mittwoch, 7. Dezember 2022 um 12:36:24 UTC+1:
> Here's a recursive solution I wrote a few months ago based on:
> https://en.wikipedia.org/wiki/Integer_square_root#Using_bitwise_operations
>
> I've put stack comments in so it can be related to the original in
> Wikipedia.
>
> : usqrt ( u -- u1 )
> dup 2 u< if exit then
> dup >r 2 rshift recurse
> 2* ( -- u sm )
> 1+ dup ( -- u la la )
> dup * ( -- u la la^2 )
> r> u> if 1- then ( -- u1 )
> ;
>
> Worked OK for me, I wasn't concerned about speed so don't know how it
> compares in that respect. Incidentally this provides an example of using
> the stack vs locals.

Cool ! I really like this one ! :-)

Hans Bezemer

unread,
Dec 7, 2022, 7:33:02 AM12/7/22
to
On Wednesday, December 7, 2022 at 10:20:46 AM UTC+1, minf...@arcor.de wrote:
> Yesterday I found a different algorithm in the net that avoids slow divisions and should
> therefore be even faster. The ARM code at the end of the document is a gem:
> http://www.azillionmonkeys.com/qed/ulerysqroot.pdf
Quick hack:

cell-bits 2 / 1- constant (bshift)

: sqrt
>r (bshift) 1 over lshift >r 0 ( bs nh R: v b )
begin
tuck 1 lshift r@ + over lshift ( nh bs t R: v b)
r> over r@ > ( nh bs t b f R: v)
if nip rot else swap negate r> + >r rot over + then ( bs b nh R: v)
rot 1- rot 1 rshift >r swap r@ 0= ( bs nh R: v b)
until rdrop rdrop nip
;

Works on both 32- and 64-bit. Isn't surprisingly fast. As a matter of fact: I've seen implementations of binary algorithms that were faster.

Hans Bezemer

minf...@arcor.de

unread,
Dec 7, 2022, 8:01:06 AM12/7/22
to
Yes, it is a mixed thing in Forth. Lots of time spent in bringing arguments to the top.
The ultra-compact ARM code used registers.

Readability: algorithm and stack movements are like mashed carrots and potatoes, difficult to separate.
In my Forth

: USQRT32 {: u | T H B:$8000 S:15 == H :}
BEGIN
H 2* B + S lshift := T
T u < IF B += H T -= u THEN
1 -= S B 2/ := B
B ~UNTIL ;

CR 200_000_000 USQRT32 .( 200mil-root: ) .

Yes, this is slower and thus a game loser. ;-)

Hans Bezemer

unread,
Dec 7, 2022, 8:24:51 AM12/7/22
to
On Wednesday, December 7, 2022 at 2:01:06 PM UTC+1, minf...@arcor.de wrote:
> : USQRT32 {: u | T H B:$8000 S:15 == H :}
Oops - 32-bit only! It's actually two loops - one to figure out the first candidate for a binary square
and the second one to figure out the square itself. That's where S and B come from - if you offer
MAX-N the first possible candidate (in 32 bits = SQRT(MAX-) ~ 2^15). So S and B are related.

Therefore, you can derive B from S at the start of the program - and may be one could eradicate
a lot of these vars from the algorithm - with some speed penalty of course. But I haven't delved
that deep (yet).

S seems to be equal to half the number of bits of MAX-N, rounded down. On this basis you can
calculate the number of bits for S for a specific architecture.

BTW, this might be an interesting one for this discussion.
https://sourceforge.net/p/forth-4th/wiki/Inside%20Zen%20FSQRT/

It explains how (on 32 bit) an FSQRT is made by:
1. Cranking up the mantissa to the max;
2. Halving the exponent;
3. Calculating a good approximation now the (integer) mantissa is close to MAX-N.

In 32-bit it required about 3-4 integer iterations and 3-4 floating point iterations - if I remember
correctly.

HB

Hans Bezemer

unread,
Dec 7, 2022, 8:45:00 AM12/7/22
to
On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote:
> Worked OK for me, I wasn't concerned about speed so don't know how it
> compares in that respect. Incidentally this provides an example of using
> the stack vs locals.
The algorithm that started this discussion did 1M iterations in about 0.76s.
Yours is 0.2s faster on my combo - even with all the function call overhead.

HB

Gerry Jackson

unread,
Dec 7, 2022, 10:30:20 AM12/7/22
to
I've just realised it can be made slightly quicker using the well formed
flag trick in the last line to eliminate the IF ... THEN

: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
\ r> u> if 1- then ( -- u1 )
r> u> + ( -- u1 )
;

I believe 4th uses 1 for TRUE so the last line for 4th would be:
r> u> -

I'm a bit dubious about whether a marginal saving like that is worthwhile.


--
Gerry

Message has been deleted

Gerry Jackson

unread,
Dec 7, 2022, 4:07:31 PM12/7/22
to
On 07/12/2022 18:12, minf...@arcor.de wrote:
> \ with microscopic change & tested
> : USQRT
> dup 2 u< IF EXIT THEN
> dup >r 2 rshift recurse
> 2* dup 1+ dup * r> u< - ;
>

Sorry but that doesn't seem to work for all cases e.g. using your USQRT

25 usqrt . \ displays 4

Using this test program on GForth
: x ( n -- ) \ get square roots from 0 to n-1
-1 swap 0
do
i usqrt 2dup <
if cr i 3 .r ." : " nip dup then \ Display next square number
. \ Followed by square roots of integers up to next square
loop drop
;

\ With my USQRT
101 x
0: 0
1: 1 1 1
4: 2 2 2 2 2
9: 3 3 3 3 3 3 3
16: 4 4 4 4 4 4 4 4 4
25: 5 5 5 5 5 5 5 5 5 5 5
36: 6 6 6 6 6 6 6 6 6 6 6 6 6
49: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
64: 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
81: 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
100: 10 ok

Whereas using your version we get

101 x
0: 0
1: 1 1 1
4: 2 2 2 2 2 2
10: 3 3 3 3 3 3
16: 4 4 4 4 4 4 4 4 4 4
26: 5 5 5 5 5 5 5 5 5 5 5 5 5 5
40: 6 6 6 6 6 6 6 6 6 6
50: 7 7 7 7 7 7 7 7 7 7 7 7 7 7
64: 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
82: 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ok

I'm tired and it's too late to start thinking about it tonight.

--
Gerry

Anton Ertl

unread,
Dec 7, 2022, 5:47:01 PM12/7/22
to
dxforth <dxf...@gmail.com> writes:
>Locals are supposed to reduce 'stack juggling' and simplify code but in this
>instance it was the juggling between variables that proved extraneous.

I have collected some of the versions posted here into

http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

I then created some versions from it that have the stack effect ( +n1
-- n2 ) and that do not count the iterations:

\ derived from original locals version
: USQRT1 {: u | x0 x1 -- uroot ; positive integer square root :}
u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
REPEAT
x0 ;

\ derived from dxforths locals-less version
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
dup r@ / r@ + 2/ ( +n x1 R:x0 )
dup r@ <
WHILE
r> drop >r
REPEAT
2drop r> ;

\ An attempt to write code for VFX's current limitations: Try to do
\ everything on the data stack with only one stack item at basic block
\ bounaries. Keep u on the return stack, so we only do a read from
\ there. Unfortunately, the data stack depth changes here.
: usqrt9 ( u -- root )
dup >r 2/ begin ( x0 R:u )
r@ over / over + 2/ ( x0 x1 R:u )
2dup > while
nip repeat
r> 2drop ;

\ Maybe we can do better by not changing the stack depth; however, we
\ use 2 stack items in the basic blocks in the loop.
: usqrta ( u -- root )
dup >r 2/ r@ begin ( x0 u R:u )
over / over + 2/ ( x0 x1 R:u )
2dup > while
nip r@ repeat
r> 2drop ;

Let's first look at the inner loops as produced by VFX64:

usqrt1 usqrt4 usqrt9 usqrta
MOV RAX, [RDI+10] MOV RCX, [RSP] MOV RDX, [RSP] MOV RAX, RBX
CQO MOV RAX, RBX MOV RAX, RDX CQO
IDIV QWord [RDI+-08] CQO CQO IDIV QWord [RBP]
ADD RAX, [RDI+-08] IDIV RCX IDIV RBX ADD RAX, [RBP]
SAR RAX, # 1 MOV RDX, [RSP] ADD RAX, RBX SAR RAX, # 1
MOV [RDI+-10], RAX ADD RDX, RAX SAR RAX, # 1 CMP RAX, [RBP]
MOV RDX, [RDI+-10] SAR RDX, # 1 CMP RBX, RAX MOV RBX, RAX
CMP RDX, [RDI+-08] MOV RCX, [RSP] LEA RBP, [RBP+-08] JNL 004E410D
JNL 004E3F00 CMP RDX, RCX MOV [RBP], RBX MOV RDX, [RSP]
MOV RDX, [RDI+-10] LEA RBP, [RBP+-08] MOV RBX, RAX MOV [RBP], RBX
MOV [RDI+-08], RDX MOV [RBP], RBX JLE/N004E4090 MOV RBX, RDX
JMP 004E3ED3 MOV RBX, RDX LEA RBP, [RBP+08] JMP 004E40E3
JNL 004E4028 JMP 004E4064
LEA RSP, [RSP+08]
PUSH RBX
MOV RBX, [RBP]
LEA RBP, [RBP+08]
JMP 004E3FEA

Here the locals code looks quite good, but it has a lot of memory
accesses. 9 and a look indeed better than 4. Let's see how they
perform:

for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
187.7 257.1 191.3 179.7 [h0:~/pub/forth/programs:86099] for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

1 4 9 a
187.7 257.1 191.3 179.7 instructions per invocation (average from u=2..1e7)
165.9 141.2 136.3 161.9 cycles per invocation on Rocket Lake
169.5 105.0 156.8 163.0 cycles per invocation on Zen3

The Zen3 result is quite surprising: while usqrt4 executes the most
instructions (as expected), it takes far fewer cycles on Zen3 than all
the other versions.

The Rocket Lake result is not as great for usqrt4, and usqrt9 beats it
by a little.

I currently have no explanation why the cycles results are like this.

dxforth

unread,
Dec 7, 2022, 7:10:06 PM12/7/22
to
On 8/12/2022 2:30 am, Gerry Jackson wrote:
> ...
> I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
> ...
> I'm a bit dubious about whether a marginal saving like that is worthwhile.

Especially when an optimizing compiler is compelled to generate the 'well-formed'
flag. I don't think it was ever a given that 'calculate' was better than 'test
and branch'. Each situation according to its merits.

dxforth

unread,
Dec 7, 2022, 8:24:55 PM12/7/22
to
On 8/12/2022 9:06 am, Anton Ertl wrote:
> dxforth <dxf...@gmail.com> writes:
>> Locals are supposed to reduce 'stack juggling' and simplify code but in this
>> instance it was the juggling between variables that proved extraneous.
>
> I have collected some of the versions posted here into
>
> http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th
>
> I then created some versions from it that have the stack effect ( +n1
> -- n2 ) and that do not count the iterations:
>
> ...
>
> \ derived from dxforths locals-less version
> : USQRT4 ( +n -- +root )
> dup 2/ >r
> BEGIN ( +n r:x0 )
> dup r@ / r@ + 2/ ( +n x1 R:x0 )
> dup r@ <
> WHILE
> r> drop >r
> REPEAT
> 2drop r> ;

With no count to maintain, it can be simplified to:

: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;

SpainHackForth

unread,
Dec 8, 2022, 5:09:40 AM12/8/22
to
I wrote a assembler word for Mecrisp passed on this article… I think it was number number 4… I tried number 9 and it was not very precise… https://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi

Heinrich Hohl

unread,
Dec 8, 2022, 5:12:36 AM12/8/22
to
On Tuesday, December 6, 2022 at 11:14:13 PM UTC+1, Anton Ertl wrote:
> Here's a LOG2 definition (originally by Wil Baden
> <wilbadenE...@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):
>
> : log2 ( n -- log2 )
> -1 swap ( log2 n)
> dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
> dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
> dup $000000000000ff00 and if swap 8 + swap 8 rshift then
> dup $00000000000000f0 and if swap 4 + swap 4 rshift then
> dup $000000000000000c and if swap 2 + swap 2 rshift then
> dup $0000000000000002 and if swap 1 + swap 1 rshift then
> + ;

Hello Anton,

did Wil Baden explain in detail how the algorithm works?

The routine works properly and the code is clear, but the algorithm
behind it is not easy to understand. I would be grateful if you have
any reference with more information on this Log2 algorithm.

Henry

Hans Bezemer

unread,
Dec 8, 2022, 12:12:21 PM12/8/22
to
On Thursday, December 8, 2022 at 11:12:36 AM UTC+1, Heinrich Hohl wrote:
> did Wil Baden explain in detail how the algorithm works?
>
> The routine works properly and the code is clear, but the algorithm
> behind it is not easy to understand. I would be grateful if you have
> any reference with more information on this Log2 algorithm.

Maybe this helps:
decimal $ffffffff00000000 2 base ! u. 1111111111111111111111111111111100000000000000000000000000000000 ok
decimal $00000000ffff0000 2 base ! u. 11111111111111110000000000000000 ok
decimal $000000000000ff00 2 base ! u. 1111111100000000 ok
decimal $00000000000000f0 2 base ! u. 11110000 ok
decimal $000000000000000c 2 base ! u. 1100 ok
decimal $0000000000000002 2 base ! u. 10

If the first one checks out (AND returns a non-zero value) we know it is in the top 32-bits.
So we update out count (it's bits 0 - 63, so we start off with -1) by adding 32 to it, shift
the whole shebang 32 bits to the right and test the first half of the last 32 bits.

If AND had returned zero, it's not in the first 32 bits - so we leave it be and continue with the
first half of the 32 bits. Same thing here - if zero, we leave it be, if not we add 16 to the count
and shift the whole darn thing 16 bits to the right - and repeat the entire procedure for the
first half of the 16 bits (which is 8 bits).

The last bit left is either 0 or 1. So we add it to our count. Let's assume we did "*0001" then
that bit is added to -1 - and it returns that bit zero is set. If we did "*0000" then zero is added
to -1 - signalling that NO bit is set at all.

It's quite simple and elegant when you think of it.

Hans Bezemer

Heinrich Hohl

unread,
Dec 8, 2022, 12:44:37 PM12/8/22
to
On Thursday, December 8, 2022 at 6:12:21 PM UTC+1, the.bee...@gmail.com wrote:
> It's quite simple and elegant when you think of it.
>
> Hans Bezemer

Hello Hans,

thank you for the explanations. Yes, the algorithm is tricky but also elegant.
This is basically a binary search for the topmost nonzero bit.

I had a look at "Bit Twiddling Hacks" and found similar algorithms.
https://graphics.stanford.edu/~seander/bithacks.html

"Count the consecutive zero bits (trailing) on the right by binary search"
is almost the same algorithm, except that it searches for the lowermost nonzero bit.

Henry

P Falth

unread,
Dec 8, 2022, 12:54:15 PM12/8/22
to
This is exactly what happens with my codegenerator. The "optimized" version is
5% larger (but just 3 bytes) and more surprisingly have 30% longer runtime! .

Gerry's original recursive version is anyway 5 times faster then your version!

Could be the division is slow on my processor, an E5-4657L v2

BR
Peter

Anton Ertl

unread,
Dec 8, 2022, 1:26:55 PM12/8/22
to
Heinrich Hohl <hheinri...@gmail.com> writes:
>On Tuesday, December 6, 2022 at 11:14:13 PM UTC+1, Anton Ertl wrote:
>> Here's a LOG2 definition (originally by Wil Baden
>> <wilbadenE...@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):
>>
>> : log2 ( n -- log2 )
>> -1 swap ( log2 n)
>> dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
>> dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
>> dup $000000000000ff00 and if swap 8 + swap 8 rshift then
>> dup $00000000000000f0 and if swap 4 + swap 4 rshift then
>> dup $000000000000000c and if swap 2 + swap 2 rshift then
>> dup $0000000000000002 and if swap 1 + swap 1 rshift then
>> + ;
>
>Hello Anton,
>
>did Wil Baden explain in detail how the algorithm works?

I only have found it in a reply by me, and what I kept from his
posting does not include an explanation. You could take the
Message-Id <wilbadenE...@netcom.com> and look it up.
al.howardknight.net does not have it (probably too old); maybe Google
Groups can find it (but the last few times I tried to find Message-Ids
with Google Groups, I was unsuccessful).

>The routine works properly and the code is clear, but the algorithm
>behind it is not easy to understand. I would be grateful if you have
>any reference with more information on this Log2 algorithm.

Hans Bezemer explains it.

A similar way to do it, but without ifs (derived from the C code in
Gforth):

: log2a ( n -- log2 )
dup 0= swap ( log2' n )
dup $ffffffff u> 32 and rot over + -rot rshift
dup $0000ffff u> 16 and rot over + -rot rshift
dup $000000ff u> 8 and rot over + -rot rshift
dup $0000000f u> 4 and rot over + -rot rshift
dup $00000003 u> 2 and rot over + -rot rshift
$00000001 u> 1 and + ;

\ checked against Gforth's builtin log2
: check
100000000 0 do i log2 i log2a <> if cr i hex. i log2 . i log2a . then loop ;

VFX64 translates this into 47 instructions (163 bytes) without a
single memory reference.

Anton Ertl

unread,
Dec 8, 2022, 1:52:02 PM12/8/22
to
dxforth <dxf...@gmail.com> writes:
>With no count to maintain, it can be simplified to:
>
>: USQRT4 ( +n -- +root )
> dup 2/ dup
> BEGIN drop
> 2dup / over + 2/
> 2dup >
> WHILE
> swap
> REPEAT
> rot 2drop ;

The version above is called USQRTB (or B) below. It inspired me to:

: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

The idea here is to never move a value on the critical path to memory
in VFX (i.e., the critical path value should be on the TOS at the
basic block boundary).

Here are the numbers, I'll show the code in a later posting:

for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

1 4 9 a b c
187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake
169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3

The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
174 cycles, but here I show a good result.

Anyway, the idea behind USQRTC seems to work well: It wins in the
number of instructions, and the number of cycles on both CPUs.

P Falth

unread,
Dec 8, 2022, 3:04:27 PM12/8/22
to
But its name is wrong. It is not unsigned. It requires a signed positive number
as seen also by the original stack comment .

try -1 usqrtc

BR
Peter

minf...@arcor.de

unread,
Dec 8, 2022, 3:35:38 PM12/8/22
to
IMO 5 times speed difference is too much to be explained by a single division.
Memory and CPU cache effects seem more plausible. Recursion retakes
items from the stack that is probably still cached, because those items were
stored only recently there in a preceding cycle.

Marcel Hendrix

unread,
Dec 8, 2022, 4:22:30 PM12/8/22
to
On Thursday, December 8, 2022 at 7:52:02 PM UTC+1, Anton Ertl wrote:
[..]
FORTH> in ( #100,000,000 iterations )
303700049900000000 7.753 seconds elapsed. minforth
303700049900000000 7.039 seconds elapsed. dxforth #1
303700049900000000 4.424 seconds elapsed. Albert
303700049900000000 0.681 seconds elapsed. Anton #1
303700049900000000 6.229 seconds elapsed. Gerry
303700049900000000 6.540 seconds elapsed. Anton #2
303700049900000000 6.208 seconds elapsed. Anton #3
303700049900000000 6.554 seconds elapsed. dxforth #2
303700049900000000 6.497 seconds elapsed. Anton #3 ok
303700049900000000 0.036 seconds elapsed. iForth ok
FORTH> .TICKER-INFO
AMD Ryzen 7 5800X 8-Core Processor, TICKS-GET uses iTSC at 4192MHz

Locals cost something, but a smart algorithm obviously wins.

The last result is because some inlined words evaluate their arguments.

-marcel

minf...@arcor.de

unread,
Dec 8, 2022, 5:21:39 PM12/8/22
to
I think many Forths do some form of register caching for top stack element(s).
But keeping locals in registers seems still unpopular in Forth. This could explain
those many memory accesses by VFX Forth for the locals version, and the
incurred speed penalty.

Anton Ertl

unread,
Dec 8, 2022, 5:33:09 PM12/8/22
to
P Falth <peter....@gmail.com> writes:
>But its name is wrong. It is not unsigned. It requires a signed positive number
>as seen also by the original stack comment .

Yes, that can be fixed with

: usqrtd ( u -- root )
dup 1 rshift dup
begin ( u x0 x1 )
nip ( u x1 )
2dup u/ over + 1 rshift ( u x1 x2 )
2dup u<= until
drop nip ;

but not every Forth system has U/ and U<= (and VFX64 has U/, but does
not inline it).

Anton Ertl

unread,
Dec 8, 2022, 5:59:36 PM12/8/22
to
Here is the promised code for the inner loop on VFX64:

usqrt4 usqrt9 usqrtb usqrtc
MOV RCX, [RSP] MOV RDX, [RSP] MOV RAX, [RBP+08] MOV RAX, [RBP+08]
MOV RAX, RBX MOV RAX, RDX CQO CQO
CQO CQO IDIV QWord [RBP] IDIV RBX
IDIV RCX IDIV RBX ADD RAX, [RBP] ADD RAX, RBX
MOV RDX, [RSP] ADD RAX, RBX SAR RAX, # 1 SAR RAX, # 1
ADD RDX, RAX SAR RAX, # 1 CMP RAX, [RBP] CMP RBX, RAX
SAR RDX, # 1 CMP RBX, RAX MOV RBX, RAX MOV [RBP], RBX
MOV RCX, [RSP] LEA RBP, [RBP+-08] JNL 004E418D MOV RBX, RAX
CMP RDX, RCX MOV [RBP], RBX MOV RDX, RBX JNLE 004E41E2
LEA RBP, [RBP+-08] MOV RBX, RAX MOV RBX, [RBP]
MOV [RBP], RBX JLE/N004E4090 MOV [RBP], RDX
MOV RBX, RDX LEA RBP, [RBP+08] JMP 004E4162
JNL 004E4028 JMP 004E4064
LEA RSP, [RSP+08]
PUSH RBX
MOV RBX, [RBP]
LEA RBP, [RBP+08]
JMP 004E3FEA

For space reasons I left USQRT1 and USQRTA (relatively slow versions
away here, you can look at them in an earlier posting.

The memory reference in USQRTC are for different memory locations, so
one is only read (the value U), and another is only written (to a
different location, which is then not used in the next iteration; it's
only relevant when leaving the loop.

USQRT9 also seems to have this property, yet is signifiantly slower;
the reasons for that are unclear to me.

USQRT4 PUSHes a value (x1) to the return stack towards the end of an
iteration, and loads it (as x0) at the start of the next iteration.
This does not seem to hurt much, though, and I don't know why.

USQRTB moves a value to data stack RAM in the penultimate instruction
of the iteration and loads it again in the IDIV instruction.

dxforth

unread,
Dec 8, 2022, 7:09:46 PM12/8/22
to
On 9/12/2022 5:41 am, Anton Ertl wrote:
> dxforth <dxf...@gmail.com> writes:
>> With no count to maintain, it can be simplified to:
>>
>> : USQRT4 ( +n -- +root )
>> dup 2/ dup
>> BEGIN drop
>> 2dup / over + 2/
>> 2dup >
>> WHILE
>> swap
>> REPEAT
>> rot 2drop ;
>
> The version above is called USQRTB (or B) below. It inspired me to:
>
> : usqrtc ( u -- root )
> dup 2/ dup
> begin ( u x0 x1 )
> nip ( u x1 )
> 2dup / over + 2/ ( u x1 x2 )
> 2dup <= until
> drop nip ;

One less branch. Neat.

Anton Ertl

unread,
Dec 9, 2022, 5:33:55 AM12/9/22
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>dxforth <dxf...@gmail.com> writes:
>>Locals are supposed to reduce 'stack juggling' and simplify code but in this
>>instance it was the juggling between variables that proved extraneous.
>
>I have collected some of the versions posted here into
>
>http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

After establishing that USQRTD has the fastest loop, I now add the
better initial value. I benchmark albert@cherry.(none)'s cheap, but
less accurate initial value as well as my more precise, but more
expensive initial value. The values tested are for 2..10^7 (like in
the other tests).

First, the results:

c d 5 e
141.1 104.5 77.0 63.8 instructions/invocation
136.3 46.7 46.0 45.9 cycles/invocation Rocket Lake
95.8 40.0 32.5 32.5 cycles/invocation Zen3

c uses the original initial value: u 2/
d uses my log2-based initial value (see below)
5 and e use albert@cherry.(none)'s initial value;
5 is albert@cherry.(none)'s code
e uses his initial value with the rest from c.

Concerning usqrtd, I could not use exactly, what I outlined earlier, but
had to adapt it slightly to guarantee an initial value that's larger
than the root (required by the terminating condition of the loop):

: log2 ( n -- log2 )
dup 0= swap ( log2' n )
dup $ffffffff u> 32 and rot over + -rot rshift
dup $0000ffff u> 16 and rot over + -rot rshift
dup $000000ff u> 8 and rot over + -rot rshift
dup $0000000f u> 4 and rot over + -rot rshift
dup $00000003 u> 2 and rot over + -rot rshift
$00000001 u> 1 and + ;

: usqrtd ( u -- root )
2 over log2 2/ lshift
dup begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

: uSQRT5 DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP R> DROP ;

: usqrte ( u -- root )
dup 10 rshift 1024 max
dup begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

It's obvious that, for the range of numbers benchmarked, the increased
precision of USQRTD does not compensate for its increased cost. I
also benchmarked with u=1000000000002..1000010000000, with the
following results:

c d 5 e
226.9 102.7 201.4 149.9 instructions/invocation
270.7 45.3 142.2 142.6 cycles/invocation Rocket Lake
170.5 39.5 102.1 101.8 cycles/invocation Zen3

So for larger u, the higher precision of the log2 approach pays off.
OTOH, if you knew that u is in that range, you could use the following
cheap initial value:

DUP 20 RSHIFT $100000 MAX

and win over SQRTD again.

Marcel Hendrix

unread,
Dec 9, 2022, 1:09:51 PM12/9/22
to
On Friday, December 9, 2022 at 11:33:55 AM UTC+1, Anton Ertl wrote:
> Concerning usqrtd, I could not use exactly, what I outlined earlier, but
> had to adapt it slightly to guarantee an initial value that's larger
> than the root (required by the terminating condition of the loop):

iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications). For your version of
the benchmark implementation, I didn't want to mention that
I had to use "log2 1+".

Therefore the sqrt with log2 initial value wins by several
streetlengths. (Although the SQRT based on FSQRT might
still be faster).

-marcel

Anton Ertl

unread,
Dec 9, 2022, 1:48:45 PM12/9/22
to
Marcel Hendrix <m...@iae.nl> writes:
>iForth has the log2 method implemented as a table (it is very
>frequently used in my type of applications).

Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
BMI1 or ABM extensions are present) that allow a fast implementation
of log2. In (development) Gforth it's

55CD95A2B62A: test r8,r8
55CD95A2B62D: jz $55CD95A2B645
55CD95A2B62F: bsr r8,r8
55CD95A2B633: xor r8,$3F
55CD95A2B637: movsx rax,r8
55CD95A2B63A: mov r8d,$3F
55CD95A2B640: sub r8,rax
55CD95A2B643: jmp $55CD95A2B64C
55CD95A2B645: mov r8,$FFFFFFFF
55CD95A2B64C: add r13,$08
55CD95A2B650: mov rcx,-$08[r13]
55CD95A2B654: jmp ecx

but I can see several ways to improve on this gcc output (bsr itself
is almost there).

Given that Gforth has LOG2 and iforth has LOG2, there is a slight
chance of standardization.

lehs

unread,
Dec 9, 2022, 5:11:51 PM12/9/22
to
tisdag 6 december 2022 kl. 18:21:48 UTC+1 skrev Anton Ertl:
> "minf...@arcor.de" <minf...@arcor.de> writes:
> >Inspired by another thread I came up with the iterative code shown below.
> >The challenge: find a shorter AND faster version of USQRT!
> >Have fun!
> >( please don't short-circuit through FSQRT or such ;-)
> >
> >\ code:
> >: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
> > 0 to ct u 2/ to x0
> > BEGIN u x0 / x0 + 2/ to x1
> > x1 x0 <
> > WHILE x1 to x0
> > ct 1+ to ct
> > REPEAT
> > ." root:" x0 . ." iterations: " ct . ;
> This is the specialization of the Newton-Raphson method for the square
> root, see
> <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
>
> For the initial value I would use
>
> 1 bits/cell u clz - 2/ lshift
>
> where clz ( x -- u ) is the count-leading-zeros function that many
> architectures have built-in. Newton-Raphson profits a lot from a good
> initial value.
In android gforth lg2 is slightly faster than log2, especially for smaller u. The idea of lg2 is to asume that u is exponential distributed rather than rectangular.

: lg2 \ u -- n
locals| u |
0 -1
begin 2* dup u and
while 1 under+
repeat drop ;

Marcel Hendrix

unread,
Dec 9, 2022, 5:12:45 PM12/9/22
to
On Friday, December 9, 2022 at 7:48:45 PM UTC+1, Anton Ertl wrote:
> Marcel Hendrix <m...@iae.nl> writes:
> >iForth has the log2 method implemented as a table (it is very
> >frequently used in my type of applications).

Sorry, I mixed log2 up with 2^x ( u -- 2^u ) ...

> Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
> BMI1 or ABM extensions are present) that allow a fast implementation
> of log2.

FORTH> : test 7 log2 . ; ok
FORTH> see test
Flags: ANSI
$0133DC40 : test
$0133DC4A mov rbx, 7 d#
$0133DC51 mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC5B bsr rax, rbx
$0133DC5F mov rbx, rax
$0133DC62 push rbx
$0133DC63 jmp .+10 ( $0124A102 ) offset NEAR
$0133DC68 ;

-marcel

Anton Ertl

unread,
Dec 9, 2022, 5:27:13 PM12/9/22
to
bsr's result is not defined if the source=0, so I would write
something like

bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest

Marcel Hendrix

unread,
Dec 9, 2022, 6:32:47 PM12/9/22
to
On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
[..]
> bsr's result is not defined if the source=0, so I would write
> something like
>
> bsr dest, source
> bne nonzero
> mov dest, -1
> nonzero:
> # do something with dest

The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...

Or do you mean that bsr with 0 does not work the same on all
architectures/chip generations?

FORTH> : test 0 log2 . ; ok
FORTH> see test ( on AMD )
Flags: ANSI
$0133DC40 : test
$0133DC4A xor rbx, rbx
$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC57 bsr rax, rbx
$0133DC5B mov rbx, rax
$0133DC5E push rbx
$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
FORTH> test -1 ok

-marcel

dxforth

unread,
Dec 9, 2022, 10:05:10 PM12/9/22
to
On 10/12/2022 10:32 am, Marcel Hendrix wrote:
> On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
> [..]
>> bsr's result is not defined if the source=0, so I would write
>> something like
>>
>> bsr dest, source
>> bne nonzero
>> mov dest, -1
>> nonzero:
>> # do something with dest
>
> The log2 of 0 is undefined anyway, so I prefer to let the user
> of log2 decide how it should be handled...

The implementer you mean? Several options spring to mind:

a) THROW - clumsy but gets user's attention; always works
b) 0 - can be useful
c) -1 - throwing a spanner in the works in the hope user notices

Anton Ertl

unread,
Dec 10, 2022, 11:49:17 AM12/10/22
to
Marcel Hendrix <m...@iae.nl> writes:
>On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
>[..]
>> bsr's result is not defined if the source=0, so I would write
>> something like
>>
>> bsr dest, source
>> bne nonzero
>> mov dest, -1
>> nonzero:
>> # do something with dest
>
>The log2 of 0 is undefined anyway, so I prefer to let the user
>of log2 decide how it should be handled...

What happens in programming is that some user happens to write a
program that just uses the result as-is, and happens to work as
intended even for input 0. And if you want to provide a high-quality
experience, your programming system should better be consistent in
what it returns in that case.

Gforth returns -1 for 0 LOG2.

One benefit of returning -1 is that you can get a count of leading
zeros with

( x ) bits/cell 1- swap log2 -

>Or do you mean that bsr with 0 does not work the same on all
>architectures/chip generations?

Intel certainly reserves the right to do it that way by saying that
the result is undefined in that case. However, for the same reason as
discussed above, they (and AMD) better stick with the behaviour of
their original implementation, or their shiny new processor might get
a reputation for incompatibility (yes, Intel may blame the
application, but the customer does not care who is to blame).

So they might just as well document the actual behaviour.

>FORTH> : test 0 log2 . ; ok
>FORTH> see test ( on AMD )
>Flags: ANSI
>$0133DC40 : test
>$0133DC4A xor rbx, rbx
>$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
>$0133DC57 bsr rax, rbx
>$0133DC5B mov rbx, rax
>$0133DC5E push rbx
>$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
>FORTH> test -1 ok

It seems that on your hardware (Zen3 IIRC) bsr leaves the destination
register unchanged when the source is 0, and iForth is one application
that relies on this behaviour in order to return -1 (otherwise the
instruction at $0133DC4D would be superfluous); I just tried in on a
Skylake, and unsurprisingly it behaves the same way. If you don't
want to rely on Intel and AMD sticking to the same behaviour, you can
change the central instructions as shown above or as

mov rax, -1
bsr rbx,rbx
cmove rbx, rax

Interstingly, AMD has added an instruction lzcnt with the ABM
extension since the K10 (2007), and Intel has picked it up in the BMI1
extension since the Haswell (2013). lzcnt delivers a complementary
result (e.g., all-bits-set produces 0, and 0 produces 64), but in
particular it is also defined for 0. It is faster than bsr on the AMD
CPUs listed on https://gmplib.org/~tege/x86-timing.pdf (e.g., 1 cycle
latency on Zen2 and 4 lzcnt/cycle throughput, vs. 4 cycle latency for
bsr and 1 bsr every 4 cycles throughput. On the Intel CPUs listed
they both have 3 cycles latency, and 1 instruction/cycle throughput.
I wonder what makes bsr so hard to implement on the AMD CPUs that they
added lzcnt.

Of course for log2 you would need something like:

lzcnt rax, rbx
mov rbx, 63
sub rbx, rax

P Falth

unread,
Dec 10, 2022, 5:37:35 PM12/10/22
to
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.

BR
Peter

dxforth

unread,
Dec 10, 2022, 6:56:52 PM12/10/22
to
On 11/12/2022 2:37 am, Anton Ertl wrote:
>
> Gforth returns -1 for 0 LOG2.
>
> One benefit of returning -1 is that you can get a count of leading
> zeros with
>
> ( x ) bits/cell 1- swap log2 -

OTOH simplest implementations return 0:

: LOG2 ( u1 -- u2 ) \ Rick C.
0 begin swap u2/ dup while swap 1+ repeat drop ;

code LOG2 ( u1 -- u2 )
bx pop ax ax xor 1 $: bx shr 2 $ jz ax inc 1 $ ju 2 $: 1push
end-code

To return something else requires it be special-cased - which can be
left to the application.

Heinrich Hohl

unread,
Dec 11, 2022, 3:34:15 AM12/11/22
to
On Sunday, December 11, 2022 at 12:56:52 AM UTC+1, dxforth wrote:
> OTOH simplest implementations return 0:
>
> : LOG2 ( u1 -- u2 ) \ Rick C.
> 0 begin swap u2/ dup while swap 1+ repeat drop ;
> To return something else requires it be special-cased - which can be
> left to the application.

You can rewrite this as:

: LOG2 ( u1 -- u2 ) -1 SWAP BEGIN DUP WHILE SWAP 1+ SWAP U2/ REPEAT DROP ;

Same number of operations, but u1 = 0 returns u2 = -1.

For u1 = 0, the output u2 = -1 is the best solution. We need a way to find out that the output
is illegal after the calculation has been done (i.e. the argument u1 is no longer available).
As Anton stated, -1 has certain advantages. It also has the advantage that we can use
0< to check whether the output is correct or illegal.

u2 = 0 should allow us to conclude that u1 = 1.

Henry

Anton Ertl

unread,
Dec 11, 2022, 4:35:48 AM12/11/22
to
P Falth <peter....@gmail.com> writes:
>My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
>But it does not flag an error if executed, instead it executes bsr! and give a wrong
>result. This is also stated in the Intel manual.

Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.

lehs

unread,
Dec 11, 2022, 4:37:55 AM12/11/22
to
I can't prove it, but it seems that beginning with a value greater than the square root and terminate when Xn - Xn+1 < 2 always gives the correct floor function SQRTF.

: sqrtf \ u -- n
locals| u |
u sqrtc~
begin dup u over / + 2/
tuck - 2 <
until ;

minf...@arcor.de

unread,
Dec 11, 2022, 5:39:01 AM12/11/22
to
Since steam went out of square roots, here's what I have for cube roots.
Seldom used for volumetric calculations. Optimization ideas?

: UCBRT64 {: x | Y B -- cbrt ; integer cube root :}
0 to Y
63 BEGIN dup 0 >=
WHILE
Y 2* to Y
Y 1+ Y * 3 * 1+ to B
x over rshift B >= IF
B over lshift negate x + to x
Y 1+ to Y
THEN 3 -
REPEAT drop
Y ;

cr 0 ucbrt64 . .( 0? )
cr 1 ucbrt64 . .( 1? )
cr 2 ucbrt64 . .( 1? )
cr 2 ucbrt64 . .( 1? )
cr 7 ucbrt64 . .( 1? )
cr 8 ucbrt64 . .( 2? )
cr 9 ucbrt64 . .( 2? )
cr 999 ucbrt64 . .( 9? )
cr 1000 ucbrt64 . .( 10? )
cr 1001 ucbrt64 . .( 10? )

Kerr-Mudd, John

unread,
Dec 11, 2022, 6:48:36 AM12/11/22
to
On Sun, 11 Dec 2022 09:28:07 GMT
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


[log2 to give a first estimate for a sqrt program]

> P Falth <peter....@gmail.com> writes:
> >My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
> >But it does not flag an error if executed, instead it executes bsr! and give a wrong
> >result. This is also stated in the Intel manual.
>
> Interesting. Looking at the encoding, lzcount is encoded as repne
> bsr, and the repne prefix normally is ignored except for the string
> instructions. Apparently nobody encoded bsr with a repne prefix,
> otherwise they could not have encoded lzcnt in this way.
>

I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
so I can't really complain.
I rarely use rep lods[x] though!

xpost to ala, as it's more about asm.


--
Bah, and indeed Humbug.

Kerr-Mudd, John

unread,
Dec 11, 2022, 6:55:29 AM12/11/22
to
/alt/.lang.asm
doh!

dxforth

unread,
Dec 11, 2022, 9:35:16 PM12/11/22
to
On 11/12/2022 7:34 pm, Heinrich Hohl wrote:
> On Sunday, December 11, 2022 at 12:56:52 AM UTC+1, dxforth wrote:
>> OTOH simplest implementations return 0:
>>
>> : LOG2 ( u1 -- u2 ) \ Rick C.
>> 0 begin swap u2/ dup while swap 1+ repeat drop ;
>> To return something else requires it be special-cased - which can be
>> left to the application.
>
> You can rewrite this as:
>
> : LOG2 ( u1 -- u2 ) -1 SWAP BEGIN DUP WHILE SWAP 1+ SWAP U2/ REPEAT DROP ;
>
> Same number of operations, but u1 = 0 returns u2 = -1.

On VFX it's even smaller (12 instr vs. 17) - though this may be a compiler
quirk. I doubt I could re-code the 8086 version to give a -1 result in the
same number of instructions/bytes let alone make it smaller.

> For u1 = 0, the output u2 = -1 is the best solution. We need a way to find out that the output
> is illegal after the calculation has been done (i.e. the argument u1 is no longer available).
> As Anton stated, -1 has certain advantages. It also has the advantage that we can use
> 0< to check whether the output is correct or illegal.
>
> u2 = 0 should allow us to conclude that u1 = 1.

I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.

Heinrich Hohl

unread,
Dec 12, 2022, 6:04:58 AM12/12/22
to
On Monday, December 12, 2022 at 3:35:16 AM UTC+1, dxforth wrote:
> I would be inclined to test the input rather than the output. Testing
> the output is a good indicator as any that it wasn't 'the best solution'.

Absolutely correct. What I want to say is that for u1=0 the output should
either be undefined, or it should be u2=-1. Not any other value.

Steve

unread,
Dec 12, 2022, 7:43:37 AM12/12/22
to
Hi,

It seems strange to use such an encoding?

Cheers,

Steve N,

Kerr-Mudd, John

unread,
Dec 12, 2022, 7:57:38 AM12/12/22
to
> Hi,
>
> It seems strange to use such an encoding?
>

Maybe so, but it would've saved a loop.

This was my poor attempt:

Log2b: ; ax<-ln2(ax) ; uses cx [14]
mov cx,16 ; bits per word
push cx
repnz shr ax,1
inc cx ; 1 too many
pop ax
sub ax,cx ;
; ln(0)=-1 fix
jnz .ln0nofix
dec ax
.ln0nofix:
ret


Ok, we can use 'bsr' to find the leading bit these days.

Log2c: ; ax<-ln2(ax) ; uses bx [7]

mov bx,-1
xchg ax,bx
bsr ax,bx ; cpu 386




for rep lods, perhaps size coders use
rep lodsb
rather than
mov bx,cx
mov al,[si+bx]

lehs

unread,
Dec 12, 2022, 11:03:21 AM12/12/22
to
No it don't work, but

: lgf \ u -- n
0 swap
begin 1 rshift dup
while 1 under+
repeat drop ;

: sqrtc~ \ u -- n 2^(1+½lg2 u)
lgf 2/ 1+ 1 swap lshift ;

: sqrtf \ u -- n almost as fast as float
1- locals| u |
u sqrtc~
begin dup u over u/ + u2/ tuck u> 0=
until ;

false [if]
: sqrtf \ u -- floor a few procent faster
0 d>f fsqrt f>d drop ;
[then]

dxforth

unread,
Dec 12, 2022, 10:52:47 PM12/12/22
to
It should be u2=-1 only until such time as someone comes along and says
u2=0 is useful. I don't have an integer example but I do have an f/p
example:

\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok

minf...@arcor.de

unread,
Dec 13, 2022, 2:16:54 AM12/13/22
to
0e flog has to be a NAN.
If any, -INF as approximation, but never zero.

Heinrich Hohl

unread,
Dec 13, 2022, 4:14:23 AM12/13/22
to
On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
Looks correct at first sight, but at closer inspection we find:

10^2 = 100
10^1 = 10
10^0 = 1
10^-1 = 0.1
10^-2 = 0.01

The result 0 is not part of this series. 0 does not have an exponent.
For x --> 0, the exponent grows towards - infinity.

However, in a simple FP package it may be convenient to represent
f=0.0 as significand = 0 and exponent = 0.

minf...@arcor.de

unread,
Dec 13, 2022, 6:40:21 AM12/13/22
to
Before starting with negative 0.0e shouldn't it suffice to say that any
similarities between integer and fp logarithm stop for values below 1 ?

dxforth

unread,
Dec 13, 2022, 8:14:51 PM12/13/22
to
On 13/12/2022 8:14 pm, Heinrich Hohl wrote:
> On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
> Looks correct at first sight, but at closer inspection we find:
>
> 10^2 = 100
> 10^1 = 10
> 10^0 = 1
> 10^-1 = 0.1
> 10^-2 = 0.01
>
> The result 0 is not part of this series. 0 does not have an exponent.
> For x --> 0, the exponent grows towards - infinity.
>
> However, in a simple FP package it may be convenient to represent
> f=0.0 as significand = 0 and exponent = 0.

Indeed in every f/p package, since:

0e fs. 0.E0 ok

To print the exponent it has to be extracted and FLOG is handy for that.
While FLOG isn't guaranteed to work on 0e, having it return 0 is better
than some other value.

In the case of LOG2 I don't see sufficient argument for it returning -1.
If least effort and least consequences from misuse be one's aim, then
returning 0 looks good to me.

dxforth

unread,
Dec 13, 2022, 8:22:37 PM12/13/22
to
On 13/12/2022 6:16 pm, minf...@arcor.de wrote:
>
> 0e flog has to be a NAN.
> If any, -INF as approximation, but never zero.

Only in the cult of IEEE. Mathematics simply calls it 'undefined'.

0 new messages