A/B registers in Forth

Marcel Hendrix

unread,

Oct 12, 2008, 5:20:46 AM10/12/08

to

Last Wednesday, the German IRC channel ( #forth-ev ) had an interesting
discussion on the usefulness of A and B registers in Forth.

I was very sceptical, but decided to do an experiment. In iForth32 there
are no free registers for A and B, and in iForth64 I did not want to invest
the time for a full implementation. Therefore I decided to emulate A and B
with USERs. Given the implementation of USER in iForth, this should produce
representative results. Also, the USER variable does not create problems
when task-switching, FOREIGN, SYSCALL and CALLBACK words.

The results are surprising. For the simple read-process-write loop,
iForth64 is always about 1.5 times faster, iForth32 2 times. As shown
by PROCESS-STACK2, the success is not simply due to the use of @+ and
!+ instructions.

For iForth32 on the Intel CPU/Windows a strange and surprising
dependency on the size of the memory buffers was noted. I have no idea
what causes it.

-marcel

-- ---------------------------------------------------------------------------
( *
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Do A/B registers bring something?
* CATEGORY : Experiment
* AUTHOR : Marcel Hendrix
* LAST CHANGE : October 12, 2008, Marcel Hendrix
* )

0 [IF]
Try: TEST-A/B ( iForth64, AMD X2 3 GHz, Linux )
Testing with size = 256
process (direct) : 0.147 seconds elapsed.
process (locals) : 0.188 seconds elapsed.
process (params) : 0.165 seconds elapsed.
process (stack1) : 0.204 seconds elapsed.
process (stack2) : 0.153 seconds elapsed.
process (A/B) : 0.107 seconds elapsed. ok

FORTH> test-a/b
process (direct) : 0.148 seconds elapsed.
process (locals) : 0.188 seconds elapsed.
process (params) : 0.164 seconds elapsed.
process (stack1) : 0.205 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.094 seconds elapsed. ok

FORTH> test-A/B ( iForth32, Intel PIV 3 GHz, Windows )
Testing with size = 65536
process (direct) : 0.408 seconds elapsed.
process (locals) : 0.555 seconds elapsed.
process (params) : 0.425 seconds elapsed.
process (stack1) : 0.403 seconds elapsed.
process (stack2) : 0.432 seconds elapsed.
process (A/B) : 0.227 seconds elapsed. ok

Testing with size = 256
process (direct) : 0.223 seconds elapsed.
process (locals) : 0.405 seconds elapsed.
process (params) : 0.230 seconds elapsed.
process (stack1) : 0.221 seconds elapsed.
process (stack2) : 0.175 seconds elapsed.
process (A/B) : 0.228 seconds elapsed. ok

FORTH> test-a/b ( iForth32, AMD X2 3 GHz, Linux )
Testing with size = 65536
process (direct) : 0.146 seconds elapsed.
process (locals) : 0.200 seconds elapsed.
process (params) : 0.172 seconds elapsed.
process (stack1) : 0.188 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.121 seconds elapsed. ok

FORTH> test-a/b
Testing with size = 256
process (direct) : 0.145 seconds elapsed.
process (locals) : 0.201 seconds elapsed.
process (params) : 0.171 seconds elapsed.
process (stack1) : 0.188 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.121 seconds elapsed. ok
[THEN]

\ : !+ ( n addr -- addr+ ) TUCK ! CELL+ ;
\ : @+ ( addr -- addr+ n ) DUP CELL+ SWAP @ ;

USER _A : A@ _A @ ; : A! _A ! ; : A@+ _A @+ SWAP _A ! ; : A!+ _A !+ _A ! ;
USER _B : B@ _B @ ; : B! _B ! ; : B@+ _B @+ SWAP _B ! ; : B!+ _B !+ _B ! ;

0 [IF] #256
[ELSE] #65536
[THEN] CONSTANT size

CREATE area1 PRIVATE size CELLS ALLOT
CREATE area2 PRIVATE size CELLS ALLOT

: PROCESS-DIRECT ( -- )
#256 0 DO area1 I CELLS + @
7 * #33 + #6996 MAX
area2 I CELLS + !
LOOP ;

: PROCESS-LOCALS ( -- )
area1 area2 LOCALS| B@ A@ |
#256 0 DO A@ I CELLS + @
7 * #33 + #6996 MAX
B@ I CELLS + !
LOOP ;

: PROCESS-PARAMS ( -- )
area1 area2 PARAMS| A@ B@ |
#256 0 DO A@ I CELLS + @
7 * #33 + #6996 MAX
B@ I CELLS + !
LOOP ;

: PROCESS-STACK1 ( -- )
area1 area2
#256 0 DO OVER I CELLS + @
7 * #33 + #6996 MAX
OVER I CELLS + !
LOOP
2DROP ;

: PROCESS-STACK2 ( -- )
area1 area2
#256 0 DO SWAP @+
7 * #33 + #6996 MAX
ROT !+
LOOP
2DROP ;

: PROCESS-A/B ( -- )
area1 A!
area2 B!
#256 0 DO A@+
7 * #33 + #6996 MAX
B!+
LOOP ;

: TEST-A/B ( -- )
#100000 LOCAL #times
CR ." Testing with size = " size .
CR ." process (direct) : " TIMER-RESET #times 0 ?DO PROCESS-DIRECT LOOP .ELAPSED
CR ." process (locals) : " TIMER-RESET #times 0 ?DO PROCESS-LOCALS LOOP .ELAPSED
CR ." process (params) : " TIMER-RESET #times 0 ?DO PROCESS-PARAMS LOOP .ELAPSED
CR ." process (stack1) : " TIMER-RESET #times 0 ?DO PROCESS-STACK1 LOOP .ELAPSED
CR ." process (stack2) : " TIMER-RESET #times 0 ?DO PROCESS-STACK2 LOOP .ELAPSED
CR ." process (A/B) : " TIMER-RESET #times 0 ?DO PROCESS-A/B LOOP .ELAPSED ;

: .ABOUT CR ." Try: TEST-A/B" ;

.ABOUT

roger...@gmail.com

unread,

Oct 12, 2008, 10:40:07 AM10/12/08

to

marcel, great post.

i was wondering if the implementations of !+ and @+ in your system are
colon or code definitions. (i'm guessing code because that would
explain them getting 2nd place.)

-Roger

Marcel Hendrix

unread,

Oct 12, 2008, 11:48:27 AM10/12/08

to

"roger...@gmail.com" <roger...@gmail.com> writes Re: A/B registers in Forth
[..]

> i was wondering if the implementations of !+ and @+ in your system are
> colon or code definitions. (i'm guessing code because that would
> explain them getting 2nd place.)

The !+ and @+ are compiler macros. The A@+ and B!+ are colon definitions,
but the iForth compiler treats colon definitions different from what you
may be used to. (The final generated code for !+ and B!+ etc. is roughly
equal.) The code for process-stack2 has 8 push/pop instructions, the code
for process-a/b only 6 (iForth64).

-marcel

Bernd Paysan

unread,

Oct 12, 2008, 11:43:44 AM10/12/08

to

Marcel Hendrix wrote:
> \ : !+ ( n addr -- addr+ ) TUCK ! CELL+ ;
> \ : @+ ( addr -- addr+ n ) DUP CELL+ SWAP @ ;
>
>
> USER _A : A@ _A @ ; : A! _A ! ; : A@+ _A @+ SWAP _A ! ; : A!+ _A !+ _A
> ! ;
> USER _B : B@ _B @ ; : B! _B ! ; : B@+ _B @+ SWAP _B ! ; : B!+ _B !+ _B
> ! ;

This looks wrong. Shouldn't

: A@+ _A @ @+ swap _A ! ;

and the same with A!+ (and B..)? And shouldn't the process-xxx words
use "size 0 DO"?

After fixing that, bigFORTH gives the following results (Phenom 2.5GHz,
user-variable used):

Testing with size = 256

process (direct) : 0,274289 sec
process (locals) : 0,296699 sec
process (params) : 0,297403 sec
process (stack1) : 0,274833 sec
process (stack2) : 0,217762 sec
process (A/B) : 0,260369 sec

Testing with size = 65536

process (direct) : 0,275094 sec
process (locals) : 0,307683 sec
process (params) : 0,307260 sec
process (stack1) : 0,288170 sec
process (stack2) : 0,236329 sec
process (A/B) : 0,274088 sec

You probably should calculate something meaningful so that simple mistakes
don't happen. I'm not sure what the difference between params and locals is
in iForth; since there are no params in bigFORTH, I just used locals there,
too (explains the almost identical timing). I then moved params to be the
first test, to exclude problems from PowerNow (but the Phenom is not as
sluggish as the Athlon64 here). The "direct" run became a bit faster by
doing so, even though the params/locals didn't show any difference.

To speed this up, I did the following: bigFORTH already reserves one
register for the object pointer - let that be A. B will be defined as user
variable, but the access words written as code macros (that's reasonable
for a system that doesn't have an analytical compiler). Results:

Phenom 2.5GHz, Linux:

Testing with size = 65536

process (params) : 0,296900 sec
process (direct) : 0,251538 sec
process (locals) : 0,294768 sec
process (stack1) : 0,275718 sec
process (stack2) : 0,220397 sec
process (A/B) : 0,145040 sec

Testing with size = 256

process (params) : 0,301286 sec
process (direct) : 0,246388 sec
process (locals) : 0,297584 sec
process (stack1) : 0,273907 sec
process (stack2) : 0,218305 sec
process (A/B) : 0,146147 sec

Now this shows an actual benefit, too. I'm almost competitive with iForth32;
fix the bug, and I'm better.

File for bigFORTH:
-----------------------------ab-test.fs-------------------------------
Code A@ ax push op ax mov Next end-code macro :ax 0 T&P
Code A! ax op mov ax pop Next end-code macro 0 :ax T&P
Code A@+ ax push op ) ax mov cell # op add next end-code macro :ax 0 T&P
Code A!+ ax op ) mov cell # op add ax pop next end-code macro 0 :ax T&P

USER _B
Code B@ ax push user' _B UP D) ax mov Next end-code macro :ax 0 T&P
Code B! ax user' _B UP D) mov ax pop Next end-code macro 0 :ax T&P
Code B@+ ax push user' _B UP D) dx mov dx ) ax mov
cell # dx add dx user' _B UP D) mov next end-code macro :ax 0 T&P
Code B!+ user' _B UP D) dx mov ax dx ) mov
cell # dx add dx user' _B UP D) mov ax pop next end-code macro :ax 0 T&P

1 [IF] #256

[ELSE] #65536
[THEN] CONSTANT size

CREATE area1 size CELLS ALLOT
CREATE area2 size CELLS ALLOT

: PROCESS-DIRECT ( -- )
size 0 DO area1 I CELLS + @

7 * #33 + #6996 MAX
area2 I CELLS + !
LOOP ;

: PROCESS-LOCALS ( -- )
area1 area2 LOCALS| B@ A@ |

size 0 DO A@ I CELLS + @

7 * #33 + #6996 MAX
B@ I CELLS + !
LOOP ;

: PROCESS-PARAMS ( -- )
area1 area2 { A@ B@ }
size 0 DO A@ I CELLS + @

7 * #33 + #6996 MAX
B@ I CELLS + !
LOOP ;

: PROCESS-STACK1 ( -- )
area1 area2

size 0 DO OVER I CELLS + @

7 * #33 + #6996 MAX
OVER I CELLS + !
LOOP
2DROP ;

: PROCESS-STACK2 ( -- )
area1 area2

size 0 DO SWAP @+ swap

7 * #33 + #6996 MAX
ROT !+
LOOP
2DROP ;

: PROCESS-A/B ( -- )
area1 A!
area2 B!

size 0 DO A@+

7 * #33 + #6996 MAX
B!+
LOOP ;

: TEST-A/B ( -- )
#25600000 size / { #times }

CR ." Testing with size = " size .

CR ." process (params) : " !time #times 0 ?DO PROCESS-PARAMS
LOOP .time
CR ." process (direct) : " !time #times 0 ?DO PROCESS-DIRECT
LOOP .time
CR ." process (locals) : " !time #times 0 ?DO PROCESS-LOCALS
LOOP .time
CR ." process (stack1) : " !time #times 0 ?DO PROCESS-STACK1
LOOP .time
CR ." process (stack2) : " !time #times 0 ?DO PROCESS-STACK2
LOOP .time
CR ." process (A/B) : " !time #times 0 ?DO PROCESS-A/B
LOOP .time ;

: .ABOUT CR ." Try: TEST-A/B" ;

.ABOUT
-----------------------------ab-test.fs-------------------------------

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Marcel Hendrix

unread,

Oct 12, 2008, 12:37:17 PM10/12/08

to

Bernd Paysan <bernd....@gmx.de> writes Re: A/B registers in Forth

> Marcel Hendrix wrote:
>> \ : !+ ( n addr -- addr+ ) TUCK ! CELL+ ;
>> \ : @+ ( addr -- addr+ n ) DUP CELL+ SWAP @ ;

>> USER _A : A@ _A @ ; : A! _A ! ; : A@+ _A @+ SWAP _A ! ; : A!+ _A !+ _A ! ;
>> USER _B : B@ _B @ ; : B! _B ! ; : B@+ _B @+ SWAP _B ! ; : B!+ _B !+ _B ! ;

> This looks wrong. Shouldn't

> : A@+ _A @ @+ swap _A ! ;

> and the same with A!+ (and B..)?

Blush -- Yes:

USER _A : A@ _A @ ; : A! _A ! ; : A@+ A@ @+ SWAP A! ; : A!+ A@ !+ A! ;
USER _B : B@ _B @ ; : B! _B ! ; : B@+ B@ @+ SWAP B! ; : B!+ B@ !+ B! ;

I hope it is correct now. It makes the process-a/b advantage (nearly) evaporate.

> And shouldn't the process-xxx words
> use "size 0 DO"?

I don't see why? You mean because #times is effectively a constant?

> After fixing that, bigFORTH gives the following results (Phenom 2.5GHz,
> user-variable used):

[..]

> You probably should calculate something meaningful so that simple mistakes
> don't happen. I'm not sure what the difference between params and locals is
> in iForth; since there are no params in bigFORTH, I just used locals there,
> too (explains the almost identical timing).

[..]

One can take the address of a LOCAL, but not of PARAM, so a PARAM can be
a register. And the stack order is reversed to be ready for Forth20xx.

> To speed this up, I did the following: bigFORTH already reserves one
> register for the object pointer - let that be A. B will be defined as user
> variable, but the access words written as code macros (that's reasonable
> for a system that doesn't have an analytical compiler). Results:

[..]

> Testing with size = 256
> process (params) : 0,301286 sec
> process (direct) : 0,246388 sec
> process (locals) : 0,297584 sec
> process (stack1) : 0,273907 sec
> process (stack2) : 0,218305 sec
> process (A/B) : 0,146147 sec

> Now this shows an actual benefit, too. I'm almost competitive with iForth32;
> fix the bug, and I'm better.

It is well-known that hand-assembly is supposed to speed up Forth 10-fold or more :-)

What is the difference between a Phenom 2,5G and an X2 3G?

-marcel

-- -------------------------------------------------------------

Try: TEST-A/B ( iForth64, AMD X2 3 GHz, Linux )

Testing with size = 65536

process (direct) : 0.148 seconds elapsed.
process (locals) : 0.188 seconds elapsed.

process (params) : 0.165 seconds elapsed.
process (stack1) : 0.204 seconds elapsed.

process (stack2) : 0.154 seconds elapsed.

process (A/B) : 0.140 seconds elapsed. ok

Testing with size = 256

process (direct) : 0.147 seconds elapsed.
process (locals) : 0.188 seconds elapsed.
process (params) : 0.165 seconds elapsed.
process (stack1) : 0.204 seconds elapsed.
process (stack2) : 0.153 seconds elapsed.

process (A/B) : 0.131 seconds elapsed. ok

FORTH> test-A/B ( iForth32, Intel PIV 3 GHz, Windows )

Testing with size = 65536

process (direct) : 0.403 seconds elapsed.
process (locals) : 0.548 seconds elapsed.
process (params) : 0.421 seconds elapsed.
process (stack1) : 0.395 seconds elapsed.
process (stack2) : 0.422 seconds elapsed.
process (A/B) : 0.375 seconds elapsed. ok

Testing with size = 256

process (direct) : 0.222 seconds elapsed.
process (locals) : 0.400 seconds elapsed.
process (params) : 0.228 seconds elapsed.
process (stack1) : 0.220 seconds elapsed.
process (stack2) : 0.176 seconds elapsed.
process (A/B) : 0.237 seconds elapsed. ok

FORTH> test-a/b ( iForth32, AMD X2 3 GHz, Linux )

Testing with size = 65536

process (direct) : 0.146 seconds elapsed.
process (locals) : 0.200 seconds elapsed.
process (params) : 0.172 seconds elapsed.
process (stack1) : 0.188 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.

process (A/B) : 0.139 seconds elapsed. ok

FORTH> test-a/b

Testing with size = 256

process (direct) : 0.145 seconds elapsed.
process (locals) : 0.201 seconds elapsed.

process (params) : 0.172 seconds elapsed.

process (stack1) : 0.187 seconds elapsed.

process (stack2) : 0.154 seconds elapsed.

process (A/B) : 0.139 seconds elapsed. ok

Bernd Paysan

unread,

Oct 12, 2008, 1:04:44 PM10/12/08

to

Marcel Hendrix wrote:
>> And shouldn't the process-xxx words
>> use "size 0 DO"?
>
> I don't see why? You mean because #times is effectively a constant?

I thought the idea was to test small vs. large arrays (and actually access
the larger arrays, and not only the first 256 cells). In my modified code,
#times is no longer a constant.

> One can take the address of a LOCAL, but not of PARAM, so a PARAM can be
> a register. And the stack order is reversed to be ready for Forth20xx.

Ok, but then one can't take the address of a local in bigFORTH (though it
lives in memory).

> It is well-known that hand-assembly is supposed to speed up Forth 10-fold
> or more :-)

It's hand assembly of something that's supposed to be a primitive (A/B).

> What is the difference between a Phenom 2,5G and an X2 3G?

Most of the time the Phenom is clock by clock identical with the Athlon64,
unless the L3 cache gives a benefit or you use SSE instructions (there it
can be up to twice as fast).

Marcel Hendrix

unread,

Oct 12, 2008, 2:10:55 PM10/12/08

to

Bernd Paysan <bernd....@gmx.de> writes Re: A/B registers in Forth

> Marcel Hendrix wrote:
[..]

>> One can take the address of a LOCAL, but not of PARAM, so a PARAM can be
>> a register. And the stack order is reversed to be ready for Forth20xx.

> Ok, but then one can't take the address of a local in bigFORTH (though it
> lives in memory).

I found it useful, as for OS calls and such one can pass the address of a
local if the code wants a pointer to a variable. I regret this choice now,
as it prevents the compiler from doing a few optimizations.

>> It is well-known that hand-assembly is supposed to speed up Forth 10-fold
>> or more :-)

> It's hand assembly of something that's supposed to be a primitive (A/B).

In that case, below is how I would do it in that case. Only for iForth64
(it's non-portable code). The advantage is back.

>> What is the difference between a Phenom 2,5G and an X2 3G?

> Most of the time the Phenom is clock by clock identical with the Athlon64,
> unless the L3 cache gives a benefit or you use SSE instructions (there it
> can be up to twice as fast).

Tempting. That would mean 16 Gflops for iForth's matrix multiply primitives.

-marcel

-- ----------------------------------------------------
ALSO ASSEMBLER
: A@+ ( -- u )
0 1 IN/OUT
POSTPONE ASM{
[r15 [ _A UP - ] LITERAL +] -> rax mov,
[rax] -> rbx mov,
[rax =CELL +] -> rax lea,
rax -> [r15 [ _A UP - ] LITERAL +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: B!+ ( u -- )
1 0 IN/OUT
POSTPONE ASM{
[r15 [ _B UP - ] LITERAL +] -> rax mov,
rbx -> [rax] mov,
[rax =CELL +] -> rax lea,
rax -> [r15 [ _B UP - ] LITERAL +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
PREVIOUS

Try: TEST-A/B ( iForth64, AMD X2 3 GHz, Linux )

Testing with size = 256
Using in-line code.

process (direct) : 0.148 seconds elapsed.
process (locals) : 0.188 seconds elapsed.
process (params) : 0.165 seconds elapsed.

process (stack1) : 0.205 seconds elapsed.

process (stack2) : 0.154 seconds elapsed.

process (A/B) : 0.096 seconds elapsed. ok

Stephen Pelc

unread,

Oct 12, 2008, 2:49:55 PM10/12/08

to

On Sun, 12 Oct 2008 11:20:46 +0200, m...@iae.nl (Marcel Hendrix) wrote:

>Last Wednesday, the German IRC channel ( #forth-ev ) had an interesting
>discussion on the usefulness of A and B registers in Forth.

...

>The results are surprising. For the simple read-process-write loop,
>iForth64 is always about 1.5 times faster, iForth32 2 times. As shown
>by PROCESS-STACK2, the success is not simply due to the use of @+ and
>!+ instructions.

The code example is by permission of Gary Bergstrom - it's beautiful!
I do hope that Gary will publish the details somewhere.

My view is that the A/B registers provide a degree of persistance
and the performance gain is due to the reduction in stack shuffling
at boundaries because there's less to be shuffled.

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Don Seglio

unread,

Oct 12, 2008, 5:53:54 PM10/12/08

to

Interesting subject, any more references?

Cecil

Bernd Paysan

unread,

Oct 12, 2008, 6:21:01 PM10/12/08

to

Stephen Pelc wrote:
> My view is that the A/B registers provide a degree of persistance
> and the performance gain is due to the reduction in stack shuffling
> at boundaries because there's less to be shuffled.

Let's suppose you actually have locals in registers, and your optimizing
compiler really produces good code: What would the actual benefit be? For a
system like bigFORTH, which has a limited optimizer, the benefit is
obvious.

There's another point with the A/B X/Y proposal: With the register pressure
we have on the most popular architecture, I don't want to have four of them
dedicated; two is already too much ;-). My suggestion would be to have both
postincrement and offset calculation, but only two, with one of them
preferred (when you need only one, use A).

Marcel Hendrix

unread,

Oct 13, 2008, 1:10:32 PM10/13/08

to

Bernd Paysan <bernd....@gmx.de> writes Re: A/B registers in Forth

[..]

> There's another point with the A/B X/Y proposal: With the register pressure
> we have on the most popular architecture, I don't want to have four of them
> dedicated; two is already too much ;-). My suggestion would be to have both
> postincrement and offset calculation, but only two, with one of them
> preferred (when you need only one, use A).

Why not have them all. Only one of them needs to really be a register,
all others could be USERs. Everyone benefits.

Far more important is that the names A, B, X and Y are very generic
and could cause awkward clashes with existing code. And second, modern
hardware has 16 double-precision FPU registers lying around doing nothing.
(SSE2). Why not have fA, fB? They would be hugely more useful for DSP work
than iA and iB.

-marcel

Bernd Paysan

unread,

Oct 13, 2008, 4:59:49 PM10/13/08

to

Marcel Hendrix wrote:
> Why not have them all. Only one of them needs to really be a register,
> all others could be USERs. Everyone benefits.

But that one that is a register should have both auto-increment and index
addressing - autoincrement for DSP work, indexing for C stackframes or
object oriented extensions.

> Far more important is that the names A, B, X and Y are very generic
> and could cause awkward clashes with existing code.

Not really. Existing code that defines A, B, X, or Y is just fine. It can't
use them, but it won't harm (unless redefined warnings harm).

> And second, modern
> hardware has 16 double-precision FPU registers lying around doing nothing.
> (SSE2). Why not have fA, fB? They would be hugely more useful for DSP work
> than iA and iB.

Another, entirely different topic...

William James

unread,

Oct 14, 2008, 5:07:59 AM10/14/08

to

On Oct 12, 11:37 am, m...@iae.nl (Marcel Hendrix) wrote:
> Bernd Paysan <bernd.pay...@gmx.de> writes Re: A/B registers in Forth

>
> > Marcel Hendrix wrote:
> >> \ : !+ ( n addr -- addr+ ) TUCK ! CELL+ ;
> >> \ : @+ ( addr -- addr+ n ) DUP CELL+ SWAP @ ;
> >> USER _A : A@ _A @ ; : A! _A ! ; : A@+ _A @+ SWAP _A ! ; : A!+ _A !+ _A ! ;
> >> USER _B : B@ _B @ ; : B! _B ! ; : B@+ _B @+ SWAP _B ! ; : B!+ _B !+ _B ! ;
> > This looks wrong. Shouldn't
> > : A@+ _A @ @+ swap _A ! ;
> > and the same with A!+ (and B..)?
>
> Blush -- Yes:

Another example lending weight to the
argument that using Forth increases the
likelihood of producing incorrect, unreadable, and unmaintainable
code.

Forth code is easy for a computer to parse; it is difficult for a
human to comprehend.

William James

unread,

Oct 14, 2008, 5:33:57 AM10/14/08

to

What a mass of redundancy!

Why don't you use a high-level language
that encourages factoring?

Ruby:

$size = 200_000

def gen_with_Array
Array.new( $size ){ rand }
end
def gen_from_range
(1 .. $size ).map{ rand }
end
def gen_by_appending
a = []
$size.times{ a << rand }
end

[
[ 'Array', :gen_with_Array ],
[ 'range', :gen_from_range ],
[ 'append', :gen_by_appending ]
].each{|str,sym| t = Time.now
send( sym )
puts "method (#{ str }): #{ Time.now - t }"
}

--- output ---
method (Array): 0.48
method (range): 0.581
method (append): 0.661

Bernd Paysan

unread,

Oct 14, 2008, 7:30:46 AM10/14/08

to

William James wrote:
> What a mass of redundancy!
>
> Why don't you use a high-level language
> that encourages factoring?

No need for that, Forth provides everything you need, Marcel just didn't use
it. This is to quickly benchmark low-level functions.

Marcel's actual fault was that his benchmark code did not provide some
testable results, and therefore a small omission in the code resulted in
incorrect behavior - but that will happen with any language. If you don't
test, better assume it's broken.

ste...@mpeforth.com

unread,

Oct 14, 2008, 9:24:35 AM10/14/08

to

On Oct 12, 11:21 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> There's another point with the A/B X/Y proposal: With the register pressure
> we have on the most popular architecture, I don't want to have four of them
> dedicated; two is already too much ;-). My suggestion would be to have both
> postincrement and offset calculation, but only two, with one of them
> preferred (when you need only one, use A).

On most systems you already use X and Y - they are the registers for
the
USER area and the locals frame pointer. So you are really down to two
extra registers, A and B. As Marcel has shown, keeping A and B as
variables may still improve performance.

Stephen

ste...@mpeforth.com

unread,

Oct 14, 2008, 9:32:01 AM10/14/08

to

On Oct 13, 6:10 pm, m...@iae.nl (Marcel Hendrix) wrote:
> Far more important is that the names A, B, X and Y are very generic
> and could cause awkward clashes with existing code.

It was a virtual machine description. There are dozens of CPUs
with registers called A, B, X and Y! If you want to call them
rA, rB and so on on in your virtual machine assembler, you're
welcome to do so. Nowhere in the wordset I described are there
words call A or B!

Stephen

Bernd Paysan

unread,

Oct 14, 2008, 11:00:25 AM10/14/08

to

ste...@mpeforth.com wrote:
> On most systems you already use X and Y - they are the registers for
> the
> USER area and the locals frame pointer. So you are really down to two
> extra registers, A and B. As Marcel has shown, keeping A and B as
> variables may still improve performance.

But what actually you don't need A and B or X and Y at the same time. You
want two registers and two different ways to access them: Either with
immediate offset or with postincrement. If I get this (two registers, two
ways of accessing them), I can have both of them in registers even on x86 -
op and up are already registers in bigFORTH.

The suggestion is that you have >X >Y and X> Y> words, which push the old
value on the stack, and within such usage, you can't access e.g. locals or
the user area, because they will use the same registers (in bigFORTH, I use
the return stack pointer for locals, frame pointers are for wimps ;-).

Marcel Hendrix

unread,

Oct 14, 2008, 1:42:08 PM10/14/08

to

William James <w_a_...@yahoo.com> writes Re: A/B registers in Forth

>On Oct 12, 4:20=A0am, m...@iae.nl (Marcel Hendrix) wrote:

>> : PROCESS-DIRECT ( -- )
>> =A0 =A0 =A0 =A0 #256 0 DO =A0area1 I CELLS + @
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A07 * #33 + =A0#6996 MAX
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0area2 I CELLS + !
[..]

> What a mass of redundancy!

> Why don't you use a high-level language
> that encourages factoring?

You mean, like the monkey that coded your newsreader?

-marcel

Ed

unread,

Oct 15, 2008, 10:16:38 PM10/15/08

to

The naming and stack diagrams for some of X/Y operations
don't seem right. Was this a typo? Are the X/Y operations
available for apps programmers to use?

--

Are the new regs meant to be added onto the existing VM (ANS)
- or do they represent a new forth engine. Given A/B has the
potential to re-write the way forth is programmed, it may be
better to have a new standard e.g. Forth II. This would avoid
splitting ANS forth into the "haves" and "have nots", and opens
the door to other changes not possible with the current model.

In the meantime, it would be nice to see more code examples
along with properly implemented forths to test the effectiveness
of the scheme. Perhaps issuing a draft spec for the new regs
would help kick things off.

Marcel Hendrix

unread,

Oct 16, 2008, 1:30:10 AM10/16/08

to

Here's an addition for the USER A/B code that tests X/Y with indexed
addressing. The advantage is now only there in iForth64 (because it has
better USERs).

I take no responsibility for the correctness of this example. Most of it
is mindless copy and paste. Just meant as my $0.02.

-marcel
-- ----------------------------------------
(*

* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments

* DESCRIPTION : Do X/Y registers bring something?

* CATEGORY : Experiment
* AUTHOR : Marcel Hendrix

* LAST CHANGE : Wednesday, October 15, 2008, 22:59 PM, Marcel Hendrix
*)

NEEDS -miscutil
NEEDS -assemble

REVISION -xy "--- X/Y register test Version 1.00 ---"

PRIVATES

DOC
(*
FORTH> test-x/y ( iForth32 PIV 3 GHz, Windows )

Testing with size = 256
Using in-line code.

process (direct) : 0.222 seconds elapsed.

process (locals) : 0.402 seconds elapsed.
process (params) : 0.232 seconds elapsed.
process (stack1) : 0.222 seconds elapsed.

process (stack2) : 0.175 seconds elapsed.

process (A/B) : 0.481 seconds elapsed. ok

Testing with size = 256

Not using in-line code.

process (direct) : 0.223 seconds elapsed.

process (locals) : 0.400 seconds elapsed.

process (params) : 0.232 seconds elapsed.
process (stack1) : 0.219 seconds elapsed.
process (stack2) : 0.177 seconds elapsed.
process (X/Y) : 0.339 seconds elapsed. ok

FORTH> test-x/y ( iForth64 AMD X2 3 GHz, Linux )

Testing with size = 256

Using in-line code.

process (direct) : 0.147 seconds elapsed.

process (locals) : 0.189 seconds elapsed.

process (params) : 0.165 seconds elapsed.
process (stack1) : 0.204 seconds elapsed.

process (stack2) : 0.154 seconds elapsed.
process (X/Y) : 0.140 seconds elapsed. ok

Testing with size = 256

Not using in-line code.

process (direct) : 0.147 seconds elapsed.

process (locals) : 0.189 seconds elapsed.

process (params) : 0.164 seconds elapsed.
process (stack1) : 0.205 seconds elapsed.

process (stack2) : 0.155 seconds elapsed.
process (X/Y) : 0.213 seconds elapsed. ok
*)
ENDDOC

FALSE =: il-code? PRIVATE

USER _X
USER _Y

il-code? [IF]
\ Inlined code
64BIT? [IF]
ALSO ASSEMBLER
: X@ ( -- u )

0 1 IN/OUT
POSTPONE ASM{

[r15 [ _X UP - ] LITERAL +] -> rbx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: X! ( u -- )

1 0 IN/OUT
POSTPONE ASM{

rbx -> [r15 [ _X UP - ] LITERAL +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: []X@ ( ix -- u )
1 1 IN/OUT
POSTPONE ASM{
[r15 [ _X UP - ] LITERAL +] -> rax mov,
[rax +rbx *8 0 +] -> rbx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: []X! ( u ix -- )
2 0 IN/OUT
POSTPONE ASM{
[r15 [ _X UP - ] LITERAL +] -> rax mov,
rcx -> [rax +rbx *8 0 +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: Y@ ( -- u )

0 1 IN/OUT
POSTPONE ASM{

[r15 [ _Y UP - ] LITERAL +] -> rbx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: Y! ( u -- )

1 0 IN/OUT
POSTPONE ASM{

rbx -> [r15 [ _Y UP - ] LITERAL +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: []Y@ ( ix -- u )
1 1 IN/OUT
POSTPONE ASM{
[r15 [ _Y UP - ] LITERAL +] -> rax mov,
[rax +rbx *8 0 +] -> rbx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: []Y! ( u ix -- )
2 0 IN/OUT
POSTPONE ASM{
[r15 [ _Y UP - ] LITERAL +] -> rax mov,
rcx -> [rax +rbx *8 0 +] mov,

}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
PREVIOUS

[ELSE]
ALSO ASSEMBLER
: X@ ( -- u )

0 1 IN/OUT
POSTPONE ASM{

ebp -> ebx mov,
[ _X UP - ] LITERAL d# -> bx mov,
[ebx] -> ebx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: X! ( u -- )

1 0 IN/OUT
POSTPONE ASM{

ebp -> eax mov,
[ _X UP - ] LITERAL d# -> ax mov,
ebx -> [eax] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: []X@ ( ix -- u )
1 1 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _X UP - ] LITERAL d# -> ax mov,
[eax] -> eax mov,
[eax ebx*4 0 +] -> ebx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: []X! ( u ix -- )
2 0 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _X UP - ] LITERAL d# -> ax mov,
[eax] -> eax mov,
ecx -> [eax ebx*4 0 +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: Y@ ( -- u )

0 1 IN/OUT
POSTPONE ASM{

ebp -> ebx mov,
[ _Y UP - ] LITERAL d# -> bx mov,
[ebx] -> ebx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: Y! ( u -- )

1 0 IN/OUT
POSTPONE ASM{

ebp -> eax mov,
[ _Y UP - ] LITERAL d# -> ax mov,
ebx -> [eax] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: []Y@ ( ix -- u )
1 1 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _Y UP - ] LITERAL d# -> ax mov,
[eax] -> eax mov,
[eax ebx*4 0 +] -> ebx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY

: []Y! ( u ix -- )
2 0 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _Y UP - ] LITERAL d# -> ax mov,
[eax] -> eax mov,
ecx -> [eax ebx*4 0 +] mov,

}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
PREVIOUS

[THEN]
[ELSE]
: X@ _X @ ; : X! _X ! ; : []X@ CELLS X@ + @ ; : []X! CELLS X@ + ! ;
: Y@ _Y @ ; : Y! _Y ! ; : []Y@ CELLS Y@ + @ ; : []Y! CELLS Y@ + ! ;
[THEN]

( n -- ) 4 MAX #20 MIN 2^x =: size

CR .( size = ) size DEC.

CREATE area1 PRIVATE size CELLS ALLOT
CREATE area2 PRIVATE size CELLS ALLOT

: PROCESS-DIRECT ( -- )
size 0 DO area1 I CELL[] @

7 * #33 + #6996 MAX

area2 I CELL[] !
LOOP ; PRIVATE

: PROCESS-LOCALS ( -- )
area1 area2 LOCALS| Y@ X@ |
size 0 DO X@ I CELL[] @

7 * #33 + #6996 MAX

Y@ I CELL[] !
LOOP ; PRIVATE

: PROCESS-PARAMS ( -- )
area1 area2 PARAMS| X@ Y@ |
size 0 DO X@ I CELL[] @

7 * #33 + #6996 MAX

Y@ I CELL[] !
LOOP ; PRIVATE

: PROCESS-STACK1 ( -- )
area1 area2

size 0 DO OVER I CELL[] @

7 * #33 + #6996 MAX

OVER I CELL[] !
LOOP
2DROP ; PRIVATE

: PROCESS-STACK2 ( -- )
area1 area2

size 0 DO SWAP @+

7 * #33 + #6996 MAX
ROT !+
LOOP

2DROP ; PRIVATE

: PROCESS-X/Y ( -- )
X@ Y@
area1 X!
area2 Y!
size 0 DO I []X@

7 * #33 + #6996 MAX

I []Y!
LOOP
Y! X! ; PRIVATE

: TEST-X/Y ( -- )
#25600000 size / LOCAL #times
CR ." Testing with size = " size DEC.
CR il-code? IF ." Using " ELSE ." Not using " ENDIF ." in-line code."
CR ." process (direct) : " TIMER-RESET #times 0 DO PROCESS-DIRECT LOOP .ELAPSED
CR ." process (locals) : " TIMER-RESET #times 0 DO PROCESS-LOCALS LOOP .ELAPSED
CR ." process (params) : " TIMER-RESET #times 0 DO PROCESS-PARAMS LOOP .ELAPSED
CR ." process (stack1) : " TIMER-RESET #times 0 DO PROCESS-STACK1 LOOP .ELAPSED
CR ." process (stack2) : " TIMER-RESET #times 0 DO PROCESS-STACK2 LOOP .ELAPSED
CR ." process (X/Y) : " TIMER-RESET #times 0 DO PROCESS-X/Y LOOP .ELAPSED ;

:ABOUT CR ." Try: TEST-X/Y" ;

.ABOUT -xy CR
DEPRIVE

(* End of Source *)

Peter Fälth

unread,

Oct 17, 2008, 6:04:36 AM10/17/08

to

On Oct 12, 8:49 pm, stephen...@mpeforth.com (Stephen Pelc) wrote:
> On Sun, 12 Oct 2008 11:20:46 +0200, m...@iae.nl (Marcel Hendrix) wrote:
> >Last Wednesday, the German IRC channel ( #forth-ev ) had an interesting
> >discussion on the usefulness of A and B registers in Forth.
> ...
> >The results are surprising. For the simple read-process-write loop,
> >iForth64 is always about 1.5 times faster, iForth32 2 times. As shown
> >by PROCESS-STACK2, the success is not simply due to the use of @+ and
> >!+ instructions.
>
> See also:
> http://www.ddj.com/embedded/210603608
> http://www.complang.tuwien.ac.at/anton/euroforth/ef08/papers/pelc.pdf
>

I find this interesting and I am always interested in testing new
developments, but looking closely there needs some more specification.

First. What are the scope of the registers.
If X is the user pointer it is global (per thread)
Y being a frame pointer should be local
But what about A and B. Are they global or local.
If global should they be saved at a function call?
Who does the saving? caller or called
The example treats them as global and uses them to pass arguments.
(but does not save and restore the old state)

Second. If this is to be used there needs to be a standard wordset.
already yours and Marcel's differ ( >A vs A!). I like yours better.

A side note. You write that the stackframe make the C calling more
efficient.
How is this achieved?
I call windows syscalls on x86, there a stack frame is not needed.
(same for linux syscalls)

Regards
Peter Fälth

> The code example is by permission of Gary Bergstrom - it's beautiful!
> I do hope that Gary will publish the details somewhere.
>
> My view is that the A/B registers provide a degree of persistance
> and the performance gain is due to the reduction in stack shuffling
> at boundaries because there's less to be shuffled.
>
> Stephen
>
> --

> Stephen Pelc, stephen...@mpeforth.com

ste...@mpeforth.com

unread,

Oct 17, 2008, 7:48:50 AM10/17/08

to

On Oct 17, 11:04 am, Peter Fälth <peter.fa...@tin.it> wrote:
> I find this interesting and I am always interested in testing new
> developments, but looking closely there needs some more specification.

It's a version of the Forth VM. At present, most Forths keep the
USER pointer in a register, so let's use X for that. X is a global
CPU register like the A and B registers. It's the Forth implemention's
responsibility to save and restore them as required - just the same
as for systems that cache TOS in a CPU register.

> First. What are the scope of the registers.

Global

> If X is the user pointer it is global (per thread)
> Y being a frame pointer should be local

No, the tasker is responsible for state save/restore.

> But what about A and B. Are they global or local.

Global.

> If global should they be saved at a function call?
> Who does the saving? caller or called

Implementation issues!

> The example treats them as global and uses them to pass arguments.
> (but does not save and restore the old state)

Deliberately so. A and B should probably be saved/restored
by the tasker.

> Second. If this is to be used there needs to be a standard wordset.
> already yours and Marcel's differ ( >A vs A!). I like yours better.
>
> A side note. You write that the stackframe make the C calling more
> efficient. How is this achieved?

The point is not about calling C from Forth, but to make a two-stack
VM capable of executing compiled C well.

Stephen

Brad Eckert

unread,

Oct 17, 2008, 12:14:08 PM10/17/08

to

On Oct 17, 4:48 am, step...@mpeforth.com wrote:
> The point is not about calling C from Forth, but to make a two-stack
> VM capable of executing compiled C well.

Should C's parameter stack be a separate frame stack or can it be the
return stack? I have a VM that uses A as the memory address for @/!.
It seems to me that an instruction with the effect "A=RP+literal"
would be useful for supporting C.

-Brad

ste...@mpeforth.com

unread,

Oct 17, 2008, 1:08:42 PM10/17/08

to

On Oct 17, 5:14 pm, Brad Eckert <nospaambr...@tinyboot.com> wrote:
> Should C's parameter stack be a separate frame stack or can it be the
> return stack? I have a VM that uses A as the memory address for @/!.
> It seems to me that an instruction with the effect "A=RP+literal"
> would be useful for supporting C.

The point of the X and Y registers is to provide fetch/store opcodes
that support (X/Y+lit) directly. AFAIR profiling in the OTA machine,
which had a C compiler, indicated that these were a good trade off
for code density. Base+offset addressing is also needed for Forth
USER variable access and locals.

My preference would be to use Y as the C frame pointer, and to pass
input arguments on the data stack. They can always be popped into
the frame if required. Similarly, return values would appear on the
data stack.

Stephen

Brad Eckert

unread,

Oct 17, 2008, 3:31:42 PM10/17/08

to

On Oct 17, 10:08 am, step...@mpeforth.com wrote:
> The point of the X and Y registers is to provide fetch/store opcodes
> that support (X/Y+lit) directly. AFAIR profiling in the OTA machine,
> which had a C compiler, indicated that these were a good trade off
> for code density. Base+offset addressing is also needed for Forth
> USER variable access and locals.

Is X the same as UP?

> My preference would be to use Y as the C frame pointer, and to pass
> input arguments on the data stack. They can always be popped into
> the frame if required. Similarly, return values would appear on the
> data stack.

Wouldn't it be easier to pass input arguments on the stack frame,
since they are easily addressed by Y?

-Brad

ste...@mpeforth.com

unread,

Oct 18, 2008, 12:21:15 PM10/18/08

to

> Wouldn't it be easier to pass input arguments on the stack frame,
> since they are easily addressed by Y?

That's just a C compiler design issue!

Stephen