I was very sceptical, but decided to do an experiment. In iForth32 there
are no free registers for A and B, and in iForth64 I did not want to invest
the time for a full implementation. Therefore I decided to emulate A and B
with USERs. Given the implementation of USER in iForth, this should produce
representative results. Also, the USER variable does not create problems
when task-switching, FOREIGN, SYSCALL and CALLBACK words.
The results are surprising. For the simple read-process-write loop,
iForth64 is always about 1.5 times faster, iForth32 2 times. As shown
by PROCESS-STACK2, the success is not simply due to the use of @+ and
!+ instructions.
For iForth32 on the Intel CPU/Windows a strange and surprising
dependency on the size of the memory buffers was noted. I have no idea
what causes it.
-marcel
-- ---------------------------------------------------------------------------
( *
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Do A/B registers bring something?
* CATEGORY : Experiment
* AUTHOR : Marcel Hendrix
* LAST CHANGE : October 12, 2008, Marcel Hendrix
* )
0 [IF]
Try: TEST-A/B ( iForth64, AMD X2 3 GHz, Linux )
Testing with size = 256
process (direct) : 0.147 seconds elapsed.
process (locals) : 0.188 seconds elapsed.
process (params) : 0.165 seconds elapsed.
process (stack1) : 0.204 seconds elapsed.
process (stack2) : 0.153 seconds elapsed.
process (A/B) : 0.107 seconds elapsed. ok
FORTH> test-a/b
process (direct) : 0.148 seconds elapsed.
process (locals) : 0.188 seconds elapsed.
process (params) : 0.164 seconds elapsed.
process (stack1) : 0.205 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.094 seconds elapsed. ok
FORTH> test-A/B ( iForth32, Intel PIV 3 GHz, Windows )
Testing with size = 65536
process (direct) : 0.408 seconds elapsed.
process (locals) : 0.555 seconds elapsed.
process (params) : 0.425 seconds elapsed.
process (stack1) : 0.403 seconds elapsed.
process (stack2) : 0.432 seconds elapsed.
process (A/B) : 0.227 seconds elapsed. ok
Testing with size = 256
process (direct) : 0.223 seconds elapsed.
process (locals) : 0.405 seconds elapsed.
process (params) : 0.230 seconds elapsed.
process (stack1) : 0.221 seconds elapsed.
process (stack2) : 0.175 seconds elapsed.
process (A/B) : 0.228 seconds elapsed. ok
FORTH> test-a/b ( iForth32, AMD X2 3 GHz, Linux )
Testing with size = 65536
process (direct) : 0.146 seconds elapsed.
process (locals) : 0.200 seconds elapsed.
process (params) : 0.172 seconds elapsed.
process (stack1) : 0.188 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.121 seconds elapsed. ok
FORTH> test-a/b
Testing with size = 256
process (direct) : 0.145 seconds elapsed.
process (locals) : 0.201 seconds elapsed.
process (params) : 0.171 seconds elapsed.
process (stack1) : 0.188 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.121 seconds elapsed. ok
[THEN]
\ : !+ ( n addr -- addr+ ) TUCK ! CELL+ ;
\ : @+ ( addr -- addr+ n ) DUP CELL+ SWAP @ ;
USER _A : A@ _A @ ; : A! _A ! ; : A@+ _A @+ SWAP _A ! ; : A!+ _A !+ _A ! ;
USER _B : B@ _B @ ; : B! _B ! ; : B@+ _B @+ SWAP _B ! ; : B!+ _B !+ _B ! ;
0 [IF] #256
[ELSE] #65536
[THEN] CONSTANT size
CREATE area1 PRIVATE size CELLS ALLOT
CREATE area2 PRIVATE size CELLS ALLOT
: PROCESS-DIRECT ( -- )
#256 0 DO area1 I CELLS + @
7 * #33 + #6996 MAX
area2 I CELLS + !
LOOP ;
: PROCESS-LOCALS ( -- )
area1 area2 LOCALS| B@ A@ |
#256 0 DO A@ I CELLS + @
7 * #33 + #6996 MAX
B@ I CELLS + !
LOOP ;
: PROCESS-PARAMS ( -- )
area1 area2 PARAMS| A@ B@ |
#256 0 DO A@ I CELLS + @
7 * #33 + #6996 MAX
B@ I CELLS + !
LOOP ;
: PROCESS-STACK1 ( -- )
area1 area2
#256 0 DO OVER I CELLS + @
7 * #33 + #6996 MAX
OVER I CELLS + !
LOOP
2DROP ;
: PROCESS-STACK2 ( -- )
area1 area2
#256 0 DO SWAP @+
7 * #33 + #6996 MAX
ROT !+
LOOP
2DROP ;
: PROCESS-A/B ( -- )
area1 A!
area2 B!
#256 0 DO A@+
7 * #33 + #6996 MAX
B!+
LOOP ;
: TEST-A/B ( -- )
#100000 LOCAL #times
CR ." Testing with size = " size .
CR ." process (direct) : " TIMER-RESET #times 0 ?DO PROCESS-DIRECT LOOP .ELAPSED
CR ." process (locals) : " TIMER-RESET #times 0 ?DO PROCESS-LOCALS LOOP .ELAPSED
CR ." process (params) : " TIMER-RESET #times 0 ?DO PROCESS-PARAMS LOOP .ELAPSED
CR ." process (stack1) : " TIMER-RESET #times 0 ?DO PROCESS-STACK1 LOOP .ELAPSED
CR ." process (stack2) : " TIMER-RESET #times 0 ?DO PROCESS-STACK2 LOOP .ELAPSED
CR ." process (A/B) : " TIMER-RESET #times 0 ?DO PROCESS-A/B LOOP .ELAPSED ;
: .ABOUT CR ." Try: TEST-A/B" ;
.ABOUT
i was wondering if the implementations of !+ and @+ in your system are
colon or code definitions. (i'm guessing code because that would
explain them getting 2nd place.)
-Roger
The !+ and @+ are compiler macros. The A@+ and B!+ are colon definitions,
but the iForth compiler treats colon definitions different from what you
may be used to. (The final generated code for !+ and B!+ etc. is roughly
equal.) The code for process-stack2 has 8 push/pop instructions, the code
for process-a/b only 6 (iForth64).
-marcel
This looks wrong. Shouldn't
: A@+ _A @ @+ swap _A ! ;
and the same with A!+ (and B..)? And shouldn't the process-xxx words
use "size 0 DO"?
After fixing that, bigFORTH gives the following results (Phenom 2.5GHz,
user-variable used):
Testing with size = 256
process (direct) : 0,274289 sec
process (locals) : 0,296699 sec
process (params) : 0,297403 sec
process (stack1) : 0,274833 sec
process (stack2) : 0,217762 sec
process (A/B) : 0,260369 sec
Testing with size = 65536
process (direct) : 0,275094 sec
process (locals) : 0,307683 sec
process (params) : 0,307260 sec
process (stack1) : 0,288170 sec
process (stack2) : 0,236329 sec
process (A/B) : 0,274088 sec
You probably should calculate something meaningful so that simple mistakes
don't happen. I'm not sure what the difference between params and locals is
in iForth; since there are no params in bigFORTH, I just used locals there,
too (explains the almost identical timing). I then moved params to be the
first test, to exclude problems from PowerNow (but the Phenom is not as
sluggish as the Athlon64 here). The "direct" run became a bit faster by
doing so, even though the params/locals didn't show any difference.
To speed this up, I did the following: bigFORTH already reserves one
register for the object pointer - let that be A. B will be defined as user
variable, but the access words written as code macros (that's reasonable
for a system that doesn't have an analytical compiler). Results:
Phenom 2.5GHz, Linux:
Testing with size = 65536
process (params) : 0,296900 sec
process (direct) : 0,251538 sec
process (locals) : 0,294768 sec
process (stack1) : 0,275718 sec
process (stack2) : 0,220397 sec
process (A/B) : 0,145040 sec
Testing with size = 256
process (params) : 0,301286 sec
process (direct) : 0,246388 sec
process (locals) : 0,297584 sec
process (stack1) : 0,273907 sec
process (stack2) : 0,218305 sec
process (A/B) : 0,146147 sec
Now this shows an actual benefit, too. I'm almost competitive with iForth32;
fix the bug, and I'm better.
File for bigFORTH:
-----------------------------ab-test.fs-------------------------------
Code A@ ax push op ax mov Next end-code macro :ax 0 T&P
Code A! ax op mov ax pop Next end-code macro 0 :ax T&P
Code A@+ ax push op ) ax mov cell # op add next end-code macro :ax 0 T&P
Code A!+ ax op ) mov cell # op add ax pop next end-code macro 0 :ax T&P
USER _B
Code B@ ax push user' _B UP D) ax mov Next end-code macro :ax 0 T&P
Code B! ax user' _B UP D) mov ax pop Next end-code macro 0 :ax T&P
Code B@+ ax push user' _B UP D) dx mov dx ) ax mov
cell # dx add dx user' _B UP D) mov next end-code macro :ax 0 T&P
Code B!+ user' _B UP D) dx mov ax dx ) mov
cell # dx add dx user' _B UP D) mov ax pop next end-code macro :ax 0 T&P
1 [IF] #256
[ELSE] #65536
[THEN] CONSTANT size
CREATE area1 size CELLS ALLOT
CREATE area2 size CELLS ALLOT
: PROCESS-DIRECT ( -- )
size 0 DO area1 I CELLS + @
7 * #33 + #6996 MAX
area2 I CELLS + !
LOOP ;
: PROCESS-LOCALS ( -- )
area1 area2 LOCALS| B@ A@ |
size 0 DO A@ I CELLS + @
7 * #33 + #6996 MAX
B@ I CELLS + !
LOOP ;
: PROCESS-PARAMS ( -- )
area1 area2 { A@ B@ }
size 0 DO A@ I CELLS + @
7 * #33 + #6996 MAX
B@ I CELLS + !
LOOP ;
: PROCESS-STACK1 ( -- )
area1 area2
size 0 DO OVER I CELLS + @
7 * #33 + #6996 MAX
OVER I CELLS + !
LOOP
2DROP ;
: PROCESS-STACK2 ( -- )
area1 area2
size 0 DO SWAP @+ swap
7 * #33 + #6996 MAX
ROT !+
LOOP
2DROP ;
: PROCESS-A/B ( -- )
area1 A!
area2 B!
size 0 DO A@+
7 * #33 + #6996 MAX
B!+
LOOP ;
: TEST-A/B ( -- )
#25600000 size / { #times }
CR ." Testing with size = " size .
CR ." process (params) : " !time #times 0 ?DO PROCESS-PARAMS
LOOP .time
CR ." process (direct) : " !time #times 0 ?DO PROCESS-DIRECT
LOOP .time
CR ." process (locals) : " !time #times 0 ?DO PROCESS-LOCALS
LOOP .time
CR ." process (stack1) : " !time #times 0 ?DO PROCESS-STACK1
LOOP .time
CR ." process (stack2) : " !time #times 0 ?DO PROCESS-STACK2
LOOP .time
CR ." process (A/B) : " !time #times 0 ?DO PROCESS-A/B
LOOP .time ;
: .ABOUT CR ." Try: TEST-A/B" ;
.ABOUT
-----------------------------ab-test.fs-------------------------------
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
> Marcel Hendrix wrote:
>> \ : !+ ( n addr -- addr+ ) TUCK ! CELL+ ;
>> \ : @+ ( addr -- addr+ n ) DUP CELL+ SWAP @ ;
>> USER _A : A@ _A @ ; : A! _A ! ; : A@+ _A @+ SWAP _A ! ; : A!+ _A !+ _A ! ;
>> USER _B : B@ _B @ ; : B! _B ! ; : B@+ _B @+ SWAP _B ! ; : B!+ _B !+ _B ! ;
> This looks wrong. Shouldn't
> : A@+ _A @ @+ swap _A ! ;
> and the same with A!+ (and B..)?
Blush -- Yes:
USER _A : A@ _A @ ; : A! _A ! ; : A@+ A@ @+ SWAP A! ; : A!+ A@ !+ A! ;
USER _B : B@ _B @ ; : B! _B ! ; : B@+ B@ @+ SWAP B! ; : B!+ B@ !+ B! ;
I hope it is correct now. It makes the process-a/b advantage (nearly) evaporate.
> And shouldn't the process-xxx words
> use "size 0 DO"?
I don't see why? You mean because #times is effectively a constant?
> After fixing that, bigFORTH gives the following results (Phenom 2.5GHz,
> user-variable used):
[..]
> You probably should calculate something meaningful so that simple mistakes
> don't happen. I'm not sure what the difference between params and locals is
> in iForth; since there are no params in bigFORTH, I just used locals there,
> too (explains the almost identical timing).
[..]
One can take the address of a LOCAL, but not of PARAM, so a PARAM can be
a register. And the stack order is reversed to be ready for Forth20xx.
> To speed this up, I did the following: bigFORTH already reserves one
> register for the object pointer - let that be A. B will be defined as user
> variable, but the access words written as code macros (that's reasonable
> for a system that doesn't have an analytical compiler). Results:
[..]
> Testing with size = 256
> process (params) : 0,301286 sec
> process (direct) : 0,246388 sec
> process (locals) : 0,297584 sec
> process (stack1) : 0,273907 sec
> process (stack2) : 0,218305 sec
> process (A/B) : 0,146147 sec
> Now this shows an actual benefit, too. I'm almost competitive with iForth32;
> fix the bug, and I'm better.
It is well-known that hand-assembly is supposed to speed up Forth 10-fold or more :-)
What is the difference between a Phenom 2,5G and an X2 3G?
-marcel
-- -------------------------------------------------------------
Try: TEST-A/B ( iForth64, AMD X2 3 GHz, Linux )
Testing with size = 65536
process (direct) : 0.148 seconds elapsed.
process (locals) : 0.188 seconds elapsed.
process (params) : 0.165 seconds elapsed.
process (stack1) : 0.204 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.140 seconds elapsed. ok
Testing with size = 256
process (direct) : 0.147 seconds elapsed.
process (locals) : 0.188 seconds elapsed.
process (params) : 0.165 seconds elapsed.
process (stack1) : 0.204 seconds elapsed.
process (stack2) : 0.153 seconds elapsed.
process (A/B) : 0.131 seconds elapsed. ok
FORTH> test-A/B ( iForth32, Intel PIV 3 GHz, Windows )
Testing with size = 65536
process (direct) : 0.403 seconds elapsed.
process (locals) : 0.548 seconds elapsed.
process (params) : 0.421 seconds elapsed.
process (stack1) : 0.395 seconds elapsed.
process (stack2) : 0.422 seconds elapsed.
process (A/B) : 0.375 seconds elapsed. ok
Testing with size = 256
process (direct) : 0.222 seconds elapsed.
process (locals) : 0.400 seconds elapsed.
process (params) : 0.228 seconds elapsed.
process (stack1) : 0.220 seconds elapsed.
process (stack2) : 0.176 seconds elapsed.
process (A/B) : 0.237 seconds elapsed. ok
FORTH> test-a/b ( iForth32, AMD X2 3 GHz, Linux )
Testing with size = 65536
process (direct) : 0.146 seconds elapsed.
process (locals) : 0.200 seconds elapsed.
process (params) : 0.172 seconds elapsed.
process (stack1) : 0.188 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.139 seconds elapsed. ok
FORTH> test-a/b
Testing with size = 256
process (direct) : 0.145 seconds elapsed.
process (locals) : 0.201 seconds elapsed.
process (params) : 0.172 seconds elapsed.
process (stack1) : 0.187 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.139 seconds elapsed. ok
I thought the idea was to test small vs. large arrays (and actually access
the larger arrays, and not only the first 256 cells). In my modified code,
#times is no longer a constant.
> One can take the address of a LOCAL, but not of PARAM, so a PARAM can be
> a register. And the stack order is reversed to be ready for Forth20xx.
Ok, but then one can't take the address of a local in bigFORTH (though it
lives in memory).
> It is well-known that hand-assembly is supposed to speed up Forth 10-fold
> or more :-)
It's hand assembly of something that's supposed to be a primitive (A/B).
> What is the difference between a Phenom 2,5G and an X2 3G?
Most of the time the Phenom is clock by clock identical with the Athlon64,
unless the L3 cache gives a benefit or you use SSE instructions (there it
can be up to twice as fast).
> Marcel Hendrix wrote:
[..]
>> One can take the address of a LOCAL, but not of PARAM, so a PARAM can be
>> a register. And the stack order is reversed to be ready for Forth20xx.
> Ok, but then one can't take the address of a local in bigFORTH (though it
> lives in memory).
I found it useful, as for OS calls and such one can pass the address of a
local if the code wants a pointer to a variable. I regret this choice now,
as it prevents the compiler from doing a few optimizations.
>> It is well-known that hand-assembly is supposed to speed up Forth 10-fold
>> or more :-)
> It's hand assembly of something that's supposed to be a primitive (A/B).
In that case, below is how I would do it in that case. Only for iForth64
(it's non-portable code). The advantage is back.
>> What is the difference between a Phenom 2,5G and an X2 3G?
> Most of the time the Phenom is clock by clock identical with the Athlon64,
> unless the L3 cache gives a benefit or you use SSE instructions (there it
> can be up to twice as fast).
Tempting. That would mean 16 Gflops for iForth's matrix multiply primitives.
-marcel
-- ----------------------------------------------------
ALSO ASSEMBLER
: A@+ ( -- u )
0 1 IN/OUT
POSTPONE ASM{
[r15 [ _A UP - ] LITERAL +] -> rax mov,
[rax] -> rbx mov,
[rax =CELL +] -> rax lea,
rax -> [r15 [ _A UP - ] LITERAL +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: B!+ ( u -- )
1 0 IN/OUT
POSTPONE ASM{
[r15 [ _B UP - ] LITERAL +] -> rax mov,
rbx -> [rax] mov,
[rax =CELL +] -> rax lea,
rax -> [r15 [ _B UP - ] LITERAL +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
PREVIOUS
Try: TEST-A/B ( iForth64, AMD X2 3 GHz, Linux )
Testing with size = 256
Using in-line code.
process (direct) : 0.148 seconds elapsed.
process (locals) : 0.188 seconds elapsed.
process (params) : 0.165 seconds elapsed.
process (stack1) : 0.205 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (A/B) : 0.096 seconds elapsed. ok
>Last Wednesday, the German IRC channel ( #forth-ev ) had an interesting
>discussion on the usefulness of A and B registers in Forth.
...
>The results are surprising. For the simple read-process-write loop,
>iForth64 is always about 1.5 times faster, iForth32 2 times. As shown
>by PROCESS-STACK2, the success is not simply due to the use of @+ and
>!+ instructions.
See also:
http://www.ddj.com/embedded/210603608
http://www.complang.tuwien.ac.at/anton/euroforth/ef08/papers/pelc.pdf
The code example is by permission of Gary Bergstrom - it's beautiful!
I do hope that Gary will publish the details somewhere.
My view is that the A/B registers provide a degree of persistance
and the performance gain is due to the reduction in stack shuffling
at boundaries because there's less to be shuffled.
Stephen
--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads
Cecil
Let's suppose you actually have locals in registers, and your optimizing
compiler really produces good code: What would the actual benefit be? For a
system like bigFORTH, which has a limited optimizer, the benefit is
obvious.
There's another point with the A/B X/Y proposal: With the register pressure
we have on the most popular architecture, I don't want to have four of them
dedicated; two is already too much ;-). My suggestion would be to have both
postincrement and offset calculation, but only two, with one of them
preferred (when you need only one, use A).
> There's another point with the A/B X/Y proposal: With the register pressure
> we have on the most popular architecture, I don't want to have four of them
> dedicated; two is already too much ;-). My suggestion would be to have both
> postincrement and offset calculation, but only two, with one of them
> preferred (when you need only one, use A).
Why not have them all. Only one of them needs to really be a register,
all others could be USERs. Everyone benefits.
Far more important is that the names A, B, X and Y are very generic
and could cause awkward clashes with existing code. And second, modern
hardware has 16 double-precision FPU registers lying around doing nothing.
(SSE2). Why not have fA, fB? They would be hugely more useful for DSP work
than iA and iB.
-marcel
But that one that is a register should have both auto-increment and index
addressing - autoincrement for DSP work, indexing for C stackframes or
object oriented extensions.
> Far more important is that the names A, B, X and Y are very generic
> and could cause awkward clashes with existing code.
Not really. Existing code that defines A, B, X, or Y is just fine. It can't
use them, but it won't harm (unless redefined warnings harm).
> And second, modern
> hardware has 16 double-precision FPU registers lying around doing nothing.
> (SSE2). Why not have fA, fB? They would be hugely more useful for DSP work
> than iA and iB.
Another, entirely different topic...
Another example lending weight to the
argument that using Forth increases the
likelihood of producing incorrect, unreadable, and unmaintainable
code.
Forth code is easy for a computer to parse; it is difficult for a
human to comprehend.
What a mass of redundancy!
Why don't you use a high-level language
that encourages factoring?
Ruby:
$size = 200_000
def gen_with_Array
Array.new( $size ){ rand }
end
def gen_from_range
(1 .. $size ).map{ rand }
end
def gen_by_appending
a = []
$size.times{ a << rand }
end
[
[ 'Array', :gen_with_Array ],
[ 'range', :gen_from_range ],
[ 'append', :gen_by_appending ]
].each{|str,sym| t = Time.now
send( sym )
puts "method (#{ str }): #{ Time.now - t }"
}
--- output ---
method (Array): 0.48
method (range): 0.581
method (append): 0.661
No need for that, Forth provides everything you need, Marcel just didn't use
it. This is to quickly benchmark low-level functions.
Marcel's actual fault was that his benchmark code did not provide some
testable results, and therefore a small omission in the code resulted in
incorrect behavior - but that will happen with any language. If you don't
test, better assume it's broken.
On most systems you already use X and Y - they are the registers for
the
USER area and the locals frame pointer. So you are really down to two
extra registers, A and B. As Marcel has shown, keeping A and B as
variables may still improve performance.
Stephen
It was a virtual machine description. There are dozens of CPUs
with registers called A, B, X and Y! If you want to call them
rA, rB and so on on in your virtual machine assembler, you're
welcome to do so. Nowhere in the wordset I described are there
words call A or B!
Stephen
But what actually you don't need A and B or X and Y at the same time. You
want two registers and two different ways to access them: Either with
immediate offset or with postincrement. If I get this (two registers, two
ways of accessing them), I can have both of them in registers even on x86 -
op and up are already registers in bigFORTH.
The suggestion is that you have >X >Y and X> Y> words, which push the old
value on the stack, and within such usage, you can't access e.g. locals or
the user area, because they will use the same registers (in bigFORTH, I use
the return stack pointer for locals, frame pointers are for wimps ;-).
>On Oct 12, 4:20=A0am, m...@iae.nl (Marcel Hendrix) wrote:
>> : PROCESS-DIRECT ( -- )
>> =A0 =A0 =A0 =A0 #256 0 DO =A0area1 I CELLS + @
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A07 * #33 + =A0#6996 MAX
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0area2 I CELLS + !
[..]
> What a mass of redundancy!
> Why don't you use a high-level language
> that encourages factoring?
You mean, like the monkey that coded your newsreader?
-marcel
The naming and stack diagrams for some of X/Y operations
don't seem right. Was this a typo? Are the X/Y operations
available for apps programmers to use?
--
Are the new regs meant to be added onto the existing VM (ANS)
- or do they represent a new forth engine. Given A/B has the
potential to re-write the way forth is programmed, it may be
better to have a new standard e.g. Forth II. This would avoid
splitting ANS forth into the "haves" and "have nots", and opens
the door to other changes not possible with the current model.
In the meantime, it would be nice to see more code examples
along with properly implemented forths to test the effectiveness
of the scheme. Perhaps issuing a draft spec for the new regs
would help kick things off.
I take no responsibility for the correctness of this example. Most of it
is mindless copy and paste. Just meant as my $0.02.
-marcel
-- ----------------------------------------
(*
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Do X/Y registers bring something?
* CATEGORY : Experiment
* AUTHOR : Marcel Hendrix
* LAST CHANGE : Wednesday, October 15, 2008, 22:59 PM, Marcel Hendrix
*)
NEEDS -miscutil
NEEDS -assemble
REVISION -xy "--- X/Y register test Version 1.00 ---"
PRIVATES
DOC
(*
FORTH> test-x/y ( iForth32 PIV 3 GHz, Windows )
Testing with size = 256
Using in-line code.
process (direct) : 0.222 seconds elapsed.
process (locals) : 0.402 seconds elapsed.
process (params) : 0.232 seconds elapsed.
process (stack1) : 0.222 seconds elapsed.
process (stack2) : 0.175 seconds elapsed.
process (A/B) : 0.481 seconds elapsed. ok
Testing with size = 256
Not using in-line code.
process (direct) : 0.223 seconds elapsed.
process (locals) : 0.400 seconds elapsed.
process (params) : 0.232 seconds elapsed.
process (stack1) : 0.219 seconds elapsed.
process (stack2) : 0.177 seconds elapsed.
process (X/Y) : 0.339 seconds elapsed. ok
FORTH> test-x/y ( iForth64 AMD X2 3 GHz, Linux )
Testing with size = 256
Using in-line code.
process (direct) : 0.147 seconds elapsed.
process (locals) : 0.189 seconds elapsed.
process (params) : 0.165 seconds elapsed.
process (stack1) : 0.204 seconds elapsed.
process (stack2) : 0.154 seconds elapsed.
process (X/Y) : 0.140 seconds elapsed. ok
Testing with size = 256
Not using in-line code.
process (direct) : 0.147 seconds elapsed.
process (locals) : 0.189 seconds elapsed.
process (params) : 0.164 seconds elapsed.
process (stack1) : 0.205 seconds elapsed.
process (stack2) : 0.155 seconds elapsed.
process (X/Y) : 0.213 seconds elapsed. ok
*)
ENDDOC
FALSE =: il-code? PRIVATE
USER _X
USER _Y
il-code? [IF]
\ Inlined code
64BIT? [IF]
ALSO ASSEMBLER
: X@ ( -- u )
0 1 IN/OUT
POSTPONE ASM{
[r15 [ _X UP - ] LITERAL +] -> rbx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: X! ( u -- )
1 0 IN/OUT
POSTPONE ASM{
rbx -> [r15 [ _X UP - ] LITERAL +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: []X@ ( ix -- u )
1 1 IN/OUT
POSTPONE ASM{
[r15 [ _X UP - ] LITERAL +] -> rax mov,
[rax +rbx *8 0 +] -> rbx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: []X! ( u ix -- )
2 0 IN/OUT
POSTPONE ASM{
[r15 [ _X UP - ] LITERAL +] -> rax mov,
rcx -> [rax +rbx *8 0 +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: Y@ ( -- u )
0 1 IN/OUT
POSTPONE ASM{
[r15 [ _Y UP - ] LITERAL +] -> rbx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: Y! ( u -- )
1 0 IN/OUT
POSTPONE ASM{
rbx -> [r15 [ _Y UP - ] LITERAL +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: []Y@ ( ix -- u )
1 1 IN/OUT
POSTPONE ASM{
[r15 [ _Y UP - ] LITERAL +] -> rax mov,
[rax +rbx *8 0 +] -> rbx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: []Y! ( u ix -- )
2 0 IN/OUT
POSTPONE ASM{
[r15 [ _Y UP - ] LITERAL +] -> rax mov,
rcx -> [rax +rbx *8 0 +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
PREVIOUS
[ELSE]
ALSO ASSEMBLER
: X@ ( -- u )
0 1 IN/OUT
POSTPONE ASM{
ebp -> ebx mov,
[ _X UP - ] LITERAL d# -> bx mov,
[ebx] -> ebx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: X! ( u -- )
1 0 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _X UP - ] LITERAL d# -> ax mov,
ebx -> [eax] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: []X@ ( ix -- u )
1 1 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _X UP - ] LITERAL d# -> ax mov,
[eax] -> eax mov,
[eax ebx*4 0 +] -> ebx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: []X! ( u ix -- )
2 0 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _X UP - ] LITERAL d# -> ax mov,
[eax] -> eax mov,
ecx -> [eax ebx*4 0 +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: Y@ ( -- u )
0 1 IN/OUT
POSTPONE ASM{
ebp -> ebx mov,
[ _Y UP - ] LITERAL d# -> bx mov,
[ebx] -> ebx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: Y! ( u -- )
1 0 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _Y UP - ] LITERAL d# -> ax mov,
ebx -> [eax] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: []Y@ ( ix -- u )
1 1 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _Y UP - ] LITERAL d# -> ax mov,
[eax] -> eax mov,
[eax ebx*4 0 +] -> ebx mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
: []Y! ( u ix -- )
2 0 IN/OUT
POSTPONE ASM{
ebp -> eax mov,
[ _Y UP - ] LITERAL d# -> ax mov,
[eax] -> eax mov,
ecx -> [eax ebx*4 0 +] mov,
}ASM ADJUST-STACK ; IMMEDIATE COMPILE-ONLY
PREVIOUS
[THEN]
[ELSE]
: X@ _X @ ; : X! _X ! ; : []X@ CELLS X@ + @ ; : []X! CELLS X@ + ! ;
: Y@ _Y @ ; : Y! _Y ! ; : []Y@ CELLS Y@ + @ ; : []Y! CELLS Y@ + ! ;
[THEN]
( n -- ) 4 MAX #20 MIN 2^x =: size
CR .( size = ) size DEC.
CREATE area1 PRIVATE size CELLS ALLOT
CREATE area2 PRIVATE size CELLS ALLOT
: PROCESS-DIRECT ( -- )
size 0 DO area1 I CELL[] @
7 * #33 + #6996 MAX
area2 I CELL[] !
LOOP ; PRIVATE
: PROCESS-LOCALS ( -- )
area1 area2 LOCALS| Y@ X@ |
size 0 DO X@ I CELL[] @
7 * #33 + #6996 MAX
Y@ I CELL[] !
LOOP ; PRIVATE
: PROCESS-PARAMS ( -- )
area1 area2 PARAMS| X@ Y@ |
size 0 DO X@ I CELL[] @
7 * #33 + #6996 MAX
Y@ I CELL[] !
LOOP ; PRIVATE
: PROCESS-STACK1 ( -- )
area1 area2
size 0 DO OVER I CELL[] @
7 * #33 + #6996 MAX
OVER I CELL[] !
LOOP
2DROP ; PRIVATE
: PROCESS-STACK2 ( -- )
area1 area2
size 0 DO SWAP @+
7 * #33 + #6996 MAX
ROT !+
LOOP
2DROP ; PRIVATE
: PROCESS-X/Y ( -- )
X@ Y@
area1 X!
area2 Y!
size 0 DO I []X@
7 * #33 + #6996 MAX
I []Y!
LOOP
Y! X! ; PRIVATE
: TEST-X/Y ( -- )
#25600000 size / LOCAL #times
CR ." Testing with size = " size DEC.
CR il-code? IF ." Using " ELSE ." Not using " ENDIF ." in-line code."
CR ." process (direct) : " TIMER-RESET #times 0 DO PROCESS-DIRECT LOOP .ELAPSED
CR ." process (locals) : " TIMER-RESET #times 0 DO PROCESS-LOCALS LOOP .ELAPSED
CR ." process (params) : " TIMER-RESET #times 0 DO PROCESS-PARAMS LOOP .ELAPSED
CR ." process (stack1) : " TIMER-RESET #times 0 DO PROCESS-STACK1 LOOP .ELAPSED
CR ." process (stack2) : " TIMER-RESET #times 0 DO PROCESS-STACK2 LOOP .ELAPSED
CR ." process (X/Y) : " TIMER-RESET #times 0 DO PROCESS-X/Y LOOP .ELAPSED ;
:ABOUT CR ." Try: TEST-X/Y" ;
.ABOUT -xy CR
DEPRIVE
(* End of Source *)
I find this interesting and I am always interested in testing new
developments, but looking closely there needs some more specification.
First. What are the scope of the registers.
If X is the user pointer it is global (per thread)
Y being a frame pointer should be local
But what about A and B. Are they global or local.
If global should they be saved at a function call?
Who does the saving? caller or called
The example treats them as global and uses them to pass arguments.
(but does not save and restore the old state)
Second. If this is to be used there needs to be a standard wordset.
already yours and Marcel's differ ( >A vs A!). I like yours better.
A side note. You write that the stackframe make the C calling more
efficient.
How is this achieved?
I call windows syscalls on x86, there a stack frame is not needed.
(same for linux syscalls)
Regards
Peter Fälth
> The code example is by permission of Gary Bergstrom - it's beautiful!
> I do hope that Gary will publish the details somewhere.
>
> My view is that the A/B registers provide a degree of persistance
> and the performance gain is due to the reduction in stack shuffling
> at boundaries because there's less to be shuffled.
>
> Stephen
>
> --
> Stephen Pelc, stephen...@mpeforth.com
It's a version of the Forth VM. At present, most Forths keep the
USER pointer in a register, so let's use X for that. X is a global
CPU register like the A and B registers. It's the Forth implemention's
responsibility to save and restore them as required - just the same
as for systems that cache TOS in a CPU register.
> First. What are the scope of the registers.
Global
> If X is the user pointer it is global (per thread)
> Y being a frame pointer should be local
No, the tasker is responsible for state save/restore.
> But what about A and B. Are they global or local.
Global.
> If global should they be saved at a function call?
> Who does the saving? caller or called
Implementation issues!
> The example treats them as global and uses them to pass arguments.
> (but does not save and restore the old state)
Deliberately so. A and B should probably be saved/restored
by the tasker.
> Second. If this is to be used there needs to be a standard wordset.
> already yours and Marcel's differ ( >A vs A!). I like yours better.
>
> A side note. You write that the stackframe make the C calling more
> efficient. How is this achieved?
The point is not about calling C from Forth, but to make a two-stack
VM capable of executing compiled C well.
Stephen
Should C's parameter stack be a separate frame stack or can it be the
return stack? I have a VM that uses A as the memory address for @/!.
It seems to me that an instruction with the effect "A=RP+literal"
would be useful for supporting C.
-Brad
The point of the X and Y registers is to provide fetch/store opcodes
that support (X/Y+lit) directly. AFAIR profiling in the OTA machine,
which had a C compiler, indicated that these were a good trade off
for code density. Base+offset addressing is also needed for Forth
USER variable access and locals.
My preference would be to use Y as the C frame pointer, and to pass
input arguments on the data stack. They can always be popped into
the frame if required. Similarly, return values would appear on the
data stack.
Stephen
Is X the same as UP?
> My preference would be to use Y as the C frame pointer, and to pass
> input arguments on the data stack. They can always be popped into
> the frame if required. Similarly, return values would appear on the
> data stack.
Wouldn't it be easier to pass input arguments on the stack frame,
since they are easily addressed by Y?
-Brad
That's just a C compiler design issue!
Stephen