Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Stack Sizes

1,062 views
Skip to first unread message

Walter Banks

unread,
Oct 3, 2016, 8:22:29 AM10/3/16
to
I am currently working on a stack processor ISA.

What better place to ask about stacks.

How many? single stack or two or three.
What depth?
Why?


w..

foxaudio...@gmail.com

unread,
Oct 3, 2016, 10:00:53 AM10/3/16
to
In this forum you could get more questions than answers I believe.

For the record the Forth VM is 2 stacks. Parameter stack and Return stack.
IF you are planning to program this machine in Forth you really don't need more
than that however there are people who prefer to add stacks for other purposes
for example Floating point math.

The fascinating thing to new people to Forth is that stack size actually tends
to be less (in general) in Forth programs versus more conventional languages
because the programmer is using the parameter stack more like an assembler
programmer uses registers. The numerical ramification of this is that Forth
hardware can get away with as few as 18 stack elements for each stack.

You can study some of the current Forth style stack machines like B16 or J1 to
see what others have done.

Now a small stack will fail if you write recursive routines but for everyday
solving problems in Forth it is manageable.

So the questions:

What language will you use for this machine?

Will you need recursion?

If if it's only for Algol style languages 1 stack and and a base pointer
register for stack frames works just fine IMHO.

For Forth, 2 very fast hardware stacks of modest size gets you of what you need.

YMMV

BF

rickman

unread,
Oct 3, 2016, 10:21:45 AM10/3/16
to
Yes.

There are many answers to your question and they all related to what you
want to do with the processor. It has been stated many times that for
most apps, an 8 element stack is enough. But if you wish to support
recursion a million elements may not be enough.

Some designs use one stack, others use as many as four. Again, it all
depends.

--

Rick C

Walter Banks

unread,
Oct 3, 2016, 11:06:59 AM10/3/16
to
I have a couple metrics that are useful. We ran several hundred actual
applications through an analysis package we wrote and found the maximum
stack depth in the real applications to be 13 (In about a half dozen of
the applications)

We did a recursion study of many applications with an instrumented
compiler. In the same group of programs and a whole bunch more we found
that recursion in real code was needed very rarely. Most code that is
written with recursion either through a chain of functions or a
recursive function can be optimized by a reasonable compiler to not need
recursion.

I don't have any specific metrics but I haven't seen any expressions
that have required more than 4 levels of data stack to process.

It guess that it depends on what?

w..

Walter Banks

unread,
Oct 3, 2016, 11:14:51 AM10/3/16
to
I have looked at what others have done but have very little information
on how they made their choices. For example, does anyone have any
regression tests that illustrates the issues of stacks?

I guess what is modest hardware size. How big is needed to accommodate most
applications given that a user or compiler makes a reasonable attempt to
write good code.

Most massively recursion pathological cases can generally be optimized
away for example.

w..

Cecil Bayona

unread,
Oct 3, 2016, 11:35:23 AM10/3/16
to
C.H. Tings ep16 Forth CPU is available from forth.org it contains VHDL
source, a custom version of eForth that is resident, and documentation
on both the VHDL code and the eForth code. The special eForth version
is easily modifiable to work for other CPUs.
--
Cecil - k5nwa

Bernd Linsel

unread,
Oct 3, 2016, 12:20:13 PM10/3/16
to
On 10/3/2016 5:14 PM, Walter Banks wrote:
>
> I have looked at what others have done but have very little information
> on how they made their choices. For example, does anyone have any
> regression tests that illustrates the issues of stacks?
>
> I guess what is modest hardware size. How big is needed to accommodate most
> applications given that a user or compiler makes a reasonable attempt to
> write good code.
>
> Most massively recursion pathological cases can generally be optimized
> away for example.

Perhaps you should read P. Koopman's excellent book on stack computers.
It is online available at
https://users.ece.cmu.edu/~koopman/stack_computers/ .

Regards,
Bernd

Cecil Bayona

unread,
Oct 3, 2016, 12:30:29 PM10/3/16
to
The ep16 and ep32 from C. H. Ting uses 32 levels for each of the stacks,
more than enough without taxing the resources, I'm in the process of
designing my own CPU so far it looks like I will be using 256 levels for
each stacks. I want it to handle a modest level of recursion.

--
Cecil - k5nwa

Anton Ertl

unread,
Oct 3, 2016, 12:40:31 PM10/3/16
to
For Forth, a data stack with 32 cells will still occasionally overflow
on naturally recursive tasks. If you can work around that (either the
programmer, the compiler, or the run-time system avoids or deals with
the stack overflow), then 32 cells is plenty, 16 cells won't cause
many overflows, and Chuck Moore decided to go with 6 in IIRC the C18;
not sure what's the number on the GA parts.

For Forth, you want a data stack and a return stack; if you have FP,
also an FP stack. For locals you may be able to make do with the
return stack, although alignment concerns are an issue if you support
FP locals; in that case, possibly a separate locals stack.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2016: http://www.euroforth.org/ef16/

rickman

unread,
Oct 3, 2016, 3:22:54 PM10/3/16
to
On 10/3/2016 11:06 AM, Walter Banks wrote:
> On 2016-10-03 10:21 AM, rickman wrote:
>> On 10/3/2016 8:22 AM, Walter Banks wrote:
>>> I am currently working on a stack processor ISA.
>>>
>>> What better place to ask about stacks.
>>>
>>> How many? single stack or two or three. What depth? Why?
>>
>> Yes.
>>
>> There are many answers to your question and they all related to what
>> you want to do with the processor. It has been stated many times
>> that for most apps, an 8 element stack is enough. But if you wish to
>> support recursion a million elements may not be enough.
>>
>> Some designs use one stack, others use as many as four. Again, it
>> all depends.
>>
>
> I have a couple metrics that are useful. We ran several hundred actual
> applications through an analysis package we wrote and found the maximum
> stack depth in the real applications to be 13 (In about a half dozen of
> the applications)

You never said what language you are using. Should I assume Forth? I
know in C programs the stack usage can be many, many kilobytes if not
megabytes.


> We did a recursion study of many applications with an instrumented
> compiler. In the same group of programs and a whole bunch more we found
> that recursion in real code was needed very rarely. Most code that is
> written with recursion either through a chain of functions or a
> recursive function can be optimized by a reasonable compiler to not need
> recursion.

I've never met a compiler I thought was truly reasonable.


> I don't have any specific metrics but I haven't seen any expressions
> that have required more than 4 levels of data stack to process.
>
> It guess that it depends on what?

Yes, it definitely depends on "what".

--

Rick C

rickman

unread,
Oct 3, 2016, 3:27:56 PM10/3/16
to
I used 256 word stacks in my CPU design, but this makes the stack
pointers longer than desirable and slowed the design. Actually, it was
the flags on the stack addresses that were the long pole in the tent,
but they had to go through the entire address carry chain and so were
limited by the stack pointer length. I probably should have tossed the
flags.

I used 256 word stacks because they shared a single dual port block RAM
with 512 words. If I go back to this design I will likely use a larger
hunk of memory and do the rotation context thing to use multiple virtual
machines where the stack RAM holds N smaller stacks for N virtual CPUs.
RAM is cheap on FPGAs usually.

--

Rick C

Chris Curl

unread,
Oct 3, 2016, 4:53:45 PM10/3/16
to
This is good information ... in my little implementation, I am (was) using
1K (256 values ... it is a 32-bit implementation) for each stack. It looks
like I can back that off considerably. I am going to change them to be 64
cells each. From the sound of it, even 64 is probably overkill.

Re the original question, assuming they are easy to configure, why not
start with sizes that seem reasonable and tweak it as issues arise?

I imagine that is what you are doing ... looking for insight into what
reasonable starting sizes would be, from people who have "been there and
done that" ...

Cheers, Chris

Walter Banks

unread,
Oct 3, 2016, 5:54:30 PM10/3/16
to
This is a commercial processor that I am trying to replace cut and try
with design. It is pretty easy at this point to change stack(s) sizes
but various choices have implementation ramifications. The underlying
processor core ISA has been running since late 2013.

w..

Walter Banks

unread,
Oct 3, 2016, 6:04:24 PM10/3/16
to
On 2016-10-03 3:22 PM, rickman wrote:
> On 10/3/2016 11:06 AM, Walter Banks wrote:
>> On 2016-10-03 10:21 AM, rickman wrote:
>>> On 10/3/2016 8:22 AM, Walter Banks wrote:
>>>> I am currently working on a stack processor ISA.
>>>>
>>>> What better place to ask about stacks.
>>>>
>>>> How many? single stack or two or three. What depth? Why?
>>>
>>> Yes.
>>>
>>> There are many answers to your question and they all related to
>>> what you want to do with the processor. It has been stated many
>>> times that for most apps, an 8 element stack is enough. But if
>>> you wish to support recursion a million elements may not be
>>> enough.
>>>
>>> Some designs use one stack, others use as many as four. Again,
>>> it all depends.
>>>
>>
>> I have a couple metrics that are useful. We ran several hundred
>> actual applications through an analysis package we wrote and found
>> the maximum stack depth in the real applications to be 13 (In
>> about a half dozen of the applications)
>
> You never said what language you are using. Should I assume Forth? I
> know in C programs the stack usage can be many, many kilobytes if not
> megabytes.

I haven't said what language for a variety of reasons. C for example in many
implementations don't always have a data stack to work with and don't
always use it when they have.

Forth is in some ways a special case in terms of language
implementation. The ISA I am working on has some similarities to some
forth ISA's.


>> We did a recursion study of many applications with an instrumented
>> compiler. In the same group of programs and a whole bunch more we
>> found that recursion in real code was needed very rarely. Most
>> code that is written with recursion either through a chain of
>> functions or a recursive function can be optimized by a reasonable
>> compiler to not need recursion.
>
> I've never met a compiler I thought was truly reasonable.

:) Probably the best argument for compiled code is compilers get to
start over every time they are called to make code generation choices
and hand code gets to evolve living with some historical artifacts over
time.

w..

Walter Banks

unread,
Oct 3, 2016, 6:12:12 PM10/3/16
to
Both are good references, I have read Koopman's book and have a some
other Ting material. I will check out your Ting link in another post.

One more question. Have there been single stack forth implementations.
What are the issues?

w..

Cecil Bayona

unread,
Oct 3, 2016, 6:52:33 PM10/3/16
to
Others here may know of some but Forth basically needs two stacks as a
minimum, in implementing Forth on CPU chips that have only one stack the
other stack is simulated using a dedicated register.

One stack only has some issues in that subroutine call information gets
in the way of data items on the stack, that why two stacks are needed as
a minimum. Early Stack computers had only one stack and that didn't work
out well.
--
Cecil - k5nwa

Paul Rubin

unread,
Oct 3, 2016, 7:06:25 PM10/3/16
to
Walter Banks <wal...@bytecraft.com> writes:
> Both are good references, I have read Koopman's book and have a some
> other Ting material. I will check out your Ting link in another post.

You might also look at Eric LaForest's thesis "Second-Generation Stack
Computer Architecture":

http://uwaterloo.ca/independent-studies/sites/ca.independent-studies/files/uploads/files/eric_laforest_thesis.pdf

Paul Rubin

unread,
Oct 3, 2016, 7:28:29 PM10/3/16
to
Walter Banks <wal...@bytecraft.com> writes:
> I guess what is modest hardware size. How big is needed to accommodate
> most applications given that a user or compiler makes a reasonable
> attempt to write good code.

What do you mean by "most applications"? Applications that run on
workstations, mobile phones, internet servers etc. usually consume far
more resources than the applications that run on stack processors from
the Forth world. Large Forth applications (there are a few) usually run
on large machines (PC's etc.) so maybe aren't relevant to your question.

Forth programs on stack machines or microcontrollers generally avoid
complicated control flow and tend to use between a few and some dozens
of stack cells, plus up to a few K of memory cells. The GA144 cpus iirc
have 10 data stack levels (T and S registers plus an 8-level circular
buffer) and 9 return stack levels (R register plus 8-level circular
buffer) but they are very small cpus. The Novix used off-chip stack
memory with 256 elements for each stack, plus a few on-chip registers
for the top levels. The J1 appears to have 32 levels for each stack:
http://excamera.com/sphinx/fpga-j1.html

If you can say the application area your processor is targeted at, you
might get better advice. If it's a general purpose processor that has
to run mainstream software, you probably want a conventional
architecture.

The main reason other than recreation I can think of to make a stack CPU
is you want some tiny sequencer in a little corner of a bigger FPGA
application. In that case you probably want to size the CPU to handle
the surrounding application's specific requirements. In the designs
I've seen, you can choose the stack size by changing a couple lines of
Verilog.

Rod Pemberton

unread,
Oct 3, 2016, 11:40:21 PM10/3/16
to
On Mon, 3 Oct 2016 08:22:26 -0400
Walter Banks <wal...@bytecraft.com> wrote:

> I am currently working on a stack processor ISA.
>
> What better place to ask about stacks.

Just over a year ago we had a discussion on stack size here.
(See "Metrics on stack usage" on c.l.f. by Anton Ertl)

> How many? single stack or two or three.

You'll need one stack for C or two stacks for Forth.

> What depth?

You'll need about 32 for Forth at a minimum.

> Why?

Philip Koopman did a simulation of register spills versus stack size in
1990 for Forth. "The right answer, then, to how big stacks should be is
16 or 32 elements, no more" said Philip Koopman in "Design Tradeoffs in
Stack Computers," Forth Dimensions, Volume 11 Number 6, March/April
1990.

http://users.ece.cmu.edu/~koopman/forth/4thdim_11_6.pdf


Rod Pemberton

hughag...@gmail.com

unread,
Oct 4, 2016, 12:47:44 AM10/4/16
to
On Monday, October 3, 2016 at 5:22:29 AM UTC-7, Walter Banks wrote:
> I am currently working on a stack processor ISA.

What is the point of this project?

If you want to program in Forth, you could buy RACE chips from Testra and use my MFX that I wrote in 1994 --- or use their on-chip Forth compiler that they wrote in MFX --- if you are going to delve into writing your own primitives in assembly-language, then you need MFX, but the on-chip Forth is easier to use because it is interactive (MFX is a cross-compiler).

But, you don't want to program in Forth! My understanding from your emails to me is that you consider Forth to be an ugly language, and you are well aware that it is very difficult to hire Forth programmers who are any good (there are only a handful on the whole planet) --- your only interest in a stack-processor is so it can be a target for your C compiler that you wrote years ago (there are millions of C programmers on the planet). You think that a stack-processor will use less FPGA resources that a register-based processor --- but, you already tried this experiment and it was a failure.

So, once again, what is the point? Why are you still interested in this stuff?

Your company has a huge investment in the MC6808 processor. I already told you how this design could be tweaked slightly so that it would make a good target for Forth, but your C compiler would still generate code for it with only a minor modification. A hybrid C/Forth processor! My expectation was that new programs could be written in Forth (because Forth is interactive and easier to program in), and they could call C functions that have already been written, so there would be a smooth transition from C to Forth. You were totally uninterested. You were worried that the minor modification to your C compiler would break existing C programs (it wouldn't, unless C programs were accessing the return-address, which is highly non-standard). You also agreed with me that nobody other than myself would ever want to program the processor in Forth --- all C programmers hate Forth and will refuse to program in Forth, or will purposely write the most ugly and inefficient Forth code possible in order to make Forth look bad.

In my design, I introduced a new stack used for return-addresses, which would be an 8-bit register, but I also said that this could be made 5-bit (32 deep) if necessary. The stack used for parameters would be 16-bit --- it already is 16-bit on the MC6808 --- Forth doesn't need such a deep data-stack, but C does, and the design is supposed to support both Forth and C so it is necessary to have the big stack in order for your legacy C programs to continue to work. So, that answers your question that you are now asking yet again --- this question has already been answered thoroughly, not just by me, but by everybody --- you need two stacks for Forth (rather than the one that C needs), but they can be pretty small (16 deep for each is a minimum, although 32 deep for the return-stack is somewhat safer, and 128 deep if you are going to support local variables).

This design was the only idea I had --- it seemed like a killer idea, considering your already heavy investment in the MC6808 --- if you didn't like it though, then I don't have any other ideas for you, except perhaps to start from scratch with the Stundurd or something similar which is a pretty big undertaking.

Jan Coombs

unread,
Oct 4, 2016, 3:31:05 AM10/4/16
to
On Mon, 3 Oct 2016 18:20:21 +0200
Bernd Linsel <b...@gmx.com> wrote:

> Perhaps you should read P. Koopman's excellent book on stack
> computers. It is online available at
> https://users.ece.cmu.edu/~koopman/stack_computers/ .

Good advice, the book is perhaps the easily only locatable
reference for this type of information, and a good read.

From section 6.4[1] it seemed to me that a depth of 12 items
for stack caches is a sweet spot. The fill/spill rate here is
mostly under 1%.

So if you fetch 4 instructions per memory cycle then fill/spill
memory access will peak at one per 25 instruction fetches.

A return stack cache of this size has similar memory usage.

Jan Coombs
--
[1] 6.4 STACK MANAGEMENT ISSUES
https://users.ece.cmu.edu/~koopman/stack_computers/sec6_4.html

Anton Ertl

unread,
Oct 4, 2016, 5:18:12 AM10/4/16
to
Jan Coombs <jenfhaom...@murmic.plus.com> writes:
>On Mon, 3 Oct 2016 18:20:21 +0200
>Bernd Linsel <b...@gmx.com> wrote:
>
>> Perhaps you should read P. Koopman's excellent book on stack
>> computers. It is online available at=20
>> https://users.ece.cmu.edu/~koopman/stack_computers/ .
>
>Good advice, the book is perhaps the easily only locatable
>reference for this type of information, and a good read.=20

Another reference is my first stack caching paper [ertl95pldi]. You
can find more data in
<http://www.complang.tuwien.ac.at/misc/stack-caching-data/>.

@InProceedings{ertl95pldi,
author = "M. Anton Ertl",
title = "Stack Caching for Interpreters",
crossref = "sigplan95",
pages = "315--327",
url = "http://www.complang.tuwien.ac.at/papers/ertl95pldi.ps.gz",
abstract = "An interpreter can spend a significant part of its
execution time on arguments of virtual machine
instructions. This paper explores two methods to
reduce this overhead for virtual stack machines by
caching top-of-stack values in (real machine)
registers. The {\em dynamic method} is based on
having, for every possible state of the cache, one
specialized version of the whole interpreter; the
execution of an instruction usually changes the
state of the cache and the next instruction is
executed in the version corresponding to the new
state. In the {\em static method} a state machine
that keeps track of the cache state is added to the
compiler. Common instructions exist in specialized
versions for several states, but it is not necessary
to have a version of every instruction for every
cache state. Stack manipulation instructions are
optimized away."
}

@Proceedings{sigplan95,
booktitle = "SIGPLAN Conference on Programming Language
Design and Implementation (PLDI'95)",
title = "SIGPLAN Conference on Programming Language
Design and Implementation (PLDI'95)",
year = "1995",
key = "PLDI '95"

minf...@arcor.de

unread,
Oct 4, 2016, 8:00:09 AM10/4/16
to
That "sweet spot" stack depth of 12 is an average stack spill figure of otherwise uncomparable programs. In other words it is not very useful.

My own real world experience is that it might be sufficient for simple control programs. On the other hand for number crunching (e.g. controller algorithms involving linear system math) or Forth systems without locals, stack depths of say 32 are a bare design minimum.

Walter Banks

unread,
Oct 4, 2016, 8:49:52 AM10/4/16
to
On 2016-10-04 3:31 AM, Jan Coombs wrote:
> On Mon, 3 Oct 2016 18:20:21 +0200
> Bernd Linsel <b...@gmx.com> wrote:
>
>> Perhaps you should read P. Koopman's excellent book on stack
>> computers. It is online available at
>> https://users.ece.cmu.edu/~koopman/stack_computers/ .
>
> Good advice, the book is perhaps the easily only locatable
> reference for this type of information, and a good read.
>
> From section 6.4[1] it seemed to me that a depth of 12 items
> for stack caches is a sweet spot. The fill/spill rate here is
> mostly under 1%.
>
> So if you fetch 4 instructions per memory cycle then fill/spill
> memory access will peak at one per 25 instruction fetches.
>
> A return stack cache of this size has similar memory usage.

These stack sizes are very close to our metrics derived from several
hundred real (shipped) application code running on a variety of embedded
systems.

w..

Walter Banks

unread,
Oct 4, 2016, 10:36:40 AM10/4/16
to
The application area is embedded systems in general. Like most
processors in this area processors tend to be used in application areas
rather than general purpose processors. The application area that it is
most likely to be used in is distributed control and instrumentation. It
is being implemented with some configurations having multiprocessors.
The basic processor will be running at about 150Mips.

I have worked on quite a few of these developments over the last 30
years or so. We have prototypes running on fpga's. A lot of the
motivation of starting this thread is to get a different perspective on
stack usage and design outside of the metrics I have for C and asm based
applications used in similar area's of embedded systems.

I have been making a point of staying away from implementation languages
for a very specific reason. This processor will in all likelihood be
programed in a variant of C. As several people here have pointed out C
requires "..." which is true when you look at the C generic history.
Most certainly the open source tools have genetics going back to the
open C compiler tools. To make a point mapping locals on a stack frame
in a PDP-11 was probably a good way to implement locals on that platform
but as I have shown in many compilers not the only way. It doesn't have
to be that way C is a language that describes a problem and the compiler
writers (me) have to map the problem effectively on the target. I have
targeted C compilers to embedded processors, forth, functional languages
and even in one case a form of state machine. We have adapted the
language to support event driven applications. Forth has a much larger
component of implementation built into the actual application code.

This brings me back to the actual questions. My role in the tool
development of this project is more like a forth programmer than a C
programmer. The code creation trials and tribulations of this user group
are very close to the choices I am making writing the optimizer for the
code generator for this processor.

To address one last point you have made the underlying verilog or vhdl
implementing the ISA is flexible and easily changed I agree. We have
done that many times over the last year or so to understand the impact
of changes on code generation.

w..

Walter Banks

unread,
Oct 4, 2016, 10:38:59 AM10/4/16
to
On 2016-10-04 4:46 AM, Anton Ertl wrote:
> Jan Coombs <jenfhaom...@murmic.plus.com> writes:
>> On Mon, 3 Oct 2016 18:20:21 +0200
>> Bernd Linsel <b...@gmx.com> wrote:
>>
>>> Perhaps you should read P. Koopman's excellent book on stack
>>> computers. It is online available at=20
>>> https://users.ece.cmu.edu/~koopman/stack_computers/ .
>>
>> Good advice, the book is perhaps the easily only locatable
>> reference for this type of information, and a good read.=20
>
> Another reference is my first stack caching paper [ertl95pldi]. You
> can find more data in
> <http://www.complang.tuwien.ac.at/misc/stack-caching-data/>.
>
> @InProceedings{ertl95pldi,
> author = "M. Anton Ertl",
> title = "Stack Caching for Interpreters",
> crossref = "sigplan95",
> pages = "315--327",
> url = "http://www.complang.tuwien.ac.at/papers/ertl95pldi.ps.gz",


Thanks for the reference.

w..


Walter Banks

unread,
Oct 4, 2016, 11:22:16 AM10/4/16
to
On 2016-10-04 12:47 AM, hughag...@gmail.com wrote:
> On Monday, October 3, 2016 at 5:22:29 AM UTC-7, Walter Banks wrote:
>> I am currently working on a stack processor ISA.
>
> What is the point of this project?
>
> If you want to program in Forth, you could buy RACE chips from Testra
> and use my MFX that I wrote in 1994 --- or use their on-chip Forth
> compiler that they wrote in MFX --- if you are going to delve into
> writing your own primitives in assembly-language, then you need MFX,
> but the on-chip Forth is easier to use because it is interactive (MFX
> is a cross-compiler).

For a implement once, mass distribution how is this important?

>
> But, you don't want to program in Forth! My understanding from your
> emails to me is that you consider Forth to be an ugly language, and
> you are well aware that it is very difficult to hire Forth
> programmers who are any good (there are only a handful on the whole
> planet) --- your only interest in a stack-processor is so it can be a
> target for your C compiler that you wrote years ago (there are
> millions of C programmers on the planet). You think that a
> stack-processor will use less FPGA resources that a register-based
> processor --- but, you already tried this experiment and it was a
> failure.

It was!! news to me. Do the design math.

>
> So, once again, what is the point? Why are you still interested in
> this stuff?
>
Want to define "stuff" so I actually know what you are talking about.
What design information do you have that determines stack size?

>
> This design was the only idea I had --- it seemed like a killer idea,
> considering your already heavy investment in the MC6808 --- if you
> didn't like it though, then I don't have any other ideas for you,
> except perhaps to start from scratch with the Stundurd or something
> similar which is a pretty big undertaking.
>

There are a lot of strawman arguments in here that are both untrue or
misleading. My views on forth you expressed are certainly incorrect, and
the role of the MC6808. To be clear to I will (and have) used any
language that is appropriate to describe an application area. My job as
always is to translate the application into executable code. Most of our
C compilers are not written in C. (All but two)

The MC6808 and follow on 6808 family developments 68HC08, 68S08 and
RS08 were projects we did most of the work on 35 years ago.

Since then we implemented about 30 C compilers and about a dozen other
compilers to support various other languages.

Since then I have personally been part of the design of about a dozen
other commercial ISA's in diversified application areas. Most of my work
has been in data flow modelling in ISA design.

w..

Walter Banks

unread,
Oct 4, 2016, 11:31:02 AM10/4/16
to
Thanks for reminding me of this paper. There is some interesting
information in it. I looked at it a two or three years ago looking for
ISA approaches to stack processors. The whole design process of stack
processors has matured and can now be quite a bit more bit efficient.
Koopman's comments on stack size is backed by some real insight into it.

Thanks.

w..

hughag...@gmail.com

unread,
Oct 4, 2016, 2:44:50 PM10/4/16
to
On Tuesday, October 4, 2016 at 8:22:16 AM UTC-7, Walter Banks wrote:
> On 2016-10-04 12:47 AM, hughag...@gmail.com wrote:
> > On Monday, October 3, 2016 at 5:22:29 AM UTC-7, Walter Banks wrote:
> >> I am currently working on a stack processor ISA.
> >
> > What is the point of this project?
> >
> > If you want to program in Forth, you could buy RACE chips from Testra
> > and use my MFX that I wrote in 1994 --- or use their on-chip Forth
> > compiler that they wrote in MFX --- if you are going to delve into
> > writing your own primitives in assembly-language, then you need MFX,
> > but the on-chip Forth is easier to use because it is interactive (MFX
> > is a cross-compiler).
>
> For a implement once, mass distribution how is this important?

Interactive design gets a program debugged and tested faster --- this might get you to market first --- you can't actually have mass distribution until you have a working board, otherwise you waste a truckload of money, but being second to market with a working board is also a sure way to waste a truckload of money.

After you have gotten a toe-hold on the market, you most likely will redesign your board to use a less-expensive chip --- when you do this, you will likely rewrite in assembly language because a lot of those small chips (the PIC18, 80c320, etc.) don't really support high-level languages very well.

The nice thing about the RACE (the chip previously known as MiniForth) is that you can develop your program using the interactive on-chip Forth system and get your board working with minimal sweat-equity invested. Later on you can rewrite a lot of the Forth code in assembly-language for the second version, which will be less expensive to manufacture and have higher performance, but will involve more sweat-equity.

I wrote the RACE assembler/simulator as part of MFX, and I can tell you right now that learning to use it is difficult (only one other person than myself has ever done this) --- it is an extremely low-level assembly-language (as compared to PIC18 or 80c320 assembly-language) --- this is not something that you want to delve into until you can be reasonably sure that there is money to be made.

If there is no money to be made, you might as well devote your life to playing Go --- that is also mind-blowingly difficult --- it is fun though, and you get to meet interesting people from around the world, neither of which is true of becoming the world's third RACE assembly-language programmer.

Hans Bezemer

unread,
Oct 4, 2016, 2:47:10 PM10/4/16
to
rickman wrote:
> There are many answers to your question and they all related to what you
> want to do with the processor. It has been stated many times that for
> most apps, an 8 element stack is enough. But if you wish to support
> recursion a million elements may not be enough.
>
> Some designs use one stack, others use as many as four. Again, it all
> depends.
True. 4tH has got a single, shared return/data stack of 512 elements. I
chose for thaty configuration because you can have programs which need
quite some return stack depth, but few data elements - or the
inverse. "Stack full" means the stack pointers cross.

I got an FP package that uses the datastack (shared) and one that has its
own stack. On top of that, I got a small lib that allows me to create my
own stacks:

---8<---
: stack dup ! ; ( stack --)
: a@ @ @ ; ( stack -- n)
: >a 1 cells over +! @ ! ; ( n stack --)
: a> dup a@ -1 cells rot +! ; ( stack -- n)
: adepth dup @ swap - ; ( stack -- n)
---8<---

On average, the standard stack is more than sufficient - unless you want to
do something crazy like the Ackermann function (unoptimized).

The only place I was really cramming as much on this poor stack was the 4tH
preprocessor doing nested macros. I even had to off load the data elements
to a separate stack.

Hans Bezemer

hughag...@gmail.com

unread,
Oct 4, 2016, 2:59:22 PM10/4/16
to
When I designed the Stundurd, I was working on the assumption that memory-access is significantly slower than register-access, and hence the key to boosting performance was to hold data in registers as much as possible. I allow zero, one or two of the top data-stack items to be held in registers. All of the machine-code primitives have multiple versions, which expect 0, 1 or 2 items in registers and leave 0, 1 or 2 items in registers. The compiler has to be smart about compiling the correct versions of the primitives so they match up --- a primitive that leaves some number of top items in registers has to be followed by a primitive that expects the same number of top items in registers --- the goal is minimize how many stack-juggling primitives get compiled (primitives that move data-stack items into or out of registers, but don't do any useful work).

All in all, I thought my design was pretty cool. A lot of people (including Walter Banks) told me that my design was obsolete however. I was thinking about processors as they existed in the mid-1990s (when the MiniForth/RACE was developed), but that modern processors are different --- nowadays an FPGA will have all of its RAM inside of the chip, and hence there is no difference in speed between "memory" and "registers" which are actually the same thing.

Anton Ertl's paper (as much as I can tell from the abstract) seems to be promoting the same idea that I was working on --- why doesn't anybody tell him that this idea is obsolete and hasn't been meaningful since the mid-1990s? --- IIRC, he was one of the many people who slammed my Stundurd design as being 20 years obsolete from day one.

Walter Banks

unread,
Oct 4, 2016, 3:41:00 PM10/4/16
to
My understanding of 4TH is it compiles to an intermediate H code that
can interpreted / compiled on a variety of targets if I have been
looking at the correct website. I am an old UCSD pascal fan as well.

One and two stacks have been what we have been doing most of our
simulations with. I quickly at one point looked at the implications of
three stacks (return, and two data) The idea that we could code up a
tree type expression evaluation.

What I found and it maybe the mix of regression tests we created from
applications was we generated slightly faster and smaller code a single
stack than we could with two stacks.

Our three stack solution had no advantages over the two or single stack
solution.

One interesting thing that has come out of this thread is there is some
consistency with about 12 levels needed for subroutine return. (Totally
consistent with our metrics of 13 levels for actual applications)

Data stacks have produced a different set of comments. In part is the
roles that data stacks are used for. Local variables, expression
evaluation and saving machine state for interrupts or event programming
applications.

Local variables are the wild card in this. They can in some languages be
quite large.

Expression evaluation is a more manageable problem to evaluate. I had
expected that data stack requirements for forth applications could
easily be larger than that used by other languages and approaches to
expression evaluation.

I may not be looking in the right place but essentially all the forth
code that I have looked at had similar metrics to code created by other
approaches. 4 or 5 data stack levels in applications was about the limit
we have seen for expression evaluation. If the 8 level comment you made
above is mostly about expression evaluation then that is close to what
we have found in the 4 or 5 level metrics we keep seeing in our tests.

What I haven't seen is anyone pointing to an application area that any
particular language that gains performance by using more stack space
which has been a surprise.


w..



Walter Banks

unread,
Oct 4, 2016, 3:53:28 PM10/4/16
to
On 2016-10-04 2:44 PM, hughag...@gmail.com wrote:
> After you have gotten a toe-hold on the market, you most likely will
> redesign your board to use a less-expensive chip --- when you do
> this, you will likely rewrite in assembly language because a lot of
> those small chips (the PIC18, 80c320, etc.) don't really support
> high-level languages very well.

I doubt very much that you can implement any application in asm for any
compiler we have written that I cannot code the same application in the
high level language in equal or smaller code and timing.

One of the tests we do is to make sure that code fragments will result
in the ISA of the processor. This means that any asm program can be
written in the HHL and produce the same code, but compilers use the
whole instruction set and do better program and data flow accounting
than people opening for opportunities for better implementation.

w..

Rod Pemberton

unread,
Oct 4, 2016, 4:15:03 PM10/4/16
to
On Tue, 4 Oct 2016 10:36:36 -0400
Walter Banks <wal...@bytecraft.com> wrote:

> As several people here have pointed out C requires "..."

Many C compilers have limitations and poorly implemented features.
IIRC, PCC didn't implement structs correctly, and GCC doesn't implement
'register' declared variables correctly. GCC implements all sorts of
extensions to the language. TenDRA was supposedly very pedantic, but
it didn't implement tail calls. And, so on ...


Rod Pemberton

Rod Pemberton

unread,
Oct 4, 2016, 4:15:17 PM10/4/16
to
On Tue, 4 Oct 2016 15:53:19 -0400
Walter Banks <wal...@bytecraft.com> wrote:

> On 2016-10-04 2:44 PM, hughag...@gmail.com wrote:
> > After you have gotten a toe-hold on the market, you most likely will
> > redesign your board to use a less-expensive chip --- when you do
> > this, you will likely rewrite in assembly language because a lot of
> > those small chips (the PIC18, 80c320, etc.) don't really support
> > high-level languages very well.
>
> I doubt very much that you can implement any application in asm for
> any compiler we have written that I cannot code the same application
> in the high level language in equal or smaller code and timing.
>

Not to start a flame war, but this is not a conclusion that assembly
and Forth programmers generally accept. They are generally of the
belief that they can outdo any compiler. They don't grasp how powerful
the optimization is on modern C compilers.


Rod Pemberton

rickman

unread,
Oct 4, 2016, 7:27:28 PM10/4/16
to
On 10/4/2016 2:46 PM, Hans Bezemer wrote:
> rickman wrote:
>> There are many answers to your question and they all related to what you
>> want to do with the processor. It has been stated many times that for
>> most apps, an 8 element stack is enough. But if you wish to support
>> recursion a million elements may not be enough.
>>
>> Some designs use one stack, others use as many as four. Again, it all
>> depends.
> True. 4tH has got a single, shared return/data stack of 512 elements. I
> chose for thaty configuration because you can have programs which need
> quite some return stack depth, but few data elements - or the
> inverse. "Stack full" means the stack pointers cross.

I think you mean you have two separate stacks sharing one block of
memory, one growing up and the other growing down. That's not really a
single stack. There are two stack pointers.


> I got an FP package that uses the datastack (shared) and one that has its
> own stack. On top of that, I got a small lib that allows me to create my
> own stacks:
>
> ---8<---
> : stack dup ! ; ( stack --)
> : a@ @ @ ; ( stack -- n)
> : >a 1 cells over +! @ ! ; ( n stack --)
> : a> dup a@ -1 cells rot +! ; ( stack -- n)
> : adepth dup @ swap - ; ( stack -- n)
> ---8<---
>
> On average, the standard stack is more than sufficient - unless you want to
> do something crazy like the Ackermann function (unoptimized).
>
> The only place I was really cramming as much on this poor stack was the 4tH
> preprocessor doing nested macros. I even had to off load the data elements
> to a separate stack.
>
> Hans Bezemer
>


--

Rick C

Anton Ertl

unread,
Oct 5, 2016, 7:26:57 AM10/5/16
to
hughag...@gmail.com writes:
>When I designed the Stundurd, I was working on the assumption that memory-a=
>ccess is significantly slower than register-access, and hence the key to bo=
>osting performance was to hold data in registers as much as possible. I all=
>ow zero, one or two of the top data-stack items to be held in registers. Al=
>l of the machine-code primitives have multiple versions, which expect 0, 1 =
>or 2 items in registers and leave 0, 1 or 2 items in registers. The compile=
>r has to be smart about compiling the correct versions of the primitives so=
> they match up --- a primitive that leaves some number of top items in regi=
>sters has to be followed by a primitive that expects the same number of top=
> items in registers

That's what gforth-fast is doing (although it uses only 0 or 1 items
on many architectures, and 0-3 on PowerPC and ARM64).

>All in all, I thought my design was pretty cool. A lot of people (including=
> Walter Banks) told me that my design was obsolete however. I was thinking =
>about processors as they existed in the mid-1990s (when the MiniForth/RACE =
>was developed), but that modern processors are different --- nowadays an FP=
>GA will have all of its RAM inside of the chip, and hence there is no diffe=
>rence in speed between "memory" and "registers" which are actually the same=
> thing.
>
>Anton Ertl's paper (as much as I can tell from the abstract) seems to be pr=
>omoting the same idea that I was working on --- why doesn't anybody tell hi=
>m that this idea is obsolete and hasn't been meaningful since the mid-1990s=
>?

Maybe you should check whether the claims are true. Of course, for an
FPGA-based CPU, you can make registers as slow as RAM if you want, but
why would you do that, and why would you use a register machine for
implementing Forth on an FPGA in the first place?

Anyway, on hard CPUs there is a significant difference between
register access and L1 cache access (and the fairy tale about
memory/cache accesses being as fast as register accesses has already
been told in the 1990s).

In particular, if you store to memory and then load from the same
spot, as you do if keep TOS in memory, this takes at least 4 cycles on
a modern Intel CPU, whereas just keeping it in a register costs 0
cycles. And similar for other hard CPUs.

If you already keep the TOS in a register, then adding the option to
allow 0 buys a little bit, optionally using 2 buys some more, and
optionally using 3 buys very little more. Here are timing results for
a Cortex A53 executing ARM64 code:

sieve bubble matrix fib fft version
0.390 0.490 0.270 0.520 0.260 2016-04-27 (0-1 registers)
0.330 0.370 0.250 0.480 0.290 2010-05-27 (0-3 registers)

For more results, take a look at

@InProceedings{ertl&gregg05,
author = {M. Anton Ertl and David Gregg},
title = {Stack Caching in {Forth}},
crossref = {euroforth05},
pages = {6--15},
url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg05.ps.gz},
pdfurl = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26gregg05.pdf},
OPTnote = {not refereed},
abstract = {Stack caching speeds Forth up by keeping stack items
in registers, reducing the number of memory accesses
for stack items. This paper describes our work on
extending Gforth's stack caching implementation to
support more than one register in the canonical
state, and presents timing results for the resulting
Forth system. For single-representation stack
caches, keeping just one stack item in registers is
usually best, and provides speedups up to a factor
of 2.84 over the straight-forward stack
representation. For stack caches with multiple stack
representations, using the one-register
representation as canonical representation is
usually optimal, resulting in an overall speedup of
up to a factor of 3.80 (and up to a factor of 1.53
over single-representation stack caching).}
}

@Proceedings{euroforth05,
title = {21st EuroForth Conference},
booktitle = {21st EuroForth Conference},
year = {2005},
key = {EuroForth'05},
editor = {M. Anton Ertl},
url = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/proceedings.pdf}

rickman

unread,
Oct 5, 2016, 10:16:02 AM10/5/16
to
The claim that block RAM is as fast as registers is not completely true.
The clock setup and clock to data delays are close enough, but
registers (even if implemented in a LUT) are async read while block RAM
is synchronous on reads as well as writes. This means you have to jump
through hoops to read a block RAM in the same clock cycle the address is
presented.

There are some tricks you can play to work around that, but they are
messy and one adds half a clock cycle to the output delay (clocking the
RAM on the falling edge rather than the rising edge. The other one I
sometimes use is to treat the register in the read path as an address
latch. It is updated on every clock cycle, but only used if the
instruction is a read. In my CPU design this means it mirrors the top
of the return stack since that is where addresses are used rather than
on the data stack.


> Anyway, on hard CPUs there is a significant difference between
> register access and L1 cache access (and the fairy tale about
> memory/cache accesses being as fast as register accesses has already
> been told in the 1990s).
>
> In particular, if you store to memory and then load from the same
> spot, as you do if keep TOS in memory, this takes at least 4 cycles on
> a modern Intel CPU, whereas just keeping it in a register costs 0
> cycles. And similar for other hard CPUs.

But that will vary across CPUs. If the comparison is for embedded
systems you need to consider other CPUs and caches. I believe some of
the lower end ARMs, CM3/CM4, run the caches as fast as registers, but
I'm not sure.
Rick C

Hans Bezemer

unread,
Oct 5, 2016, 12:58:24 PM10/5/16
to
rickman wrote:
> I think you mean you have two separate stacks sharing one block of
> memory, one growing up and the other growing down. That's not really a
> single stack. There are two stack pointers.
Absolutely correct. But since there is no way to tell whether a item on a
certain location on any particular moment is a data- or a return item, I
consider it a shared stack. All a matter of definition, I guess. But you
know what I meant and that's the important point ;-)

Hans Bezemer

rickman

unread,
Oct 5, 2016, 1:17:53 PM10/5/16
to
I'm a bit lost now... Why can't you tell what's data and what's return
addresses? Isn't that what the stack pointers are for?

I guess you are really saying the same thing as before, the two stacks
share the same block of memory so any memory location can be either data
or return address... but it should be easy to tell which it is.

To my thinking the stack pointers define the stack, so having two
independent pointers means two stacks.

But I'm glad you mentioned this. In FPGAs it is common to use a block
of RAM as two RAMs since they are dual ported. Sharing one block RAM
between two stacks is common. I didn't occur to me until now to use the
opposite direction stack pointers to maximize the available space.

This does preclude Chuck's trick of making the stack wrap around so it
can be used as a ring buffer. But then the size has to be pretty small
to be useful. I believe the F18A has a 10 deep data stack, top and
second on stack registers and an 8 deep ring buffer. The return stack
is just the top of stack and an 8 deep ring buffer.

--

Rick C

Cecil Bayona

unread,
Oct 5, 2016, 1:18:04 PM10/5/16
to
If one has a 1K deep block of RAM, and the stack pointers are 9 bits,
one having bit 10 high, the other having bit 10 low. the two stacks will
never run into each other, they will each wrap around their own 512 byte
space.

Another scheme is one stack grows down and starts at the top of RAM, the
other starts at the bottom and grows upwards. all is OK as long as the
sum of the depth of both stacks is not 1024 or greater. This has more
flexibility as one stack can be way bigger than the other as needed. In
this case the stack pointers are 10 bits.

In both cases one would buffer the stacks with registers as the TOS, I
like one for the Program Stack and two levels of buffering for the DATA
stack so that the top 3 Data stack levels are ready to be used with no
delays.
--
Cecil - k5nwa

hughag...@gmail.com

unread,
Oct 5, 2016, 3:26:51 PM10/5/16
to
On Tuesday, October 4, 2016 at 12:41:00 PM UTC-7, Walter Banks wrote:
> On 2016-10-04 2:46 PM, Hans Bezemer wrote:
> > On average, the standard stack is more than sufficient - unless you
> > want to do something crazy like the Ackermann function
> > (unoptimized).
> >
> > The only place I was really cramming as much on this poor stack was
> > the 4tH preprocessor doing nested macros. I even had to off load the
> > data elements to a separate stack.
>
> My understanding of 4TH is it compiles to an intermediate H code that
> can interpreted / compiled on a variety of targets if I have been
> looking at the correct website. I am an old UCSD pascal fan as well.

This is what Java does with its JVM --- define a VM that is simulated on a variety of different target processors --- I don't think this is a good idea because you are interpreting opcodes at run-time when this work should be done at compile-time (JIT supposedly works around this problem, but it is complicted, and you are still doing work at run-time that could have been done at compile-time).

I've pretty much moved on from the Stundurd idea. My new idea is the "hugh-niversal assembler." This is an assembly-language for a hypothetical Forth-centric processor. Different versions of the assembler will generate machine-code for different target processors. The Forth compiler will be a traditional STC Forth that uses a built-in Forth assembler to generate the code. By switching out the assembler however, the same compiler can be retargeted to different targets without needing to be rewritten.

In some cases, the hypothetical processor will be less powerful than the actual processor, in which case the actual processor will not be used to its full potential. For example, the actual processor may have more registers than the hypothetical processor, in which case these extra registers won't get used (an example would be the PIC24 with 20 registers).

In some cases, the hypothetical processor will be more powerful than the actual processor, in which case some instructions will assemble into multiple machine-code instructions, or possibly into a subroutine call. For example, the actual processor may have fewer registers than the hypothetical processor, in which case some registers will have to be memory variables (an example would be the MC6812 with 4 registers).

The hypothetical processor is a pretty close match to the 16-bit x86, so anybody who has written an MS-DOS Forth compiler using an 8086 Forth assembler should be able to easily port it over to use the hugh-niversal Forth assembler. The 80186 was a terrible micro-controller 25 years ago, and it is totally obsolete now --- that was because it had too much interrupt latency, and it was slow as molasses --- these problems aren't inherent in the design however; the hugh-niversal assembler should generate code for the PIC24 that is almost as fast as code written in PIC24 assembly-language.

Here is an excerpt from my document:
----------------------------------------------------------------------------
Anton Ertl's PAF (Portable Assembler for Forth) has major problems:

1.) His PAF is for 32-bit processors. This makes no sense, as the ARM is used in maybe 99% of all 32-bit micro-controllers (the MIPS, ColdFire, etc. make up the remainder) --- he should just write his Forth in ARM assembly-language if he wants portability in the 32-bit world.

2.) He believes that the register set size can be abstracted away. This is not true --- the number of registers is the one thing that has to be fixed --- a big part of designing any Forth system is dedicating the registers to specific uses, and the designer needs to know how many registers he has in order to do this.

3.) He considers abstracting away the register set size to be his primary goal --- to accomplish this, he is removing features from Forth that get in the way of this misguided goal --- when (if) he gets PAF to work, his Forth will be crippled.

4.) He still doesn't want lambda functions to be added to Forth --- so for all of his effort, the final result will still just be a typical lame Forth --- the lack of lambda functions has always been Forth's greatest weakness.

5.) He seems to have no concern at all for efficiency. Gforth was utterly useless for micro-controller programming --- his goal with Gforth was portability, but Gforth was so slow that it was never used in any commercial application, so there are no programs to port --- PAF will be more of the same.

The idea for HUGHasm came from PAF, but HUGHasm is different in every way --- HUGHasm is a design that makes sense in the real-world of commercial micro-controllers; it is not an academic exercise like PAF.
----------------------------------------------------------------------------

> One and two stacks have been what we have been doing most of our
> simulations with. I quickly at one point looked at the implications of
> three stacks (return, and two data) The idea that we could code up a
> tree type expression evaluation.
>
> What I found and it maybe the mix of regression tests we created from
> applications was we generated slightly faster and smaller code a single
> stack than we could with two stacks.
>
> Our three stack solution had no advantages over the two or single stack
> solution.

I have Straight Forth that has multiple stacks, but it is for 64-bit desktop-computers; it is not for micro-controllers. Here is an excerpt from my document:

----------------------------------------------------------------------------
In a function, there are several places to hold data:
single-stack --- This called the "data-stack" in ANS-Forth etc..

double-stack --- This is where double-precision numbers, ratios and BFIs are held.

float-stack --- This is where floats are held.

string-stack --- This is where strings are held (strings can also be held on the single-stack).

return-stack --- This is used for temporarily holding data inside of functions. This can't be used for passing data into functions.

user registers --- These are the VEC PTR CNT and REG registers. PTR CNT and VEC are used by HOFs for iteration, and REG is general-purpose.

local variables --- These are used for temporarily hoding data inside of functions. They can't be used for passing data into functions, although they can be used for passing data into and out of quotations.

The single-stack is the fundamental working-storage for Straight Forth; the same as in ANS-Forth and all other Forths. We have DUP OVER SWAP ROT etc. for juggling the values on the single-stack. We have + - * / etc. for doing arithmetic on single-precision numbers on the single-stack.

Double-precision numbers and ratios are held on the double-stack. We have DDUP DOVER DSWAP DROT etc. for juggling the values on the double-stack. We have D+ D- D* D/ etc. for doing arithmetic on double-precision numbers on the double-stack. We have RAT+ RAT- RAT* RAT/ etc. for doing arithmetic on ratios on the double-stack. Mixed-precision arithmetic will move data from one stack to another. For example, S>D pops a value from the single-stakc and pushes it to the double-stack as a double-precision number. Similarly, M* pops two values from the single-stack and pushes their product to the double-stack, and M/ pops the numerator from the double-stack and pops the denominator from the single-stack and pushes the quotient to the single-stack. Also, S>RAT pops a value from the single-stack and pushes it to the double-stack as a ratio, whereas RAT>S pops a ratio from the double-stack and pushes it as a quotient to the single-stack. All the numbers on the single-stack (in the Fixed version) are effectively ratios that always have a denominator of 2^32 (unity).
----------------------------------------------------------------------------

> One interesting thing that has come out of this thread is there is some
> consistency with about 12 levels needed for subroutine return. (Totally
> consistent with our metrics of 13 levels for actual applications)

I said 16, but only because 8 seemed to be too small --- 12 should be okay.

In Straight Forth the float-stack is only 8 deep --- this was done because the most likely target processor will be the 64-bit x86 that has an 8-deep float-stack in hardware --- this is pretty small, so the programmer is encouraged to hold floats in local variables rather than leave them on the float-stack (note that in Straight Forth any data type can be stored in local variables).

The other stacks are plenty big, so the programmer never has to worry about overflow. The data-stack is gigantic --- this is mostly done to allow C functions to be called.

Straight Forth is for desktop computers. A micro-controller Forth would not have so many stacks. I think a single and a double stack would be sufficient for data (micro-controllers don't work with strings or float very much). A 16-bit micro-controller is going to work with double-precision integers or ratios though (because 16-bit integers don't provide enough precision) --- so a double-stack is worthwhile --- maybe not though, as all the data coming in from ADC ports or going out DAC ports is 16-bit (maybe only 12-bit).

I don't really have enough experience with micro-controllers to design a Forth standard for micro-controllers --- also, I'm dubious that a standard is needed, because the programs aren't going to be portable anyway due to being tightly tied to the I/O --- Mark Wills thinks that I/O can be abstracted out and standardized, and he has more experience with micro-controllers than I do, so maybe he is right.

> What I haven't seen is anyone pointing to an application area that any
> particular language that gains performance by using more stack space
> which has been a surprise.

I agree that there aren't many applications that benefit from having big stacks --- certainly not in micro-controllers --- maybe occasionally in desktop-computer programs that do recursive-descent searches.

My N-Queens program dumps all of the data onto the data-stack --- this requires a big stack --- a better technique though would be to dump this into a linked-list.

Also, in the novice-package I have functions that take variable numbers of parameters on the data-stack (with a zero sentinel underneath the valid data) --- these may work with dozens of data --- once again though, it could be argued that this data should be in a linked-list.

rickman

unread,
Oct 5, 2016, 4:08:01 PM10/5/16
to
Not sure why having the top three items of the stack available is
important. The CPU I designed had a register for the top of stack and
used the block RAM for the next on stack. The block RAM looks like a
register so for all intents and purposes, this functionally was two
registers as the top two items on the stack. This arrangement allows
all the more common Forth words to be implemented in one clock cycle.
The only one I can think of off the top of my head that can't be
supported by this is ROT, which according to Koopman is 13 in dynamic
frequency of use averaging about 2% and not in the list of the top 18 in
static frequency of use, so not a terrible thing to construct from
primitives.

Why do you feel it useful to cache three stack items on chip? Unless
the stack was in external RAM and the three cached stack items were used
as a flexible FIFO to minimize memory accesses, I don't see the utility.
Is this to accommodate combining instructions for parallel execution?

--

Rick C

Cecil Bayona

unread,
Oct 5, 2016, 4:44:00 PM10/5/16
to


On 10/5/2016 2:26 PM, hughag...@gmail.com wrote:

> I've pretty much moved on from the Stundurd idea. My new idea is the "hugh-niversal assembler." This is an assembly-language for a hypothetical Forth-centric processor. Different versions of the assembler will generate machine-code for different target processors. The Forth compiler will be a traditional STC Forth that uses a built-in Forth assembler to generate the code. By switching out the assembler however, the same compiler can be retargeted to different targets without needing to be rewritten.
>
> In some cases, the hypothetical processor will be less powerful than the actual processor, in which case the actual processor will not be used to its full potential. For example, the actual processor may have more registers than the hypothetical processor, in which case these extra registers won't get used (an example would be the PIC24 with 20 registers).
>
> In some cases, the hypothetical processor will be more powerful than the actual processor, in which case some instructions will assemble into multiple machine-code instructions, or possibly into a subroutine call. For example, the actual processor may have fewer registers than the hypothetical processor, in which case some registers will have to be memory variables (an example would be the MC6812 with 4 registers).
>
> The hypothetical processor is a pretty close match to the 16-bit x86, so anybody who has written an MS-DOS Forth compiler using an 8086 Forth assembler should be able to easily port it over to use the hugh-niversal Forth assembler. The 80186 was a terrible micro-controller 25 years ago, and it is totally obsolete now --- that was because it had too much interrupt latency, and it was slow as molasses --- these problems aren't inherent in the design however; the hugh-niversal assembler should generate code for the PIC24 that is almost as fast as code written in PIC24 assembly-language.
>
> Here is an excerpt from my document:

> The idea for HUGHasm came from PAF, but HUGHasm is different in every way --- HUGHasm is a design that makes sense in the real-world of commercial micro-controllers; it is not an academic exercise like PAF.
> ----------------------------------------------------------------------------

Interesting, on and off I have worked on a project using a funky virtual
machine but somewhat different from your idea.

The Forth compiler generates an internal form, basically
micro-primitives, it takes several micro-primitives do do simple
operations such a DUP, SWAP, + etc. many of those micro-primitives
translate to a single instruction on a decent CPU.

Here is where I take a turn in a different direction, before the
assembler generates the code for the CPU an optimizer steps in and looks
for patterns of micro-primitives, if a match is found then a simpler
pattern replaces the old pattern, these substitutions are directed by a
template file that specifies how to optimize the code. After the
optimizer is finished the code is handed over to the assembler that
generates actual machine code for a specific CPU.

I have experimented with limited versions and it works quite well, the
assembler which translates the optimized micro-primitives has intimate
knowledge of the CPU and generates quite good code, the optimizer has
improved the micro primitives and simplified complex operations. When
changing to a new CPU one changes the assembler, and may optionally
change some of the templates to take special advantage of the target CPU
but overall little changes take place.

It seems complicated, the optimizer is the bear in this setup but it can
be quite fast and once done it set, the payoff is the generated code
which tends to be quite good, so running efficiency is the target of the
scheme.

Over the years I have had computers come and go so some of my test code
was lost but I want to resurrect it again, I have not forgotten the
basics on what was done, and now I'm getting close to settling down on
my Forth software setup so it will soon time to give it a go again.

The main thrust is for use with embedded CPUs, the non-ARM variety for
that I leave it to MPE. Lately I have developed a resurgent interest in
old CPUs CDP1802, 6502, 6809, Z80, and new FPGA ones, right now it's
CDP1802 time. I'm not a fan of Intel 386 assembler so for PC's I'll
stick to my MPE VFX 7 Pro. compiler.

--
Cecil - k5nwa

Cecil Bayona

unread,
Oct 5, 2016, 4:50:58 PM10/5/16
to
No Parallel execution yet, but just because.

TOS and the RAM itself are enough for many operations, but there are
also a bunch that reaches for the third item, with two registers and RAM
that makes the top 3 stack items accessible with no speed penalty so
makes things a little faster. No much but it doesn't cost much to have
TOS and two lower levels available.
--
Cecil - k5nwa

alex

unread,
Oct 5, 2016, 5:07:29 PM10/5/16
to
On 05/10/16 20:26, hughag...@gmail.com wrote:
> On Tuesday, October 4, 2016 at 12:41:00 PM UTC-7, Walter Banks wrote:
>> On 2016-10-04 2:46 PM, Hans Bezemer wrote:
>>> On average, the standard stack is more than sufficient - unless you
>>> want to do something crazy like the Ackermann function
>>> (unoptimized).
>>>
>>> The only place I was really cramming as much on this poor stack was
>>> the 4tH preprocessor doing nested macros. I even had to off load the
>>> data elements to a separate stack.
>>
>> My understanding of 4TH is it compiles to an intermediate H code that
>> can interpreted / compiled on a variety of targets if I have been
>> looking at the correct website. I am an old UCSD pascal fan as well.
>
> This is what Java does with its JVM --- define a VM that is simulated
> on a variety of different target processors --- I don't think this
> is a good idea because you are interpreting opcodes at run-time
> when this work should be done at compile-time (JIT supposedly works
> around this problem, but it is complicted, and you are still doing
> work at run-time that could have been done at compile-time).
>

JITs are valuable because they allow compiling work to be done at run
time that isn't possible at compile time. For instance, an array's shape
may not be known at compile time, but driven by external data; JIT
compiling SQL statements, where optimisations can be made at runtime
based on the input data; JIT parallelization for unknown workloads or
unknown compute & processor availability; etc.



hughag...@gmail.com

unread,
Oct 5, 2016, 5:31:56 PM10/5/16
to
On Tuesday, October 4, 2016 at 8:22:16 AM UTC-7, Walter Banks wrote:
> On 2016-10-04 12:47 AM, hughag...@gmail.com wrote:
> > So, once again, what is the point? Why are you still interested in
> > this stuff?
> >
> Want to define "stuff" so I actually know what you are talking about.

Here "stuff" means "stack-processors."

> There are a lot of strawman arguments in here that are both untrue or
> misleading. My views on forth you expressed are certainly incorrect, and
> the role of the MC6808. To be clear to I will (and have) used any
> language that is appropriate to describe an application area. My job as
> always is to translate the application into executable code. Most of our
> C compilers are not written in C. (All but two)

My understanding is that you think of Forth as being intermediate code that a C compiler would generate for a stack-processor.

Your interest in stack-processors is that you believe they use less resources in an FPGA than register-processors use.

You don't see any point in hand-coding Forth code, for the same reason that you don't see any point in hand-coding assembly-language code --- you believe that a modern C compiler can generate this low-level code as well or better than it can be hand-written --- you assume that all programmers will want to program in C rather than hand-write assembly-language or Forth code.

Am I reading you wrong? Are these assumptions incorrect? Have you ever written any Forth programs?

I saw another design once, for the F3C processor, that was based on these ideas. The designer wrote the "assembler" for his processor in Java (he just modified an assembler written for another processor to work with his F3C instead). He was assuming that a compiler (possibly a C compiler written in Java, or maybe an ANTLR compiler for some C-like language) would generate the assembly-language code and feed it into this assembler --- he had no intention of writing a Forth compiler --- it was obvious that he had never seen a Forth compiler and didn't know that any such thing exists. His F3C design made no effort to make function calls fast or small, so he apparently didn't know that Forth code is typically heavily factored.

AFAIK, Rickman is the same --- he doesn't program in Forth himself (he has never posted any Forth code on comp.lang.forth and he appears to know nothing about the subject) --- his only interest in stack-processors is that he believes they use less resources in an FPGA than register-processors use.

Apparently many FPGA guys have read Koopman's book and have become interested in stack-processors because of Koopman's idea that stack-processors use less resources in an FPGA than register-processors use --- none of these guys have ever programmed in Forth, nor do they have any intention of ever doing so --- they think of Forth as a kind of assembly-language that could, like any other assembly-language, be used as the back-end for a C compiler.

My Straight Forth has two built-in data-structures, which are the array and the chain (an ordered container keyed on a string) --- not that C only has the array. By comparison, ANS-Forth doesn't have any data-structures --- because of this, a lot of people think of ANS-Forth as being an assembly-language --- all assembly-languages lack support for data-structures, but instead require that all data-structures be hand-written. For example, an array would have you multiply the index by the element-size and then add the result to the base address --- in x86 assembly-language you get an addressing-mode that does all of this in one instruction, but in most assembly-languages you need multiple instructions --- so, ANS-Forth could be considered to be a lower-level assembly-language than x86 assembly-language.

Elizabeth Rather says:
--------------------------------------------------------
Virtually every Forth application needs some kind of array structures. The reason that some general-purpose one might be "little used" is because it's so easy to make a version that does *exactly* what the application needs rather than some generic definition. ... The objective is to master the toolset and be able to think clearly about exactly what kind of arrays will be useful in your application and then build exactly that kind.
--------------------------------------------------------

Elizabeth Rather is the reason why people think of Forth as being an assembly-language --- they don't want to program in Forth --- they suppose that a C compiler might be able to generate Forth code though, which would be useful if the target is a stack-processor.

rickman

unread,
Oct 5, 2016, 5:36:52 PM10/5/16
to
I'm trying to understand what "bunch" of operations use three stack
items. The only common one that comes to mind is ROT. There is not
much reason to optimize infrequently used operations.

If you are designing a MISC, the extra register may be more than you
think. The register itself is not so much, but the data path connecting
the registers and functional units blossoms dramatically when adding
registers. In FPGAs and most ASICs, there are no "busses". Instead
every data source on a bus is an input to a mux. If controlled
separately, this mux must be duplicated for every register written to.
The logic required can be order n^2.

--

Rick C

Cecil Bayona

unread,
Oct 5, 2016, 6:14:19 PM10/5/16
to
I'm thinking ahead to an optimizing compiler that will combine multiple
instructions into one, for that end reaching into the third item happens
more often than a normal Forth program.
--
Cecil - k5nwa

hughag...@gmail.com

unread,
Oct 5, 2016, 7:32:42 PM10/5/16
to
On Wednesday, October 5, 2016 at 1:44:00 PM UTC-7, Cecil - k5nwa wrote:
> Interesting, on and off I have worked on a project using a funky virtual
> machine but somewhat different from your idea.
>
> The Forth compiler generates an internal form, basically
> micro-primitives, it takes several micro-primitives do do simple
> operations such a DUP, SWAP, + etc. many of those micro-primitives
> translate to a single instruction on a decent CPU.
>
> Here is where I take a turn in a different direction, before the
> assembler generates the code for the CPU an optimizer steps in and looks
> for patterns of micro-primitives, if a match is found then a simpler
> pattern replaces the old pattern, these substitutions are directed by a
> template file that specifies how to optimize the code. After the
> optimizer is finished the code is handed over to the assembler that
> generates actual machine code for a specific CPU.

That doesn't diverge very much from my design.

I was thinking of having multiple versions of each instruction, each with a different configuration in regard to how many of the top stack items (0, 1 or 2) are expected in registers and how many (0, 1 or 2) are returned in registers.

You are thinking of having a single version of each instruction, and then having a peephole optimizer in the assembler fix them up internally so they pass values in registers rather than push a value to memory and immediately pop it out of memory

Either way, this is peephole optimization --- I have the compiler doing it --- you have the assembler doing it.

Your method might be better. For one thing, the target processor may not have very many registers, in which case you make the assembler only support 0 or 1 top items in registers, rather than 0, 1 or 2.

OTOH, your method puts the complexity in the assembler rather than the compiler --- this is a disadvantage because the assembler has to be rewritten for every new target --- if the complexity is in the compiler, then it gets written once, but doesn't have to get rewritten again.

SwiftForth code is always good for a laugh:

: init-list ( node -- node ) 0 over ! ; ok
see init-list
478F0F 4 # EBP SUB 83ED04
478F12 EBX 0 [EBP] MOV 895D00
478F15 0 # EBX MOV BB00000000
478F1A 4 # EBP SUB 83ED04
478F1D EBX 0 [EBP] MOV 895D00
478F20 4 [EBP] EBX MOV 8B5D04
478F23 0 [EBP] EAX MOV 8B4500
478F26 EAX 0 [EBX] MOV 8903
478F28 4 [EBP] EBX MOV 8B5D04
478F2B 8 # EBP ADD 83C508
478F2E RET C3 ok
see over
405C5F 4 # EBP SUB 83ED04
405C62 EBX 0 [EBP] MOV 895D00
405C65 4 [EBP] EBX MOV 8B5D04
405C68 RET C3 ok
see !
40572F 0 [EBP] EAX MOV 8B4500
405732 EAX 0 [EBX] MOV 8903
405734 4 [EBP] EBX MOV 8B5D04
405737 8 # EBP ADD 83C508
40573A RET C3 ok

All that SwiftForth does is inline primitives' code rather than compile a CALL to it --- here a literal number (0) is pasted in, then OVER is pasted in, then ! is pasted in --- there is no peephole-optimization being done.

rickman

unread,
Oct 5, 2016, 8:01:46 PM10/5/16
to
On 10/5/2016 7:32 PM, hughag...@gmail.com wrote:
> On Wednesday, October 5, 2016 at 1:44:00 PM UTC-7, Cecil - k5nwa
> wrote:
>> Interesting, on and off I have worked on a project using a funky
>> virtual machine but somewhat different from your idea.
>>
>> The Forth compiler generates an internal form, basically
>> micro-primitives, it takes several micro-primitives do do simple
>> operations such a DUP, SWAP, + etc. many of those micro-primitives
>> translate to a single instruction on a decent CPU.
>>
>> Here is where I take a turn in a different direction, before the
>> assembler generates the code for the CPU an optimizer steps in and
>> looks for patterns of micro-primitives, if a match is found then a
>> simpler pattern replaces the old pattern, these substitutions are
>> directed by a template file that specifies how to optimize the
>> code. After the optimizer is finished the code is handed over to
>> the assembler that generates actual machine code for a specific
>> CPU.
>
> That doesn't diverge very much from my design.
>
> I was thinking of having multiple versions of each instruction, each
> with a different configuration in regard to how many of the top stack
> items (0, 1 or 2) are expected in registers and how many (0, 1 or 2)
> are returned in registers.

That would need to be supported by the CPU directly, no? In essence the
CPU would have to know how many stack items are in registers and call
the appropriate microcode to execute the appropriate primitive.
Otherwise how would the compiler know what to expect when a word is
compiled into another word?

Given that any word can be called my a number of other words, how can a
compiler know what is happening at run time?

I'm not entirely clear on the context of this use of stack caching.
Normally this is a technique used to facilitate the use of multiple
registers in a standard CPU. The trick is to know when to adjust the
stack item count in registers and when such an adjustment would be
counter productive. It would not be unreasonable to keep in registers
more items than typically needed by any one word and only adjust when
required. So hysteresis could facilitate this. If a stack push is
needed when the register count is already at max does a flush of n
items. A stack pop when the register usage count is at the minimum does
a cache fill. But how can a standard CPU know when to do this? Either
the CPU needs to have hardware support for this, which standard CPUs
don't have, or extra instructions are required for every stack operation
to check the register usage count.

Am I missing something? If stack caching doesn't always keep the same
number of items in registers, how can a standard CPU keep track of when
to fill and flush?

--

Rick C

Elizabeth D. Rather

unread,
Oct 5, 2016, 8:48:08 PM10/5/16
to
Source, destination, count? There's a bunch of words (moves, compares)
that take those three. Not a huge bunch, though.

> If you are designing a MISC, the extra register may be more than you
> think. The register itself is not so much, but the data path connecting
> the registers and functional units blossoms dramatically when adding
> registers. In FPGAs and most ASICs, there are no "busses". Instead
> every data source on a bus is an input to a mux. If controlled
> separately, this mux must be duplicated for every register written to.
> The logic required can be order n^2.
>

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

Cecil Bayona

unread,
Oct 5, 2016, 9:36:06 PM10/5/16
to


On 10/5/2016 6:32 PM, hughag...@gmail.com wrote:
> On Wednesday, October 5, 2016 at 1:44:00 PM UTC-7, Cecil - k5nwa wrote:
>> Interesting, on and off I have worked on a project using a funky virtual
>> machine but somewhat different from your idea.
>>
>> The Forth compiler generates an internal form, basically
>> micro-primitives, it takes several micro-primitives do do simple
>> operations such a DUP, SWAP, + etc. many of those micro-primitives
>> translate to a single instruction on a decent CPU.
>>
>> Here is where I take a turn in a different direction, before the
>> assembler generates the code for the CPU an optimizer steps in and looks
>> for patterns of micro-primitives, if a match is found then a simpler
>> pattern replaces the old pattern, these substitutions are directed by a
>> template file that specifies how to optimize the code. After the
>> optimizer is finished the code is handed over to the assembler that
>> generates actual machine code for a specific CPU.
>
> That doesn't diverge very much from my design.
>
> I was thinking of having multiple versions of each instruction, each with a different configuration in regard to how many of the top stack items (0, 1 or 2) are expected in registers and how many (0, 1 or 2) are returned in registers.
>
> You are thinking of having a single version of each instruction, and then having a peephole optimizer in the assembler fix them up internally so they pass values in registers rather than push a value to memory and immediately pop it out of memory
>
> Either way, this is peephole optimization --- I have the compiler doing it --- you have the assembler doing it.
>
> Your method might be better. For one thing, the target processor may not have very many registers, in which case you make the assembler only support 0 or 1 top items in registers, rather than 0, 1 or 2.
>
> OTOH, your method puts the complexity in the assembler rather than the compiler --- this is a disadvantage because the assembler has to be rewritten for every new target --- if the complexity is in the compiler, then it gets written once, but doesn't have to get rewritten again.
>

The optimizer is not built into the assembler its a separate entity, it
remains the same no matter what chip you use, it uses an external file
with templates, if you want to add additional optimization you add it to
the templates, the optimizer itself is not changed.

Basically the compiler bundles a word definition and all that it
contains and send it to the optimizer, it rearranges the code and sends
it to the assembler, the assembler generates the code, it knows the
internal details to generate efficient code, but it still is fairly
simple. Code that is inline between code definitions is bundles and sent
to the optimizer to optimize before working again on the next word
definition.

Anyway, I had it working very well many years ago, and it supported a
multitude of CPUs but, that was a long time ago and it is lost, I would
like to re-create it again in the not too distant future. The part that
is fairly difficult is the optimizer, it generally a very powerful and
flexible macro processor, that is difficult to do efficiently, I used to
use a balanced link list to speed up the searching.

--
Cecil - k5nwa

Cecil Bayona

unread,
Oct 5, 2016, 9:49:15 PM10/5/16
to
It will try to combine multiple instructions and generate a better
sequence hence sometime having access to the 3rd stack item come handy.
In a z80 it would use indexed access but on a stack machine that may not
be possible. The Forth CPU I'm thinking of creating would have two index
registers to access general memory but not the stacks which is a
separate memory space, but I may change my mind on that issue, nothing
is final until one is six feet under.

The FPGA I will use has 21K 4 input LUTs it should be plenty for what
I'm thinking about, a somewhat similar implementation minus the index
registers I have been working on uses 28% of a 5K LUT device.
--
Cecil - k5nwa

hughag...@gmail.com

unread,
Oct 6, 2016, 12:28:43 AM10/6/16
to
On Wednesday, October 5, 2016 at 6:36:06 PM UTC-7, Cecil - k5nwa wrote:
> On 10/5/2016 6:32 PM, hughag...@gmail.com wrote:
> > Your method might be better. For one thing, the target processor may not have very many registers, in which case you make the assembler only support 0 or 1 top items in registers, rather than 0, 1 or 2.
> >
> > OTOH, your method puts the complexity in the assembler rather than the compiler --- this is a disadvantage because the assembler has to be rewritten for every new target --- if the complexity is in the compiler, then it gets written once, but doesn't have to get rewritten again.
> >
>
> The optimizer is not built into the assembler its a separate entity, it
> remains the same no matter what chip you use, it uses an external file
> with templates, if you want to add additional optimization you add it to
> the templates, the optimizer itself is not changed.

If the optimizer is not built into the assembler, and only the assembler contains target-specific information, then we haven't diverged at all --- you already are using the same method that I'm using.

As I said above though, the disadvantage of this is that the number of registers in the hypothetical processor is fixed --- this will be the same no matter how many registers the target processor has.

I also said, in that excerpt:
"[Anton Ertl] believes that the register set size can be abstracted away. This is not true --- the number of registers is the one thing that has to be fixed --- a big part of designing any Forth system is dedicating the registers to specific uses, and the designer needs to know how many registers he has in order to do this."

I said:
"I was thinking of having multiple versions of each instruction, each with a different configuration in regard to how many of the top stack items (0, 1 or 2) are expected in registers and how many (0, 1 or 2) are returned in registers."

Michael Barry said that having all of these primitives is not necessary. He recommended that the primitives do this at run-time. He said that the top value of the data-stack would always be held in a registers (the TOS register) and the second value of the data-stack may or may not be held in a register (the SOS register). There would be a flag that would indicate TRUE if SOS was popped or FALSE if it the second value was in memory:
1.) If a primitive (a "consumer") needs SOS then it checks the flag at the beginning and pops SOS from memory if the flag is FALSE.
2.) If a primitive (a "producer") needs SOS in memory then it checks the flag at the beginning and pushes SOS to memory if the flag is TRUE.
3.) Either way, the primitive sets the flag at the end to indicate if SOS is popped or if the second value is in memory.

A "producer" is a primitive that leaves more data on the data-stack (LIT and OVER etc.) and a "consumer" is a primitive that leaves less data on the data-stack (+ and DROP etc.).

I got the terms "producer" and "consumer" from ISYS Forth (for the Apple-IIc) which did peephole-optimization. The programmer (I think his name was Robert Illyes) said that Forth code mostly alternates producers and consumers, which is why the stack never gets very deep. His code optimizer allowed the Y:A register-pair to be used to pass the TOS from one primitive to the next when possible, although all of the stack was in memory nominally. I won't go into all the details here because the 65c02 is not something that people care about nowadays. I will say that his code optimizer was quite good! My own 65c02 cross-compiler was essentially just ISYS Forth except that it ran on an MS-DOS machine and was written in UR/Forth and generated 65c02 machine-code, but it was compatible with ISYS in every way.

Michael Barry got this idea of the SOS-valid flag from some old processor from the 1960s (maybe a Burroughs). Unfortunately I can't find the thread in which he discussed this. I did find a thread in which Bernd Linsel supported Michael Barry's idea:
https://groups.google.com/forum/#!searchin/comp.lang.forth/Michael$20optimization%7Csort:date/comp.lang.forth/n5mkONylBBs/RdMW_nOkCwAJ
In this thread I said:
On Sunday, October 11, 2015 at 10:14:30 PM UTC-7, hughag...@gmail.com wrote:
> On Sunday, October 11, 2015 at 9:12:03 AM UTC-7, Bernd Linsel wrote:
> > Your idea of having several stack configurations (SOS valid or not)
> > in order to simplify compiler optimization is not bad, although, as
> > you mention in 19.5. Michael Barry's suggestion, the processor itself
> > can decide if it is necessary to huff or puff before a primitive, as
> > it always can track the validity of SOS.
>
> Yes. I'm really thinking that I should just go with Michael's suggestion and get rid of all the redundancy. Right now I'm holding off until I can talk to hardware designer and find out if it is possible for microcode to branch. On the MiniForth I was told that branching was very expensive to implement, and I was required to write all of the primitives without any branching --- I'm assuming the same limitation for the Stundurd --- the MiniForth was 20 years ago though, and this limitation may no longer be an issue.

Here is section 19.5 from the Stundurd document that Bernd Linsel is referencing:
============================================================================
section 19.5) an interesting suggestion

Michael Barry says this:

----------------------------------------------------------------------------
In the simplest FMITE, I believe there are two internal states for the parameter stack: huffed and puffed. If the FMITE keeps an internal flag that keeps track of this, it can insert a huff or puff automatically BEFORE the currently decoded instruction, but only if the current state is wrong for that particular instruction. Latency would be a maximum of a single huff or puff, and would usually be zero. By putting this atomic latency in its own clock slot (BEFORE the current instruction, not AFTER), the maximum clock speed wouldn't hardly suffer at all. All of your mnemonics that have a suffix like 01, 12, 21, could be consolidated into a single op-code and mnemonic with no suffix, because the cpu would know by the current state and the op-code it has just decoded whether or not it will have to insert a one cycle delay to huff or puff. HUFF01, HUFF12, and PUFF21 could be completely eliminated from the op-code matrix, because they would be internalized, and only used as dictated by the current instruction stream.

There would be no need to worry about leaving the processor in any given huff/puff state after any instruction or exit (or interrupt, for that matter), because any necessary state changes would be handled by the cpu when the following instruction is fetched and decoded. I believe that some of the Burroughs mainframes were doing this 50 years ago, so it's a proven strategy. The best analogy that comes to mind is a manual transmission vs. an automatic: your assembly code examples expend a lot of effort to keep all the huffers and puffers in harmony by manually clutching and shifting gears (states) as needed. You don't have to upshift or downshift an automatic, you just put it in "drive" and let it decide the best gear for the circumstances. In my experience, it usually makes the most efficient choice for the given road (instruction stream), unless it's broken. And you don't have to worry about spilling your coffee or dropping your cigarette by having to reach for the gearshift in traffic. It should also be possible to get the compiler to recognize and favor more efficient sequences (like NIP DROP instead of DROP DROP, so the cpu doesn't have to work so hard automatically huffing and puffing).

----------------------------------------------------------------------------

Michael Barry is a professional auto-mechanic, which is presumably why his explanation concluded with an automobile analogy. As an historical note, Henry Ford originally wanted an automatic transmission because he didn't believe that the average American was capable of learning how to operate a clutch, and he expected that if only a manual transmission were offered then automobiles would only be purchased by "gear heads" (the term for technology enthusiasts in the early 20th century), whereas the common people would continue to use horses. Unfortunately, his engineers informed him that the automatic-transmission technology was not ready for prime-time yet, so he had to go with a manual transmission. The people managed, although 50 years later when automatic-transmissions were introduced everybody jumped on the technology as they apparently had had quite enough of spilling their coffee and dropping their cigarettes. Michael programs as a hobby, mostly in 6502 assembly language for retro computers from the 1980s.

His idea is pretty good. There would be a one-bit register flag (lets call it SV for "SOS valid") that would be set to TRUE if SOS has the second-on-stack value in it or set to FALSE if SOS is invalid. Each primitive would set SV appropriately when it concluded. Most of the primitives would also examine SV before beginning, and either huff a datum into SOS or puff a datum out of SOS depending upon which they need. Some primitives, such as LIT-PLUS don't care if SOS is valid or not because they only work with TOS and so they would ignore SV and leave it as-is for the next primitive to worry about.

The advantage of this scheme is that there would be fewer primitives needed to do the same thing as currently being done. Right now, there is a lot of redundancy such as SWAP12 and SWAP22 that essentially do the same thing, but are only different in regard to whether they expect SOS to be valid or not. All of these redundant primitives could be replaced with a single primitive, such as SWAP for SWAP12 and SWAP22. We still have 64 primitives total available though, so a lot of what currently are macros could become primitives --- for example, we could have a TUCK primitive (currently we have TUCK12 and TUCK22 macros, each expanding to two primitives). Also, a single primitive only takes up one byte, so it could fit in the skip-area of a skippy primitive along with another primitive, reducing the need to put a BRANCH in the skip-area and the code elsewhere. The overall program would be smaller --- reducing bloat could be an issue for big programs (we only have 32KB of non-volatile memory, and only 16KB of that is accessible by CALL) --- also, reducing bloat increases speed because less time is spent fetching and decoding opcodes.

The disadvantage is that most of the primitives would be slower because they have to test SV before beginning and also set SV at the end. This likely won't make much of a difference in an FPGA. In an emulation however, it could make a big difference. The AVR is not too bad, because the T flag could be used as SV, but most processors (such as the MSP-430 or PIC24) would require a 16-bit register to be used for SV which is wasteful of resources (it might be possible to use the Carry flag, although you would have to be careful not to accidently clobber it in your NEXT code). Another disadvantage is that this scheme requires primitives to be able to branch internally (depending upon the state of SV at the beginning). My experience with the MiniForth, is that primitives can't branch internally --- I assume that this is very expensive to implement in an FPGA --- Mark Wills says that I'm thinking in terms of 20-year-old technology and that modern FPGAs are much more powerful and don't have this limitation. I'm not familiar enough with FPGAs to know if this is true or not (maybe it is only true for big expensive FPGAs but not for small inexpensive FPGAs).

Right now, I'm sticking with the simple plan of having the huffing and puffing figured out at compile-time (the cross-compiler generates either SWAP12 or SWAP22 as appropriate). Michael's idea with this being done at run-time is something to consider though --- maybe a second version of the processor can do this.

hughag...@gmail.com

unread,
Oct 6, 2016, 3:05:30 AM10/6/16
to
Maybe so.

My goal with the ITC version of Straight Forth is to generate code that is twice as fast as SwiftForth code (about half as fast as VFX code), which is the minimum needed for it to be useful for writing commercial software. If it is not this fast, I may delve into JIT. I'm expecting that basic compile-time peephole optimization should be adequate though.

JIT is pretty complicated --- I'm hoping to avoid that!

Andrew Haley

unread,
Oct 6, 2016, 3:17:08 AM10/6/16
to
alex <al...@rivadpm.com> wrote:

> JITs are valuable because they allow compiling work to be done at
> run time that isn't possible at compile time. For instance, an
> array's shape may not be known at compile time, but driven by
> external data; JIT compiling SQL statements, where optimisations can
> be made at runtime based on the input data; JIT parallelization for
> unknown workloads or unknown compute & processor availability; etc.

The other crucial thing is that a JIT can optimize only along hot code
paths, ignoring everything else. This makes it possible to inline
extremely deep, eliminating call overhead, removing dynamic memory
allocation, and eliminating many branches.

Andrew.

Anton Ertl

unread,
Oct 6, 2016, 4:43:07 AM10/6/16
to
rickman <gnu...@gmail.com> writes:
>On 10/5/2016 5:59 AM, Anton Ertl wrote:
>> Anyway, on hard CPUs there is a significant difference between
>> register access and L1 cache access (and the fairy tale about
>> memory/cache accesses being as fast as register accesses has already
>> been told in the 1990s).
>>
>> In particular, if you store to memory and then load from the same
>> spot, as you do if keep TOS in memory, this takes at least 4 cycles on
>> a modern Intel CPU, whereas just keeping it in a register costs 0
>> cycles. And similar for other hard CPUs.
>
>But that will vary across CPUs. If the comparison is for embedded
>systems you need to consider other CPUs and caches. I believe some of
>the lower end ARMs, CM3/CM4, run the caches as fast as registers, but
>I'm not sure.

And the fairy tale is still told and believed today. It must be a
very attractive fantasy to be so persistent.

What is the reality: First of all, ARM is a load/store architecture,
so unless the loads and stores take 0 cycles (they don't, see below)
on these machines, keeping the TOS in memory will cost more than
keeping it in a register. Likewise for other stack items, although
there are tradeoffs to consider there that complicate the analysis.

Now, concerning the instruction timings, the "Cortex-M3 Technical
Reference Manual"
<http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337h/CHDDIGAC.html>
says:

|Operation Description Assembler Cycles
|Load Word LDR Rd, [Rn, <op2>] 2[c]
|Store Word STR Rd, [Rn, <op2>] 2[c]
|
|[c] Neighboring load and store single instructions can pipeline their
|address and data phases. This enables these instructions to complete
|in a single execution cycle.

I very much doubt that [c] applies if you have a store followed by a
load to the same address, so I would guess that this sequence costs at
least 4 cycles. For concrete data, measure it. I would not be
surprised if additional latency would become visible when the load
reads data very soon after the store has written it.

Walter Banks

unread,
Oct 6, 2016, 8:49:04 AM10/6/16
to
On 2016-10-05 5:31 PM, hughag...@gmail.com wrote:
> On Tuesday, October 4, 2016 at 8:22:16 AM UTC-7, Walter Banks wrote:
>> On 2016-10-04 12:47 AM, hughag...@gmail.com wrote:
>>> So, once again, what is the point? Why are you still interested
>>> in this stuff?
>>>
>> Want to define "stuff" so I actually know what you are talking
>> about.
>
> Here "stuff" means "stack-processors."
>
>> There are a lot of strawman arguments in here that are both untrue
>> or misleading. My views on forth you expressed are certainly
>> incorrect, and the role of the MC6808. To be clear to I will (and
>> have) used any language that is appropriate to describe an
>> application area. My job as always is to translate the application
>> into executable code. Most of our C compilers are not written in
>> C. (All but two)
>
> My understanding is that you think of Forth as being intermediate
> code that a C compiler would generate for a stack-processor.

No I think that the forth community have a knowledge base of programming
stack processors that is useful to understand to implement code
generators for compilers for stack machines.

For what it is worth wrote a test code generator one time that generated
forth as part of a project looking at code reliability and stack
processor code generation sequences. At various other times we have
generated C, FORTRAN, Pascal, lisp, trac from C source.

>
> Your interest in stack-processors is that you believe they use less
> resources in an FPGA than register-processors use.

That is probably true but our use of fpga's is mostly for ISA prototypes
on a wide range of architectures. We have worked on risc, conventional
register and microcoded ISA's each for their own reasons.

>
> You don't see any point in hand-coding Forth code, for the same
> reason that you don't see any point in hand-coding assembly-language
> code

Far from what I believe.

> --- you believe that a modern C compiler can generate this low-level
> code as well or better than it can be hand-written

That has been demonstrably true for at least 20 years that a good C
modern C compiler can equal or better hand written assembler in code
size and/or execution time.


> --- you assume that all programmers will want to program in C rather
> than hand-write assembly-language or Forth code.

Not true. I assume that programmers will program in the language
appropriate for the job with the tools that they are familiar with on
the available platform that suites the application.


> Am I reading you wrong? Are these assumptions incorrect?

Some are as I have pointed out.

> Have you ever written any Forth programs?
Yes.

w..

jim.bra...@ieee.org

unread,
Oct 6, 2016, 3:08:49 PM10/6/16
to
On Monday, October 3, 2016 at 7:22:29 AM UTC-5, Walter Banks wrote:
> I am currently working on a stack processor ISA.
>
> What better place to ask about stacks.
>
> How many? single stack or two or three.
> What depth?
> Why?
>
>
> w..

]> I am currently working on a stack processor ISA.

Checked out the ByteCraft website where there is a 2009 article on avoiding having a stack. Agree that non-recursive routines can have their data "frame" allocated in global memory. Which can be cached just as well as a software stack. However, the number of bits needed to specify a location in global memory exceeds that needed when using a stack or when using a frame pointer? That is: code size should be smaller for using the data stack mechanism rather than global or even frame offsets?
(has anyone ever proposed compressing Forth address literals?)

As to how many stacks: multiple data stacks makes it easier to do a multiple issue per clock stack machine. (warning: bright idea ahead) Each stack has a single write port and multiple read ports (a common feature of FPGA RAMs). Thus an ALU for each stack which can read from any of the stacks and write results to its stack.

Jim Brakefield

Paul Rubin

unread,
Oct 6, 2016, 4:08:05 PM10/6/16
to
Walter Banks <wal...@bytecraft.com> writes:
> At various other times we have generated C, FORTRAN, Pascal, lisp,
> trac from C source.

Trac, you mean Calvin Mooers' old language? #(ps, #(rs)) ?

Wow! That is cool. Trac is even more unusual than Forth ;-)

alex

unread,
Oct 6, 2016, 4:47:33 PM10/6/16
to
On 06/10/16 20:08, jim.bra...@ieee.org wrote:
> On Monday, October 3, 2016 at 7:22:29 AM UTC-5, Walter Banks wrote:
>> I am currently working on a stack processor ISA.
>>
>> What better place to ask about stacks.
>>
>> How many? single stack or two or three.
>> What depth?
>> Why?
>>
>>
>> w..
>
> ]> I am currently working on a stack processor ISA.
>
> Checked out the ByteCraft website where there is a 2009 article on avoiding having a stack.

There are many systems that don't have stacks supported by the processor
hardware. The IBM S/360 and its successors are one example. Its
instruction set has no equivalent of the x86 PUSH and POP. The standard
calling convention (the S-type linkage) uses doubly linked lists. The
caller provides a register save area for the callee's use.

Of course, one could create a synthetic stack by using a register
pointer to memory, but it would require the register to be additionally
incremented or decremented after a load or store.

Walter Banks

unread,
Oct 6, 2016, 7:05:14 PM10/6/16
to
It is always about tradeoff's for dynamic memory. Compiled stacks
require larger address space but address space comes in 8 bits generally
and 8 bits for dynamic space in most embedded applications is rarely
enough.

What is interesting is the price can be much less depending on the data
flow analysis of the application and dynamic memory for compiled stack
may be just as likely to accessed by a index register as it is to be
directly accessed.

Two things have come out of our metrics related to this. Most (actually
essentially all) recursion gets optimized out of embedded applications.
The second thing is the run time overhead of procedural languages like C
to tear down stack frames at run time is 0.5% to 1% often more vs none
for compiled stacks. The last point (I know more than 2) is that
dynamic array support on the stack is notoriously weak in many ISA's
but in compiled stack has the the same support as normal globals.

> (has anyone ever proposed compressing Forth address literals?)

I have looked at compressed constants and code at the ISA level in a
couple processor projects in both cases for various reasons it wasn't a win.

w..



Walter Banks

unread,
Oct 6, 2016, 7:11:47 PM10/6/16
to
Yup, that one, didn't ever release it I didn't want to be sued.
Also targeted SAM at one point. (SAM Same As Moores :) functionally but
quite different syntax.

w..

rickman

unread,
Oct 6, 2016, 9:57:53 PM10/6/16
to
On 10/6/2016 3:08 PM, jim.bra...@ieee.org wrote:
> On Monday, October 3, 2016 at 7:22:29 AM UTC-5, Walter Banks wrote:
>> I am currently working on a stack processor ISA.
>>
>> What better place to ask about stacks.
>>
>> How many? single stack or two or three. What depth? Why?
>>
>>
>> w..
>
> ]> I am currently working on a stack processor ISA.
>
> Checked out the ByteCraft website where there is a 2009 article on
> avoiding having a stack. Agree that non-recursive routines can have
> their data "frame" allocated in global memory. Which can be cached
> just as well as a software stack. However, the number of bits needed
> to specify a location in global memory exceeds that needed when using
> a stack or when using a frame pointer? That is: code size should be
> smaller for using the data stack mechanism rather than global or even
> frame offsets? (has anyone ever proposed compressing Forth address
> literals?)

I used a sort of compression in my instruction set for all literals. An
instruction that uses an address contains a short literal in the
instruction. If the literal required is too long to fit in the
instruction a literal extension can be prepended as an additional
instruction. So literals take up space in instruction sized pieces. Is
that what you mean?


> As to how many stacks: multiple data stacks makes it easier to do a
> multiple issue per clock stack machine. (warning: bright idea ahead)
> Each stack has a single write port and multiple read ports (a common
> feature of FPGA RAMs). Thus an ALU for each stack which can read
> from any of the stacks and write results to its stack.
>
> Jim Brakefield
>


--

Rick C

jim.bra...@ieee.org

unread,
Oct 6, 2016, 10:21:14 PM10/6/16
to
]> I used a sort of compression in my instruction set for all literals. An
]> instruction that uses an address contains a short literal in the
]> instruction. If the literal required is too long to fit in the
]> instruction a literal extension can be prepended as an additional
]> instruction. So literals take up space in instruction sized pieces. Is
]> that what you mean?

Yes, this is my preference. People with more design experience may prefer variable length instructions with various literal/immediate sizes (there exist ISA's of both types).

At a more abstract level considering the information content of the instruction stream, global and non-recursive local variables can be accessed via a base and offset, where the offset is generally shorter than a full memory address. If there is frame pointer, then the offset is likely to be even shorter. If the Forth PICK instruction is used for this purpose (requires that locations on the stack of locals is known to the compiler(1)), then a ~dozen PICK offsets can be folded into the op-code space. The inverse of the PICK instruction (e.g. TO) is not needed to update locals, just PICK the new value(s) and prune the stack on return.

(1) Personally like the idea of keeping parameters and locals on the data stack. However one can also have a PICK from the return stack or from a frame pointer or from some other stack.

Jim Brakefield

rickman

unread,
Oct 7, 2016, 12:10:47 AM10/7/16
to
Most of this is beyond what I wish to address in an CPU in an FPGA. I
see them as simple beasts not unlike Chuck's F18 or Bernd Paysan's b16.
After seeing the J1 I realized I had too much in my CPU design. The J1
is amazingly simple and yet effective. At some point I want to try some
alternatives that I think may be useful. A register/stack hybrid by
giving a limited range of access into the stack for operands. The trick
is to provide enough that it is useful without complicating the
hardware. The initial tests I did seemed to show improvement in
instruction counts and consequently speed as well.

--

Rick C

Paul Rubin

unread,
Oct 7, 2016, 2:14:53 AM10/7/16
to
rickman <gnu...@gmail.com> writes:
> A register/stack hybrid by giving a limited range of access into the
> stack for operands.

This sounds like a register window architecture, e.g. SPARC. They were
in vogue in workstation processors for a while, but then went out of
favor, I think because of difficulties making them superscalar or
something like that. For a small fpga softcore maybe that doesn't
matter.

rickman

unread,
Oct 7, 2016, 2:26:54 AM10/7/16
to
Lol, no, I don't think not being "superscalar" matters to a MISC CPU.
Not familiar with register window, but that sounds a bit like it, but
put it on top of a stack. The idea is to do away with most of the stack
manipulations, OVER, DUP, etc. buy implementing them with small offsets
and stack pointer adjustments other than +1/0/-1.

One of the issues in stack machines is dealing with more than a couple
or three stack items. I think a small offset into the stack will make
this work a lot better.

--

Rick C

Lars Brinkhoff

unread,
Oct 7, 2016, 2:37:26 AM10/7/16
to
Paul Rubin <no.e...@nospam.invalid> writes:
> This sounds like a register window architecture, e.g. SPARC. They
> were in vogue in workstation processors for a while, but then went out
> of favor

Not completely. The newish Xtensa architecture has an option for
register windows.

Paul Rubin

unread,
Oct 7, 2016, 2:57:57 AM10/7/16
to
rickman <gnu...@gmail.com> writes:
> One of the issues in stack machines is dealing with more than a couple
> or three stack items. I think a small offset into the stack will make
> this work a lot better.

The SPARC had a 24-register window, where each level of call got 8 local
registers and could also see the local sets of the two higher calls. It
seemed like a good idea at the time. I never dealt with it directly so
I don't know what nonobvious issues it had, if any.

https://en.wikipedia.org/wiki/SPARC

Albert van der Horst

unread,
Oct 7, 2016, 3:13:49 AM10/7/16
to
In article <87k2dkw...@nightsong.com>,
The most prestigious example of late of a register window is the Itanium.
It is not clear to me whether thay fell prey to the success of Intel/Windows
or there are legitimate technical difficulties.

So much is sure that you don't hear a lot of them.
I had the opportunity to lay hands on transputer and Dec Alpha
hardware when they were on their way back.
I'd love to try Forth on a register window machine like the Itanium.

Groetjes Albert
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Lars Brinkhoff

unread,
Oct 7, 2016, 3:35:57 AM10/7/16
to
Albert van der Horst writes:
> I'd love to try Forth on a register window machine like the Itanium.

It shouldn't be hard get an ESP8266, but I don't know if its CPU
implements the register window option. The ESP32 seems to have it.

Anton Ertl

unread,
Oct 7, 2016, 3:38:59 AM10/7/16
to
Paul Rubin <no.e...@nospam.invalid> writes:
>rickman <gnu...@gmail.com> writes:
>This sounds like a register window architecture, e.g. SPARC. They were
>in vogue in workstation processors for a while, but then went out of
>favor, I think because of difficulties making them superscalar or
>something like that.

Their inclusion in IA-64 is counterevidence for this theory.

My guess is that it did not buy enough in the benchmarks, in
particular SPEC CPU (but I have my doubts that the benchmarks are
representative of real applications in this area).

Anton Ertl

unread,
Oct 7, 2016, 3:56:10 AM10/7/16
to
rickman <gnu...@gmail.com> writes:
>On 10/5/2016 7:32 PM, hughag...@gmail.com wrote:
>> I was thinking of having multiple versions of each instruction, each
>> with a different configuration in regard to how many of the top stack
>> items (0, 1 or 2) are expected in registers and how many (0, 1 or 2)
>> are returned in registers.
>
>That would need to be supported by the CPU directly, no? In essence the
>CPU would have to know how many stack items are in registers and call
>the appropriate microcode to execute the appropriate primitive.
>Otherwise how would the compiler know what to expect when a word is
>compiled into another word?

If you do it in the compiler (static stack caching), you typically
have a canonical stack state, and force the stack representation into
this state at the start of a definition and at other places where you
would otherwise get problems. Gforth returns to the canonical state
(one stack item in a register) at every basic block boundary (branch,
branch target). For straight-line code, the compiler always knows in
which state the previous VM instruction ended and can compile the
appropriate version of the current VM instruction. For more on that,
read [ertl&gregg05].

If you do it in the VM interpreter (dynamic stack caching), every VM
instruction knows at which state it ends, and jumps to the version of
the next VM instruction for that state (similar to what you describe
for a microcoded CPU). In that case the compiler does not need to be
aware of stack caching. For more on that, read [ertl95].

@InProceedings{ertl95pldi,
author = "M. Anton Ertl",
title = "Stack Caching for Interpreters",
crossref = "sigplan95",
pages = "315--327",
url = "http://www.complang.tuwien.ac.at/papers/ertl95pldi.ps.gz",
abstract = "An interpreter can spend a significant part of its
execution time on arguments of virtual machine
instructions. This paper explores two methods to
reduce this overhead for virtual stack machines by
caching top-of-stack values in (real machine)
registers. The {\em dynamic method} is based on
having, for every possible state of the cache, one
specialized version of the whole interpreter; the
execution of an instruction usually changes the
state of the cache and the next instruction is
executed in the version corresponding to the new
state. In the {\em static method} a state machine
that keeps track of the cache state is added to the
compiler. Common instructions exist in specialized
versions for several states, but it is not necessary
to have a version of every instruction for every
cache state. Stack manipulation instructions are
optimized away."
}
@Proceedings{sigplan95,
booktitle = "SIGPLAN Conference on Programming Language
Design and Implementation (PLDI'95)",
title = "SIGPLAN Conference on Programming Language
Design and Implementation (PLDI'95)",
year = "1995",
key = "PLDI '95"
}

@InProceedings{ertl&gregg05,
author = {M. Anton Ertl and David Gregg},
title = {Stack Caching in {Forth}},
crossref = {euroforth05},
pages = {6--15},
url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg05.ps.gz},
pdfurl = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26gregg05.pdf},
OPTnote = {not refereed},
abstract = {Stack caching speeds Forth up by keeping stack items
in registers, reducing the number of memory accesses
for stack items. This paper describes our work on
extending Gforth's stack caching implementation to
support more than one register in the canonical
state, and presents timing results for the resulting
Forth system. For single-representation stack
caches, keeping just one stack item in registers is
usually best, and provides speedups up to a factor
of 2.84 over the straight-forward stack
representation. For stack caches with multiple stack
representations, using the one-register
representation as canonical representation is
usually optimal, resulting in an overall speedup of
up to a factor of 3.80 (and up to a factor of 1.53
over single-representation stack caching).}
}
@Proceedings{euroforth05,
title = {21st EuroForth Conference},
booktitle = {21st EuroForth Conference},
year = {2005},
key = {EuroForth'05},
editor = {M. Anton Ertl},
url = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/proceedings.pdf}

Paul Rubin

unread,
Oct 7, 2016, 4:09:20 AM10/7/16
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> Their inclusion in IA-64 is counterevidence for this theory.
> My guess is that it did not buy enough in the benchmarks,

SPARC was one of the faster processors around in its early days, but
other designs got way ahead of it with comparable die sizes and
processes. I have the impression that the register window scheme was
the culprit one way or another. Today you can buy expensive recent
SPARC chips that are still way slower than Intel, Power8, etc. I don't
know how Itanic does compared to those. But I figure there's reasons
nobody uses it.

Andrew Haley

unread,
Oct 7, 2016, 9:18:58 AM10/7/16
to
The decision to go with register windows was based on a bunch of
benchmarks which seemed to indicate a performance improvement over
something more conventional. But the benchmarks weren't
representative enough, and the fixed size of the windows was
problematic because registers are left unused. In the end, register
windows are a loss: sophisticated register allocators can generate
better code which uses fewer physical registers.

Andrew.

Walter Banks

unread,
Oct 7, 2016, 10:52:57 AM10/7/16
to
Better code yes from register allocators. Part of this is a delusion as
well. The goal often was to have faster context switches and for various
reasons that didn't materialize. RAM fragmentation became a problem.
After the promising generalizations and promises register windows when
it came to details they didn't work very well. The idea has been around
since at least the 70's with the TI9900.

w..

Anton Ertl

unread,
Oct 7, 2016, 12:50:41 PM10/7/16
to
Paul Rubin <no.e...@nospam.invalid> writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> Their inclusion in IA-64 is counterevidence for this theory.
>> My guess is that it did not buy enough in the benchmarks,
>
>SPARC was one of the faster processors around in its early days, but
>other designs got way ahead of it with comparable die sizes and
>processes. I have the impression that the register window scheme was
>the culprit one way or another.

And at some point the 68000 fell back and the 88000 fell back, and at
some point MIPS fell back, and at some point Alpha fell back, and at
some point Motorola's PowerPCs fell back, and they all had no register
windows. OTOH, Fujitsu's SPARC64 CPUs were doing comparatively well
AFAIK, to the point that Oracle sold them and did SPEC CPU submissions
with them (compare
<https://www.spec.org/cpu2006/results/res2011q2/cpu2006-20110328-15288.html>
and
<https://www.spec.org/cpu2006/results/res2011q2/cpu2006-20110328-15270.html>).

Intel CPUs now have ~200 or so physical registers, and accessing them
is more complicated than register window access (admittedly, they
would have to do register window access in addition to the register
renaming), yet they manage to get this to fly, so I expect that Intel
would also get register windows to fly if they put their minds to it.

I think the problem with most RISC architectures was that they were
relatively easy to develop at first, and their designers were not
prepared for dealing with the complexity that superscalar out-of-order
designs entail. So the designs were delayed, canceled, or slow.
Alpha with their VAX legacy were better prepared for that, but
apparently eventually ran out of money because of low revenue.

>I don't
>know how Itanic does compared to those. But I figure there's reasons
>nobody uses it.

IA-64 implementations have been relatively slow AFAICT; they have not
submitted results since 2010, so it's unclear how well the recent ones
do (it seems both Intel and HP only do these for the existing customer
base these days), which is a pity, because Poulson looks promising (at
least it should do quite a bit better than the earlier ones). The
IA-64 problem seems to be overall complexity and lack of commitment.
If Intel had been committed to it just as it was to IA-32 in the dark
386 times, they would have worked out how to make it fly over the
course of a few generations, but they just did minor updates of the
McKinley for a decade; of course, unlike for IA-32, there was little
software base for IA-64, and therefore little incentive to keep on
improving it.

Anyway, if anybody wants to play with IA-64, I looked around some time
ago, and you can get a second-hand Itanium II system complete with
memory for relatively little (IIRC starting at around EUR200).

Anton Ertl

unread,
Oct 7, 2016, 1:30:09 PM10/7/16
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Paul Rubin <no.e...@nospam.invalid> wrote:
>> The SPARC had a 24-register window, where each level of call got 8 local
>> registers and could also see the local sets of the two higher calls.

Not quite. When shifting the window, it is shifted by 16 registers.
You typically shift in one direction on call, and in the other on
return. So the callee (B) sees 8 of the registers that the caller (A)
saw, and the caller uses these to pass parameters; if the B wants to
make a call itself, it puts the parameters in the 8 registers at the
other end of the window; the 8 registers in between can only be used
for local data. The remaining 8 (or 7?) registers were not managed in
Windows.

>The decision to go with register windows was based on a bunch of
>benchmarks which seemed to indicate a performance improvement over
>something more conventional. But the benchmarks weren't
>representative enough

Representative of what? They were not representative of the SPEC CPU
benchmarks as typically used; but the SPEC CPU benchmarks as typically
used are not at all representative of most applications, in particular
wrt call overhead. So it might be that the benchmarks that showed
register windows to be a win are more representative of applications
than those that didn't.

>and the fixed size of the windows was
>problematic because registers are left unused. In the end, register
>windows are a loss: sophisticated register allocators can generate
>better code which uses fewer physical registers.

Except that these sophisticated register allocators never materialized
in the real world. When I compile Gforth on MIPS or Alpha and use
more than one register for stack items, gcc spills some VM registers;
now gcc's register allocation is ancient, but last I heard, someone
tried replacing it with something more sophisticated, and found that
the result produced slower code.

Ok, Gforth may be a special case, so let's look at other stuff:
Applications that are dynamically linked (SPEC CPU benchmarks usually
are statically linked for performance) and use polymorphic calls (no
interprocedural register allocation for that, but fortunately rare in
SPEC CPU); static compiler sophistication will not help for these
cases.

The IA-64 designers had too much rather than too little faith in
sophisticated compilers, yet they included the register stack
(something like register windows, but without fixed shifts, avoiding
the waste "problem"); so they certainly did not buy the "sophisticated
register allocators can generate better code" argument.

Cecil Bayona

unread,
Oct 7, 2016, 1:33:02 PM10/7/16
to
In the early 70's I designed and built my own CPU, it had register
windows, 32 32 register windows, the purpose was for speedy interrupt
handling, no registers needed saving to answer an interrupt, which made
it fast, and predictable. It could be used in programming as the call
and return instructions had an option to change register banks. It was
handy in a few places but it was mainly an interrupt feature.
--
Cecil - k5nwa

jim.bra...@ieee.org

unread,
Oct 7, 2016, 1:51:19 PM10/7/16
to
]>The ESP32 seems to have it.

FYI
The Xtensa ISA reference manual shows register windows as optional.
Also, via the ENTRY instruction, register windows can have 4, 8 or 12 registers as opposed to the 8 register window on SPARC.

Cecil Bayona

unread,
Oct 7, 2016, 2:21:10 PM10/7/16
to
That CPU family is insane, so many features, if I remember you can buy a
board for under $10. I skimmed over some of the literature and it seems
the whole family is designed so you can have a custom CPU made with the
features and even extra instructions one wants.

Any Forth available (dream on) or even a C compiler?
--
Cecil - k5nwa

Lars Brinkhoff

unread,
Oct 7, 2016, 2:41:00 PM10/7/16
to
Cecil Bayona <cba...@cbayona.com> writes:
> That CPU family is insane, so many features, if I remember you can buy
> a board for under $10.

Yes, it seems interesting!

> Any Forth available (dream on)

I know about three:

https://github.com/niclash/forthright
https://github.com/zeroflag/punyforth
https://github.com/CraigLindley/ESP8266Forth

> or even a C compiler?

https://github.com/jcmvbkbc/gcc-xtensa

Albert van der Horst

unread,
Oct 7, 2016, 2:44:33 PM10/7/16
to
In article <nt8p26$cuc$1...@dont-email.me>,
Cecil Bayona <cba...@cbayona.com> wrote:
>That CPU family is insane, so many features, if I remember you can buy a
>board for under $10. I skimmed over some of the literature and it seems
>the whole family is designed so you can have a custom CPU made with the
>features and even extra instructions one wants.

I was puzzling 5 minutes to find out what CPU family you could be talking
about. Then I scolled down. Please don't top post.

>
>Any Forth available (dream on) or even a C compiler?
>
>On 10/7/2016 2:35 AM, Lars Brinkhoff wrote:
>> Albert van der Horst writes:
>>> I'd love to try Forth on a register window machine like the Itanium.
>>
>> It shouldn't be hard get an ESP8266, but I don't know if its CPU
>> implements the register window option. The ESP32 seems to have it.
>>
>
>--
>Cecil - k5nwa

Rod Pemberton

unread,
Oct 7, 2016, 11:04:40 PM10/7/16
to
It would be useful if you could move the pointer from position at the
top of the stack, while guaranteeing that items now above the new
pointer location were not corrupted. E.g., if you wanted to swap the
3rd and 4th items, you could:

1) change data stack pointer to point to 3rd item
2) "protect" the items between new and old position, i.e.,
"guard" the 1st and 2nd items
3) SWAP, which would swap the 3rd and 4th items
4) restore data stack pointer
5) remove "guard" protecting 1st and 2nd items from being overwritten

I noticed this type of situation come up both in Forth code and C code
converted to Forth. This would make all data stack operations
independent of the current stack position. I'm just not sure how to do
it more effectively than moving items to the return stack. Guarding
the data would be the hard part in software, I think, without needing
to use to two block moves of data, which would slow this down. Swapping
the stack pointer is quick and easy.


Rod Pemberton

jim.bra...@ieee.org

unread,
Oct 8, 2016, 12:56:11 AM10/8/16
to
]> 3) SWAP, which would swap the 3rd and 4th items

Only way to do such a generalized SWAP in one clock cycle is to have even and odd location stack memory pairs? For an FPGA implementation, this means either two LUT RAM register files or two block RAM stack memories?

Jim Brakefield

rickman

unread,
Oct 8, 2016, 1:12:02 AM10/8/16
to
If there were no registers to save on a context switch, where did you
save the register window pointer? Was this inherent in the interrupt
you were running at the time? Where did you save the info on which
interrupt you were running?
--

Rick C

Walter Banks

unread,
Oct 8, 2016, 7:23:54 AM10/8/16
to
This is an accurate description of a very real problem. In both code
generators and ISA projects I have worked on deal with the data stack is
a problem because of fundamental assumptions. A lot of the ISA's out
there make an assumption that if a stack pointer is created that looks a
lot like an index register with a rich collection of math operations
then everything will just work out. As your simple example points out
there are really two problems.

1) There is a real need for two pointers for a stack one the simple
pointer that indexes where data is pushed on and off the stack and a
second that carries with it top and next operate. Combining the two as
most systems do is always a compromise.

2) Operations on the stack have different ISA requirements than index
register operations on Global variables. This difference starts with the
role of the accumulator. The question of a primary accumulator, is it
the TOP or separate AC. In your example above the AC needs to be TOP.

From metrics (assuming appropriate code generation) we have shown that
the TOP can always be the system AC without penalty but for stack
operations an AC separate from top of stack can be expensive.

Going back to your example there is the overhead of moving using
whatever mechanism the register window somewhere on the stack. ISA
developers have tried a lot of approaches. Stack indexing - address
overhead. Pointer base addressing - set up overhead. Ivan Godard's belt
to reduce offset address size offset by retaining most intermediate
results.

w..





Andrew Haley

unread,
Oct 9, 2016, 9:34:43 AM10/9/16
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Paul Rubin <no.e...@nospam.invalid> wrote:
>>> The SPARC had a 24-register window, where each level of call got 8 local
>>> registers and could also see the local sets of the two higher calls.
>
> Not quite. When shifting the window, it is shifted by 16 registers.
> You typically shift in one direction on call, and in the other on
> return. So the callee (B) sees 8 of the registers that the caller (A)
> saw, and the caller uses these to pass parameters; if the B wants to
> make a call itself, it puts the parameters in the 8 registers at the
> other end of the window; the 8 registers in between can only be used
> for local data. The remaining 8 (or 7?) registers were not managed in
> Windows.
>
>>The decision to go with register windows was based on a bunch of
>>benchmarks which seemed to indicate a performance improvement over
>>something more conventional. But the benchmarks weren't
>>representative enough
>
> Representative of what?

AIUI, of much of the real-world code which is run on SPARC processors.

> They were not representative of the SPEC CPU benchmarks as typically
> used; but the SPEC CPU benchmarks as typically used are not at all
> representative of most applications, in particular wrt call
> overhead. So it might be that the benchmarks that showed register
> windows to be a win are more representative of applications than
> those that didn't.

:-)

>>and the fixed size of the windows was problematic because registers
>>are left unused. In the end, register windows are a loss:
>>sophisticated register allocators can generate better code which
>>uses fewer physical registers.
>
> Except that these sophisticated register allocators never
> materialized in the real world.

I guess that depends on whether your glass is half empty or half full.
There's no getting around register allocation being NP-complete, but
IME approximate algorithms run in reasonable time on reasonably small
programs. With some notable exceptions, of course.

> When I compile Gforth on MIPS or Alpha and use more than one
> register for stack items, gcc spills some VM registers; now gcc's
> register allocation is ancient, but last I heard, someone tried
> replacing it with something more sophisticated, and found that the
> result produced slower code.

That was some time ago. GCC's old register allocator has now
successfully been replaced.

> Ok, Gforth may be a special case,

Probably, for reasons discussed here before.

> so let's look at other stuff: Applications that are dynamically
> linked (SPEC CPU benchmarks usually are statically linked for
> performance) and use polymorphic calls (no interprocedural register
> allocation for that, but fortunately rare in SPEC CPU); static
> compiler sophistication will not help for these cases.

Well, sure: an optimizing compiler can't optimize beyond its
"optimization horizon": the amount of code that it can see. That's
obvious, surely. You can't optimize what you can't see.

> The IA-64 designers had too much rather than too little faith in
> sophisticated compilers, yet they included the register stack
> (something like register windows, but without fixed shifts, avoiding
> the waste "problem"); so they certainly did not buy the "sophisticated
> register allocators can generate better code" argument.

The register stack isn't on the face of it such a terrible idea, but
it's certainly questionable whether it's the best way to use the
silicon. But the IA-64 does not, as you note, have fixed-size
windows, which are the real Achilles' heel of the SPARC. And the
reason that architecture hasn't much been imitated.

Andrew.

Anton Ertl

unread,
Oct 9, 2016, 12:29:14 PM10/9/16
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>Paul Rubin <no.e...@nospam.invalid> wrote:
>>>> The SPARC had a 24-register window, where each level of call got 8 local
>>>> registers and could also see the local sets of the two higher calls.
>>
>> Not quite. When shifting the window, it is shifted by 16 registers.
>> You typically shift in one direction on call, and in the other on
>> return. So the callee (B) sees 8 of the registers that the caller (A)
>> saw, and the caller uses these to pass parameters; if the B wants to
>> make a call itself, it puts the parameters in the 8 registers at the
>> other end of the window; the 8 registers in between can only be used
>> for local data. The remaining 8 (or 7?) registers were not managed in
>> Windows.
>>
>>>The decision to go with register windows was based on a bunch of
>>>benchmarks which seemed to indicate a performance improvement over
>>>something more conventional. But the benchmarks weren't
>>>representative enough
>>
>> Representative of what?
>
>AIUI, of much of the real-world code which is run on SPARC processors.

Do you have any evidence for this claim?

>>>and the fixed size of the windows was problematic because registers
>>>are left unused. In the end, register windows are a loss:
>>>sophisticated register allocators can generate better code which
>>>uses fewer physical registers.
>>
>> Except that these sophisticated register allocators never
>> materialized in the real world.
>
>I guess that depends on whether your glass is half empty or half full.
>There's no getting around register allocation being NP-complete, but
>IME approximate algorithms run in reasonable time on reasonably small
>programs.

Sure, but that does nothing to address the issues that register
windows address. A function that calls another function and has, say,
6 registers live across the call, needs to save 6 registers and
restore 6 registers somewhere (either save calle-saved registers on
entry and exit, or save caller-saved registers around the call).
Register Windows reduce that overhead. In theory, one can use
interprocedural register allocation to reduce these overheads. In
practice, one does not do that (except when compiling SPEC CPU
benchmarks).

>GCC's old register allocator has now
>successfully been replaced.

Since which version?

>> so let's look at other stuff: Applications that are dynamically
>> linked (SPEC CPU benchmarks usually are statically linked for
>> performance) and use polymorphic calls (no interprocedural register
>> allocation for that, but fortunately rare in SPEC CPU); static
>> compiler sophistication will not help for these cases.
>
>Well, sure: an optimizing compiler can't optimize beyond its
>"optimization horizon": the amount of code that it can see. That's
>obvious, surely. You can't optimize what you can't see.

But register windows work even for such calls.

>> The IA-64 designers had too much rather than too little faith in
>> sophisticated compilers, yet they included the register stack
>> (something like register windows, but without fixed shifts, avoiding
>> the waste "problem"); so they certainly did not buy the "sophisticated
>> register allocators can generate better code" argument.
>
>The register stack isn't on the face of it such a terrible idea, but
>it's certainly questionable whether it's the best way to use the
>silicon. But the IA-64 does not, as you note, have fixed-size
>windows, which are the real Achilles' heel of the SPARC.

I doubt that it's the Achilles' heel. Sure, variable shifts would
have made better use of the physical registers, but as long as the
number of overflows/underflows is small, that makes little difference.
And the number of physical registers was selected such that the number
of overflows was small in practice.

The Achilles' heel of SPARC were slow implementations.

>And the
>reason that architecture hasn't much been imitated.

I think the paper that claimed that interprocedural register
allocation makes register windows unnecessary played a big role.

hughag...@gmail.com

unread,
Oct 9, 2016, 7:23:01 PM10/9/16
to
On Friday, October 7, 2016 at 7:52:57 AM UTC-7, Walter Banks wrote:
> The idea has been around
> since at least the 70's with the TI9900.

Mark Wills is our big TI9900 guy around here.

Is the IT9900 still used at all? I have read that it was the best processor of its era --- I have also read however that the TI99/4A was the worst computer of its era --- somewhat like putting a race-car engine in a go-kart.

All of this talk about register allocation is somewhat interesting, but I'm taking a much simpler approach in Straight Forth. Note that a "number" in Straight Forth is a 64-bit fixed-point with an assumed unity of 2^32, which is why I refer to a "fractional part" later. Here is an excerpt from my document:

-----------------------------------------------------------------------------

In Straight Forth we have four user registers: PTR CNT REG and ACC. Note that PTR and CNT can only hold integers (the fractional part is zero'd) --- PTR and CNT are typically used together to describe an array and the term PC refers to this register pair.
...
REG and ACC are general-purpose registers (I haven't studied cmForth, but have read that it had a register like this too).

PTR and CNT can't have a fractional part, so they aren't general-purpose. For the most part, PTR and CNT are used for passing data into and back out of HOFs. A HOF is a "higher-order function" that executes an xt (typically a quotation). In functions that don't call any HOFs, the PTR and CNT user registers can be used for other purposes --- this isn't idiomatic however --- iteration is done in HOFs and so calling HOFs is pretty common (the PTR and CNT user registers were provided to help boost the speed of iteration).

The user registers are always saved and restored upon entrance and exit of any function or quotation. It is possible for a calling function to pass data into a sub-function in the user registers, but it is not possible for the sub-function to return data to the calling function in the user registers. The sub-function saves the user registers upon entrance but it doesn't initialize them, so they still have the calling function's values inside of the sub-function. The sub-function restores the user registers upon exit though, so when the calling function resumes the user register values are unchanged. The user registers are like local variables (rather than global variables) in that they can be used in a function and they won't get clobbered by a sub-function. The user registers are different from local variables however in that they can be used to pass data into sub-functions (by comparison, sub-functions don't have access to calling functions' local variables).

The exception to the above rule is that a HOF (higher-order function) does not save and restore PTR and CNT (it does save and restore REG and ACC). A HOF is defined with HOF: rather than colon, but is otherwise just like a colon function. A HOF can pass data back up to the calling function in PTR and CNT. In Straight Forth it is idiomatic that all iteration should be done inside of HOFs. The programmer generally does not use BEGIN WHILE REPEAT anywhere except inside of HOFs. A function that calls a HOF can use REG and ACC like local variables (REG and ACC won't get clobbered by the HOF), but PTR and CNT will be used for passing data into and out of the HOF (they will get clobbered by the HOF). If the HOF doesn't pass any data back up to the calling function in PTR and CNT, but just restores PTR and CNT, then the HOF can be defined with colon rather than with HOF: because colon restores these automatically.

When calling a HOF, PC is first set to describe the data-structure that will be traversed. Some HOFs traverse the entire data-structure, in which case PC will be restored. Other HOFs traverse the data-structure until the executed xt returns a TRUE, at which time they early-exit. In this case, the HOF will return on the data-stack some information about what caused the early-exit. The HOF will also return PC set to describe the remainder of the data-structure that hasn't been traversed yet. Oftentimes the HOF will be called repeatedly until the entire data-structure has been traversed. Because the HOF returns PC set to describe the remainder, the HOF can be called again using this PC to search the remainder of the data-structure --- this can be done repeatedly until the entire data-structure has been searched.

A novice-level Straight Forth programmer should only use the built-in HOFs; he should not write his own HOFs and he should not use BEGIN WHILE REPEAT (iteration is a common source of bugs, which is why novices are discouraged from using BEGIN WHILE REPEAT). An intermediate-level Straight Forth programmer may write his own HOFs and use BEGIN WHILE REPEAT inside of them, but should not use BEGIN WHILE REPEAT anywhere except inside of his HOFs, and he should not write very many HOFs. It is idiomatic for all Straight Forth programmers from novice to advanced to program in a novice-level manner (using the built-in HOFs) as much as possible because this makes the programs easy to read by others who may be novice-level. The built-in data-structures are the array and the chain (an ordered container with a string as a key). These should be adequate for 99% of all Straight Forth programs. In rare circumstances it will be necessary to use some other data-structure, in which case HOFs will have to be written to support that data-structure (a HOF is used to traverse a data-structure). Straight Forth programmers should not introduce new data-structures unless necessary, because doing so makes their programs hard to read, but they should stick with novice-level techniques (using the two built-in data-structures) as much as possible --- the programmers should not strive to prove that they are intermediate-level or (woo hoo!) advanced-level --- the goal with Straight Forth is to write programs that do something useful and which are easy to read by everybody including novices.

rickman

unread,
Oct 10, 2016, 1:49:27 AM10/10/16
to
On 10/9/2016 7:22 PM, hughag...@gmail.com wrote:
> On Friday, October 7, 2016 at 7:52:57 AM UTC-7, Walter Banks wrote:
>> The idea has been around since at least the 70's with the TI9900.
>
> Mark Wills is our big TI9900 guy around here.
>
> Is the IT9900 still used at all? I have read that it was the best
> processor of its era --- I have also read however that the TI99/4A
> was the worst computer of its era --- somewhat like putting a
> race-car engine in a go-kart.

It's era included the PDP-11 computer line and LSI-11 computer chips. I
think trying to claim some sort of superiority of one line of computer
designs over another is a bit pointless. How do you measure "best"?
The TI-990 was a reasonably popular mini-computer line, but never had
overwhelming sales. The TMS9900 chip came in a humongous 64 pin package
and required a 4 phase clock, not entirely different from many of the
first CPUs like the 8080. But the 9900 had exactly one significant
successor, the 9995 which had an 8 bit data bus. All in all this
product line didn't get much traction other than in TI's own computers.

I can't imagine you can buy any 9900 type chips other than HDL in an
FPGA. I've got a 9900 computer board I modified to use 2716 EPROMs
rather than the old 2708 EPROMs. I also built a 9995 board with some 8
kB of fast RAM and one EPROM. Both are still around somewhere, haven't
touched them in years.

--

Rick C

Andrew Haley

unread,
Oct 10, 2016, 4:26:01 AM10/10/16
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>>The decision to go with register windows was based on a bunch of
>>>>benchmarks which seemed to indicate a performance improvement over
>>>>something more conventional. But the benchmarks weren't
>>>>representative enough
>>>
>>> Representative of what?
>>
>>AIUI, of much of the real-world code which is run on SPARC processors.
>
> Do you have any evidence for this claim?

No.

>>>>and the fixed size of the windows was problematic because registers
>>>>are left unused. In the end, register windows are a loss:
>>>>sophisticated register allocators can generate better code which
>>>>uses fewer physical registers.
>>>
>>> Except that these sophisticated register allocators never
>>> materialized in the real world.
>>
>>I guess that depends on whether your glass is half empty or half full.
>>There's no getting around register allocation being NP-complete, but
>>IME approximate algorithms run in reasonable time on reasonably small
>>programs.
>
> Sure, but that does nothing to address the issues that register
> windows address. A function that calls another function and has, say,
> 6 registers live across the call, needs to save 6 registers and
> restore 6 registers somewhere (either save calle-saved registers on
> entry and exit, or save caller-saved registers around the call).
> Register Windows reduce that overhead. In theory, one can use
> interprocedural register allocation to reduce these overheads.

Sure, but it's completely not free with regster windows either. Of
course if every call fits neatly in the register window that's a win,
but if they don't that's a loss.

> In practice, one does not do that (except when compiling SPEC CPU
> benchmarks).
>
>>GCC's old register allocator has now
>>successfully been replaced.
>
> Since which version?

I can't remember, and I don't have time to look it up right now.

>>> so let's look at other stuff: Applications that are dynamically
>>> linked (SPEC CPU benchmarks usually are statically linked for
>>> performance) and use polymorphic calls (no interprocedural register
>>> allocation for that, but fortunately rare in SPEC CPU); static
>>> compiler sophistication will not help for these cases.
>>
>>Well, sure: an optimizing compiler can't optimize beyond its
>>"optimization horizon": the amount of code that it can see. That's
>>obvious, surely. You can't optimize what you can't see.
>
> But register windows work even for such calls.

Sure, up to their intrinsic limitations caused by the fixed-size
windows.

>>The register stack isn't on the face of it such a terrible idea, but
>>it's certainly questionable whether it's the best way to use the
>>silicon. But the IA-64 does not, as you note, have fixed-size
>>windows, which are the real Achilles' heel of the SPARC.
>
> I doubt that it's the Achilles' heel. Sure, variable shifts would
> have made better use of the physical registers, but as long as the
> number of overflows/underflows is small, that makes little difference.

Indeed, but that's the big assumption: everything fits nicely in the
available register sets. When that happens, SPARC is fantastic: no
spills and fills, super-fast calls, etc. Even with somewhat inferior
manufacturing technology, if register windows are a substantial
architectural advantage they should have won, but didn't.

> And the number of physical registers was selected such that the
> number of overflows was small in practice.

That's true iff you're prepared to dedicate a lot of silicon to
register sets rather than something else. It's not free.

> The Achilles' heel of SPARC were slow implementations.

That too. In particular, the TI Viking was a profound disappointment
and came at a time when SPARC was under real threat from Intel et al.

Andrew.

Albert van der Horst

unread,
Oct 10, 2016, 6:28:09 AM10/10/16
to
In article <4f3794f4-58f2-47b7...@googlegroups.com>,
<hughag...@gmail.com> wrote:
<SNIP Ti 9900>
>---------------------------------------------------------------------------=
>--
>
>In Straight Forth we have four user registers: PTR CNT REG and ACC. Note th=
>at PTR and CNT can only hold integers (the fractional part is zero'd) --- P=
>TR and CNT are typically used together to describe an array and the term PC=
> refers to this register pair.
<SNIP implementation details>

I can see that you've done a clever implementation, (or is it desing phase?)
not so much the motivation behind it.

>
>A novice-level Straight Forth programmer should only use the built-in HOFs;=
> he should not write his own HOFs and he should not use BEGIN WHILE REPEAT =
>(iteration is a common source of bugs, which is why novices are discouraged=
<SNIP explanation>

If the motivation is to allow simplication in behalf of novices, you
would do well to give some simple examples of original Forth
versus Straight Forth.

Groetjes Albert

Cecil Bayona

unread,
Oct 10, 2016, 11:21:30 AM10/10/16
to
The device was slow by later standards but it never stood a change,
programming wise it was decent.

TI used every trick to possibly cripple the CPU in the TI99/4,
unfortunately that is a horrible market ploy that most likely backfired
on them. Most likely designers that saw the performance of the TI99/4
were turned off before they looked at the device. You can still find the
chips an the TI99/4 on eBay for all it's worth.

Actually those EPROMs you mentioned are rather expensive, the much
larger sizes are way cheaper. The 1Mb 28c1000 parts are available
inexpensively and are prime parts from distributors. I recently bought
some 27C256 and 29C256 devices for a CDP 1802, and Z80 projects and they
cost more than the 28c1000 parts which are 4X bigger and way faster. As
soon as I can find some decent wire-wrap 32 pin sockets I will buy some
of those chips and save some money on future projects even if I only use
part of the devices.

--
Cecil - k5nwa

Cecil Bayona

unread,
Oct 10, 2016, 11:29:20 AM10/10/16
to
I bought a Spark Machine a long time ago on eBay and was disappointed by
it's performance, somehow I expected better, it was a CPU from the early
90's, so I did not used it much and got rid of it. I did not program it
in any fashion, and installed one of the BSD Linux flavors on it, at the
time I considered it a waste of money and time. At that time an old
Apple Power PC just blew it away also running BSD Linux.

--
Cecil - k5nwa

Anton Ertl

unread,
Oct 10, 2016, 1:01:41 PM10/10/16
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>I guess that depends on whether your glass is half empty or half full.
>>>There's no getting around register allocation being NP-complete, but
>>>IME approximate algorithms run in reasonable time on reasonably small
>>>programs.
>>
>> Sure, but that does nothing to address the issues that register
>> windows address. A function that calls another function and has, say,
>> 6 registers live across the call, needs to save 6 registers and
>> restore 6 registers somewhere (either save calle-saved registers on
>> entry and exit, or save caller-saved registers around the call).
>> Register Windows reduce that overhead. In theory, one can use
>> interprocedural register allocation to reduce these overheads.
>
>Sure, but it's completely not free with regster windows either. Of
>course if every call fits neatly in the register window that's a win,
>but if they don't that's a loss.

If a function needs more registers, it's certainly not worse off then
if the machine has no register windows: When it calls another
function, it shifts the window before the call and shifts it back
after, and presto, 16 registers are saved that would have required 16
stores before the call and 16 loads after; apart from that, the
register allocator has to spill registers just as it would without
register windows; the register windows certainly are a win here.

>>>Well, sure: an optimizing compiler can't optimize beyond its
>>>"optimization horizon": the amount of code that it can see. That's
>>>obvious, surely. You can't optimize what you can't see.
>>
>> But register windows work even for such calls.
>
>Sure, up to their intrinsic limitations caused by the fixed-size
>windows.

What do you have in mind?

>>>The register stack isn't on the face of it such a terrible idea, but
>>>it's certainly questionable whether it's the best way to use the
>>>silicon. But the IA-64 does not, as you note, have fixed-size
>>>windows, which are the real Achilles' heel of the SPARC.
>>
>> I doubt that it's the Achilles' heel. Sure, variable shifts would
>> have made better use of the physical registers, but as long as the
>> number of overflows/underflows is small, that makes little difference.
>
>Indeed, but that's the big assumption

Not at all, because the physical register size can be (and has been)
chosen such that overflows/underflows are rare.

Apart from that, if the fixed shift had been such a big problem, they
could have added variable shifts in later revisions of the
architecture (in particular in the move to 64 bits), but they chose
not to; so apparently it was not a big problem, much less an Achilles'
heel.

>Even with somewhat inferior
>manufacturing technology, if register windows are a substantial
>architectural advantage they should have won, but didn't.

Superior computer architecture always wins? Maybe on another world.
In this world, architectures with lots of software win.

Anyway, back to register windows:

Are they a substantial advantage for statically-linked SPEC CPU
benchmarks? Obviously not, otherwise more architectures would have
adopted them.

Are they a small advantage for real-world applications (dynamically
linked, with polymorphic calls)? I think so, and apparently the IA-64
architects thought so, too.

Do they justify the additional silicon? If it was a justifiable use
of silicon in 1986, it certainly is justfifiable now.

Do they justify the additional design complexity (with possible
knock-on effects on clock rate and time-to-market). The SPARC,
AMD29K, and IA-64 architects thought so, the others didn't. It's
probably not a clear-cut issue.

>> And the number of physical registers was selected such that the
>> number of overflows was small in practice.
>
>That's true iff you're prepared to dedicate a lot of silicon to
>register sets rather than something else. It's not free.

Somehow the CPU people ran out of things that they need silicon for
some years ago, so now they give us more than one core per CPU chip.
Interestingly, SPARC implementations are pretty far along on the
multi-core game: SPARC64 X (2012) has 16 cores (with 2 threads each)
on a chip, i.e., 32 instances of the register set; actually, already
the UntraSPARC T1 from 2005 has 8 cores with 4 threads each, again
needing 32 instances of the register set, and the SPARC T3 and T5 have
16 cores with 8 threads each, for 128 instances of the register set.

Would they have gone for more cores and more threads if they did not
have such big register sets? Maybe, but no other server CPUs with
more threads come to my mind, so probably not.

If register windows were really as much as a problem you make them out
to be, they could have been obsoleted over time: First, let compilers
produce code without window shifts, then reduce the number of physical
registers to the allowed minimum, and eventually remove register
windows support, especially in the 32-bit to 64-bit transition; look
at what ARM did with shifts and conditional instructions in the move
to 64 bits (or AMD with segments in the move to 64 bits).

Lars Brinkhoff

unread,
Oct 10, 2016, 2:23:06 PM10/10/16
to
Anton Ertl wrote:
> Do [register windows] justify the additional design complexity (with
> possible knock-on effects on clock rate and time-to-market). The
> SPARC, AMD29K, and IA-64 architects thought so, the others didn't.

Also the Xtensa architecture, which can shift its 16-register window by
4, 8, or 12. Tensilica was founded in 1997.

Bernd Paysan

unread,
Oct 10, 2016, 9:02:05 PM10/10/16
to
Am Mon, 03 Oct 2016 07:00:39 -0700 schrieb foxaudioresearch:

> You can study some of the current Forth style stack machines like B16 or
> J1 to see what others have done.

b16's stack sizes are simply configurable in powers of two. I used
different sizes for different projects; but either 8 or 16 usually were
"big enough" for my projects.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o ID: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
http://bernd-paysan.de/
It is loading more messages.
0 new messages