Forth CPUs with NO Interrupts

rickman

unread,

Aug 26, 2015, 12:53:18 AM8/26/15

to

The register thread made me realize that one of the greatest
accomplishments of the GA144 is not any of the typically quoted features
of the inter-processor communications or the small CPU size. It is the
fact that programming the processor is made so much simpler by not
having *interrupts*. Interrupts make the design of any processor
significantly more complex, not to mention the software.

I think there was a discussion in one of the FPGA groups where a stack
CPU design was criticized for not including a semaphore instruction
which allowed inter-process communications to be implemented without
tons of work. The "forth" solution would be to just not allow
interrupts and to not have multiprocessing!

I still see the GA144 as a sort of FPGA that supports software on each
of the logic blocks. This may be one of those things that is just a bit
ahead of its time. When technology reaches a point where a GA256K is
available, maybe we will know how to properly utilize vast arrays of
processors. While there are many flaws in the GA144 system level
design, I think much of the criticism leveled toward it is due to
potential users viewing it through conventional CPU glasses.

--

Rick

Anton Ertl

unread,

Aug 26, 2015, 5:03:31 AM8/26/15

to

rickman <gnu...@gmail.com> writes:
>I think there was a discussion in one of the FPGA groups where a stack
>CPU design was criticized for not including a semaphore instruction
>which allowed inter-process communications to be implemented without
>tons of work. The "forth" solution would be to just not allow
>interrupts and to not have multiprocessing!

The classical Forth solution is to have cooperative round-robin
multi-tasking, and there you know where the task switches can happen
(at PAUSE and I/O words), and everything that does not contain such
words is executed atomically (apart from interrupts); semaphores, if
necessary, can be built out of simpler words. Interrupts were
supported, but typically just acticated a task that did the actual
work when the round-robin scheduler came to it.

If course, this simplicity is no longer possible with we want multiple
cores to work. You can look at Chuck Moore's recent work to see his
solution for the resulting problems: No shared memory, communication
and synchronization through links with almost no buffering. Not sure
if that's enough in general, though. Anyway, other processors don't
have semaphore instructions, either, they have stuff from which
semaphores and other synchronization primitives can be built.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2015: http://www.rigwit.co.uk/EuroForth2015/

Mark Wills

unread,

Aug 26, 2015, 7:33:44 AM8/26/15

to

On Wednesday, 26 August 2015 05:53:18 UTC+1, rickman wrote:
> The register thread made me realize that one of the greatest
> accomplishments of the GA144 is not any of the typically quoted features
> of the inter-processor communications or the small CPU size. It is the
> fact that programming the processor is made so much simpler by not
> having *interrupts*. Interrupts make the design of any processor
> significantly more complex, not to mention the software.

I can imagine that interrupts do indeed make the hardware side
more complex. Regarding the software side, you might be interested
in the approach taken by Texas Instruments in the 1970s with their
990/x line of mini computers.

The 9900/x mini's had a CPU that was originally built from discrete
logic, like most of the minis back then (PDPs etc) on plug in boards.
Later on it was integrated into a single chip in the form of the
TMS9900 and decscendants.

The TMS9900 has a unique architecture: It has 16 general purpose
registers, but, uniquely, the registers do not reside on the chip.
The registers are in regular RAM, and can be anywhere. The CPU
only has three on-board registers: PC (program counter), ST (status
register) and WP (workspace pointer).

Using the LWPI instruction, (Load Workspace Pointer) you can have
as many sets of 16 registers as you have memory for. They can even
overlap if that's what you want to do. Certain instructions, for
example BLWP (branch and load workspace pointer) allow you to
specify an address to branch to, and critically, an address from
which the 16 registers will be taken from in RAM. Thusly,
subroutines can each have their own set of registers - meaning the
registers effectively become local variables.

Interrupts on the 9900 are programmed using vectors. Some locations
in memory are reserved for vectors. So, when a level 4 interrupt
(for example) comes in, the processor reads the level 4 vector.
In that vector are two pointers: The address of an interrupt
service routine, and the address of the registers for that ISR.
Thusly, interrupts are very fast indeed - no registers need to
be pushed to stacks in order to save their contents - the ISR
(every ISR) can have its own register workspace. Same for subroutines.

Even better, when an interrupt occurs, or a subroutine is called
using BLWP, three registers in the *new* workspace are loaded with:

Previous contents of program counter (for returning from ISR or SR)
Previous contents of status register
Previous address of register set

So, having serviced your interrupt, or finished your subroutine,
you can return, seamlessly, back to your main program. The entire
previous context of the machine is restored instantly. The contents
of the registers stay where they were: in RAM.
Using this technique, routines can call routines, that call routines
that call routines, each having their own register set, and if
an interrupt arrives while it's all happening, the programmer makes
precisely no effort at all. Just let it happen. It'll all unwind
itself.

It's a brilliant architecture (in practice, the implementation
used in the 9900 could be better, it was massively improved in
the TMS9995) that actually makes the silicon a lot simpler.
Move the registers off the chip into memory. You won't get
the fastest performance using that technique (you have to go
out on the address and data busses (obviously) to fetch your
registers), but you can get a nice simple design and programming
it is like falling off a log!

Mark

rickman

unread,

Aug 26, 2015, 8:46:14 AM8/26/15

to

I still have a TMS9995 CPU card I built many years ago. I remember that
I got it to work, but never made it do anything useful. I think I was
going to connect it to an LCD from a UPS truck. Not sure why I didn't,
I think it had to do with getting my hands on a Heathkit LSI-11 which
made the TMS devices fade into the background, lol.

--

Rick

rickman

unread,

Aug 26, 2015, 8:50:25 AM8/26/15

to

On 8/26/2015 4:51 AM, Anton Ertl wrote:
> rickman <gnu...@gmail.com> writes:
>> I think there was a discussion in one of the FPGA groups where a stack
>> CPU design was criticized for not including a semaphore instruction
>> which allowed inter-process communications to be implemented without
>> tons of work. The "forth" solution would be to just not allow
>> interrupts and to not have multiprocessing!
>
> The classical Forth solution is to have cooperative round-robin
> multi-tasking, and there you know where the task switches can happen
> (at PAUSE and I/O words), and everything that does not contain such
> words is executed atomically (apart from interrupts); semaphores, if
> necessary, can be built out of simpler words. Interrupts were
> supported, but typically just acticated a task that did the actual
> work when the round-robin scheduler came to it.
>
> If course, this simplicity is no longer possible with we want multiple
> cores to work. You can look at Chuck Moore's recent work to see his
> solution for the resulting problems: No shared memory, communication
> and synchronization through links with almost no buffering. Not sure
> if that's enough in general, though. Anyway, other processors don't
> have semaphore instructions, either, they have stuff from which
> semaphores and other synchronization primitives can be built.

I don't recall the particulars, but the guy complaining about this CPU
design was saying you can't construct a useful semaphore without a *lot*
of headache unless you design an atomic test-modify-write instruction or
similar. I supposed you can disable interrupts while you do this in
multiple instructions, but I think he was saying this becomes unwieldy
and greatly increases the interrupt response time as well as the
response jitter.

His rant seemed to be that it is not practical to give such CPUs an OS.

--

Rick

Walter Banks

unread,

Aug 26, 2015, 11:42:00 AM8/26/15

to

On 26/08/2015 12:53 AM, rickman wrote:
> The register thread made me realize that one of the greatest
> accomplishments of the GA144 is not any of the typically quoted
> features of the inter-processor communications or the small CPU size.
> It is the fact that programming the processor is made so much simpler
> by not having *interrupts*. Interrupts make the design of any
> processor significantly more complex, not to mention the software.
>

Interrupt context switching has always been a software nightmare. There
have been many approaches from hardware saving registers or not. No
matter what the approach it is always a comes down to one of two metrics
minimum switch speed vs minimum execution cycles.

In one processor design project we chose that all interrupts would be
run to completion so that interrupts could only happen at the pleasure
of the application developer. This minimizes the amount of context
needed and minimizes the average cycle count. We compensated for switch
delays by being able to precomputed event driven I/O transfers in hardware.

w..

Syd Rumpo

unread,

Aug 26, 2015, 12:12:05 PM8/26/15

to

On 26/08/2015 05:53, rickman wrote:
> The register thread made me realize that one of the greatest
> accomplishments of the GA144 is not any of the typically quoted features
> of the inter-processor communications or the small CPU size. It is the
> fact that programming the processor is made so much simpler by not
> having *interrupts*. Interrupts make the design of any processor
> significantly more complex, not to mention the software.
>

<snip>

For me, one of the attractions of a stack machine is that interrupts can
be very low latency - there's nothing to stack, it's already there.
(Ok, probably need to stack a mode register or two.)

But I take the point - one or more simpler processors per task is good.

Cheers
--
Syd

Mux

unread,

Aug 26, 2015, 12:48:51 PM8/26/15

to

That's kinda why I brought up the part about registers. In a 'pure' stack machine architecture an interrupt does indeed become just a call. As long as it doesn't allow for re-entrant interrupts and leaves the state of the stack where it was before the entry point, it's literally just a single subroutine call with the caveat that when you use encoded (i.e. 5:5:5) instructions, you can only service the interrupt when your instruction register has been fully emptied.

My 'C4' forth core, which only has 16 instructions basically deals with interrupts by loading the IR with 'dup dup xor call' instead of the next instruction word from memory, which effectively executes a subroutine call to address zero. Since reset *also* starts at address zero, there's a check to see how it got there, either from a fresh reset (interrupt request is cleared on reset) or by interrupt, in which case the interrupt request bit is set.

While it's a simple solution with minimal overhead, NOT having interrupts and having multiple cores makes things more interesting and also does away with the whole register state stuff.

-Y

Elizabeth D. Rather

unread,

Aug 26, 2015, 1:23:06 PM8/26/15

to

On 8/25/15 10:51 PM, Anton Ertl wrote:
> rickman <gnu...@gmail.com> writes:
>> I think there was a discussion in one of the FPGA groups where a stack
>> CPU design was criticized for not including a semaphore instruction
>> which allowed inter-process communications to be implemented without
>> tons of work. The "forth" solution would be to just not allow
>> interrupts and to not have multiprocessing!
>
> The classical Forth solution is to have cooperative round-robin
> multi-tasking, and there you know where the task switches can happen
> (at PAUSE and I/O words), and everything that does not contain such
> words is executed atomically (apart from interrupts); semaphores, if
> necessary, can be built out of simpler words. Interrupts were
> supported, but typically just acticated a task that did the actual
> work when the round-robin scheduler came to it.
>
> If course, this simplicity is no longer possible with we want multiple
> cores to work. You can look at Chuck Moore's recent work to see his
> solution for the resulting problems: No shared memory, communication
> and synchronization through links with almost no buffering. Not sure
> if that's enough in general, though. Anyway, other processors don't
> have semaphore instructions, either, they have stuff from which
> semaphores and other synchronization primitives can be built.

GA's concept is that the multiple cores are tiny and numerous, and in
effect replace the need for any multitasking within a core.

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

m...@iae.nl

unread,

Aug 26, 2015, 2:17:33 PM8/26/15

to

On Wednesday, August 26, 2015 at 1:33:44 PM UTC+2, Mark Wills wrote:

Thanks a lot for this very clear explanation!

It reads very much like the transputer architecture (although
the 'puter additionally had hardware floating-point AND shallow
stacks).

-marcel

PS: Why not have 16 extra cores, with limited functionality,
for interrupts? They would only need specific instructions,
and maybe it would be OK if they could only address say
1K of windowable memory. They could even be revamped 9900s :-)

hughag...@gmail.com

unread,

Aug 26, 2015, 3:50:37 PM8/26/15

to

The TI9900 is pretty cool --- way ahead of its time --- similar to the Sun Sparc with that movable set of pseudo-registers.

In regard to interrupts, the MiniForth didn't have them --- it polled I/O as part of NEXT and would call an ISR if needed --- the primitives were generally short and fast because they didn't do any branching internally, so I/O didn't get lost (multiplication and division had to be factored though, as those were too slow otherwise).

rickman

unread,

Aug 26, 2015, 6:28:24 PM8/26/15

to

On 8/26/2015 3:50 PM, hughag...@gmail.com wrote:
>
> The TI9900 is pretty cool --- way ahead of its time --- similar to
> the Sun Sparc with that movable set of pseudo-registers.
>
> In regard to interrupts, the MiniForth didn't have them --- it polled
> I/O as part of NEXT and would call an ISR if needed --- the
> primitives were generally short and fast because they didn't do any
> branching internally, so I/O didn't get lost (multiplication and
> division had to be factored though, as those were too slow
> otherwise).

That sounds like Miniforth *did* have interrupts. I can't imagine it
would be polling I/O in the NEXT routine using Forth instructions, that
would be an enormous overhead. More likely hardware monitored
interrupts and when NEXT happened the interrupts were enabled.

Did this thing really have a NEXT routine? That alone sounds
inefficient. Next is basically the fetching and execution of the next
Forth instruction which really should be done in hardware just as any
CPU would do.

--

Rick

rickman

unread,

Aug 26, 2015, 6:39:06 PM8/26/15

to

You only have nothing to stack if you have no data that needs to be
preserved between invocations. I think most interrupt processes have
data that needs to be preserved.

--

Rick

Ala'a Alawi

unread,

Aug 27, 2015, 11:37:14 AM8/27/15

to

Not forth CPU, but it does not have interrupts IIRC, P8X32A from basic stamp creator Parallax

https://www.parallax.com/sites/default/files/downloads/P8X32A-Propeller-Datasheet-v1.4.0_0.pdf

with an 8 cores, and somewhat interesting design to look at.

Ala'a

hughag...@gmail.com

unread,

Aug 28, 2015, 2:44:36 AM8/28/15

to

On Wednesday, August 26, 2015 at 4:33:44 AM UTC-7, Mark Wills wrote:

> I can imagine that interrupts do indeed make the hardware side
> more complex.

The point I was making in regard to the MiniForth, is that you don't need to have interrupts that can occur after ANY machine-instruction anywhere. You can just have one machine-instruction that can interrupt itself if an interrupt is necessary --- so all the complexity is buried in that one machine-instruction. In the MiniForth that instruction was NEXT because it gets executed very frequently. Also, the general-purpose registers (such as the carry bit) never carry data from one primitive to the next, but are only used internally within each primitive, so they don't have to be saved and restored because they are guaranteed to be invalid at the time that NEXT executes.

I don't know anything about HDL, so I can't really comment on how much complexity was reduced with this method --- I had no involvement in writing the HDL code for the MiniForth (or in writing the HDL itself, which was done before I even started work there).

I wrote the assembler/simulator. I remember that up to five instructions could be packed into a single opcode, and all of them would execute concurrently (each opcode took one clock cycle). The job of the assembler was to rearrange the instructions so as to pack as many into each opcode as possible (minimize how many slots got filled with a NOP), while yet guaranteeing that the generated code did the same thing as it would if one instruction were in each opcode and the instructions were in the same order as they had been in the source-code. It was all about knowing which registers each instruction depended upon and which registers it affected, and making sure that no instruction that needed certain registers got compiled ahead of the instruction that set those registers --- this was similar to the "out of order execution" of the x86 processors, except that I was doing it at compile-time rather than run-time. The NEXT instruction always parallelized with (got packed together with) other instructions so it took zero clock cycles (the only exception was the NOOP primitive, because there weren't any other instruction to parallelize NEXT with). NEXT accessed memory, so it couldn't be packed together with any other instruction that accessed memory, but there were always instructions nearby that just accessed registers and NEXT would get packed in with them. The MiniForth was Harvard Architecture, so by "memory" I mean the data memory (both data and Forth code) --- there was a separate memory for the machine-code.

> Interrupts on the 9900 are programmed using vectors. Some locations
> in memory are reserved for vectors. So, when a level 4 interrupt
> (for example) comes in, the processor reads the level 4 vector.
> In that vector are two pointers: The address of an interrupt
> service routine, and the address of the registers for that ISR.
> Thusly, interrupts are very fast indeed - no registers need to
> be pushed to stacks in order to save their contents - the ISR
> (every ISR) can have its own register workspace. Same for subroutines.
>
> Even better, when an interrupt occurs, or a subroutine is called
> using BLWP, three registers in the *new* workspace are loaded with:
>
> Previous contents of program counter (for returning from ISR or SR)
> Previous contents of status register
> Previous address of register set
>
> So, having serviced your interrupt, or finished your subroutine,
> you can return, seamlessly, back to your main program. The entire
> previous context of the machine is restored instantly. The contents
> of the registers stay where they were: in RAM.
> Using this technique, routines can call routines, that call routines
> that call routines, each having their own register set, and if
> an interrupt arrives while it's all happening, the programmer makes
> precisely no effort at all. Just let it happen. It'll all unwind
> itself.

The MiniForth had sixteen 16-bit pseudo-registers that were in memory. They were at a fixed location (address zero) however, so they couldn't be relocated like in your TI9900. Having a base pointer for the registers is actually a pretty cool idea though --- we should have done that in the MiniForth.

Actually, the pseudo-registers weren't even added to the MiniForth design until the very end of the project. This was after the entire motion-control program had been written. The pseudo-registers were added because the next project was a C compiler and I said that I needed pseudo-registers because LCC (I was planning on retargeting LCC to generate code for the MiniForth) expected to have a set of registers available. The C compiler project died though, mostly because I didn't know very much about LCC and Testra wasn't willing to pay me to learn --- they expected me to already be an expert in LCC, similar to how I was already an expert in UR/Forth when I started working there --- I can't be an expert in everything though, and I'm not going to spend my free time learning something that I have no interest in outside of my job. Also, I was quite dubious that the MiniForth would be competitive running C programs against mainstream processors (such as the MC6812) that were designed specifically to support C --- the MiniForth could blow away mainstream processors (such as the MC6812) in running Forth programs --- everybody wants to program in C rather than in Forth though, so there was no interest in the MiniForth processor anywhere (afaik, the motion-control boards are the only application that it has ever been used for).

MFX would have been a lot different if I had known at the beginning that pseudo-registers were going to be included in the design. All I was told at the beginning was that the goal was to port the motion-control program over from 80c320 UR/Forth to the MiniForth --- UR/Forth expected a basic stack-machine (it didn't have local variables), so the MiniForth was a basic stack-machine (only after I got the pseudo-registers did I add local variables, but they weren't used in the motion-control program that was already written).

Mux

unread,

Aug 28, 2015, 9:28:35 PM8/28/15

to

It's not that much that it's more difficult from an HDL perspective but rather the CPU you're implementing. The more complicated your architecture, the more overhead you'll have to implement it.

Most stack computers have no condition flags to save, so all it really becomes is saving the next PC and CALLing the the ISR. Most register CPU's have a lot more to do but a lot of that is still left up to the programmer and the bare minimum is performed. Of course, with Superscalar and OoO operations, you'll need to flush the entire pipeline and what not, which will take a LOT of cycles. At a minimum, something like a 6502 pushes the status register and the next PC before calling the ISR. A special RTI (return from interrupt) differs from an RTS in that it does the extra pull.

For an interpreted software forth implementation, I don't see anything wrong polling at next. In fact, I'm pretty sure that's what microcode machines do as well :-)

As I alluded to earlier, my C4 core checks for interrupts once the ISR is empty and then, rather than loading the next set of instructions from the PC, replaces it with an ISR call.

-Y

pbj...@gmail.com

unread,

Aug 30, 2015, 10:01:59 PM8/30/15

to

On Wednesday, August 26, 2015 at 2:53:18 PM UTC+10, rickman wrote:
<snip>

> It is the
> fact that programming the processor is made so much simpler by not
> having *interrupts*. Interrupts make the design of any processor
> significantly more complex, not to mention the software.
>
> I think there was a discussion in one of the FPGA groups where a stack
> CPU design was criticized for not including a semaphore instruction
> which allowed inter-process communications to be implemented without
> tons of work. The "forth" solution would be to just not allow
> interrupts and to not have multiprocessing!

<snip>

Seeing that the Parallax Propeller forums are down at the moment I know that my Tachyon Forth implemented on the 8 core 32-bit Propeller works well without "interrupts" :) There are some that want interrupts for the next generation Propeller yet must forget somehow that each core is a CPU that can be dedicated to the task that would normally handled by the interrupt. One of those typical interrupts is the UART but with the Propeller for instance I can dedicate a core just to serial receive and/or transmit including the bit level implementation right through to handshaking and buffering etc. So the application is not "interrupted" for anything but only needs to access the buffers from the common hub RAM.

In fact cores are so much better than interrupts as they can be dedicated to the task at hand without distraction. For instance I am developing a 32 channel serial asynch with any mix of rx and tx (32 total) that is designed to work at 115,200 baud although the transmitters are quite happy working even at 1M baud, all in one core without any dedicated hardware UARTs etc. I've only spent less than a day on this and already the transmitter section is working very well without any jitter etc and now I am down to testing the receiver section while I get sidetracked and program one of the cores as a resident logic analyser.
wOOt!

Interrupts are great, that is if you only have a single core, which is not great at all these days. Multicore should not be thought of as parallel processing, as if even from the days of ILLIAC IV IIRC this was shown to be quite difficult/impossible to program. Multi-threading we understand though. Rather the cores are used IMO much as people are used within an organization, each specialized and looking after what they do best and everything runs smoothly.

Interrupts? ah fooie!

(looks like the Propeller forums are back online, time to have some more fun)

rickman

unread,

Aug 30, 2015, 10:45:24 PM8/30/15

to

On 8/30/2015 10:01 PM, pbj...@gmail.com wrote:
>
> Interrupts are great, that is if you only have a single core, which
> is not great at all these days. Multicore should not be thought of as
> parallel processing, as if even from the days of ILLIAC IV IIRC this
> was shown to be quite difficult/impossible to program.
> Multi-threading we understand though. Rather the cores are used IMO
> much as people are used within an organization, each specialized and
> looking after what they do best and everything runs smoothly.
>
> Interrupts? ah fooie!
>
> (looks like the Propeller forums are back online, time to have some
> more fun)

I agree that multicore is a great way to program. Interrupts are a
terrible kludge created by the focus on building CPUs of ever increasing
speed rather than using multiple cores assigned to a limited number of
tasks.

But none of that precludes the need for semaphores. How does the
Parallax handle that? Are there single instructions for semaphore
operations?

--

Rick

hughag...@gmail.com

unread,

Aug 30, 2015, 11:06:03 PM8/30/15

to

On Sunday, August 30, 2015 at 7:01:59 PM UTC-7, pbj...@gmail.com wrote:
> In fact cores are so much better than interrupts as they can be dedicated to the task at hand without distraction. For instance I am developing a 32 channel serial asynch with any mix of rx and tx (32 total) that is designed to work at 115,200 baud although the transmitters are quite happy working even at 1M baud, all in one core without any dedicated hardware UARTs etc. I've only spent less than a day on this and already the transmitter section is working very well without any jitter etc and now I am down to testing the receiver section while I get sidetracked and program one of the cores as a resident logic analyser.
> wOOt!
>
> Interrupts are great, that is if you only have a single core, which is not great at all these days. Multicore should not be thought of as parallel processing, as if even from the days of ILLIAC IV IIRC this was shown to be quite difficult/impossible to program. Multi-threading we understand though. Rather the cores are used IMO much as people are used within an organization, each specialized and looking after what they do best and everything runs smoothly.
>
> Interrupts? ah fooie!

I agree that having more than one processor makes things easier. The MiniForth had an 8032 on board with some 64MB of RAM (not a typo: 64MB not 64KB). The 8032 had a UART built in, which the MiniForth lacked, so the 8032 buffered the files from the desktop-computer and then fed the data to the MiniForth through parallel I/O --- this way the MiniForth could do the laser etching, but not have to store the big files in memory (it had limited memory) --- the timing was pretty precise on the motion-control, so you really can't have the processor interrupted by a UART while it is in the middle of etching.

I've been thinking about upgrading the FMITE so that it could be multi-core. It is pretty small, so on a big FPGA it there would be room for more than one of them. I would need semaphores so they don't step on each other's toes when accessing common RAM. Can you describe what is needed here? I think I need a primitive that can read and modify a memory location in one step.

Reinhardt Behm

unread,

Aug 30, 2015, 11:09:43 PM8/30/15

to

It has something called "lock" that can be used similar like a semaphore.
But I only once had the need for it.
Since each core has its own RAM and there is controlled access to common
(hub) RAM there is not much need for semaphores. You usually use hub RAM for
some kind of mailboxes or implement simple ring buffers which work without
the need for synchronization.

One of the big advantages of this architecture is the clearly predictable
timing also for access to common RAM.
For example if you write the software for some strange protocol (I did it
for ARINC-429, serial 32bit, 100kBit/s, used in avionics) it will run on one
CPU independent of what is going on in the rest. If you need more of the
same interfaces just dedicate another CPU(s). With conventional interrupt
driven single core CPUs you would have to think about mutual interference of
the timing.
What I also found, that such programs get much simpler because each CPU just
does _one_ task.

--
Reinhardt

rickman

unread,

Aug 30, 2015, 11:20:59 PM8/30/15

to

It seems a bit odd to have the complexity of a CPU in an FPGA to handle
the relatively simple task of motion control. Why not use a simpler CPU
like the 8032 to control dedicated hardware for motion control? FPGAs
are all about application dedicated hardware. Much easier to get the
synchronization and precise control that way.

--

Rick

Bob

unread,

Aug 31, 2015, 8:52:25 PM8/31/15

to

On 08/30/2015 11:06 PM, hughag...@gmail.com wrote:
> I agree that having more than one processor makes things easier.
The MiniForth had an 8032 on board with some 64MB of RAM (not a
typo: 64MB not 64KB). The 8032 had a UART built in, which the
MiniForth lacked, so the 8032 buffered the files from the
desktop-computer and then fed the data to the MiniForth through
parallel I/O --- this way the MiniForth could do the laser
etching, but not have to store the big files in memory (it had
limited memory) --- the timing was pretty precise on the
motion-control, so you really can't have the processor
interrupted by a UART while it is in the middle of etching.
>
> I've been thinking about upgrading the FMITE so that it could
be multi-core. It is pretty small, so on a big FPGA it there
would be room for more than one of them. I would need semaphores
so they don't step on each other's toes when accessing common
RAM. Can you describe what is needed here? I think I need a
primitive that can read and modify a memory location in one step.
>

A primitive that can do three things in order in one step all at
once is impossible.

The 8086 instruction LOCK causes the cpu hardware to output on a
pin from the start of read to end of write to tell the ram
arbitrator not to let any other cpu interfere. That is equivalent
to disable interrupts on a single cpu.

A totally different approach to communication between processes
is to use a linked list of small message buffers. One process
uses the head pointer to take messages. The other process uses
the tail pointer to send messages. Neither steps on the other's
pointer. This works without disabling interrupts on a single cpu
and it works without locking hardware on multiple cpus with
shared ram.

The ram needs to correctly complete one read or write at a time. It
cannot mix bits from two simultaneous transfers. If you have that
you have all you need. You may want more but you have what you
need.

odu_bob

Mark Wills

unread,

Sep 1, 2015, 4:39:28 AM9/1/15

to

On Tuesday, 1 September 2015 01:52:25 UTC+1, Bob wrote:
>
> A primitive that can do three things in order in one step all at
> once is impossible.
>

Is it? What about the Forth primitive + ?

In a VHDL implementation I have, that instruction:

* Pops first operand off the stack (adjusting stack pointer)
* Pops second operand off the stack (adjusting stack pointer)
* Adds them
* Pushes the result to the stack (adjusting the stack pointer)

In a single clock cycle. That's at least 4 operations.

dunno

unread,

Sep 1, 2015, 1:58:31 PM9/1/15

to

Why are you adjusting SP three times? It's needed only once.

--
dunno

Bob

unread,

Sep 1, 2015, 6:52:03 PM9/1/15

to

All at ONCE? at one time read add and write all at ONE time?
If they all happen at once there is no order, they happen at
the same time.

To organize communication between multiple processors the
+ primitive doesn't ensure CPU A won't read or write between
the reads and writes of CPU B executing the + primitive.

Several posts back the 8 core Propeller was mentioned then
Hugh said he might need a primitive for multiple FMITEs.

Again, a primitive that can be used to organize multiple
CPUs to simultaneously read and write is impossible. It
is a problem that is different from one CPU with an
uninterruptible primitive. A different CPU is running
and can execute reads and writes independent of the
primitive read / write on the other CPU.

The shared ram must ensure that a read or write happens
correctly. Another read or write request that happens
at the same time must be processed later or at the same
time. At the same time is possible if the address is
different, both are reads, or only one is a write. In
the last case the read can correctly deliver the data
from before or after the write. If both are writes the
ram can retain either of the writes but mixing the bits
would be bad.

If your shared ram can't do that it gets much harder.
That is a class of problem difficulty that I describe
as "Knuth has a solution, don't bother to try to solve
it again".

hughag...@gmail.com

unread,

Sep 1, 2015, 7:32:36 PM9/1/15

to

I was assuming that the primitive would lock out the other processors from the common-memory during its execution --- I never said anything about simultaneous access to the common-memory --- that is obviously impossible.

Anyway, I already figured out what my primitive needs to do, so never mind my question about what is needed --- now I wish that I hadn't asked.

Bernd Paysan

unread,

Sep 1, 2015, 7:38:06 PM9/1/15

to

Bob wrote:

>> In a VHDL implementation I have, that instruction:
>>
>> * Pops first operand off the stack (adjusting stack pointer)
>> * Pops second operand off the stack (adjusting stack pointer)
>> * Adds them
>> * Pushes the result to the stack (adjusting the stack pointer)
>>
>> In a single clock cycle. That's at least 4 operations.
>>
> All at ONCE? at one time read add and write all at ONE time?
> If they all happen at once there is no order, they happen at
> the same time.

I suppose that the stack pointer is in a register, and the VHDL compiler
optimizes a sequence of +1+1-1 into +1. The stack itself might be out of
registers, too, so there are two read and one write port to it. Can't be
done in RAM.

If the author kept TOS in a register, the rest can be done in a single
ported RAM. That's what I do in my b16. But then, I know what I'm doing.
I also only do one stack pointer update per instruction, removing an
opportunity to show off how good it optimizes for the HDL compiler.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o ID: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
http://bernd-paysan.de/

rickman

unread,

Sep 1, 2015, 10:09:17 PM9/1/15

to

On 9/1/2015 7:38 PM, Bernd Paysan wrote:
> Bob wrote:
>
>>> In a VHDL implementation I have, that instruction:
>>>
>>> * Pops first operand off the stack (adjusting stack pointer)
>>> * Pops second operand off the stack (adjusting stack pointer)
>>> * Adds them
>>> * Pushes the result to the stack (adjusting the stack pointer)
>>>
>>> In a single clock cycle. That's at least 4 operations.
>>>
>> All at ONCE? at one time read add and write all at ONE time?
>> If they all happen at once there is no order, they happen at
>> the same time.
>
> I suppose that the stack pointer is in a register, and the VHDL compiler
> optimizes a sequence of +1+1-1 into +1. The stack itself might be out of
> registers, too, so there are two read and one write port to it. Can't be
> done in RAM.

What can't be done in RAM? Have you never heard of multiple port RAM?
Ram is just registers connected to the outside world via multiplexors.

--

Rick

Bernd Paysan

unread,

Sep 2, 2015, 8:12:51 AM9/2/15

to

rickman wrote:
> What can't be done in RAM? Have you never heard of multiple port RAM?
> Ram is just registers connected to the outside world via multiplexors.

No, RAM is not registers connected to the outside world via multiplexers.
That's a register file, RAM is way denser, and it's connected to the outside
world via bit lines and sense amplifiers.

There are two bit lines per bit, inverted to each other. You can do dual-
ported RAM (by doubling the number of bit lines), but that means you can
either read two entries in a go, or read one, and write one. That's what
most FPGAs have inside; while you theoretically can have more ports (the
more you have, the less dense is your RAM), you won't see that in an FPGA.

So two reads, one write isn't possible in one cycle on an FPGA. Just put
the top of stack into a register, and the number of ports to stack RAM can
go down to one.

rickman

unread,

Sep 2, 2015, 12:40:31 PM9/2/15

to

On 9/2/2015 8:12 AM, Bernd Paysan wrote:
> rickman wrote:
>> What can't be done in RAM? Have you never heard of multiple port RAM?
>> Ram is just registers connected to the outside world via multiplexors.
>
> No, RAM is not registers connected to the outside world via multiplexers.
> That's a register file, RAM is way denser, and it's connected to the outside
> world via bit lines and sense amplifiers.
>
> There are two bit lines per bit, inverted to each other. You can do dual-
> ported RAM (by doubling the number of bit lines), but that means you can
> either read two entries in a go, or read one, and write one. That's what
> most FPGAs have inside; while you theoretically can have more ports (the
> more you have, the less dense is your RAM), you won't see that in an FPGA.
>
> So two reads, one write isn't possible in one cycle on an FPGA. Just put
> the top of stack into a register, and the number of ports to stack RAM can
> go down to one.

You get hung up in meaningless distinctions while missing the point.
"Bit lines" *are* multiplexors using a wire OR.

You DO see dual port RAMs in FPGAs.... in fact it is hard to find RAMs
that AREN'T dual port. By using two such RAMs in parallel you can have
three ports implementing 2 reads and 1 write in any clock cycle. I have
done this.

--

Rick

Mux

unread,

Sep 2, 2015, 1:40:44 PM9/2/15

to

With the exception that blockram's are 'free' and don't take up resources, I use 16x16bit deep bi-directional shift-registers. They're generally cheaper and faster than using blockram but take, of course, more resources. All depends on what you need. With that, I only need two control bits: enable and direction. Additionally, all of these work in parallel so an Add basically does all of the logic in a single clock. No fiddling with stack pointers, adders and what not..

-Mux

rickman

unread,

Sep 2, 2015, 4:57:46 PM9/2/15

to

I am speculating, but if you are using the Xilinx SRL (if that is the
right name), thy are implemented as the LUT latches with counters to
work as a shift register. They don't have full data paths to shift data
between the latches and using counters saves power too. If you are
implementing a proper shift register using FPGA registers, you are using
a 2 to 1 mux on every bit of every register to get bi-directionality.

BTW, blockrams are far from "free". They are just a different color of
money.

--

Rick

hughag...@gmail.com

unread,

Sep 2, 2015, 5:14:13 PM9/2/15

to

On Sunday, August 30, 2015 at 8:20:59 PM UTC-7, rickman wrote:
> On 8/30/2015 11:06 PM, hughag...@gmail.com wrote:
> > On Sunday, August 30, 2015 at 7:01:59 PM UTC-7, pbj...@gmail.com
> > wrote:
> >> Interrupts? ah fooie!
> >
> > I agree that having more than one processor makes things easier. The
> > MiniForth had an 8032 on board with some 64MB of RAM (not a typo:
> > 64MB not 64KB). The 8032 had a UART built in, which the MiniForth
> > lacked, so the 8032 buffered the files from the desktop-computer and
> > then fed the data to the MiniForth through parallel I/O --- this way
> > the MiniForth could do the laser etching, but not have to store the
> > big files in memory (it had limited memory) --- the timing was pretty
> > precise on the motion-control, so you really can't have the processor
> > interrupted by a UART while it is in the middle of etching.

> It seems a bit odd to have the complexity of a CPU in an FPGA to handle
> the relatively simple task of motion control. Why not use a simpler CPU
> like the 8032 to control dedicated hardware for motion control? FPGAs
> are all about application dedicated hardware. Much easier to get the
> synchronization and precise control that way.

Rickman obviously has no experience with motion-control. This is actually a difficult task. The tool has to accelerate and decelerate smoothly. The goal is to minimize harmonic resonance. There is a lot of math involved. A lot of motion-control programs solve this problem simply by putting a big delay (like 20 milliseconds) between every change of direction for the tool. This can't be done in a laser-etcher however. The tool (a mirror that directs the laser beam down onto the work surface) has to move at a steady speed so the laser will burn into the surface to a steady depth and width. If there were variations in speed at the joints where the lines change direction, then the laser would burn a blotch into the work surface at those joints.

Testra used an 80c320 originally, and over time they converted all of the Forth code into assembly-language. It was too slow however (mostly because 16-bit multiplication was slow) --- this is why the MiniForth was invented --- to be a replacement for the 80c320 providing better speed (note that the 80c320 is more than three times faster than the 8032 that Rickman recommends).

I don't actually have any experience with motion-control. The entire time that I worked at Testra, I never saw the source-code for the motion-control program, as that was highly proprietary. All I did was write MFX, the development system for the MiniForth --- the motion-control program is the only program that has ever been written in MFX, and I didn't see the source-code --- at least I admit that I don't know anything about motion-control however, which is a lot different than Rickman who describes motion-control as a "relatively simple task" and yet knows less about the subject than I do.

hughag...@gmail.com

unread,

Sep 2, 2015, 5:23:16 PM9/2/15

to

On Sunday, August 30, 2015 at 7:01:59 PM UTC-7, pbj...@gmail.com wrote:

> Seeing that the Parallax Propeller forums are down at the moment I know that my Tachyon Forth implemented on the 8 core 32-bit Propeller works well without "interrupts" ...

>
> In fact cores are so much better than interrupts as they can be dedicated to the task at hand without distraction. For instance I am developing a 32 channel serial asynch with any mix of rx and tx (32 total) that is designed to work at 115,200 baud although the transmitters are quite happy working even at 1M baud, all in one core without any dedicated hardware UARTs etc. I've only spent less than a day on this and already the transmitter section is working very well without any jitter etc and now I am down to testing the receiver section while I get sidetracked and program one of the cores as a resident logic analyser.
> wOOt!

Every time that somebody writes a Forth program that works, this is another nail in the coffin of Forth-200x --- I'm assuming here that any Forth program that works, will necessarily be written in a non-standard Forth (because Standard Forth is a bug-ridden mess designed by an idiot sales-person).

ANS-Forth was just a marketing gimmick from Elizabeth Rather intended to fool the world into believing that Forth Inc. sets the standard for the industry and that all Forth programmers are dependent upon Forth Inc. for leadership --- Forth-200x is just more of the same, as it is mandated to be 100% compatible with ANS-Forth --- the Forth-200x committee relies entirely upon every Forth programmer being a failure at programming, and there is nothing that they hate and fear more than a Forth programmer who is capable of writing programs that work.

Bernd Paysan

unread,

Sep 2, 2015, 5:35:56 PM9/2/15

to

rickman wrote:
> You get hung up in meaningless distinctions while missing the point.
> "Bit lines" *are* multiplexors using a wire OR.

No, a multiplexer is a logic gate. RAM is deliberately minimized special-
purpose circuitry, even with dynamic elements (the bit line is precharged,
so it doesn't have a pullup like a wired or, and it's used for both
directions, reading and writing; you have to be careful with your design
that writing actually flips a bit - it's at the border between analog and
digital design).

> You DO see dual port RAMs in FPGAs.... in fact it is hard to find RAMs
> that AREN'T dual port.

I did in fact write that.

> By using two such RAMs in parallel you can have
> three ports implementing 2 reads and 1 write in any clock cycle. I have
> done this.

Sounds like a stupid idea. By using some flip-flops for TOS (16 for 16
bits), you can use an entire RAM block for something else. And you reduce
your cycle time, because the read/write time to the TOS element in RAM is no
longer in the critical path.

Mux

unread,

Sep 2, 2015, 5:55:10 PM9/2/15

to

> I am speculating, but if you are using the Xilinx SRL (if that is the
> right name), thy are implemented as the LUT latches with counters to
> work as a shift register. They don't have full data paths to shift data
> between the latches and using counters saves power too. If you are
> implementing a proper shift register using FPGA registers, you are using
> a 2 to 1 mux on every bit of every register to get bi-directionality.
>

Yeah, but since they're wedged inbetween latches / registers (for lack of a better word), the performance you get is higher. For a register file you need an NxM multiplexer which usually comes out slower. Add to that the fact that you now have need a bi-directional counter to address into that and your critical path has become even slower.

> BTW, blockrams are far from "free". They are just a different color of
> money.

Oh, I'm sure :-)

-Y

rickman

unread,

Sep 2, 2015, 6:02:56 PM9/2/15

to

No, I am not an expert in motion control. My point was about the
division of labor between the fast processor and the slow. It seemed a
bit odd to design a custom processor and then to program it in a custom
language rather than to use an FPGA in the manner in which they work
best. Of course I don't know the details of this application. That is
why I *asked*. I think the answer I got was a more rude version of, "I
don't know."

Ok, I got it.

--

Rick

rickman

unread,

Sep 2, 2015, 6:09:21 PM9/2/15

to

On 9/2/2015 5:35 PM, Bernd Paysan wrote:
> rickman wrote:
>> You get hung up in meaningless distinctions while missing the point.
>> "Bit lines" *are* multiplexors using a wire OR.
>
> No, a multiplexer is a logic gate. RAM is deliberately minimized special-
> purpose circuitry, even with dynamic elements (the bit line is precharged,
> so it doesn't have a pullup like a wired or, and it's used for both
> directions, reading and writing; you have to be careful with your design
> that writing actually flips a bit - it's at the border between analog and
> digital design).

I'm not going to argue such a silly topic. I don't care how you
implement a multiplexor, it is still a mux.

>> You DO see dual port RAMs in FPGAs.... in fact it is hard to find RAMs
>> that AREN'T dual port.
>
> I did in fact write that.
>
>> By using two such RAMs in parallel you can have
>> three ports implementing 2 reads and 1 write in any clock cycle. I have
>> done this.
>
> Sounds like a stupid idea. By using some flip-flops for TOS (16 for 16
> bits), you can use an entire RAM block for something else. And you reduce
> your cycle time, because the read/write time to the TOS element in RAM is no
> longer in the critical path.

Ok, so you acknowledge that you can do all the operations for an ADD out
of RAM at the same time and that you can have more than 2 port RAM in an
FPGA?

You may think it a stupid idea, but "stupid is as stupid does".
Different requirements for different projects.

--

Rick

rickman

unread,

Sep 2, 2015, 6:11:58 PM9/2/15

to

On 9/2/2015 5:55 PM, Mux wrote:
>> I am speculating, but if you are using the Xilinx SRL (if that is the
>> right name), thy are implemented as the LUT latches with counters to
>> work as a shift register. They don't have full data paths to shift data
>> between the latches and using counters saves power too. If you are
>> implementing a proper shift register using FPGA registers, you are using
>> a 2 to 1 mux on every bit of every register to get bi-directionality.
>>
>
> Yeah, but since they're wedged inbetween latches / registers (for lack of a better word), the performance you get is higher. For a register file you need an NxM multiplexer which usually comes out slower. Add to that the fact that you now have need a bi-directional counter to address into that and your critical path has become even slower.

And you only get them on Xilinx parts. Xilinx <> FPGA

I don't optimize until I know I have an issue that requires
optimization. Maybe the optimization is minimizing LUT usage?

>> BTW, blockrams are far from "free". They are just a different color of
>> money.
>
> Oh, I'm sure :-)
>
> -Y
>

--

Rick

Mux

unread,

Sep 2, 2015, 7:20:31 PM9/2/15

to

> And you only get them on Xilinx parts. Xilinx <> FPGA
>
> I don't optimize until I know I have an issue that requires
> optimization. Maybe the optimization is minimizing LUT usage?

Could be. I'm using altera and instantiating them individually. Either way, performance analysis comes out a few 10's of mhz higher. On Altera devices, it synthesizes to 1 LE/bit.

-Y

Bernd Paysan

unread,

Sep 2, 2015, 8:07:18 PM9/2/15

to

rickman wrote:
> I'm not going to argue such a silly topic. I don't care how you
> implement a multiplexor, it is still a mux.

You are just clueless, and at the same time you want to be right. Have you
graduated from kindergarden yet?

Your description of what a RAM is was simply very naive and inaccurate.
Neither is an SRAM cell a register, nor is a bitline a multiplexer, because
functionally, it does only part of the job, and both multiplexing and
demultiplexing (for write, you need a demultiplexer). The other parts of
the job are done by the word line and the sense amplifiers+precharge
drivers. And if you want to have the edge-triggered RAM block (as you get
in the FPGA case), you need registers to store the inputs. That's all
"hidden" from your naive view in the block RAM macro.

As FPGA-only developer, you probably think you don't need to know what it
is.

> Ok, so you acknowledge that you can do all the operations for an ADD out
> of RAM at the same time and that you can have more than 2 port RAM in an
> FPGA?

You need to use two RAM blocks to get more than 2 ports... Using two RAM
blocks for what can be done with one is a waste of resources. Having to
jump through the hoops of connecting two RAM blocks just to get one port
more is a waste of programming time.

> You may think it a stupid idea, but "stupid is as stupid does".
> Different requirements for different projects.

What's you requirement that drives you to use two block RAMs where one would
be sufficient? You always can do things with more resources, slower.
Requirements usually press you towards doing things with less resources,
faster. Unless you don't have any requirements, then you can waste
resources as much as you like.

IMHO, wasting resources by itself is stupid.

The code doesn't become more complicated with a TOS in registers, either.
It's not a premature optimization, it's something that goes without
thinking. We have cached the top of stack in many software Forth
implementations for ages, and it not only makes the Forth faster, but the
code easier to write.

It's probably like using a big block 400 cubic inch V8 motor to go 65mhp.
You can do that with the 600cm³ (less than 40 cubic inches) two-cylinder
motor of a 2CV.

As a Forth programmer, I hate bloat just for the sake of bloat.

rickman

unread,

Sep 2, 2015, 8:43:03 PM9/2/15

to

Something is wrong. Altera parts don't have shift registers built into
their LEs. At least they didn't and I am pretty sure Xilinx has a
patent on it, good for a few more years I expect... maybe not. Which
part are you using? One of the latest and greatest? Or are you saying
1 LE/bit/sr depth?

--

Rick

rickman

unread,

Sep 2, 2015, 8:49:17 PM9/2/15

to

On 9/2/2015 8:07 PM, Bernd Paysan wrote:
> rickman wrote:
>> I'm not going to argue such a silly topic. I don't care how you
>> implement a multiplexor, it is still a mux.
>
> You are just clueless, and at the same time you want to be right. Have you
> graduated from kindergarden yet?

Whatever... I'm not interested in splitting hairs on this. The point
is you said, "Can't be done in RAM." because you have a preconceived
idea of what a RAM is. It can be, and *is* done with RAM.

> What's you requirement that drives you to use two block RAMs where one would
> be sufficient?

How about a design that is out of logic and has free ram blocks? Or one
that needs a larger register file than is practical in distributed logic?

> It's probably like using a big block 400 cubic inch V8 motor to go 65mhp.
> You can do that with the 600cm³ (less than 40 cubic inches) two-cylinder
> motor of a 2CV.

Not if you are towing a 5000 lb boat.

Why are we even discussing this? I feel this is a waste of bandwidth.
We both have more useful things to do... I'm going to do them. :)

--

Rick

Mux

unread,

Sep 3, 2015, 1:34:26 AM9/3/15

to

It's not built-in. I basically have an array of registers that does this for a pop:

T <= reg[0];
reg[0] <= reg[1];
reg[1] <= reg[2];
...
reg[n] <= 16'd0;

and the code below for a push. I'm actually using a loop construct so I don't have to write it out each time, but still..

T <= alu_result;
reg[0] <= T;
reg[1] <= reg[0];
...

All that this does is place a single 2-to-1 mux between each register based on the direction of shift, so it should (all) pretty much go down to routing. A 16x16 stack translates to 256 LE's on as low as an ancient EPC10k10 FPGA.

-Y

Bernd Paysan

unread,

Sep 3, 2015, 8:30:30 AM9/3/15

to

rickman wrote:

> On 9/2/2015 8:07 PM, Bernd Paysan wrote:
>> rickman wrote:
>>> I'm not going to argue such a silly topic. I don't care how you
>>> implement a multiplexor, it is still a mux.
>>
>> You are just clueless, and at the same time you want to be right. Have
>> you graduated from kindergarden yet?
>
> Whatever... I'm not interested in splitting hairs on this. The point
> is you said, "Can't be done in RAM." because you have a preconceived
> idea of what a RAM is. It can be, and *is* done with RAM.

I've the idea of a real RAM, because I know what a real RAM is. Can't be
done with one block RAM. Can be done with two, yes, but that's a waste of
resources.

>> What's you requirement that drives you to use two block RAMs where one
>> would be sufficient?
>
> How about a design that is out of logic and has free ram blocks? Or one
> that needs a larger register file than is practical in distributed logic?

The logic to access the two RAMs costs more LEs than putting TOS into
registers.

If you want to implement a 3-register RISC architecture, your approach is
the only way to go. But you don't, you implement a stack CPU.

>> It's probably like using a big block 400 cubic inch V8 motor to go 65mhp.
>> You can do that with the 600cm³ (less than 40 cubic inches) two-cylinder
>> motor of a 2CV.
>
> Not if you are towing a 5000 lb boat.

The thing you do is actually slower...

> Why are we even discussing this? I feel this is a waste of bandwidth.
> We both have more useful things to do... I'm going to do them. :)

Probably. Your design approach just makes me cringe. I think you really
can learn something here. But I think you don't want to.

rickman

unread,

Sep 3, 2015, 12:59:38 PM9/3/15

to

On 9/3/2015 1:34 AM, Mux wrote:
> On Wednesday, September 2, 2015 at 5:43:03 PM UTC-7, rickman wrote:
>> On 9/2/2015 7:20 PM, Mux wrote:
>>>> And you only get them on Xilinx parts. Xilinx <> FPGA
>>>>
>>>> I don't optimize until I know I have an issue that requires
>>>> optimization. Maybe the optimization is minimizing LUT usage?
>>>
>>> Could be. I'm using altera and instantiating them individually.
>>> Either way, performance analysis comes out a few 10's of mhz
>>> higher. On Altera devices, it synthesizes to 1 LE/bit.
>>
>> Something is wrong. Altera parts don't have shift registers built
>> into their LEs. At least they didn't and I am pretty sure Xilinx
>> has a patent on it, good for a few more years I expect... maybe
>> not. Which part are you using? One of the latest and greatest?
>> Or are you saying 1 LE/bit/sr depth?
>>
>> --
>>
>> Rick
>
> It's not built-in. I basically have an array of registers that does
> this for a pop:
>

> T <= reg[0]; reg[0] <= reg[1]; reg[1] <= reg[2]; .... reg[n] <=

> 16'd0;
>
> and the code below for a push. I'm actually using a loop construct so
> I don't have to write it out each time, but still..
>

> T <= alu_result; reg[0] <= T; reg[1] <= reg[0]; ....

>
> All that this does is place a single 2-to-1 mux between each register
> based on the direction of shift, so it should (all) pretty much go
> down to routing. A 16x16 stack translates to 256 LE's on as low as an
> ancient EPC10k10 FPGA.

Yes and very inefficient in terms of logic utilization. A high price to
pay for performance which may or may not pan out. I find the block RAMs
are not in any of the critical timing paths of any CPU I've designed.

--

Rick

Mux

unread,

Sep 3, 2015, 6:54:22 PM9/3/15

to

> Yes and very inefficient in terms of logic utilization. A high price to
> pay for performance which may or may not pan out. I find the block RAMs
> are not in any of the critical timing paths of any CPU I've designed.
>
> --
>
> Rick

How is that inefficient if it's one LE per bit? As far as performance goes, it's well into the 200mhz performance range which is fast enough for me on a cheap FPGA. I'm not looking into spending $$$ to get FPGA's in the Ghz range. Anything that runs at about 50-100mhz will do nicely for me..

-Y

rickman

unread,

Sep 3, 2015, 7:54:26 PM9/3/15

to

Not one LE per bit, one LE per bit per register. If you are working in
a part where you have lots of LEs to use, fine. My CPU design uses 600
LUTs total plus 1 block RAM for both stacks (not 1 each) and whatever
data/program memory you need to use. Using 256 LUTs for the stack
registers seems like a poor tradeoff to me, but it all depends on what
you have lots of.

Does your entire CPU run at 200 MHz? The decode must be very simple. I
take it you have a zero compare instruction rather than flags? What
about detecting carry or overflow? Those are the speed critical paths
in my design. In a 15 year old Altera FPGA it ran at 50 some MHz. I
only ever had one application for it though, so I spent a little time
looking at improvements, but never worked with it further.

--

Rick

Mux

unread,

Sep 4, 2015, 12:13:53 AM9/4/15

to

> Not one LE per bit, one LE per bit per register. If you are working in
> a part where you have lots of LEs to use, fine. My CPU design uses 600
> LUTs total plus 1 block RAM for both stacks (not 1 each) and whatever
> data/program memory you need to use. Using 256 LUTs for the stack
> registers seems like a poor tradeoff to me, but it all depends on what
> you have lots of.
>

Correct. One LE per bit per register. The EPF10k10 has 576 LE's and with both the data- and return stack being 8 deep (256LE's) and the remaining logic coming in at around ~100 LE's that leaves ample space for other things. The 10k10 only has 6144 bits of memory that I need for line-buffers and palettes. So yeah, it's a trade-off.

> Does your entire CPU run at 200 MHz? The decode must be very simple. I
> take it you have a zero compare instruction rather than flags? What
> about detecting carry or overflow? Those are the speed critical paths
> in my design. In a 15 year old Altera FPGA it ran at 50 some MHz. I
> only ever had one application for it though, so I spent a little time
> looking at improvements, but never worked with it further.

Yup. The instructions are 4-bit each, which means I can fetch 4 of them at once. A '0=' instruction is used to for checking for zero (obviously) and consists of nothing more than at 16-way OR rather than a compare. no overflow or carry. Registered performance hovers somewhere around 200mhz with a tpd hovering around ~125mhz. Nice thing is that it also has a single interrupt (like I stated earlier) source as well as a pre-fetch for the next 4-instructions from memory, based on some of the work from Eric La-Forest.

On a somewhat different note,I don't particularly care for the hostilities in this- and other threads, or any on CLF. Not singling out anyone in particular but the bitch-fests that I've seen on this group far outweigh the actual useful posts where people actually try to help one-another.

Just an observation...

-Y

Jan Coombs <Jan-54 >

unread,

Sep 4, 2015, 4:48:33 AM9/4/15

to

On Thu, 3 Sep 2015 21:13:50 -0700 (PDT)
Mux <Yvo....@gmail.com> wrote:

> >
> Correct. One LE per bit per register. [...]

Thanks for the details, I'm very interested in practical work
on FPGA stack engines, especially using tiny low power devices.

> On a somewhat different note,I don't particularly care for the
> hostilities in this- and other threads, or any on CLF. Not
> singling out anyone in particular but the bitch-fests that
> I've seen on this group far outweigh the actual useful posts
> where people actually try to help one-another.
>
> Just an observation...

It's a big problem, with noise often outweighing signal. I try
to avoid responding when when threads get set in treacle.

Jan Coombs
--
email valid, else fix dots and hyphen
jan4clf2014@murrayhyphenmicroftdotcodotuk

rickman

unread,

Sep 4, 2015, 11:47:15 AM9/4/15

to

On 9/4/2015 12:13 AM, Mux wrote:
>> Not one LE per bit, one LE per bit per register. If you are
>> working in a part where you have lots of LEs to use, fine. My CPU
>> design uses 600 LUTs total plus 1 block RAM for both stacks (not 1
>> each) and whatever data/program memory you need to use. Using 256
>> LUTs for the stack registers seems like a poor tradeoff to me, but
>> it all depends on what you have lots of.
>>
> Correct. One LE per bit per register. The EPF10k10 has 576 LE's and
> with both the data- and return stack being 8 deep (256LE's) and the
> remaining logic coming in at around ~100 LE's that leaves ample space
> for other things. The 10k10 only has 6144 bits of memory that I need
> for line-buffers and palettes. So yeah, it's a trade-off.
>
>> Does your entire CPU run at 200 MHz? The decode must be very
>> simple. I take it you have a zero compare instruction rather than
>> flags? What about detecting carry or overflow? Those are the
>> speed critical paths in my design. In a 15 year old Altera FPGA it
>> ran at 50 some MHz. I only ever had one application for it though,
>> so I spent a little time looking at improvements, but never worked
>> with it further.
>
> Yup. The instructions are 4-bit each, which means I can fetch 4 of
> them at once. A '0=' instruction is used to for checking for zero
> (obviously) and consists of nothing more than at 16-way OR rather
> than a compare. no overflow or carry. Registered performance hovers
> somewhere around 200mhz with a tpd hovering around ~125mhz.

You lost me there. What do you mean tpd? Are you talking about the
useful I/O rate?

> Nice
> thing is that it also has a single interrupt (like I stated earlier)
> source as well as a pre-fetch for the next 4-instructions from
> memory, based on some of the work from Eric La-Forest.

Is your memory external? I've never gone that route since my uses for
internal CPUs are always 100% internal. I typically don't have room on
the board for an external memory or the design is a retrofit where there
was none in the first place. Prefetching doesn't speed up a CPU running
from internal memory since the block RAM is nearly as fast as a register.

BTW, what are your 4 bit instructions? I've never been able to convince
myself such a small instruction is the way to go. Just as stack ISAs
generate a fair percentage of code to thrash the stack, small
instructions exacerbate this with the very few instructions possible.
I'm wondering if VLIW isn't the way to go for stack machines. But then
the tools get more complex to optimize the parallelism or the user has
to figure it all out themselves.

> On a somewhat different note,I don't particularly care for the
> hostilities in this- and other threads, or any on CLF. Not singling
> out anyone in particular but the bitch-fests that I've seen on this
> group far outweigh the actual useful posts where people actually try
> to help one-another.
>
> Just an observation...

I hope you aren't referring to my posts. I try to avoid confrontation,
but know I push Hugh's buttons sometimes. I also get into "discussions"
with Bernd sometimes that are less than productive. I was surprised
that a recent interaction with Hugh started to go in a semi-useful
direction for a moment. If I am crossing lines I would like for someone
to tell me, but I expect it is hard to know where the line is in this
group. At least it isn't s.e.d where anything goes, including
references to Nazis, lol.

--

Rick

Mux

unread,

Sep 4, 2015, 11:49:31 AM9/4/15

to

> > On a somewhat different note,I don't particularly care for the
> > hostilities in this- and other threads, or any on CLF. Not
> > singling out anyone in particular but the bitch-fests that
> > I've seen on this group far outweigh the actual useful posts
> > where people actually try to help one-another.
> >
> > Just an observation...
>
> It's a big problem, with noise often outweighing signal. I try
> to avoid responding when when threads get set in treacle.
>
> Jan Coombs
> --

I think a bigger problem is that it sends a message to people new to forth that this is / can be a pretty hostile group, which turns a lot of people off. If we ever want to engage and encourage people to pick up forth, we need to be more welcoming and be open to new ideas without being prejudice.

-Y

Mux

unread,

Sep 4, 2015, 12:20:12 PM9/4/15

to

> Is your memory external? I've never gone that route since my uses for
> internal CPUs are always 100% internal. I typically don't have room on
> the board for an external memory or the design is a retrofit where there
> was none in the first place. Prefetching doesn't speed up a CPU running
> from internal memory since the block RAM is nearly as fast as a register.
>

Memory is external. 64kx16 static ram at 10ns which allows me to do pre-fetching if the last instruction isn't referencing memory. Check out Eric La-Forests paper for more info.

> BTW, what are your 4 bit instructions? I've never been able to convince
> myself such a small instruction is the way to go. Just as stack ISAs
> generate a fair percentage of code to thrash the stack, small
> instructions exacerbate this with the very few instructions possible.

Yup. It's taking MISC to an absolute level but it kinda works. See if I can remember... PC@,LIT,+,AND,XOR,2/,0=,DUP,DROP,OVER,PUSH,POP,@,!,JUMP,CALL.

From Eric's paper, PC@ is an instruction that literally fetches the next 4 instructions. While that does mean an extra clock per every 4 instructions, most of the time the prefetcher manages to avoid it. Given that you generally need more stack operations, this increases the frequency the prefetcher is able to kick in. As Eric mentioned, it's a really elegant solution as it gets automatically shifted in (all zeroes) and thus becomes the '5th' instruction, which is why it's coded as 0. Ting's 4-bit CPU and others use a 2-bit instruction counter and a nop, which means you have to NOP out your IR and stall rather than early-out.

The interrupt mechanism, which happens when a PC@ is about to execute, is basically a mux that either gates { DUP,DUP,XOR,CALL } or the next instruction word from memory and calls address 0, which has code that determines how it got there (i.e. interrupt bit set -> interrupt ).

Jump is conditional and works closely with 0= to mask off addresses. A side effect is that you can never jump to zero, which is fine. Exit is encoded as { pop, jump }.

Call on the other hand IS able to jump to zero and therefore allows me to generate a call without referencing memory and hardwiring reset/irq to the same address.

One last thing is that LIT is the only instruction that references memory. So jump / call first put the address on the datastack. Mainly did this to simplify stuff.

4-bits is really pushing it but it was an interesting exercise / experiment to see what you can do. Code density is reasonably high in places but, as you stated already, you end up with more instructions to do the same. To that extent, I'm currently working on a new 5-bit CPU :-)

Btw, if anyone's interested I can post my verilog code on OpenCores.

> I'm wondering if VLIW isn't the way to go for stack machines. But then
> the tools get more complex to optimize the parallelism or the user has
> to figure it all out themselves.
>

Thing with VLIW is that you have a lot of unused units sitting idle that takes up space. I remember working on VMLabs' Nuon which had a theoretical '28 instructions per clock' spread out over 4 cores. Died a horrible death but was a pretty neat architecture at the time.

>
> > On a somewhat different note,I don't particularly care for the
> > hostilities in this- and other threads, or any on CLF. Not singling
> > out anyone in particular but the bitch-fests that I've seen on this
> > group far outweigh the actual useful posts where people actually try
> > to help one-another.
> >
> > Just an observation...
>
> I hope you aren't referring to my posts. I try to avoid confrontation,
> but know I push Hugh's buttons sometimes. I also get into "discussions"
> with Bernd sometimes that are less than productive. I was surprised
> that a recent interaction with Hugh started to go in a semi-useful
> direction for a moment. If I am crossing lines I would like for someone
> to tell me, but I expect it is hard to know where the line is in this
> group. At least it isn't s.e.d where anything goes, including
> references to Nazis, lol.
> --

Posted about this earlier. Just feel this group needs to be less egotistic and more welcoming. Most of the discussions gravitate towards shouting matches.

-Yvo

rickman

unread,

Sep 4, 2015, 12:52:49 PM9/4/15

to

I'm not sure why that would be. Perhaps I am not familiar with some of
the VLIW architectures, but to me it allows maximum parallelism. I
worked on an array processor which had four ALUs (two add/sub and two
mul plus a div/sq root which was multicycle so tthey didn't count it in
the four) and a storage move processor (SMP, very fancy DMA) all
controlled by 100+ bit control words. They called it "microcode" but
was the assembly language for this machine and the only way to program
it. These days it would be called a VLIW machine, no? The point was to
have total control over all the operations of the machine to get maximum
parallelism with no restrictions. There were no idle units unless you
just plain had no use for them and every CPU has that all the time.

> Posted about this earlier. Just feel this group needs to be less
> egotistic and more welcoming. Most of the discussions gravitate
> towards shouting matches.

I WELCOME YOUR IDEAS! ;)

--

Rick

Mux

unread,

Sep 4, 2015, 1:20:43 PM9/4/15

to

> I'm not sure why that would be. Perhaps I am not familiar with some of
> the VLIW architectures, but to me it allows maximum parallelism. I
> worked on an array processor which had four ALUs (two add/sub and two
> mul plus a div/sq root which was multicycle so tthey didn't count it in
> the four) and a storage move processor (SMP, very fancy DMA) all
> controlled by 100+ bit control words. They called it "microcode" but
> was the assembly language for this machine and the only way to program
> it. These days it would be called a VLIW machine, no? The point was to
> have total control over all the operations of the machine to get maximum
> parallelism with no restrictions. There were no idle units unless you
> just plain had no use for them and every CPU has that all the time.
>

Yeah, that's VLIW alright. The Nuon had 4 CPU's and relied heavily on DMA as they only had tiny caches (3 CPU's had 4K and 1 of them only had 8k) with external SDRAM for filling them. Each VLIW word was 128bits wide and could process 7 instructions in parallel which, times 4 gives you a theoretical 28 instructions / clock :-)

Anyway, you can read more about it here:

http://cdn.arstechnica.net/wp-content/uploads/2015/06/1383659488_NUON-Architecture-26-OEM.pdf

> I WELCOME YOUR IDEAS! ;)
>

Great! I'm all for healthy discussions and what not as long as they're constructive :-)

hughag...@gmail.com

unread,

Sep 4, 2015, 2:15:59 PM9/4/15

to

According to Google, treacle is defined as: "cloying sentimentality or flattery."

I used to say that ANS-Forth was a castle built on sand --- but now I would describe it as a castle built on treacle!

The Elizabeth Rather cult were the architects of their own ruin, and for mortar they used treacle...

rickman

unread,

Sep 4, 2015, 2:32:36 PM9/4/15

to

Mux, how would you recommend handling Beetlejuice?

--

Rick

Mux

unread,

Sep 4, 2015, 2:36:42 PM9/4/15

to

> Mux, how would you recommend handling Beetlejuice?

I might be somewhat out of the loop.. Who's considered Beetlejuice?

-Mux

rickman

unread,

Sep 4, 2015, 5:46:26 PM9/4/15

to

On 9/4/2015 2:36 PM, Mux wrote:
>> Mux, how would you recommend handling Beetlejuice?
>
> I might be somewhat out of the loop.. Who's considered Beetlejuice?

It's my pet name for Hugh. One way to get him to post is to mention his
name. Not that I am trying to do that...

--

Rick