I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule.
Does anyone here have a feel for the correspondence between Forth source primitives and generated code?
This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.
> I've been thinking about Novix style processors like the
> RTX2000. There are many Forth sequences that can be compacted into
> one instruction, so with a good optimizer the chip can execute
> several Forth (source) primitives in one machine cycle. I suspect
> though that such optimization opportunities are the exception rather
> than the rule.
It happens a lot bacause many phrases are things like OVER + and
R> DROP . Also, ; pairs with just about everything. Novix had more
of these phrases than RTX2000 because of the way its encoding was
done.
On Monday, October 22, 2012 11:53:24 AM UTC-4, Brad Eckert wrote:
> Hi All, I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule. Does anyone here have a feel for the correspondence between Forth source primitives and generated code? This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.
Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!
On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:
> Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!
Hmmm... I'm not following you. What is the function/significance of
this special bit? Where is it set? In the call instruction or the
return instruction?
On Monday, October 22, 2012 3:10:29 PM UTC-4, M.R.W Wills wrote:
> On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote: > Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words! Hmmm... I'm not following you. What is the function/significance of this special bit? Where is it set? In the call instruction or the return instruction?
There is no special return instruction. Return is achieved by one bit only:
For all instructions except Subroutine Calls or Branch instructions, bit 5 of the instruction code represents the Subroutine Return Bit. If this bit is set to 1, a Return is performed whereby the return address is popped from the Return Stack.
Op Mon, 22 Oct 2012 12:10:28 -0700 (PDT) schreef Mark Wills:
> On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:
>> Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!
> Hmmm... I'm not following you. What is the function/significance of
> this special bit? Where is it set? In the call instruction or the
> return instruction?
There is no return instruction, but any instruction may have its return bit
set. : SQR DUP + ; may be one instruction word existing of DUP, plus and
the return bit.
> On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:
>> Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!
> Hmmm... I'm not following you. What is the function/significance of
> this special bit? Where is it set? In the call instruction or the
> return instruction?
<this post turned out to be a bit longer than I expected...>
Stack machines are inherently pretty simple because of the limited nature of registers and the limited instruction set typically implemented. I did one about ten years ago and found it really consists of three execution units, all of which can operate in parallel.
1) The Instruction Unit contains the instruction memory with the address generation (instruction fetch) as well as the decode (although the decode could also be spread among the units).
2) The Address Unit contains the return stack with stack pointer and the various logic that manipulates elements on the stack, such as auto-increment of addresses and loop counter functions. The address for main memory is provided by this unit in my machine.
3) The Data Unit contains the data stack with stack pointer, the ALU and any other special logic for operations on the data stack. I also included the memory interface in this unit even though the address comes from the Address Unit.
Each of these three units is present in any dual stack CPU design. Any given instruction may involve any combination of the three units or may leave some idle. If instructions leave any unit idle it can be combined with another instruction that uses just that unit in a compatible way. For example, as others have indicated, any instruction that is not using the Address Unit, e.g. ADD, SUB, etc. can execute an instruction that only uses the Address Unit, e.g. RET, LOOP, etc. Both instructions have to be using the Instruction Unit in a compatible way which is typically not hard since most instructions are doing the equivalent of NEXT (or NOP in other terms).
The only issue with such combining of instructions is the instruction encoding. I encoded to minimize the amount of program space needed which resulted in a minimum width instruction optimized per Koopman's data for instruction frequency. I looked at separating the opcodes for each unit in essence making the MISC equivalent of a VLIW processor, if you can have such a thing... lol I couldn't quite squeeze the instruction into a 9 bit word which is a memory width commonly available in FPGAs. I am now looking at using an FPGA that only supports 16 bit wide memory and don't like the idea of multiplexing the instructions, that just adds a level of logic to the timing path. A wider instruction might just minimize the decode logic and provide for more instruction parallelism at the same time.
Another way of paralleling instructions is to just pick unused opcodes and implement the most common parallel instructions. This won't improve timing of the design and may worsen it a bit, but will provide for fewer instructions in a program.
The obvious one, return as a separate bit, in parallel with everything, can only be used with about half the instructions in my CPU design. The return stack is used in a lot of them. But if the bit is free...
Info that would be VERY useful to me is frequency of use of instructions in combination. This could be pulled from existing code by measuring how often instructions are found adjacent to each other. This may not be a perfect measure, but it would be a great start. Can anyone generate a metric on this similar to Koopman's data on instruction use?
On Monday, October 22, 2012 10:53:24 AM UTC-5, Brad Eckert wrote:
> Hi All,
> I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule.
> Does anyone here have a feel for the correspondence between Forth source primitives and generated code?
> This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.
Are you sure you're not confusing machine cycles with address fetches? Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster. Maybe I'm not understanding what you mean by machine cycle. I'm a huge fan of James Bowman's J1, but I don't believe it packs instructions in memory at all. It simply executes them very fast thru the use of dual port memory. I think of the B16 and J1 as two sides of the same coin. The B16 is very memory efficient, but slower than the J1. The J1 is very fast, but not as memory efficient as the B16. The B16 is programmed directly in Forth, using about 32 built-in words. The J1 is similar but I believe has a few more possible Forth words as primitive instructions.
On Monday, October 22, 2012 4:54:33 PM UTC-4, daveyrotten wrote:
> Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster.
The RTX2000 family of RISCs is able to execute several commands within one clock cycle. Only memory access needs two clock cycles: one to address memory, and one to fetch/store.
On Monday, October 22, 2012 4:14:19 PM UTC-5, visua...@rocketmail.com wrote:
> On Monday, October 22, 2012 4:54:33 PM UTC-4, daveyrotten wrote:
> > Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster.
> The RTX2000 family of RISCs is able to execute several commands within one clock cycle. Only memory access needs two clock cycles: one to address memory, and one to fetch/store.
Ok, sorry. I guess the compiler must pick instructions that have no interdependency to pack together. I'm not familiar with the RTX2000 itself.
On Monday, October 22, 2012 5:26:20 PM UTC-4, daveyrotten wrote:
> On Monday, October 22, 2012 4:14:19 PM UTC-5, visua...@rocketmail.com wrote: > On Monday, October 22, 2012 4:54:33 PM UTC-4, daveyrotten wrote: > > > Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster. > > > > The RTX2000 family of RISCs is able to execute several commands within one clock cycle. Only memory access needs two clock cycles: one to address memory, and one to fetch/store. Ok, sorry. I guess the compiler must pick instructions that have no interdependency to pack together. I'm not familiar with the RTX2000 itself.
> I've been thinking about Novix style processors like the RTX2000.
> There are many Forth sequences that can be compacted into one
> instruction, so with a good optimizer the chip can execute several
> Forth (source) primitives in one machine cycle. I suspect though
> that such optimization opportunities are the exception rather than the
> rule.
The question for both you (and Rick) is if you create new, faster, more
powerful, multiple operation instructions, how do you ensure they are used?
Without an optimizer, it's likely the instruction will have a low
instruction frequency. I.e., a person is unlikely to use it. In which
case, there is no point in using or implementing it.
(This is repeated later in a reply to Rick.)
> Does anyone here have a feel for the correspondence between Forth
> source primitives and generated code?
Generally, Forth's built using "primitives" or low-level words generally
need 30 to 40 or so. I kept track of how many are needed for certain
Forths. There are a few posts by me to c.l.f. with counts and specific
words used.
> On 10/22/2012 3:10 PM, Mark Wills wrote:
> > On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:
> >> Despite allowing several Forth commands inside one RTX2000 16 bit
> >> word a great advantage is the ability to set a special Return from
> >> Subroutine bit - that is a real speed accelerator - so a whole
> >> executable program may need only one RTX2000 16 bit word!
> >> Imagine what additional possibilities you can have with 32 bit words!
> > Hmmm... I'm not following you. What is the function/significance of
> > this special bit? Where is it set? In the call instruction or the
> > return instruction?
> <this post turned out to be a bit longer than I expected...>
I'm versed in (old) microprocessor design. So, that's easily taken care of:
[SNIP]
> The only issue with such combining of instructions is the instruction
> encoding. I encoded to minimize the amount of program space needed
> which resulted in a minimum width instruction optimized per Koopman's
> data for instruction frequency. I looked at separating the opcodes for
> each unit in essence making the MISC equivalent of a VLIW processor, if
> you can have such a thing... lol I couldn't quite squeeze the
> instruction into a 9 bit word which is a memory width commonly available
> in FPGAs. [...]
Forth's built using "primitives" generally need 30 to 40 or so, i.e.,
5-bits. Why do you need 9-bits (512)? I'd guess that you're encoding
things other than just the instruction, e.g., control-bits, offsets, modes,
etc.
Koopman's and Ertl's instruction frequency data is basically the same.
Using their data is a good choice though.
(This is also posted earlier in the thread:)
The question for both you (and Brad) is if you create new, faster, more
powerful, multiple operation instructions, how do you ensure they are used?
Without an optimizer, it's likely the instruction will have a low
instruction frequency. I.e., a person is unlikely to use it. In which
case, there is no point in using or implementing it.
> The obvious one, return as a separate bit, in parallel with everything,
> can only be used with about half the instructions in my CPU design. The
> return stack is used in a lot of them. But if the bit is free...
Years ago, there was a processor that used a few bits, like two, for
conditional execution of each instruction. I don't recall what it was, or
if it was a Forth processor. It might've been a bit-slice design...
> Info that would be VERY useful to me is frequency of use of instructions
> in combination. This could be pulled from existing code by measuring
> how often instructions are found adjacent to each other. This may not
> be a perfect measure, but it would be a great start. Can anyone
> generate a metric on this similar to Koopman's data on instruction use?
Anton Ertl also has instruction frequency data. I don't recall if he showed
combinations or not. I know he or someone created the concept of "super
operators" for Forth, which I think is what you're asking about. If he
doesn't respond, I'll attempt to locate for you what I previously found.
On Monday, October 22, 2012 11:53:24 AM UTC-4, Brad Eckert wrote:
> Hi All, I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule. Does anyone here have a feel for the correspondence between Forth source primitives and generated code? This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.
The trick is to make the right use of the available bits.
The RTX2000 uses two important tricks to make it run fast:
First of all, one bit switches between two types of commands: general commands and subroutine calls. This bit could be bit zero. If bit zero is zero, the other bits form the address of the subroutine to be called. That's obvious, because you don't need odd addresses for subroutines. But the RTX2000 uses bit 15 and shifts the address to fit.
Second, the aforementioned trick with the return from subroutine bit.
These both accomplish speed first hand.
The remaining bits should be used to decode all Forth primitives which are needed. I am sure you don't need a frequency of calling, because the Forth OS itself needs all of them.
If there are enough bits, it will be possible to use these to run several primitives in one clock cycle - RISCs normally use only one clock cycle per command, the RTX2000 method allows up to four commands run per one clock cycle - it's some kind of Super-RISC. These commands running in one clock cycle should generate the next level of primitives. And of course there is the possibility to use some bits to switch between different kinds of decoding. The more bits, the more possibilities there are, and more commands can be made to run in one clock cycle, not necessarily in parallel.
On Oct 22, 6:48 pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
wrote:
> Years ago, there was a processor that used a few bits, like two, for
> conditional execution of each instruction. I don't recall what it was, or
> if it was a Forth processor. It might've been a bit-slice design...
Hugh Aguilar <hughaguila...@yahoo.com> wrote:
> On Oct 22, 6:48?pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
> wrote:
>> Years ago, there was a processor that used a few bits, like two, for
>> conditional execution of each instruction. ?I don't recall what it was, or
>> if it was a Forth processor. ?It might've been a bit-slice design...
> Isn't that the way that the ARM works?
It was, but it's mostly been dropped in ARM 64 because "Benchmarking
shows that modern branch predictors work well enough that predicated
execution of instructions does not offer sufficient benefit to justify
its significant use of opcode space, and its implementation cost in
advanced implementations."
On Mon, 22 Oct 2012 08:53:22 -0700 (PDT), Brad Eckert
<hwfw...@gmail.com> wrote:
>I've been thinking about Novix style processors like the RTX2000. There are=
> many Forth sequences that can be compacted into one instruction, so with a=
> good optimizer the chip can execute several Forth (source) primitives in o=
>ne machine cycle. I suspect though that such optimization opportunities are=
> the exception rather than the rule.
rickman <gnu...@gmail.com> writes:
>Info that would be VERY useful to me is frequency of use of instructions >in combination. This could be pulled from existing code by measuring >how often instructions are found adjacent to each other. This may not >be a perfect measure, but it would be a great start. Can anyone >generate a metric on this similar to Koopman's data on instruction use?
Andrew Haley <andre...@littlepinkcloud.invalid> wrote:
> Hugh Aguilar <hughaguila...@yahoo.com> wrote:
>>> Years ago, there was a processor that used a few bits, like two, for
>>> conditional execution of each instruction.
>> Isn't that the way that the ARM works?
> It was, but it's mostly been dropped in ARM 64 because "Benchmarking
> shows that modern branch predictors work well enough that predicated
> execution of instructions does not offer sufficient benefit to justify
> its significant use of opcode space, and its implementation cost in
> advanced implementations."
FWIW, the Propeller CPU has conditional execution too. I did a fair amount
of hand-coding of assembly for it, and came away not liking its instruction
set very much at all. MIPS is my all-time favorite, but I'd even take 32-bit
x86 over Propeller.
> "rickman"<gnu...@gmail.com> wrote in message
> news:k64bj1$m4n$1@dont-email.me...
>> On 10/22/2012 3:10 PM, Mark Wills wrote:
>>> On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:
>>>> Despite allowing several Forth commands inside one RTX2000 16 bit
>>>> word a great advantage is the ability to set a special Return from
>>>> Subroutine bit - that is a real speed accelerator - so a whole
>>>> executable program may need only one RTX2000 16 bit word!
>>>> Imagine what additional possibilities you can have with 32 bit words!
>>> Hmmm... I'm not following you. What is the function/significance of
>>> this special bit? Where is it set? In the call instruction or the
>>> return instruction?
>> <this post turned out to be a bit longer than I expected...>
> I'm versed in (old) microprocessor design. So, that's easily taken care of:
> [SNIP]
>> The only issue with such combining of instructions is the instruction
>> encoding. I encoded to minimize the amount of program space needed
>> which resulted in a minimum width instruction optimized per Koopman's
>> data for instruction frequency. I looked at separating the opcodes for
>> each unit in essence making the MISC equivalent of a VLIW processor, if
>> you can have such a thing... lol I couldn't quite squeeze the
>> instruction into a 9 bit word which is a memory width commonly available
>> in FPGAs. [...]
> Forth's built using "primitives" generally need 30 to 40 or so, i.e.,
> 5-bits. Why do you need 9-bits (512)? I'd guess that you're encoding
> things other than just the instruction, e.g., control-bits, offsets, modes,
> etc.
I was constrained by the memories available in FPGAs. Many allow somewhat flexible word widths of 1, 2, 4, 8, 9, 16 and 18 bits, some even 32 and 36 bits. Multiplexing multiple instructions in one word has a downside in that it requires extra levels of logic in the instruction decode path which affects *all* instructions. I wanted to avoid that.
So I started with a 4 bit instruction and found that rather limiting, mainly in the impact on performance since most code as around twice as long as it could be with larger instructions. Literals (both data and address) were especially problematic. Looking at Koopman's data it was clear that anything which could optimize the address fields of calls and other instructions, including immediate data would be a boon.
So I tried 8 bit words and used a variable bit with instruction with the remaining bits as immediate data. This was combined with a data extension scheme similar to that used by the Transputer. They used 4 bit instructions with 4 bits of immediate data which would be shifted into larger words. Since this would be the most commonly used instruction I gave it one bit with 7 bit immediate data. The first invocation of a literal instruction pushes the top of return stack with the 7 bit data, sign extended. Each subsequent invocation of the literal instruction shifts 7 more bits into the top of return stack. Calls and Jumps have a four field which is combined with the top of return stack if a literal has been pushed, or just sign extended if not.
There remains some 16 opcodes for the various instructions for manipulating data. The 8 bit machine was used in one design.
I considered a 9 bit version to fully utilize the block RAM in most FPGAs. The immediate data fields were extended by one bit which I think was significant for jumps and calls, 5 bits vs. 4). In the case of general opcodes the extra bit could be used to provide 32 instructions rather than 16, but I didn't feel this gave much benefit and complicated the instruction decode which was already more complex than I preferred. Another alternative was to use the extra bit to flag a combined Return instruction. I found it could only be used with about half the opcodes I was using because of conflicts.
> Koopman's and Ertl's instruction frequency data is basically the same.
> Using their data is a good choice though.
It is a LOT better than no data at all which is what I have otherwise.
> (This is also posted earlier in the thread:)
> The question for both you (and Brad) is if you create new, faster, more
> powerful, multiple operation instructions, how do you ensure they are used?
> Without an optimizer, it's likely the instruction will have a low
> instruction frequency. I.e., a person is unlikely to use it. In which
> case, there is no point in using or implementing it.
Who is this "person"? My design was for me and if I generated enough code to analyze statistically, I would pick the instructions to combine from analyzing my code. It's not like I am selling this design for others to use... not that I wouldn't mind sharing, but you have to read my mind for much of the details. One person asked and I gave him my block diagram with labeled control points and my opcode cheat sheet. He couldn't make heads or tails out of it... lol
>> The obvious one, return as a separate bit, in parallel with everything,
>> can only be used with about half the instructions in my CPU design. The
>> return stack is used in a lot of them. But if the bit is free...
> Years ago, there was a processor that used a few bits, like two, for
> conditional execution of each instruction. I don't recall what it was, or
> if it was a Forth processor. It might've been a bit-slice design...
I've heard of that as well as other "unique" features. An ancient Univac machine had a bit in the address field that flagged indirect. The address fetched had the same bit... they had to add a indirect counter to get out of the infinite loops that could happen.
>> Info that would be VERY useful to me is frequency of use of instructions
>> in combination. This could be pulled from existing code by measuring
>> how often instructions are found adjacent to each other. This may not
>> be a perfect measure, but it would be a great start. Can anyone
>> generate a metric on this similar to Koopman's data on instruction use?
> Anton Ertl also has instruction frequency data. I don't recall if he showed
> combinations or not. I know he or someone created the concept of "super
> operators" for Forth, which I think is what you're asking about. If he
> doesn't respond, I'll attempt to locate for you what I previously found.
That would be greatly interesting. I'm surprised I didn't notice this before. I wish there was some market for a machine like this, but then others would have done this before me. I know Bernd would have been all over this years ago if the market existed as well as others.
It seems that if the CPU isn't pipelined and blazing fast, it isn't interesting to most FPGA users. They prefer very high performance CPUs that access MBs of external memory and use 1000's of LUTs. My design has an extensible word size, but will likely never have a C compiler for it.
That reminds me of the ZPU. It is a stack machine designed to run C. It was also designed to be as tiny as possible in the minimal configuration with other versions running faster but using more resources. The ported the gcc tools for it. I think the minimal version is slightly smaller than my design, but very slow, maybe 10x... or would that be 10/?
> rickman<gnu...@gmail.com> writes:
>> Info that would be VERY useful to me is frequency of use of instructions
>> in combination. This could be pulled from existing code by measuring
>> how often instructions are found adjacent to each other. This may not
>> be a perfect measure, but it would be a great start. Can anyone
>> generate a metric on this similar to Koopman's data on instruction use?
> "Brad Eckert"<hwfw...@gmail.com> wrote in message
> news:e1479cfa-a969-40ab-b20c-096d82601657@googlegroups.com...
>> I've been thinking about Novix style processors like the RTX2000.
>> There are many Forth sequences that can be compacted into one
>> instruction, so with a good optimizer the chip can execute several
>> Forth (source) primitives in one machine cycle. I suspect though
>> that such optimization opportunities are the exception rather than the
>> rule.
> The question for both you (and Rick) is if you create new, faster, more
> powerful, multiple operation instructions, how do you ensure they are used?
> Without an optimizer, it's likely the instruction will have a low
> instruction frequency. I.e., a person is unlikely to use it. In which
> case, there is no point in using or implementing it.
> (This is repeated later in a reply to Rick.)
We aren't talking about Forth coding really. We are talking about the assembly language for a machine. I don't think instructions will go unused just because they are mapped to Forth in a more complicated way than 1 to 1 (or 1/2 to 1).
>> Does anyone here have a feel for the correspondence between Forth
>> source primitives and generated code?
> Generally, Forth's built using "primitives" or low-level words generally
> need 30 to 40 or so. I kept track of how many are needed for certain
> Forths. There are a few posts by me to c.l.f. with counts and specific
> words used.
Don't confuse Forth low level primitives (which are really HLL primitives selected to be convenient for the programmer writing a Forth) and assembly language which has to be selected in part based on what is practical and efficient to implement. Chuck's machine only uses 32 opcodes and you can get by with as few as 16.
On 10/22/2012 10:44 PM, visualfo...@rocketmail.com wrote:
> On Monday, October 22, 2012 11:53:24 AM UTC-4, Brad Eckert wrote:
>> Hi All, I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule. Does anyone here have a feel for the correspondence between Forth source primitives and generated code? This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.
> The trick is to make the right use of the available bits.
> The RTX2000 uses two important tricks to make it run fast:
> First of all, one bit switches between two types of commands: general commands and subroutine calls. This bit could be bit zero. If bit zero is zero, the other bits form the address of the subroutine to be called. That's obvious, because you don't need odd addresses for subroutines. But the RTX2000 uses bit 15 and shifts the address to fit.
> Second, the aforementioned trick with the return from subroutine bit.
> These both accomplish speed first hand.
> The remaining bits should be used to decode all Forth primitives which are needed. I am sure you don't need a frequency of calling, because the Forth OS itself needs all of them.
With a 16 bit or larger instruction, you have enough bits to "encode" each execution unit in a dual stack machine separately. Then you don't need a separate bit for the return operation. It can't be used in parallel with any other Instruction Unit operation or a Return Stack operation since both of these are used to do a return. It can only be used in parallel with a purely Data Stack operation.
The next time I look at a design on an FPGA I will take a look at a 16 or 18 bit instruction word that encodes the execution units operations separately. Part of the problem is that 16 is too many! What to do with the remainder?
> If there are enough bits, it will be possible to use these to run several primitives in one clock cycle - RISCs normally use only one clock cycle per command, the RTX2000 method allows up to four commands run per one clock cycle - it's some kind of Super-RISC. These commands running in one clock cycle should generate the next level of primitives. And of course there is the possibility to use some bits to switch between different kinds of decoding.. The more bits, the more possibilities there are, and more commands can be made to run in one clock cycle, not necessarily in parallel.
Primitives can only be run together if they don't conflict. That's why I want to look at which primitives occur together in the code.
On Tuesday, October 23, 2012 5:47:32 PM UTC-4, rickman wrote:
> > On 10/22/2012 10:44 PM, visualforth.com wrote: > > The RTX2000 uses two important tricks to make it run fast: > > Second, the aforementioned trick with the return from subroutine bit. > With a 16 bit or larger instruction, you have enough bits to "encode" each
> execution unit in a dual stack machine separately. Then you don't need a
> separate bit for the return operation. It can't be used in parallel with any
> other Instruction Unit operation or a Return Stack operation since both of
> these are used to do a return. It can only be used in parallel with a purely
> Data Stack operation.
On 10/23/2012 7:00 PM, visualfo...@rocketmail.com wrote:
> On Tuesday, October 23, 2012 5:47:32 PM UTC-4, rickman wrote:
>>> On 10/22/2012 10:44 PM, visualforth.com wrote:
>>> The RTX2000 uses two important tricks to make it run fast:
>>> Second, the aforementioned trick with the return from subroutine bit.
>> With a 16 bit or larger instruction, you have enough bits to "encode" each
>> execution unit in a dual stack machine separately. Then you don't need a
>> separate bit for the return operation. It can't be used in parallel with any
>> other Instruction Unit operation or a Return Stack operation since both of
>> these are used to do a return. It can only be used in parallel with a purely
>> Data Stack operation.
> If you say so ....
I don't understand. Are you agreeing or passively saying you don't agree?