Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
RTX2000 optimization
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 173 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Brad Eckert  
View profile  
 More options Oct 22 2012, 11:53 am
Newsgroups: comp.lang.forth
From: Brad Eckert <hwfw...@gmail.com>
Date: Mon, 22 Oct 2012 08:53:22 -0700 (PDT)
Local: Mon, Oct 22 2012 11:53 am
Subject: RTX2000 optimization
Hi All,

I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule.

Does anyone here have a feel for the correspondence between Forth source primitives and generated code?

This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andrew Haley  
View profile  
 More options Oct 22 2012, 12:21 pm
Newsgroups: comp.lang.forth
From: Andrew Haley <andre...@littlepinkcloud.invalid>
Date: Mon, 22 Oct 2012 11:21:12 -0500
Local: Mon, Oct 22 2012 12:21 pm
Subject: Re: RTX2000 optimization

Brad Eckert <hwfw...@gmail.com> wrote:
> Hi All,

> I've been thinking about Novix style processors like the
> RTX2000. There are many Forth sequences that can be compacted into
> one instruction, so with a good optimizer the chip can execute
> several Forth (source) primitives in one machine cycle. I suspect
> though that such optimization opportunities are the exception rather
> than the rule.

It happens a lot bacause many phrases are things like OVER + and
R> DROP .  Also, ; pairs with just about everything.  Novix had more
of these phrases than RTX2000 because of the way its encoding was
done.

Andrew.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
visualfo...@rocketmail.com  
View profile  
 More options Oct 22 2012, 12:56 pm
Newsgroups: comp.lang.forth
From: visualfo...@rocketmail.com
Date: Mon, 22 Oct 2012 09:56:29 -0700 (PDT)
Local: Mon, Oct 22 2012 12:56 pm
Subject: Re: RTX2000 optimization

On Monday, October 22, 2012 11:53:24 AM UTC-4, Brad Eckert wrote:
> Hi All, I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule. Does anyone here have a feel for the correspondence between Forth source primitives and generated code? This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.

Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Wills  
View profile  
 More options Oct 22 2012, 3:10 pm
Newsgroups: comp.lang.forth
From: Mark Wills <forthfr...@gmail.com>
Date: Mon, 22 Oct 2012 12:10:28 -0700 (PDT)
Local: Mon, Oct 22 2012 3:10 pm
Subject: Re: RTX2000 optimization
On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:

> Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!

Hmmm... I'm not following you. What is the function/significance of
this special bit? Where is it set? In the call instruction or the
return instruction?

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
visualfo...@rocketmail.com  
View profile  
 More options Oct 22 2012, 3:55 pm
Newsgroups: comp.lang.forth
From: visualfo...@rocketmail.com
Date: Mon, 22 Oct 2012 12:55:11 -0700 (PDT)
Local: Mon, Oct 22 2012 3:55 pm
Subject: Re: RTX2000 optimization

On Monday, October 22, 2012 3:10:29 PM UTC-4, M.R.W Wills wrote:
> On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote: > Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words! Hmmm... I'm not following you. What is the function/significance of this special bit? Where is it set? In the call instruction or the return instruction?

There is no special return instruction. Return is achieved by one bit only:

For all instructions except Subroutine Calls or Branch instructions, bit 5 of the instruction code represents the Subroutine Return Bit. If this bit is set to 1, a Return is performed whereby the return address is popped from the Return Stack.

Source: Intersil HS-RTX2010RH Data Sheet March 2000 File Number 3961.3, p. 28
http://www.intersil.com/content/dam/Intersil/documents/fn39/fn3961.pdf

Subroutine Return Bit, ibid., p. 31, HARRIS RTX2000 ADVANCE INFORMATION, May 1988, p. 16


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Coos Haak  
View profile  
 More options Oct 22 2012, 4:02 pm
Newsgroups: comp.lang.forth
From: Coos Haak <chfo...@hccnet.nl>
Date: Mon, 22 Oct 2012 22:02:13 +0200
Local: Mon, Oct 22 2012 4:02 pm
Subject: Re: RTX2000 optimization
Op Mon, 22 Oct 2012 12:10:28 -0700 (PDT) schreef Mark Wills:

> On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:
>> Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!

> Hmmm... I'm not following you. What is the function/significance of
> this special bit? Where is it set? In the call instruction or the
> return instruction?

There is no return instruction, but any instruction may have its return bit
set.   : SQR DUP + ;  may be one instruction word existing of DUP, plus and
the return bit.

--
Coos

CHForth, 16 bit DOS applications
http://home.hccnet.nl/j.j.haak/forth.html


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rickman  
View profile  
 More options Oct 22 2012, 4:50 pm
Newsgroups: comp.lang.forth
From: rickman <gnu...@gmail.com>
Date: Mon, 22 Oct 2012 16:50:33 -0400
Local: Mon, Oct 22 2012 4:50 pm
Subject: Re: RTX2000 optimization
On 10/22/2012 3:10 PM, Mark Wills wrote:

> On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:
>> Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!

> Hmmm... I'm not following you. What is the function/significance of
> this special bit? Where is it set? In the call instruction or the
> return instruction?

<this post turned out to be a bit longer than I expected...>

Stack machines are inherently pretty simple because of the limited
nature of registers and the limited instruction set typically
implemented.  I did one about ten years ago and found it really consists
of three execution units, all of which can operate in parallel.

1) The Instruction Unit contains the instruction memory with the address
generation (instruction fetch) as well as the decode (although the
decode could also be spread among the units).

2) The Address Unit contains the return stack with stack pointer and the
various logic that manipulates elements on the stack, such as
auto-increment of addresses and loop counter functions.  The address for
main memory is provided by this unit in my machine.

3) The Data Unit contains the data stack with stack pointer, the ALU and
any other special logic for operations on the data stack.  I also
included the memory interface in this unit even though the address comes
from the Address Unit.

Each of these three units is present in any dual stack CPU design.  Any
given instruction may involve any combination of the three units or may
leave some idle.  If instructions leave any unit idle it can be combined
with another instruction that uses just that unit in a compatible way.
For example, as others have indicated, any instruction that is not using
the Address Unit, e.g. ADD, SUB, etc. can execute an instruction that
only uses the Address Unit, e.g. RET, LOOP, etc.  Both instructions have
to be using the Instruction Unit in a compatible way which is typically
not hard since most instructions are doing the equivalent of NEXT (or
NOP in other terms).

The only issue with such combining of instructions is the instruction
encoding.  I encoded to minimize the amount of program space needed
which resulted in a minimum width instruction optimized per Koopman's
data for instruction frequency.  I looked at separating the opcodes for
each unit in essence making the MISC equivalent of a VLIW processor, if
you can have such a thing... lol  I couldn't quite squeeze the
instruction into a 9 bit word which is a memory width commonly available
in FPGAs.  I am now looking at using an FPGA that only supports 16 bit
wide memory and don't like the idea of multiplexing the instructions,
that just adds a level of logic to the timing path.  A wider instruction
might just minimize the decode logic and provide for more instruction
parallelism at the same time.

Another way of paralleling instructions is to just pick unused opcodes
and implement the most common parallel instructions.  This won't improve
timing of the design and may worsen it a bit, but will provide for fewer
instructions in a program.

The obvious one, return as a separate bit, in parallel with everything,
can only be used with about half the instructions in my CPU design.  The
return stack is used in a lot of them.  But if the bit is free...

Info that would be VERY useful to me is frequency of use of instructions
in combination.  This could be pulled from existing code by measuring
how often instructions are found adjacent to each other.  This may not
be a perfect measure, but it would be a great start.  Can anyone
generate a metric on this similar to Koopman's data on instruction use?

Rick


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
daveyrotten  
View profile  
 More options Oct 22 2012, 4:54 pm
Newsgroups: comp.lang.forth
From: daveyrotten <danw8...@gmail.com>
Date: Mon, 22 Oct 2012 13:54:33 -0700 (PDT)
Local: Mon, Oct 22 2012 4:54 pm
Subject: Re: RTX2000 optimization

On Monday, October 22, 2012 10:53:24 AM UTC-5, Brad Eckert wrote:
> Hi All,

> I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule.

> Does anyone here have a feel for the correspondence between Forth source primitives and generated code?

> This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.

Are you sure you're not confusing machine cycles with address fetches?  Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster. Maybe I'm not understanding what you mean by machine cycle. I'm a huge fan of James Bowman's J1, but I don't believe it packs instructions in memory at all. It simply executes them very fast thru the use of dual port memory. I think of the B16 and J1 as two sides of the same coin. The B16 is very memory efficient, but slower than the J1. The J1 is very fast, but not as memory efficient as the B16.  The B16 is programmed directly in Forth, using about 32 built-in words. The J1 is similar but I believe has a few more possible Forth words as primitive instructions.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
visualfo...@rocketmail.com  
View profile  
 More options Oct 22 2012, 5:14 pm
Newsgroups: comp.lang.forth
From: visualfo...@rocketmail.com
Date: Mon, 22 Oct 2012 14:14:19 -0700 (PDT)
Local: Mon, Oct 22 2012 5:14 pm
Subject: Re: RTX2000 optimization

On Monday, October 22, 2012 4:54:33 PM UTC-4, daveyrotten wrote:
> Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster.

The RTX2000 family of RISCs is able to execute several commands within one clock cycle. Only memory access needs two clock cycles: one to address memory, and one to fetch/store.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
daveyrotten  
View profile  
 More options Oct 22 2012, 5:26 pm
Newsgroups: comp.lang.forth
From: daveyrotten <danw8...@gmail.com>
Date: Mon, 22 Oct 2012 14:26:19 -0700 (PDT)
Local: Mon, Oct 22 2012 5:26 pm
Subject: Re: RTX2000 optimization

On Monday, October 22, 2012 4:14:19 PM UTC-5, visua...@rocketmail.com wrote:
> On Monday, October 22, 2012 4:54:33 PM UTC-4, daveyrotten wrote:

> > Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster.

> The RTX2000 family of RISCs is able to execute several commands within one clock cycle. Only memory access needs two clock cycles: one to address memory, and one to fetch/store.

Ok, sorry. I guess the compiler must pick instructions that have no interdependency to pack together. I'm not familiar with the RTX2000 itself.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
visualfo...@rocketmail.com  
View profile  
 More options Oct 22 2012, 7:00 pm
Newsgroups: comp.lang.forth
From: visualfo...@rocketmail.com
Date: Mon, 22 Oct 2012 16:00:03 -0700 (PDT)
Local: Mon, Oct 22 2012 7:00 pm
Subject: Re: RTX2000 optimization

On Monday, October 22, 2012 5:26:20 PM UTC-4, daveyrotten wrote:
> On Monday, October 22, 2012 4:14:19 PM UTC-5, visua...@rocketmail.com wrote: > On Monday, October 22, 2012 4:54:33 PM UTC-4, daveyrotten wrote: > > > Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster. > > > > The RTX2000 family of RISCs is able to execute several commands within one clock cycle. Only memory access needs two clock cycles: one to address memory, and one to fetch/store. Ok, sorry. I guess the compiler must pick instructions that have no interdependency to pack together. I'm not familiar with the RTX2000 itself.

I am. I did several designs, 1988-1995:
http://www.somersetweb.com/BruehlConsult/Projects/RTX2000-MINI.html
http://www.somersetweb.com/BruehlConsult/Projects/mc-RISC-EMUF.html
http://www.somersetweb.com/BruehlConsult/Projects/S5-4MB-RTX2000.html
http://www.somersetweb.com/BruehlConsult/Projects/S5-F-1MBd-RTX2000.html

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rod Pemberton  
View profile  
 More options Oct 22 2012, 9:47 pm
Newsgroups: comp.lang.forth
From: "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
Date: Mon, 22 Oct 2012 21:51:24 -0400
Local: Mon, Oct 22 2012 9:51 pm
Subject: Re: RTX2000 optimization
"Brad Eckert" <hwfw...@gmail.com> wrote in message

news:e1479cfa-a969-40ab-b20c-096d82601657@googlegroups.com...

> I've been thinking about Novix style processors like the RTX2000.
> There are many Forth sequences that can be compacted into one
> instruction, so with a good optimizer the chip can execute several
> Forth (source) primitives in one machine cycle. I suspect though
> that such optimization opportunities are the exception rather than the
> rule.

The question for both you (and Rick) is if you create new, faster, more
powerful, multiple operation instructions, how do you ensure they are used?
Without an optimizer, it's likely the instruction will have a low
instruction frequency.  I.e., a person is unlikely to use it.  In which
case, there is no point in using or implementing it.
(This is repeated later in a reply to Rick.)

> Does anyone here have a feel for the correspondence between Forth
> source primitives and generated code?

Generally, Forth's built using "primitives" or low-level words generally
need 30 to 40 or so.  I kept track of how many are needed for certain
Forths.  There are a few posts by me to c.l.f. with counts and specific
words used.

Rod Pemberton


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rod Pemberton  
View profile  
 More options Oct 22 2012, 9:48 pm
Newsgroups: comp.lang.forth
From: "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
Date: Mon, 22 Oct 2012 21:52:27 -0400
Local: Mon, Oct 22 2012 9:52 pm
Subject: Re: RTX2000 optimization
"rickman" <gnu...@gmail.com> wrote in message

news:k64bj1$m4n$1@dont-email.me...

> On 10/22/2012 3:10 PM, Mark Wills wrote:
> > On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:
> >> Despite allowing several Forth commands inside one RTX2000 16 bit
> >> word a great advantage is the ability to set a special Return from
> >> Subroutine bit - that is a real speed accelerator - so a whole
> >> executable program may need only one RTX2000 16 bit word!
> >> Imagine what additional possibilities you can have with 32 bit words!

> > Hmmm... I'm not following you. What is the function/significance of
> > this special bit? Where is it set? In the call instruction or the
> > return instruction?

> <this post turned out to be a bit longer than I expected...>

I'm versed in (old) microprocessor design.  So, that's easily taken care of:
[SNIP]

> The only issue with such combining of instructions is the instruction
> encoding.  I encoded to minimize the amount of program space needed
> which resulted in a minimum width instruction optimized per Koopman's
> data for instruction frequency.  I looked at separating the opcodes for
> each unit in essence making the MISC equivalent of a VLIW processor, if
> you can have such a thing... lol  I couldn't quite squeeze the
> instruction into a 9 bit word which is a memory width commonly available
> in FPGAs. [...]

Forth's built using "primitives" generally need 30 to 40 or so, i.e.,
5-bits.  Why do you need 9-bits (512)?  I'd guess that you're encoding
things other than just the instruction, e.g., control-bits, offsets, modes,
etc.

Koopman's and Ertl's instruction frequency data is basically the same.
Using their data is a good choice though.

(This is also posted earlier in the thread:)
The question for both you (and Brad) is if you create new, faster, more
powerful, multiple operation instructions, how do you ensure they are used?
Without an optimizer, it's likely the instruction will have a low
instruction frequency.  I.e., a person is unlikely to use it.  In which
case, there is no point in using or implementing it.

> The obvious one, return as a separate bit, in parallel with everything,
> can only be used with about half the instructions in my CPU design.  The
> return stack is used in a lot of them.  But if the bit is free...

Years ago, there was a processor that used a few bits, like two, for
conditional execution of each instruction.  I don't recall what it was, or
if it was a Forth processor.  It might've been a bit-slice design...

> Info that would be VERY useful to me is frequency of use of instructions
> in combination.  This could be pulled from existing code by measuring
> how often instructions are found adjacent to each other.  This may not
> be a perfect measure, but it would be a great start.  Can anyone
> generate a metric on this similar to Koopman's data on instruction use?

Anton Ertl also has instruction frequency data.  I don't recall if he showed
combinations or not.  I know he or someone created the concept of "super
operators" for Forth, which I think is what you're asking about.  If he
doesn't respond, I'll attempt to locate for you what I previously found.

Rod Pemberton


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
visualfo...@rocketmail.com  
View profile  
 More options Oct 22 2012, 10:44 pm
Newsgroups: comp.lang.forth
From: visualfo...@rocketmail.com
Date: Mon, 22 Oct 2012 19:44:15 -0700 (PDT)
Local: Mon, Oct 22 2012 10:44 pm
Subject: Re: RTX2000 optimization

On Monday, October 22, 2012 11:53:24 AM UTC-4, Brad Eckert wrote:
> Hi All, I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule. Does anyone here have a feel for the correspondence between Forth source primitives and generated code? This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.

The trick is to make the right use of the available bits.

The RTX2000 uses two important tricks to make it run fast:
First of all, one bit switches between two types of commands: general commands and subroutine calls. This bit could be bit zero. If bit zero is zero, the other bits form the address of the subroutine to be called. That's obvious, because you don't need odd addresses for subroutines. But the RTX2000 uses bit 15 and shifts the address to fit.
Second, the aforementioned trick with the return from subroutine bit.

These both accomplish speed first hand.

The remaining bits should be used to decode all Forth primitives which are needed. I am sure you don't need a frequency of calling, because the Forth OS itself needs all of them.

If there are enough bits, it will be possible to use these to run several primitives in one clock cycle - RISCs normally use only one clock cycle per command, the RTX2000 method allows up to four commands run per one clock cycle - it's some kind of Super-RISC. These commands running in one clock cycle should generate the next level of primitives. And of course there is the possibility to use some bits to switch between different kinds of decoding. The more bits, the more possibilities there are, and more commands can be made to run in one clock cycle, not necessarily in parallel.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hugh Aguilar  
View profile  
 More options Oct 23 2012, 1:03 am
Newsgroups: comp.lang.forth
From: Hugh Aguilar <hughaguila...@yahoo.com>
Date: Mon, 22 Oct 2012 22:03:13 -0700 (PDT)
Local: Tues, Oct 23 2012 1:03 am
Subject: Re: RTX2000 optimization
On Oct 22, 6:48 pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
wrote:

> Years ago, there was a processor that used a few bits, like two, for
> conditional execution of each instruction.  I don't recall what it was, or
> if it was a Forth processor.  It might've been a bit-slice design...

Isn't that the way that the ARM works?

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andrew Haley  
View profile  
 More options Oct 23 2012, 4:19 am
Newsgroups: comp.lang.forth
From: Andrew Haley <andre...@littlepinkcloud.invalid>
Date: Tue, 23 Oct 2012 03:19:41 -0500
Local: Tues, Oct 23 2012 4:19 am
Subject: Re: RTX2000 optimization

Hugh Aguilar <hughaguila...@yahoo.com> wrote:
> On Oct 22, 6:48?pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
> wrote:
>> Years ago, there was a processor that used a few bits, like two, for
>> conditional execution of each instruction. ?I don't recall what it was, or
>> if it was a Forth processor. ?It might've been a bit-slice design...

> Isn't that the way that the ARM works?

It was, but it's mostly been dropped in ARM 64 because "Benchmarking
shows that modern branch predictors work well enough that predicated
execution of instructions does not offer sufficient benefit to justify
its significant use of opcode space, and its implementation cost in
advanced implementations."

Andrew.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Stephen Pelc  
View profile  
 More options Oct 23 2012, 6:55 am
Newsgroups: comp.lang.forth
From: stephen...@mpeforth.com (Stephen Pelc)
Date: Tue, 23 Oct 2012 10:52:34 GMT
Local: Tues, Oct 23 2012 6:52 am
Subject: Re: RTX2000 optimization
On Mon, 22 Oct 2012 08:53:22 -0700 (PDT), Brad Eckert

<hwfw...@gmail.com> wrote:
>I've been thinking about Novix style processors like the RTX2000. There are=
> many Forth sequences that can be compacted into one instruction, so with a=
> good optimizer the chip can execute several Forth (source) primitives in o=
>ne machine cycle. I suspect though that such optimization opportunities are=
> the exception rather than the rule.

See:
  http://www.complang.tuwien.ac.at/anton/euroforth/ef04/pelc-bailey04.pdf
This machine has been run in an FPGA.

  http://www-users.cs.york.ac.uk/~chrisb/main-pages/fpga/fpga-research-...

>Does anyone here have a feel for the correspondence between Forth source pr=
>imitives and generated code?

Many address generating phrases such as "dup 8 + @" can be collapsed
to single instructions. However, the result is not a Forth purist's
CPU.

Stephen

--
Stephen Pelc, stephen...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Anton Ertl  
View profile  
 More options Oct 23 2012, 8:47 am
Newsgroups: comp.lang.forth
From: an...@mips.complang.tuwien.ac.at (Anton Ertl)
Date: Tue, 23 Oct 2012 12:44:15 GMT
Local: Tues, Oct 23 2012 8:44 am
Subject: Re: RTX2000 optimization

rickman <gnu...@gmail.com> writes:
>Info that would be VERY useful to me is frequency of use of instructions
>in combination.  This could be pulled from existing code by measuring
>how often instructions are found adjacent to each other.  This may not
>be a perfect measure, but it would be a great start.  Can anyone
>generate a metric on this similar to Koopman's data on instruction use?

http://www.complang.tuwien.ac.at/forth/peep/

in particular:

http://www.complang.tuwien.ac.at/forth/peep/sorted

There is also a later paper that shows different kinds of data:

http://www.complang.tuwien.ac.at/anton/euroforth/ef01/gregg01.pdf

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2012: http://www.euroforth.org/ef12/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
van...@vsta.org  
View profile  
 More options Oct 23 2012, 1:54 pm
Newsgroups: comp.lang.forth
From: van...@vsta.org
Date: 23 Oct 2012 17:54:46 GMT
Local: Tues, Oct 23 2012 1:54 pm
Subject: Re: RTX2000 optimization

Andrew Haley <andre...@littlepinkcloud.invalid> wrote:
> Hugh Aguilar <hughaguila...@yahoo.com> wrote:
>>> Years ago, there was a processor that used a few bits, like two, for
>>> conditional execution of each instruction.
>> Isn't that the way that the ARM works?
> It was, but it's mostly been dropped in ARM 64 because "Benchmarking
> shows that modern branch predictors work well enough that predicated
> execution of instructions does not offer sufficient benefit to justify
> its significant use of opcode space, and its implementation cost in
> advanced implementations."

FWIW, the Propeller CPU has conditional execution too.  I did a fair amount
of hand-coding of assembly for it, and came away not liking its instruction
set very much at all.  MIPS is my all-time favorite, but I'd even take 32-bit
x86 over Propeller.

--
Andy Valencia
Home page: http://www.vsta.org/andy/
To contact me: http://www.vsta.org/contact/andy.html


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rickman  
View profile  
 More options Oct 23 2012, 5:19 pm
Newsgroups: comp.lang.forth
From: rickman <gnu...@gmail.com>
Date: Tue, 23 Oct 2012 17:19:01 -0400
Local: Tues, Oct 23 2012 5:19 pm
Subject: Re: RTX2000 optimization
On 10/22/2012 9:52 PM, Rod Pemberton wrote:

I was constrained by the memories available in FPGAs.  Many allow
somewhat flexible word widths of 1, 2, 4, 8, 9, 16 and 18 bits, some
even 32 and 36 bits.  Multiplexing multiple instructions in one word has
a downside in that it requires extra levels of logic in the instruction
decode path which affects *all* instructions.  I wanted to avoid that.

So I started with a 4 bit instruction and found that rather limiting,
mainly in the impact on performance since most code as around twice as
long as it could be with larger instructions.  Literals (both data and
address) were especially problematic.  Looking at Koopman's data it was
clear that anything which could optimize the address fields of calls and
other instructions, including immediate data would be a boon.

So I tried 8 bit words and used a variable bit with instruction with the
remaining bits as immediate data.  This was combined with a data
extension scheme similar to that used by the Transputer.  They used 4
bit instructions with 4 bits of immediate data which would be shifted
into larger words.  Since this would be the most commonly used
instruction I gave it one bit with 7 bit immediate data.  The first
invocation of a literal instruction pushes the top of return stack with
the 7 bit data, sign extended.  Each subsequent invocation of the
literal instruction shifts 7 more bits into the top of return stack.
Calls and Jumps have a four field which is combined with the top of
return stack if a literal has been pushed, or just sign extended if not.

There remains some 16 opcodes for the various instructions for
manipulating data.  The 8 bit machine was used in one design.

I considered a 9 bit version to fully utilize the block RAM in most
FPGAs.  The immediate data fields were extended by one bit which I think
was significant for jumps and calls, 5 bits vs. 4).  In the case of
general opcodes the extra bit could be used to provide 32 instructions
rather than 16, but I didn't feel this gave much benefit and complicated
the instruction decode which was already more complex than I preferred.
  Another alternative was to use the extra bit to flag a combined Return
instruction.  I found it could only be used with about half the opcodes
I was using because of conflicts.

> Koopman's and Ertl's instruction frequency data is basically the same.
> Using their data is a good choice though.

It is a LOT better than no data at all which is what I have otherwise.

> (This is also posted earlier in the thread:)
> The question for both you (and Brad) is if you create new, faster, more
> powerful, multiple operation instructions, how do you ensure they are used?
> Without an optimizer, it's likely the instruction will have a low
> instruction frequency.  I.e., a person is unlikely to use it.  In which
> case, there is no point in using or implementing it.

Who is this "person"?  My design was for me and if I generated enough
code to analyze statistically, I would pick the instructions to combine
from analyzing my code.  It's not like I am selling this design for
others to use... not that I wouldn't mind sharing, but you have to read
my mind for much of the details.  One person asked and I gave him my
block diagram with labeled control points and my opcode cheat sheet.  He
couldn't make heads or tails out of it... lol

>> The obvious one, return as a separate bit, in parallel with everything,
>> can only be used with about half the instructions in my CPU design.  The
>> return stack is used in a lot of them.  But if the bit is free...

> Years ago, there was a processor that used a few bits, like two, for
> conditional execution of each instruction.  I don't recall what it was, or
> if it was a Forth processor.  It might've been a bit-slice design...

I've heard of that as well as other "unique" features.  An ancient
Univac machine had a bit in the address field that flagged indirect.
The address fetched had the same bit... they had to add a indirect
counter to get out of the infinite loops that could happen.

>> Info that would be VERY useful to me is frequency of use of instructions
>> in combination.  This could be pulled from existing code by measuring
>> how often instructions are found adjacent to each other.  This may not
>> be a perfect measure, but it would be a great start.  Can anyone
>> generate a metric on this similar to Koopman's data on instruction use?

> Anton Ertl also has instruction frequency data.  I don't recall if he showed
> combinations or not.  I know he or someone created the concept of "super
> operators" for Forth, which I think is what you're asking about.  If he
> doesn't respond, I'll attempt to locate for you what I previously found.

That would be greatly interesting.  I'm surprised I didn't notice this
before.  I wish there was some market for a machine like this, but then
others would have done this before me.  I know Bernd would have been all
over this years ago if the market existed as well as others.

It seems that if the CPU isn't pipelined and blazing fast, it isn't
interesting to most FPGA users.  They prefer very high performance CPUs
that access MBs of external memory and use 1000's of LUTs.  My design
has an extensible word size, but will likely never have a C compiler for
it.

That reminds me of the ZPU.  It is a stack machine designed to run C.
It was also designed to be as tiny as possible in the minimal
configuration with other versions running faster but using more
resources.  The ported the gcc tools for it.  I think the minimal
version is slightly smaller than my design, but very slow, maybe 10x...
or would that be 10/?

Rick


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rickman  
View profile  
 More options Oct 23 2012, 5:32 pm
Newsgroups: comp.lang.forth
From: rickman <gnu...@gmail.com>
Date: Tue, 23 Oct 2012 17:32:14 -0400
Local: Tues, Oct 23 2012 5:32 pm
Subject: Re: RTX2000 optimization
On 10/23/2012 8:44 AM, Anton Ertl wrote:

Thank you.  I'll take a look at this data.

Rick


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rickman  
View profile  
 More options Oct 23 2012, 5:36 pm
Newsgroups: comp.lang.forth
From: rickman <gnu...@gmail.com>
Date: Tue, 23 Oct 2012 17:36:45 -0400
Local: Tues, Oct 23 2012 5:36 pm
Subject: Re: RTX2000 optimization
On 10/22/2012 9:51 PM, Rod Pemberton wrote:

We aren't talking about Forth coding really.  We are talking about the
assembly language for a machine.  I don't think instructions will go
unused just because they are mapped to Forth in a more complicated way
than 1 to 1 (or 1/2 to 1).

>> Does anyone here have a feel for the correspondence between Forth
>> source primitives and generated code?

> Generally, Forth's built using "primitives" or low-level words generally
> need 30 to 40 or so.  I kept track of how many are needed for certain
> Forths.  There are a few posts by me to c.l.f. with counts and specific
> words used.

Don't confuse Forth low level primitives (which are really HLL
primitives selected to be convenient for the programmer writing a Forth)
and assembly language which has to be selected in part based on what is
practical and efficient to implement.  Chuck's machine only uses 32
opcodes and you can get by with as few as 16.

Rick


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rickman  
View profile  
 More options Oct 23 2012, 5:47 pm
Newsgroups: comp.lang.forth
From: rickman <gnu...@gmail.com>
Date: Tue, 23 Oct 2012 17:47:25 -0400
Local: Tues, Oct 23 2012 5:47 pm
Subject: Re: RTX2000 optimization
On 10/22/2012 10:44 PM, visualfo...@rocketmail.com wrote:

> On Monday, October 22, 2012 11:53:24 AM UTC-4, Brad Eckert wrote:
>> Hi All, I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule. Does anyone here have a feel for the correspondence between Forth source primitives and generated code? This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.

> The trick is to make the right use of the available bits.

> The RTX2000 uses two important tricks to make it run fast:
> First of all, one bit switches between two types of commands: general commands and subroutine calls. This bit could be bit zero. If bit zero is zero, the other bits form the address of the subroutine to be called. That's obvious, because you don't need odd addresses for subroutines. But the RTX2000 uses bit 15 and shifts the address to fit.
> Second, the aforementioned trick with the return from subroutine bit.

> These both accomplish speed first hand.

> The remaining bits should be used to decode all Forth primitives which are needed. I am sure you don't need a frequency of calling, because the Forth OS itself needs all of them.

With a 16 bit or larger instruction, you have enough bits to "encode"
each execution unit in a dual stack machine separately.  Then you don't
need a separate bit for the return operation.  It can't be used in
parallel with any other Instruction Unit operation or a Return Stack
operation since both of these are used to do a return.  It can only be
used in parallel with a purely Data Stack operation.

The next time I look at a design on an FPGA I will take a look at a 16
or 18 bit instruction word that encodes the execution units operations
separately.  Part of the problem is that 16 is too many!  What to do
with the remainder?

> If there are enough bits, it will be possible to use these to run several primitives in one clock cycle - RISCs normally use only one clock cycle per command, the RTX2000 method allows up to four commands run per one clock cycle - it's some kind of Super-RISC. These commands running in one clock cycle should generate the next level of primitives. And of course there is the possibility to use some bits to switch between different kinds of decoding.. The more bits, the more possibilities there are, and more commands can be made to run in one clock cycle, not necessarily in parallel.

Primitives can only be run together if they don't conflict.  That's why
I want to look at which primitives occur together in the code.

Rick


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
visualfo...@rocketmail.com  
View profile  
 More options Oct 23 2012, 7:00 pm
Newsgroups: comp.lang.forth
From: visualfo...@rocketmail.com
Date: Tue, 23 Oct 2012 16:00:55 -0700 (PDT)
Local: Tues, Oct 23 2012 7:00 pm
Subject: Re: RTX2000 optimization

On Tuesday, October 23, 2012 5:47:32 PM UTC-4, rickman wrote:
> > On 10/22/2012 10:44 PM, visualforth.com wrote:
> > The RTX2000 uses two important tricks to make it run fast:
> > Second, the aforementioned trick with the return from subroutine bit.
> With a 16 bit or larger instruction, you have enough bits to "encode" each
> execution unit in a dual stack machine separately. Then you don't need a
> separate bit for the return operation. It can't be used in parallel with any
> other Instruction Unit operation or a Return Stack operation since both of
> these are used to do a return. It can only be used in parallel with a purely
> Data Stack operation.

If you say so ....

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rickman  
View profile  
 More options Oct 23 2012, 9:07 pm
Newsgroups: comp.lang.forth
From: rickman <gnu...@gmail.com>
Date: Tue, 23 Oct 2012 21:07:11 -0400
Local: Tues, Oct 23 2012 9:07 pm
Subject: Re: RTX2000 optimization
On 10/23/2012 7:00 PM, visualfo...@rocketmail.com wrote:

> On Tuesday, October 23, 2012 5:47:32 PM UTC-4, rickman wrote:
>>> On 10/22/2012 10:44 PM, visualforth.com wrote:
>>> The RTX2000 uses two important tricks to make it run fast:
>>> Second, the aforementioned trick with the return from subroutine bit.

>> With a 16 bit or larger instruction, you have enough bits to "encode" each
>> execution unit in a dual stack machine separately. Then you don't need a
>> separate bit for the return operation. It can't be used in parallel with any
>> other Instruction Unit operation or a Return Stack operation since both of
>> these are used to do a return. It can only be used in parallel with a purely
>> Data Stack operation.

> If you say so ....

I don't understand.  Are you agreeing or passively saying you don't agree?

Rick


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 173   Newer >
« Back to Discussions « Newer topic     Older topic »