Xilinx microblaze vs. picoblaze

emanuel stiebler

unread,

Oct 15, 2002, 1:14:20 AM10/15/02

to

Hi,

anybody out here has some insight, why the 8 bit picoblaze (up 112 MHz)
is clocked lower than the 32 bit microblaze (up 150 MHz) ?

cheers

Falk Brunner

unread,

Oct 15, 2002, 12:35:35 PM10/15/02

to

"emanuel stiebler" <e...@ecubics.com> schrieb im Newsbeitrag
news:3DABA42C...@ecubics.com...

> Hi,
>
> anybody out here has some insight, why the 8 bit picoblaze (up 112 MHz)

112 MHz (one hundred twelf megahertz???)
In what device? the fastest Virtex-II??
My expirience is somewhere 50 MHz in a -5/-6 Spartan-II(E).

> is clocked lower than the 32 bit microblaze (up 150 MHz) ?

Looks like the microblaze is more pipelied than the picoblaze (which has
some pipelining too). Picoblaze (Hello Ken ;-) was developed for minimum
size on top priority, speed was second (AFAIK)

--
MfG
Falk

Symon

unread,

Oct 15, 2002, 12:51:42 PM10/15/02

to

Dear Emanuel,
Microblaze is heavily pipelined so can be clocked faster than
picoblaze. OTOH, Picoblaze uses far fewer FPGA resources.
HTH, Syms.

emanuel stiebler <e...@ecubics.com> wrote in message news:<3DABA42C...@ecubics.com>...

emanuel stiebler

unread,

Oct 15, 2002, 1:24:40 PM10/15/02

to

Falk Brunner wrote:
>
> "emanuel stiebler" <e...@ecubics.com> schrieb im Newsbeitrag
> news:3DABA42C...@ecubics.com...
> > Hi,
> >
> > anybody out here has some insight, why the 8 bit picoblaze (up 112 MHz)
>
> 112 MHz (one hundred twelf megahertz???)
> In what device? the fastest Virtex-II??

http://www.xilinx.com/ipcenter/processor_central/picoblaze/index.htm

was even a typo. It runs at 116 MHz ;-)
at least in the press release ...

> My expirience is somewhere 50 MHz in a -5/-6 Spartan-II(E).

There are two "picoblazes". One for the Spartan, one for the Virtex.
Pretty different beasts.

> > is clocked lower than the 32 bit microblaze (up 150 MHz) ?
>
> Looks like the microblaze is more pipelied than the picoblaze (which has
> some pipelining too). Picoblaze (Hello Ken ;-) was developed for minimum
> size on top priority, speed was second (AFAIK)

AFAIRC, the microblaze still needs only two clock per instruction ...

cheers

Falk Brunner

unread,

Oct 15, 2002, 2:06:43 PM10/15/02

to

"emanuel stiebler" <e...@ecubics.com> schrieb im Newsbeitrag

news:3DAC4F58...@ecubics.com...

> > My expirience is somewhere 50 MHz in a -5/-6 Spartan-II(E).
>
> There are two "picoblazes". One for the Spartan, one for the Virtex.
> Pretty different beasts.

?? R u sure?
I think there are NOT two picoblaze versions, since Sparten-II (why the hell
is everyone talking about Spartan when it is actually Spartan-II, which, our
honour, is a BIG difference???) is practically identical to Virtex.

> > size on top priority, speed was second (AFAIK)
>
> AFAIRC, the microblaze still needs only two clock per instruction ...

PICOBLAZE (TPFKAKCPSM , decode THIS;-)) executes EVERY instruction in two
clock cycles, this was made for simplicity (ressource usage). But still much
better than the original stupid 12 clocks/cycle design of the 8051 . .. .
I dont know about microblaze.

--
MfG
Falk

Goran Bilski

unread,

Oct 15, 2002, 2:03:16 PM10/15/02

to

Hi,

MicroBlaze is more pipelined and is more floorplanned than PicoBlaze.
The 150 MHz for MicroBlaze is on a V2Pro and 116MHz for PicoBlaze is on VII-6.
MicroBlaze runs at 135 MHz on a VII-6.

Since PicoBlaze is optimized for area and more pipeline has a big area cost in
processor design.

There is also not that big difference on performance with different datasizes.
The carry-chain is pretty fast as soon you starting to use it.
A 64-bit MicroBlaze would probably run at 100 MHz and a 8 bit MicroBlaze would
just run a little faster than the 32-bit version since the control decoding will
probably be the limiting factor.

By the way for most instruction MicroBlaze needs only 1 clock/instruction.

Göran

emanuel stiebler

unread,

Oct 15, 2002, 2:48:01 PM10/15/02

to

Goran Bilski wrote:
>
> There is also not that big difference on performance with different datasizes.
> The carry-chain is pretty fast as soon you starting to use it.
> A 64-bit MicroBlaze would probably run at 100 MHz and a 8 bit MicroBlaze would
> just run a little faster than the 32-bit version since the control decoding will
> probably be the limiting factor.
>
> By the way for most instruction MicroBlaze needs only 1 clock/instruction.

That was the answer I was looking for, even that I didn't make
it clear what exactly my question was ;-)

So, we can't go faster at the actual Xilinx parts than around 100
"MIPS",
independent, if it is a 8,16,32,64 bit processor, right ?

And the problem is really in the instruction decoding, control path.

Thanks

Nicholas C. Weaver

unread,

Oct 15, 2002, 2:53:35 PM10/15/02

to

In article <3DAC62E1...@ecubics.com>,

emanuel stiebler <e...@ecubics.com> wrote:
>That was the answer I was looking for, even that I didn't make
>it clear what exactly my question was ;-)
>
>So, we can't go faster at the actual Xilinx parts than around 100
>"MIPS",
>independent, if it is a 8,16,32,64 bit processor, right ?
>
>And the problem is really in the instruction decoding, control path.

Well, you CAN, but you are going to have to go multithreaded. If your
critical path is 1 32 bit add + bypassing, yeah, you really can't do
any faster. The only way to pipeline this (or the instruction decode,
if that is the critical factor) is to use an interleaving,
multithreading strategy (aka $C$-slow a microprocessor).
--
Nicholas C. Weaver nwe...@cs.berkeley.edu

Goran Bilski

unread,

Oct 15, 2002, 3:38:59 PM10/15/02

to

Hi,

If your definition of MIPS is maximum number of instruction per clock cycle, the 150
MHz MicroBlaze has 150 MIPS.
There is also possible to do a 200+ MIPS processor, if you want to optimize around
MIPS.
A heavy pipelined processor without any forwarding in the pipeline could easy run at
200+ MHz.
Would that processor be more efficient than MicroBlaze? I don't think so, the number
of stall due to pipeline hazardous will actually give it a lower performance than
MicroBlaze.

Processors in FPGAs has to be handle more delicate than ASIC processor due to
forwarding in pipeline could easy remove all benefits gain by more pipeline stages. In
FPGA a mux cost as much as an ALU which is not the case for ASIC or custom design.

Another approach is to rely on advanced compiler techniques for handling all the
pipeline hazardous but it would make it almost impossible to program the processor in
assembler since the user has to do the handling.
I personally don't think that this approach would gain that much more performance than
MicroBlaze and you have to spend a lot of resources on the compiler which could be
used for other stuff.

Another approach is to add multi-threading capabilities but I think that
multi-processing is better for FPGA than multi-threading.

Göran Bilski

Nicholas C. Weaver

unread,

Oct 15, 2002, 4:14:45 PM10/15/02

to

In article <3DAC6ED3...@Xilinx.com>,
Goran Bilski <Goran....@Xilinx.com> wrote:

>Another approach is to add multi-threading capabilities but I think that
>multi-processing is better for FPGA than multi-threading.

I disagree, based on the following observation: A 4-5 stage pipeline
is going to have 2-3 levels of FPGA logic between each pipeline stage,
suggesting that there are plentiful registers which can be exploited
if the design is retimed and interleaved.

Actual experience: Leon I (synthesized Sparc), Virtex:

Initially: 23 MHz
Retimed: 25 MHz
2-slow retimed: 46 MHz (so each thread at 23 MHz)

Rick Filipkiewicz

unread,

Oct 15, 2002, 4:30:48 PM10/15/02

to

Goran Bilski wrote:

> Hi,
>
> MicroBlaze is more pipelined and is more floorplanned than PicoBlaze.
> The 150 MHz for MicroBlaze is on a V2Pro and 116MHz for PicoBlaze is on VII-6.
> MicroBlaze runs at 135 MHz on a VII-6.
>

Aha! There we have it caught on this very NG a Xilinx person admitting that
floorplanning is important :-)).

Goran Bilski

unread,

Oct 15, 2002, 5:12:34 PM10/15/02

to

Hi,

Yes, but I don't use the floorplanner.
I added all placement constraints (RLOC) in my VHDL code.
This gives me a more stable and reliable timing.

I have actually done a test where I remove all RLOC from my code and let par do the
placement.
The result is within 5% of the handplaced version.

Göran

Ray Andraka

unread,

Oct 15, 2002, 6:32:38 PM10/15/02

to

That is lower than we typically see for data path designs. For heavily arithmetic
designs in VIrtex, VirtexE and SpartanII, we've consistently seen better than 30%
improvement, and frequently better than 60% doing the same experiment. Of course it
also depends on what you are setting the constraints as. The routing tools only work
as hard as they have to to meet your constraints, so if the target is low compared with
what could be realized, the results are going to be skewed making floorplanning not
look like a big win. For VirtexII, the same applies, except non-arithmetic logic is
quite a bit faster than the carry chains so the routing limit is artificially low if
there is arithmetic covered by the same constraint. 135MHz is not a very high target
for VirtexII.

Goran Bilski wrote:

> Hi,
>
> Yes, but I don't use the floorplanner.
> I added all placement constraints (RLOC) in my VHDL code.
> This gives me a more stable and reliable timing.
>
> I have actually done a test where I remove all RLOC from my code and let par do the
> placement.
> The result is within 5% of the handplaced version.
>
> Göran

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email r...@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759

Goran Bilski

unread,

Oct 15, 2002, 7:35:40 PM10/15/02

to

Hi Ray,

Are you implying something ;-)

I could do a better work on BRAM placement but since the number of BRAM connected to
MicroBlaze can be of any number it would force me to do a lot of different floorplans
dependent on the number of BRAMs.
The BRAM is in the critical path.

Most requests on MicroBlaze is NOT on the performance but more on functionality and there
is where I spend most of my time now.

Göran

Ray Andraka

unread,

Oct 15, 2002, 8:31:05 PM10/15/02

to

No, only that the savings is not representative of what can typically be achieved in a
pipelined datapath design through floorplanning. The placement of the BRAMs and multipliers
is certainly a driver. Most of our stuff is tolerant to additional pipelining so we will
typically surround the BRAMs with additional pipeline stages, which is something that doesn't
work too well with a simple microprocessor.

Hal Murray

unread,

Oct 16, 2002, 12:28:54 AM10/16/02

to

>Processors in FPGAs has to be handle more delicate than ASIC processor due to
>forwarding in pipeline could easy remove all benefits gain by more pipeline stages. In
>FPGA a mux cost as much as an ALU which is not the case for ASIC or custom design.
>
>Another approach is to rely on advanced compiler techniques for handling all the
>pipeline hazardous but it would make it almost impossible to program the processor in
>assembler since the user has to do the handling.
>I personally don't think that this approach would gain that much more performance than
>MicroBlaze and you have to spend a lot of resources on the compiler which could be
>used for other stuff.

This seems like an interesting opportunity for an open source project.

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

Hal Murray

unread,

Oct 16, 2002, 12:41:35 AM10/16/02

to

>Another approach is to add multi-threading capabilities but I think that
>multi-processing is better for FPGA than multi-threading.

Why?

If I understand what multi-threading means, the idea is to interleave
alternate cycles of two execution streams in order to reduce the
losses due to stalls.

It looks like it "just" requires an extra address bit (odd/even cycle)
to the register file and the same bit selects between pairs of special
registers like the PC.

Are you telling me that the ALU and instruction decoding is small enough
so that I might just as well build two copies of the whole CPU?

Goran Bilski

unread,

Oct 16, 2002, 11:08:34 AM10/16/02

to

Hi,

Sort of.

The complete decoding and the ALU is around 10-13% of the design.
The actual instruction decoding is less than 5%.

Make it multithreading as I understand is to have more than 1 instructions
streams in the pipeline.
What is the benefit unless you double the pipeline and have two data pipelines?
Almost nothing

So with two threads in MicroBlaze, to double the pipeline is to double the size
of MicroBlaze.
You also have to double the instruction fetching data throughput in order to get
the two streams busy.
That would put a big burden on the bus infrastructure and external memory
interface which suddenly has to double it's performance.
The doubling of the pipeline and added control handling WILL also lower the
maximum clock frequency of MicroBlaze.

Compare this with two MicroBlaze which can have it's separate instruction
fetching and both running at the maximum clock frequency.

I would say that multiprocessing which is easier to do and with more performance
a better choice.
Say you suddenly would like to have 5 threads instead of 2. That is a major
change of the multithreading MicroBlaze and almost impossible to get the
instruction fetching to keep up. With multiprocessing, just add another 3
MicroBlazes and you're done.

BUT there is always a catch and that is how you write programs for these systems.

Göran

Nicholas C. Weaver

unread,

Oct 16, 2002, 12:03:31 PM10/16/02

to

In article <3DAD80F2...@Xilinx.com>,

Goran Bilski <Goran....@Xilinx.com> wrote:
>Hi,
>
>Sort of.
>
>The complete decoding and the ALU is around 10-13% of the design.
>The actual instruction decoding is less than 5%.
>
>Make it multithreading as I understand is to have more than 1 instructions
>streams in the pipeline.
>What is the benefit unless you double the pipeline and have two data pipelines?
>Almost nothing

Uhh, you don't double the pipelines, you take the single pipeline,
double up the registers IN them, and then move the regsters to
rebalance all the pipeline stages, as now you have 2x the registers
through any fedback loop, allowing you to up the clock frequncy alot.

If you do this to every register in the core (and tweak the RF), a
multithreaded design just sort of "dros out" automatically.

You can even write a tool to do that automatically.

What happens in the end is is you take adantage of the two threads to
up the clock substantially. Each individual thread is now a little
slower, but the throughput for the 2 threads is now substantiall
higher. You use more pipelining and more power, and you may or may
not end up thrashing the caches, but itdoes work.

I can send you a paper submission and a thesis chapter draft on the
subject if you want.

>So with two threads in MicroBlaze, to double the pipeline is to
>double the size of MicroBlaze. You also have to double the
>instruction fetching data throughput in order to get the two streams
>busy. That would put a big burden on the bus infrastructure and
>external memory interface which suddenly has to double it's
>performance. The doubling of the pipeline and added control handling
>WILL also lower the maximum clock frequency of MicroBlaze.

You don't need to double the exteral memory interface if you share the
cache, this is especially true on workloads where the threads are
related. The external memory interfare is now 2x the CLOCK, but you
could slow it down from there and arbitrate beween the two streams of
execution.

You also probably want to make the feeding of interrupts a little
different, so you can designate one thread as receiving the
interrupts.

>Say you suddenly would like to have 5 threads instead of 2. That is a major
>change of the multithreading MicroBlaze and almost impossible to get the
>instruction fetching to keep up. With multiprocessing, just add another 3
>MicroBlazes and you're done.

What you do is you have a 1 thread and a 2 thread version (going
beyond 2 threads seems to be less effective, maby 3 depending on the
architecture). From the exterior, however, they still look normal.
You can still tile that like any other core to create a multiprocessor
machine.

>BUT there is always a catch and that is how you write programs for these
>systems.

"one thread for I/O, one thread for processing" does come up in some
cases.

>Göran
>
>Hal Murray wrote:
>
>> >Another approach is to add multi-threading capabilities but I think that
>> >multi-processing is better for FPGA than multi-threading.
>>
>> Why?
>>
>> If I understand what multi-threading means, the idea is to interleave
>> alternate cycles of two execution streams in order to reduce the
>> losses due to stalls.
>>
>> It looks like it "just" requires an extra address bit (odd/even cycle)
>> to the register file and the same bit selects between pairs of special
>> registers like the PC.
>>
>> Are you telling me that the ALU and instruction decoding is small enough
>> so that I might just as well build two copies of the whole CPU?
>>
>> --
>> The suespammers.org mail server is located in California. So are all my
>> other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
>> commercial e-mail to my suespammers.org address or any of my other addresses.
>> These are my opinions, not necessarily my employer's. I hate spam.
>

Goran Bilski

unread,

Oct 16, 2002, 12:24:15 PM10/16/02

to

Hi,

"Nicholas C. Weaver" wrote:

Please do.

If you double all the registers in the data pipeline, hasn't you doubled the
pipeline?
Or is all functionality between the pipestages shared?

Nicholas C. Weaver

unread,

Oct 16, 2002, 1:27:12 PM10/16/02

to

In article <3DAD92AF...@Xilinx.com>,

Goran Bilski <Goran....@Xilinx.com> wrote:
>> I can send you a paper submission and a thesis chapter draft on the
>> subject if you want.
>>
>
>Please do.

Done.

>If you double all the registers in the data pipeline, hasn't you doubled the
>pipeline?
>Or is all functionality between the pipestages shared?

The functions between the pipeline stages remain unchanged, so one is
pipelining the computation on a finer grain. This can work fairly
well for FPGAs as the ratio of LUTs to FFs is 1/1, but that is usually
not the case in logic except for highly agressive designs.

Ray Andraka

unread,

Oct 16, 2002, 1:41:15 PM10/16/02

to

Goran Bilski wrote:

>
>
> Please do.
>
> If you double all the registers in the data pipeline, hasn't you doubled the
> pipeline?
> Or is all functionality between the pipestages shared?

Yes, you've doubled the pipeline by doubling the registers. In a non-pipelined
design, you presumably have several layers of LUTs between registers. Adding pipeline
registers is nearly free in the FPGA, since they are there but unused in slices used
for combinatorial LUTs. By adding the pipeline stages you can crank up the clock
considerably if you've balanced the delays between the registers. That in turn lets
you have more than one thread in process at any given moment. All the functionality
goes through the same path, it is just that the path has more registers in it, so that
the next sample can be put into the pipeline before the previous one comes out.

Goran Bilski

unread,

Oct 16, 2002, 2:06:47 PM10/16/02

to

You can't just double the number of pipestage for a processor without major impacts.
For streaming pipeline which hardware pipelines are I agree but for processor that can't
be done.

Göran

Goran Bilski

unread,

Oct 16, 2002, 2:09:55 PM10/16/02

to

Hi,

"Nicholas C. Weaver" wrote:

> In article <3DAD92AF...@Xilinx.com>,
> Goran Bilski <Goran....@Xilinx.com> wrote:
> >> I can send you a paper submission and a thesis chapter draft on the
> >> subject if you want.
> >>
> >
> >Please do.
>
> Done.
>

Thanks

>
> >If you double all the registers in the data pipeline, hasn't you doubled the
> >pipeline?
> >Or is all functionality between the pipestages shared?
>
> The functions between the pipeline stages remain unchanged, so one is
> pipelining the computation on a finer grain. This can work fairly
> well for FPGAs as the ratio of LUTs to FFs is 1/1, but that is usually
> not the case in logic except for highly agressive designs.
>

The problem to share functionality is that you need muxes to select which thread
to use.
These muxes will be in most cases larger than the function itself and will
definitely decrease the clock frequency of the processor.

I still believe that multi-threading is useful for custom/ASIC implementation but
multi-processor works better in FPGA:

Göran

Jim Granville

unread,

Oct 16, 2002, 2:31:40 PM10/16/02

to

Nicholas C. Weaver wrote:
>
> In article <3DAD80F2...@Xilinx.com>,
> Goran Bilski <Goran....@Xilinx.com> wrote:
> >So with two threads in MicroBlaze, to double the pipeline is to
> >double the size of MicroBlaze. You also have to double the
> >instruction fetching data throughput in order to get the two streams
> >busy. That would put a big burden on the bus infrastructure and
> >external memory interface which suddenly has to double it's
> >performance. The doubling of the pipeline and added control handling
> >WILL also lower the maximum clock frequency of MicroBlaze.
>
> You don't need to double the exteral memory interface if you share the
> cache, this is especially true on workloads where the threads are
> related. The external memory interfare is now 2x the CLOCK, but you
> could slow it down from there and arbitrate beween the two streams of
> execution.

Suppose the memory interface was optimised for synchronous/burst FLASH,
- doesn't this multi threading fight a little with that interface ?

Or do you always need a multi-layer memory interface ?

> You also probably want to make the feeding of interrupts a little
> different, so you can designate one thread as receiving the
> interrupts.

That sounds a good idea, plus it infers SW access to thread steering,
so small routines can also be tagged for a 'fast thread'.

Is this the same scheme Intel has comming, where they claim 25% higher
performance ( if the SW supports it :)

-jg

Nicholas C. Weaver

unread,

Oct 16, 2002, 2:35:27 PM10/16/02

to

In article <3DADAAB7...@Xilinx.com>,
Goran Bilski <Goran....@Xilinx.com> wrote:

>You can't just double the number of pipestage for a processor without
>major impacts. For streaming pipeline which hardware pipelines are I
>agree but for processor that can't be done.

Uhh, yes it can.

Double all the pipeline stages, double the register file, rebalance
the delays now that you have more pipelining, and out drops a 2-thread
multithreaded architecture. Each single thread now runs slower, but
aggregate throughput (sum of the two threads) is increased.

It is so obvious yet unintuitive that nobody has actually DONE it
before. :)

Ken McElvain

unread,

Oct 16, 2002, 2:39:34 PM10/16/02

to

On big advantage of multi threading is that the pipeline
interlocks can be eliminated if the number of threads is larger
than the longest feedback path in the pipeline. For example,
a branch instruction does not have to stall waiting for
conditions from the preceeding comparison. This yields
some boost in the total performance.

Permanent state such as conditions codes, register files have to be
expanded into larger memories with part of the index being the current
thread id, but other registers mostly do not have to be modified. Given
the distributed ram capabilities in Xilinx parts, this is pretty
cheap.

The first place I saw this was the CDC 6600 IO processors, which
I belive ran 16 threads.

- Ken

Nicholas C. Weaver

unread,

Oct 16, 2002, 3:02:38 PM10/16/02

to

In article <3DADB0...@designtools.co.nz>,
Jim Granville <jim.gr...@designtools.co.nz> wrote:

>Nicholas C. Weaver wrote:
>> You don't need to double the exteral memory interface if you share the
>> cache, this is especially true on workloads where the threads are
>> related. The external memory interfare is now 2x the CLOCK, but you
>> could slow it down from there and arbitrate beween the two streams of
>> execution.
>
>Suppose the memory interface was optimised for synchronous/burst FLASH,
>- doesn't this multi threading fight a little with that interface ?

A bit, but not too bad. You want to redo the memory interface anyway
to make the outside look like a normal, single threaded part for
convenience.

>> You also probably want to make the feeding of interrupts a little
>> different, so you can designate one thread as receiving the
>> interrupts.
>
>That sounds a good idea, plus it infers SW access to thread steering,
>so small routines can also be tagged for a 'fast thread'.
>
>Is this the same scheme Intel has comming, where they claim 25% higher
>performance ( if the SW supports it :)

It's not coming, its present in the P4 Xeons, and present but untested
and disabed in the desktop P4s. You want better software or at least
scheduling to use it RIGHT, but it is there.

Same concept (run 2 threads as 2 separate virtual processors on the
same shared hardware), largely the same programmer implications (same
working set good, different working sets BAAAAD, which is just the
opposite of an SMP), totally different implementation:

Intel/Uwashington (Hyperthreading/SMT) approach is to issue from 2
instruction streams into a superscalar core, as any one stream
actually has pretty crappy utilization of the functional units. INtel
doesn't even increase the physical registers (which are much more
numerous than the architectural registers).

$C$-slow multithreading is to double up the pipeline, issuing from two
threads on an even/odd basis, so each thread now runs a little slower,
but the finer pipelining allows the system to run faster.

Goran Bilski

unread,

Oct 16, 2002, 2:56:14 PM10/16/02

to

Hi,

I agree that you can if you also double the clock frequency of the pipeline,
creating parts of the normal clock.
What I meant was the keeping the same clock and just adding more pipestages.

I have finally got the idea of multithreading but it not as easy to implement
since you need to find a good middle point in each
pipestage that can divide the pipestage into equal parts.

The control path is also needed to split into subparts and you also need to find
good points to break it up.
The processor also definitely needs a cache which you can run at the double
speed or more ports to in order to get the data for each thread.
I think that would be the largest obstacle for multithreading MicroBlaze, the
number of ports to the BRAM is finite (2) and in my implementation BRAM is
almost already in the critical path.

Göran Bilski

Nicholas C. Weaver

unread,

Oct 16, 2002, 3:15:51 PM10/16/02

to

In article <3DADB64E...@Xilinx.com>,

Goran Bilski <Goran....@Xilinx.com> wrote:
>Hi,
>
>I agree that you can if you also double the clock frequency of the pipeline,
>creating parts of the normal clock.
>What I meant was the keeping the same clock and just adding more pipestages.

Yeah, thats a loss generally. You need to add more stages around all
feedback loops to really improve things, and that is what the
multithreading allows.

>I have finally got the idea of multithreading but it not as easy to
>implement since you need to find a good middle point in each
>pipestage that can divide the pipestage into equal parts. The
>control path is also needed to split into subparts and you also need
>to find good points to break it up.

You can actually write a tool to do it all automatically, especially
if you are willing to say "screw initial conditions".

Hal Murray

unread,

Oct 16, 2002, 3:25:12 PM10/16/02

to

>The problem to share functionality is that you need muxes to select which thread
>to use.
>These muxes will be in most cases larger than the function itself and will
>definitely decrease the clock frequency of the processor.

I don't think we are on the same wavelength yet.

The idea is not to add those muxes, but to have two threads
running through the same heavily pipelined structure on
alternating cycles.

The extra pipeline registers are free on most FPGAs. They
let you run with a faster clock rate. But if you only have
one thread you will waste many of thone new cycles on stalls.
You can get back those cycles if you let another thread use
them.

Only around the edges do you need new muxes - a wider register
file and new mux/enables on registers like the PC.

> I still believe that multi-threading is useful for custom/ASIC implementation but
> multi-processor works better in FPGA:

This isn't rocket science. All we need is to compare 2x the LUTs used
by a single threaded design with a multi threaded design. (Scaled by
clock speed if that changes.)

Hal Murray

unread,

Oct 16, 2002, 3:30:49 PM10/16/02

to

> BUT there is always a catch and that is how you write programs
> for these systems.

Standard programming problem. People are getting pretty good
at it. Yes, there are lots of applications where it doesn't work.

If you can't take advantage of multi-threading then you wouldn't
be able to use multi-processing either.

rickman

unread,

Oct 16, 2002, 4:01:34 PM10/16/02

to

Hal Murray wrote:
>
> >Processors in FPGAs has to be handle more delicate than ASIC processor due to
> >forwarding in pipeline could easy remove all benefits gain by more pipeline stages. In
> >FPGA a mux cost as much as an ALU which is not the case for ASIC or custom design.
> >
> >Another approach is to rely on advanced compiler techniques for handling all the
> >pipeline hazardous but it would make it almost impossible to program the processor in
> >assembler since the user has to do the handling.
> >I personally don't think that this approach would gain that much more performance than
> >MicroBlaze and you have to spend a lot of resources on the compiler which could be
> >used for other stuff.
>
> This seems like an interesting opportunity for an open source project.

Aren't there already CPUs in FPGA open source projects?

http://www.fpgacpu.org/

http://www.opencores.org/

The list is getting pretty long.

--

Rick "rickman" Collins

rick.c...@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

Ray Andraka

unread,

Oct 16, 2002, 4:08:05 PM10/16/02

to

No, you don't need muxes. The process is time division multiplexed in the same
hardware. It is just a matter of keeping track where the pieces of each thread are
in relation to one another. As long as all parts have the same depth through
hardware loops you get that depth worth of multithreading. The only place muxes may
be needed is for selecting outside inputs.

Goran Bilski wrote:

> Hi,
>
> "Nicholas C. Weaver" wrote:
>
> The problem to share functionality is that you need muxes to select which thread
> to use.
> These muxes will be in most cases larger than the function itself and will
> definitely decrease the clock frequency of the processor.
>

> Göran
>
> > --
> > Nicholas C. Weaver nwe...@cs.berkeley.edu

--

Ray Andraka

unread,

Oct 16, 2002, 4:09:30 PM10/16/02

to

It can, if each phase of the pipeline is assigned to a different thread.

Ray Andraka

unread,

Oct 16, 2002, 4:20:21 PM10/16/02

to

We do it all the time in our DSP designs. Granted, they are not typically
microprocessors in the traditional sense, but the fact is it is doable.

Basically what happens is that if you have a single thread, you have extra
clock cycles between each instruction to allow time for the pipeline
propagation. Let's take a really simple case where you are just pulling data
out of a register, conditionally adding 1 to it an putting it back. In a
non=pipelined version you can increment a particular value on every clock. If
you add pipelining to the data path, the result of the first increment is not
available in the memory to be used again for N clocks where N is the depth of
the pipeline. The pipelining does allow you to run the clock faster, but it
actually slows the processing of that one memory location since the process
has to be stalled until the memory is updated. However, you can use the clock
cycles in between to increment other locations in memory so long as you don't
use any values that have been updated before they become available again. In
a sense then, you can partition the memory in to N areas, each of which is
accessed only by 1 in N cycles. Each of the cycles is then a different
'thread' running on the same processor. The result is no one thread is any
faster than the unpipelined processor (it is actually a bit slower because you
add set-up and clock-Q times plus slack for each register you add to the
pipeline), but because the path is pipelined you can have more than one thread
being operated on at a time (each with a one clock skew in relation to the
previous). This is extendable to a general purpose processor as long as the
pipeline depth is consistent for all instructions.

"Nicholas C. Weaver" wrote:

--

Ray Andraka

unread,

Oct 16, 2002, 4:24:28 PM10/16/02

to

As I recall, you were getting speeds of ~135 MHz in V2. You should be able to get a
fully pipelined processor using BRAMs up to 200 MHz or so without any big problems.
In VirtexII, the carry chains will be the limiting factor, not the BRAM if you do it
right. In Virtex and VIrtexE, you can double the width of the BRAM and then use
registers to assemble consecutive accessess. It does get a bit messy in that case
because it introduces pipeline misses.

Goran Bilski wrote:

--

Ray Andraka

unread,

Oct 16, 2002, 4:25:45 PM10/16/02

to

The SRL16's make the permanent state really easy to store too.

Ken McElvain wrote:

--

Goran Bilski

unread,

Oct 16, 2002, 4:33:39 PM10/16/02

to

Hal Murray wrote:

> > BUT there is always a catch and that is how you write programs
> > for these systems.
>
> Standard programming problem. People are getting pretty good
> at it. Yes, there are lots of applications where it doesn't work.
>
> If you can't take advantage of multi-threading then you wouldn't
> be able to use multi-processing either.
>

Is that true?

Don't you need to actually have two threads in order to use the multi-threading
but multiprocessor parallelism can be more fine grain.

ex.
A code where the inner loop has a function call where some operations take place.

Is it easier to thread that function or just place the function in another
processor?

Isn't it how data is move between two processor/threads that is more crucial?

How does you actually move data between two threads in the same processor?

Göran

Goran Bilski

unread,

Oct 16, 2002, 5:06:22 PM10/16/02

to

Yes, But I also have a embedded multiplier which already is using the registrated
output.
I can't add another pipestage in that path since there is nowhere to insert it.
Then I have to add special arrangement if the instructions is using the multiplier in
the control logic.

I have painfully detect that minor tweaks in the control logic can easily make it the
critical path.
I think that is possible to have two threads in MicroBlaze but I not convince that it
would give me more performance than two separate MicroBlazes. The overall area will be
less than two MicroBlazes but not far from it.

MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and the slices
count will go up.
(There is also a lot of places for Virtex and VirtexE where I have used all in/out for a
slice and even if there is a free DFF, it can't be reached.)
It will not double the size but a significant increase will occur.

The multiprocessor approach makes it much easier to add 10 extra processors(threads).

Göran

Ray Andraka

unread,

Oct 16, 2002, 5:26:33 PM10/16/02

to

Easiest to do via memory or register file. Thread timing has to make sure the value
is available before using it.

Goran Bilski wrote:

>
> Isn't it how data is move between two processor/threads that is more crucial?
>
> How does you actually move data between two threads in the same processor?
>
> Göran
>
> >
> > --
> > The suespammers.org mail server is located in California. So are all my
> > other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
> > commercial e-mail to my suespammers.org address or any of my other addresses.
> > These are my opinions, not necessarily my employer's. I hate spam.

--

Nicholas C. Weaver

unread,

Oct 16, 2002, 5:48:48 PM10/16/02

to

In article <3DAC6ED3...@Xilinx.com>,
Goran Bilski <Goran....@Xilinx.com> wrote:

>Another approach is to rely on advanced compiler techniques for
>handling all the pipeline hazardous but it would make it almost
>impossible to program the processor in assembler since the user has
>to do the handling. I personally don't think that this approach
>would gain that much more performance than MicroBlaze and you have to
>spend a lot of resources on the compiler which could be used for
>other stuff.

MIPS: Machine without Interlocking Pipeline Stages.

Ray Andraka

unread,

Oct 16, 2002, 6:42:11 PM10/16/02

to

It won't quite double performance, but it also should not be a significantly larger area or
you either aren't doing it right or it is already heavily pipelined. The gain is not raw
performance, it is a gain of performance/area. Two separate instances will provide more
MIPs one dual threaded machine, but at the cost of more area.

Normally, the pipeline stages should be inserted so that their input comes from the LUT in
the same slice, so it is not an issue if you used up all the inputs. The only blocking
input in that case is the SR input if you are using a CLB RAM or SRL16. The control logic
can usually be pipelined similarly, but it may require a fresh start at the design rather
than patching the existing one.

Jan Gray

unread,

Oct 16, 2002, 6:47:46 PM10/16/02

to

Nicholas wrote:
> MIPS: Machine without Interlocking Pipeline Stages.

MIPS: "Microprocessor without Interlocked Pipeline Stages"

Even the R4000 (which had a 2 cycle load-to-use delay, IIRC) made sure to
have a 0 cycle ALU-to-use delay, whereas a superpipelined 250 MHz V-II RISC
would necessarily a 1 cycle ALU-to-use delay. That is rather more
challenging for the code scheduler to address.

My (unpublished) V-II architectural studies concur back up what Goran has
been writing. Furthermore, to 2-thread any such machine would increase the
area intolerably, because so much area is tied up in register files (which
would have to double in size). Also, if you grow the area of the processor,
it will slow down because of increased interconnect delays. A compact
processor is a fast processor.

Goran wrote:
> MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and
the slices count will go up.

Is that an implementation improvement over the old 900+ LUTs figure, or did
the old 900 LUTs figure include other non-core resources? Can you say what
changed?

(SPRAM reg files? (The following is fast enough for ~150 MHz operation:
Register the result in a write-back register (in FFs) on CLK rising edge;
present reg file write address and write-back data to reg file SPRAMs while
CLK high; write results to SPRAMs on CLK falling edge; present reg file read
address while CLK low; mux SPRAM outputs with immediate and/or forwarded
result; and register in operand registers.))

Jan Gray, Gray Research LLC

Goran Bilski

unread,

Oct 16, 2002, 7:14:06 PM10/16/02

to

Hi Jan,

Jan Gray wrote:

> Nicholas wrote:
> > MIPS: Machine without Interlocking Pipeline Stages.
>
> MIPS: "Microprocessor without Interlocked Pipeline Stages"
>
> Even the R4000 (which had a 2 cycle load-to-use delay, IIRC) made sure to
> have a 0 cycle ALU-to-use delay, whereas a superpipelined 250 MHz V-II RISC
> would necessarily a 1 cycle ALU-to-use delay. That is rather more
> challenging for the code scheduler to address.
>
> My (unpublished) V-II architectural studies concur back up what Goran has
> been writing. Furthermore, to 2-thread any such machine would increase the
> area intolerably, because so much area is tied up in register files (which
> would have to double in size). Also, if you grow the area of the processor,
> it will slow down because of increased interconnect delays. A compact
> processor is a fast processor.
>
> Goran wrote:
> > MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and
> the slices count will go up.
>
> Is that an implementation improvement over the old 900+ LUTs figure, or did
> the old 900 LUTs figure include other non-core resources? Can you say what
> changed?
>

Ooops, I did it again!!
Error on my side, I looked at the report file for the core and took the number
of IO instead of LUTs.

>
> (SPRAM reg files? (The following is fast enough for ~150 MHz operation:
> Register the result in a write-back register (in FFs) on CLK rising edge;
> present reg file write address and write-back data to reg file SPRAMs while
> CLK high; write results to SPRAMs on CLK falling edge; present reg file read
> address while CLK low; mux SPRAM outputs with immediate and/or forwarded
> result; and register in operand registers.))
>

(Of 900 LUTs, 256 of them are the register file => around 30%)
It's something that I have thought off but it would not leave much room for any
logic handling of the register addresses or operations on the register output.
Since I using a SRL16 as the instruction prefetch buffer and the output delay of
a SRL16 is around 2 ns, I can't add much logic to the register address before
they have to go to the register file.

But if I got some spare time (hahahaha) this is something that I would try to do
in order to get down the MicroBlaze size.
Jan, You might be able to do a clean room implementation of a MicroBlaze where
area is everything.
If you have any spare time ;-)

Hal Murray

unread,

Oct 16, 2002, 7:51:03 PM10/16/02

to

>> >Another approach is to rely on advanced compiler techniques for handling all the
>> >pipeline hazardous but it would make it almost impossible to program the processor in
>> >assembler since the user has to do the handling.
>> >I personally don't think that this approach would gain that much more performance than
>> >MicroBlaze and you have to spend a lot of resources on the compiler which could be
>> >used for other stuff.
>>
>> This seems like an interesting opportunity for an open source project.
>
>Aren't there already CPUs in FPGA open source projects?
>
>http://www.fpgacpu.org/
>
>http://www.opencores.org/
>
>The list is getting pretty long.

I was thinking of the compiler rather than the hardware.

The idea is to use one thread rather than multiple, and make the
compiler smart enough to understand the pipeline delays, and either
automatically insert noops or slap your wrist if you do something
bad.

Think of it as microcode rather than "normal" (whatever that
means) RISC type code. You have to get your head around it,
but once you get in the right mode it's not that hard. Maybe
I was lucky to have a good mentor at the right time.

Tim

unread,

Oct 16, 2002, 7:54:30 PM10/16/02

to

Various people wrote:
> > > MIPS: Machine without Interlocking Pipeline Stages.
> >
> > MIPS: "Microprocessor without Interlocked Pipeline Stages"

From long ago, when life was simpler.

MIPS means 'MIPS', and has plenty of interlocking. Some
implementations also have the HACF instruction, which we
all challenge Goran to implement. (Halt and Catch Fire)

Goran Bilski

unread,

Oct 16, 2002, 8:28:53 PM10/16/02

to

I always tries to get in an instruction that always produce the value 42.
But I for some reason can never past the management on that one.

Göran

Ken McElvain

unread,

Oct 16, 2002, 10:22:04 PM10/16/02

to

Nicholas C. Weaver wrote:

Sorry, it was done a long time ago. Try to find some info on
the CDC 6600 IO processors. They ran 16 threads in a very deep
pipeline.

Nicholas C. Weaver

unread,

Oct 16, 2002, 10:25:45 PM10/16/02

to

In article <3DADEB1E...@andraka.com>,
Ray Andraka <r...@andraka.com> wrote:

>Normally, the pipeline stages should be inserted so that their input
>comes from the LUT in the same slice, so it is not an issue if you
>used up all the inputs. The only blocking input in that case is the
>SR input if you are using a CLB RAM or SRL16. The control logic can
>usually be pipelined similarly, but it may require a fresh start at
>the design rather than patching the existing one.

One observation: If you want to $C$-slow the clock enable anyway, you
want to loop it through LUT logic anyway, otherwise you get
interferance between the two threads.

Same actually goes for the reset as well.

Nicholas C. Weaver

unread,

Oct 16, 2002, 10:28:51 PM10/16/02

to

In article <3DAE1E8...@synplicity.com>,

Ken McElvain <k...@synplicity.com> wrote:
>> Uhh, yes it can.
>>
>> Double all the pipeline stages, double the register file, rebalance
>> the delays now that you have more pipelining, and out drops a 2-thread
>> multithreaded architecture. Each single thread now runs slower, but
>> aggregate throughput (sum of the two threads) is increased.
>>
>> It is so obvious yet unintuitive that nobody has actually DONE it
>> before. :)
>>
>
>Sorry, it was done a long time ago. Try to find some info on
>the CDC 6600 IO processors. They ran 16 threads in a very deep
>pipeline.

Those machines, also Hep and Tera, didn't have any bypassing. The
proposed multithreaded approach I'm talking about keeps the bypassing
but doubles the pipelining.

The closest, "interleaved multithreading" kept some of the bypassing,
but never took advantage that now the bypass feedback loops have more
registers in it to up the clock rate by finer pipelining.

Jan Gray

unread,

Oct 16, 2002, 10:53:49 PM10/16/02

to

"One can also build a simple barrel processor (say 4 threads (slots) x 32
regs = 128 entries of 32-bits = 2 16-bit ports on a single 256x16 BRAM,
tripled cycled, or two BRAMs double cycled) and switch threads on each
cycle. Then you can have a 4-deep pipeline without need for any result
forwarding muxes (by the time you read an operand on thread[i], you have
already retired that threads' previous result to the register file).

This seems to me to be a perfectly simple and practical basis to issue
instructions faster than the ALU + result forwarding mux + operand register
recurrence critical path. Unfortunately single-thread performance is not so
hot but in workloads such as a "network processing", who cares?

This idea was taken to sublime levels in the 20-stage pipelined 5-threaded 1
GHz MicroUnity MediaProcessor (which would have needed some result
forwarding, but not 18 stages worth)."

Hal Murray

unread,

Oct 17, 2002, 1:15:00 AM10/17/02

to

>> If you can't take advantage of multi-threading then you wouldn't
>> be able to use multi-processing either.

>Is that true?

>Don't you need to actually have two threads in order to use the multi-threading
>but multiprocessor parallelism can be more fine grain.

From the software view, a multi-threaded CPU is just like an SMP.
It's just time multiplexed rather than replicated in space.

>ex. A code where the inner loop has a function call where some
> operations take place.

> Is it easier to thread that function or just place the function
> in another processor?

There is a separate issue of whether you are doing fine grained
(dozen instructions?) or course grained (millisecond) switching.

Executing a function on another processor/thread doesn't do any
good unless you have something else to do while it runs.

Perhaps the sort of example you are looking for on fine grained
work is fortran loops. If you are going around a loop 100 times
and you have 2 processors, you could get one to do the odd slots
and let the other do the even slots. Lots of compiler work to find
where you can do it. It obviously doesn't work (at least not in
the simple minded way) if one iteration refers to the results
of the previous iteration.

The classic coarse grained example is reading your mail while
PAR or simulation is crunching away on the other CPU.

> Isn't it how data is move between two processor/threads that is
> more crucial?

It's not a big deal on coarse grained work since it doesn't happen
very often.

For fine grained work the details (hardware and software) are probably
more important than we can evaluate without a solid design to discuss.
The straw man is that all the data goes through main memory to
get from one CPU to the other one, but since that probably hits
in the cache it shouldn't take long. Adding hacks to copy
registers from the other (logical) CPU might be a significant
help.

I'm not really a wizard on this stuff. There are probably
many relevant PhD thesis out there.

Nicholas C. Weaver

unread,

Oct 17, 2002, 1:49:11 AM10/17/02

to

In article <uqshqkr...@corp.supernews.com>,
Hal Murray <hmu...@suespammers.org> wrote:

>>Don't you need to actually have two threads in order to use the
>>multi-threading but multiprocessor parallelism can be more fine
>>grain.

>From the software view, a multi-threaded CPU is just like an SMP.
>It's just time multiplexed rather than replicated in space.

The big difference is the interfearance effects.

On a shared memory SMP, if the two processes share a common memory
working set, you have coherancy misses where writes cause ping-ponging
of memory ownership and generally reduce memory performance
considerably.

In a multithreaded architecture with a cache, the misses occur for the
exact opposite reason: the two tasks use different working sets,
thrashing the cache.

In the P4 style, there is yet another interferance effect: the two
threads, when competing for the same functional units, will slow each
other down.

>The classic coarse grained example is reading your mail while
>PAR or simulation is crunching away on the other CPU.

Or simply having a separate thread for the kernel on a web server.

>> Isn't it how data is move between two processor/threads that is
>> more crucial?
>
>It's not a big deal on coarse grained work since it doesn't happen
>very often.
>
>For fine grained work the details (hardware and software) are probably
>more important than we can evaluate without a solid design to discuss.
>The straw man is that all the data goes through main memory to
>get from one CPU to the other one, but since that probably hits
>in the cache it shouldn't take long. Adding hacks to copy
>registers from the other (logical) CPU might be a significant
>help.

Only by a couple of cycles, but it IS useful, some of teh RAW work has
benefitted greatly from the fast move between processors.

>I'm not really a wizard on this stuff. There are probably
>many relevant PhD thesis out there.

Although alot can be answtered by just thinking about things.

I would like to see an actual study of where the P4 multithreading has
interferance effects.

Ray Andraka

unread,

Oct 17, 2002, 2:11:03 AM10/17/02

to

That is exactly what I was trying to get at.

Jan Gray wrote:

--

Rob Finch

unread,

Oct 17, 2002, 2:19:01 AM10/17/02

to

Multi-threaded = Is this a barrel processor ?

If this is a cpu implemented in an FPGA the need is probably not so much for
performance (which markets well 'gee whiz factor') but rather for
functionality in a small size.

I've tried taking the "hands on approach" experimenting with several design
options and come to the following conclusions:
- designing a processor to run in an FPGA is different than designing the
processor for custom logic. Methods used to gain performance in processors
designed for custom logic simply don't work well in an FPGA. In an FPGA
twice as big = half as fast, which is not the case for a processor built out
of custom logic.
- it's probably not a good idea to build an FPGA processor with more than a
three stage pipeline, all the additional routing, register bypass
multiplexing, and control logic makes the design bigger and consequently
slower. There is no real difference in performance for a more complex
design, it is simply more complicated.
- it is possible to build a really fast barrel cpu, but most of the
performance gain is lost trying to interface it to a memory system
- assuming that the processor has to be interfaced to *external* memory (a
requirement of many real world apps), this is a significant design
consideration. The memory space and bandwidth has to be shared with the FPGA
application in many circumstances. This bandwidth limit puts a limit on the
performance needed of the FPGA cpu.
- using BRAM's for caches is much slower than using them as raw ram
resources, they were not designed to be used as caches. They lack cache tag
comparators, and a way to clear the cache or cache lines in a simple manner.
They could also use more ports. Recognizing that this is not likely to
change also puts a limit on the performance. With a little bit of work, I
can make a cpu much faster than the cache, but there is no point in doing
this.
- the design of the whole system is important. It is way too easy to get
caught up in the process of designing a really fast component and then
realize that the rest of the system can't keep up with it anyway.
- if a high performance processor is required, use a real one. Don't try to
build one in an FPGA because this is not their strongpoint.

Sample Stats (SpartanII - slowest speed grade)
Sparrow2 processor (version1 32 bits, 32 reg ) - no pipeline, executes most
instructions as one single long instruction cycle. Is simple, but has a lot
of functionality including 32 bit barrel shifter and hardware multiply. Runs
at 25 MHz (same speed as external memory)

Sparrow2 processor (version2) - three stage pipeline with register
bypassing - same functionality as above. Runs at 40+MHz using an instruction
cache. The design is significantly larger and more complex however.

Which one is better ?

Rob

BTW: I have a relatively small and fast 6502 compatible core available at
www.birdcomputer.ca (lacks docs yet - on the way)

Edwin Naroska

unread,

Oct 17, 2002, 4:37:18 AM10/17/02

to Nicholas C. Weaver

Hi,

Nicholas C. Weaver wrote:
> In article <3DADAAB7...@Xilinx.com>,
> Goran Bilski <Goran....@Xilinx.com> wrote:
>
>
>>You can't just double the number of pipestage for a processor without
>>major impacts. For streaming pipeline which hardware pipelines are I
>>agree but for processor that can't be done.
>
>

> Uhh, yes it can.
>
> Double all the pipeline stages, double the register file, rebalance
> the delays now that you have more pipelining, and out drops a 2-thread
> multithreaded architecture. Each single thread now runs slower, but
> aggregate throughput (sum of the two threads) is increased.
>
> It is so obvious yet unintuitive that nobody has actually DONE it
> before. :)

I think there has been a lot of research done in this field.
Perhaps not for FPGAs as a target but I remember reading some
research papers that talked about different kinds of hardware
support for multithreading.

E.g., one paper talked about a design with a very deep pipeline
(I think it had more than 10 stages; unfortunately, I forgot
paper title and author names). This architecture
were capable of running N (N = number of stages) threads
in parallel. However, each thread submitted only one
instruction every N-th cycle. Using this approach there
is no need for bypassing (no raw hazards), there are
no control hazards (ok, if memory is fast enough), ...
In fact, each thread submits a new instruction AFTER
its previous instruction reached the end of the pipeline.
As a result, each thread executes at a 1/N th of the clock
rate, but without any problems caused by raw or control
hazards. Further, N threads are running in parallel.

For this architecture the pipeline register are independent
from the number of threads. However, each threads needs
its own register set (i.e., total register are N * register
set per thread). This can be handled more efficiently
using register renaming...

I am not sure, but this approach sounds similar to what
I've read in this thread (?)

--
Edwin

rickman

unread,

Oct 17, 2002, 5:32:06 PM10/17/02

to

There once was a time when I understood pretty much most if not all of
processor design that was available without an NDA. However that was
when I was in school and a lot has happened since then.

I have not heard of this threading concept before, but it makes sense at
a hardware level. I am not sure I understand the software
implications. It seems to me that the multiple instruction streams must
be completely independant and infact, you must have a separate PC for
each stream, no? So how do you run something like this? I guess there
must be instructions for starting and stopping each separate thread in
the processor. Or do all the threads start following reset??? How
would the OS manage that?

Hal Murray

unread,

Oct 17, 2002, 8:37:07 PM10/17/02

to

>I have not heard of this threading concept before, but it makes sense at
>a hardware level. I am not sure I understand the software
>implications. It seems to me that the multiple instruction streams must
>be completely independant and infact, you must have a separate PC for
>each stream, no? So how do you run something like this? I guess there
>must be instructions for starting and stopping each separate thread in
>the processor. Or do all the threads start following reset??? How
>would the OS manage that?

It's just like SMP. The hardware is time interleaved rather than replicated.

rickman

unread,

Oct 17, 2002, 9:43:11 PM10/17/02

to

Hal Murray wrote:
>
> >I have not heard of this threading concept before, but it makes sense at
> >a hardware level. I am not sure I understand the software
> >implications. It seems to me that the multiple instruction streams must
> >be completely independant and infact, you must have a separate PC for
> >each stream, no? So how do you run something like this? I guess there
> >must be instructions for starting and stopping each separate thread in
> >the processor. Or do all the threads start following reset??? How
> >would the OS manage that?
>
> It's just like SMP. The hardware is time interleaved rather than replicated.

Yes, I understand that. But when you replicate the hardware, you
replicate *all* of the hardware. I have not heard anyone say that there
are multiple copies of the program counter (PC). If you are working off
one PC how do you switch between the threads on a clock cycle basis?
Additionally, how do you start up a thread? When this multiple thread
CPU starts following reset, each of the threads will need to be running
*something*. How is that managed? Do they all boot the BIOS or
whatever startup code you have?

Ray Andraka

unread,

Oct 18, 2002, 12:36:11 AM10/18/02

to

The PC just needs to have more than one register in its feedback. The SRL16's are
great for this. We do this type of thing all the time with counters in our designs
so that one increment logic and an SRL 16 make up a set of time multiplexed
counters all on a common path. The trick is to be careful that all the loops are
pipelined to the correct level. You can reset one counter without resetting the
rest, branch on each one independently etc.

rickman wrote:

> Yes, I understand that. But when you replicate the hardware, you
> replicate *all* of the hardware. I have not heard anyone say that there
> are multiple copies of the program counter (PC). If you are working off
> one PC how do you switch between the threads on a clock cycle basis?
> Additionally, how do you start up a thread? When this multiple thread
> CPU starts following reset, each of the threads will need to be running
> *something*. How is that managed? Do they all boot the BIOS or
> whatever startup code you have?
>
> --
>
> Rick "rickman" Collins
>
> rick.c...@XYarius.com
> Ignore the reply address. To email me use the above address with the XY
> removed.
>
> Arius - A Signal Processing Solutions Company
> Specializing in DSP and FPGA design URL http://www.arius.com
> 4 King Ave 301-682-7772 Voice
> Frederick, MD 21701-3110 301-682-7666 FAX

--

Hal Murray

unread,

Oct 18, 2002, 4:37:25 AM10/18/02

to

>> It's just like SMP. The hardware is time interleaved rather than replicated.

>Yes, I understand that. But when you replicate the hardware, you
>replicate *all* of the hardware. I have not heard anyone say that there
>are multiple copies of the program counter (PC). If you are working off
>one PC how do you switch between the threads on a clock cycle basis?
>Additionally, how do you start up a thread? When this multiple thread
>CPU starts following reset, each of the threads will need to be running
>*something*. How is that managed? Do they all boot the BIOS or
>whatever startup code you have?

As far as I know, we are just waving our hands here. Nobody has worked
out and tested anything. I've been assuming that registers like the PC
would get duplicated since that's the only thing that makes sense.

How things get started is normally one of those system-specific parts.
(aka hacks and kludges) You could do things like smash both PCs to 0
and then read a status register that contained a bit to tell you which
CPU you are. Typical SMP initialization code lets CPU 0 do a lot of the
work while the rest keep out of the way. But often the each have to
setup the local caches. It's a delciate dance, and yes, it is often
relegated to the BIOS.

Jan Gray

unread,

Oct 18, 2002, 3:59:26 PM10/18/02

to

"rickman" <spamgo...@yahoo.com> wrote

> Yes, I understand that. But when you replicate the hardware, you
> replicate *all* of the hardware. I have not heard anyone say that there
> are multiple copies of the program counter (PC). If you are working off
> one PC how do you switch between the threads on a clock cycle basis?
> Additionally, how do you start up a thread? When this multiple thread
> CPU starts following reset, each of the threads will need to be running
> *something*. How is that managed? Do they all boot the BIOS or
> whatever startup code you have?

Yes, there are multiple PCs. For example, in LUT RAM. Each PC is
initialized with some reset vector value. They can all be the same reset
vector, or one PC can be a distinguished reset vector, which releases a
software semaphore-like-entity after the single boot thread has done
one-time initialization of the rest of the system. Each thread will know
its identity, for example by starting the PCs at different addresses, or by
preloading the thread# in a particular register, for each thread, in the
barrel register file.

An RTOS would presumably be employed to schedule logical threads/tasks to
'physical' threads.

Most state (reg file, PC, poss. PSR or other special regs) would be
replicated per thread; however, most function units and other parts of the
datapath would be shared (time multiplexed) amongst the threads.

Rick Filipkiewicz

unread,

Oct 20, 2002, 8:18:15 AM10/20/02

to

Jan Gray wrote:

> Nicholas wrote:
> > MIPS: Machine without Interlocking Pipeline Stages.
>
> MIPS: "Microprocessor without Interlocked Pipeline Stages"
>
> Even the R4000 (which had a 2 cycle load-to-use delay, IIRC) made sure to
> have a 0 cycle ALU-to-use delay, whereas a superpipelined 250 MHz V-II RISC
> would necessarily a 1 cycle ALU-to-use delay. That is rather more
> challenging for the code scheduler to address.
>

In fact it was the R2000/3000 [and its IDT 30xx derivatives] that relied
strictly on compiler scheduling to handle the, then, 1 clock load-to-use delay.
When the R4000 with its 8 stage pipeline came out the MIPS architects had
decided that 2 clocks was too much (and I suppose they wanted to be able to run
legacy MIPS-I code) and so they put in a load interlock. In fact they added
something called a pipeline `slip' to handle this where part of the pipe stops &
the rest keeps going, instead of a crude stall. This interlock was kept even
when the non-multiprocessor R4000 variant was re-engineered as the 5-stage R4600
and then on to the R5000, RM52xx, RM70xx etc.

I'll check up with our compiler person just how hard a 1 clock ALU-to-use delay
would be for gcc's scheduler to handle.

Jan Gray

unread,

Oct 25, 2002, 9:05:08 PM10/25/02

to

Earlier I wrote:
"One can also build a simple barrel processor...

Then you can have a 4-deep pipeline without need for any result

forwarding muxes ..."

See slides 2-7 in this MIT architecture lecture by Asanovic:
http://abp.lcs.mit.edu/6.823/lectures/lecture23.pdf.

Alisha Pal

unread,

Feb 18, 2022, 2:56:01 AM2/18/22

to

There was a time when I knew pretty much everything about <a href="https://cloudplay.fm/cpu-cores-vs-threads/">CPU </a> design that wasn't covered by an NDA. Details (hardware and software) are definitely more significant than we can evaluate without a strong design to discuss for fine-grained work.

Hassan Iqbal

unread,

Feb 20, 2022, 5:13:31 PM2/20/22

to

On Friday, 18 February 2022 at 07:56:01 UTC, Alisha Pal wrote:
> There was a time when I knew pretty much everything about <a href="https://cloudplay.fm/cpu-cores-vs-threads/">CPU </a> design that wasn't covered by an NDA. Details (hardware and software) are definitely more significant than we can evaluate without a strong design to discuss for fine-grained work.

The OP is 20 years old. As of today, the picoblaze is not supported by Xilinx anymore.