Xilinx microblaze vs. picoblaze

158 views
Skip to first unread message

emanuel stiebler

unread,
Oct 15, 2002, 1:14:20 AM10/15/02
to
Hi,

anybody out here has some insight, why the 8 bit picoblaze (up 112 MHz)
is clocked lower than the 32 bit microblaze (up 150 MHz) ?

cheers

Falk Brunner

unread,
Oct 15, 2002, 12:35:35 PM10/15/02
to
"emanuel stiebler" <e...@ecubics.com> schrieb im Newsbeitrag
news:3DABA42C...@ecubics.com...

> Hi,
>
> anybody out here has some insight, why the 8 bit picoblaze (up 112 MHz)

112 MHz (one hundred twelf megahertz???)
In what device? the fastest Virtex-II??
My expirience is somewhere 50 MHz in a -5/-6 Spartan-II(E).

> is clocked lower than the 32 bit microblaze (up 150 MHz) ?

Looks like the microblaze is more pipelied than the picoblaze (which has
some pipelining too). Picoblaze (Hello Ken ;-) was developed for minimum
size on top priority, speed was second (AFAIK)

--
MfG
Falk

Symon

unread,
Oct 15, 2002, 12:51:42 PM10/15/02
to
Dear Emanuel,
Microblaze is heavily pipelined so can be clocked faster than
picoblaze. OTOH, Picoblaze uses far fewer FPGA resources.
HTH, Syms.

emanuel stiebler <e...@ecubics.com> wrote in message news:<3DABA42C...@ecubics.com>...

emanuel stiebler

unread,
Oct 15, 2002, 1:24:40 PM10/15/02
to
Falk Brunner wrote:
>
> "emanuel stiebler" <e...@ecubics.com> schrieb im Newsbeitrag
> news:3DABA42C...@ecubics.com...
> > Hi,
> >
> > anybody out here has some insight, why the 8 bit picoblaze (up 112 MHz)
>
> 112 MHz (one hundred twelf megahertz???)
> In what device? the fastest Virtex-II??

http://www.xilinx.com/ipcenter/processor_central/picoblaze/index.htm

was even a typo. It runs at 116 MHz ;-)
at least in the press release ...

> My expirience is somewhere 50 MHz in a -5/-6 Spartan-II(E).

There are two "picoblazes". One for the Spartan, one for the Virtex.
Pretty different beasts.

> > is clocked lower than the 32 bit microblaze (up 150 MHz) ?
>
> Looks like the microblaze is more pipelied than the picoblaze (which has
> some pipelining too). Picoblaze (Hello Ken ;-) was developed for minimum
> size on top priority, speed was second (AFAIK)

AFAIRC, the microblaze still needs only two clock per instruction ...

cheers

Falk Brunner

unread,
Oct 15, 2002, 2:06:43 PM10/15/02
to
"emanuel stiebler" <e...@ecubics.com> schrieb im Newsbeitrag
news:3DAC4F58...@ecubics.com...

> > My expirience is somewhere 50 MHz in a -5/-6 Spartan-II(E).
>
> There are two "picoblazes". One for the Spartan, one for the Virtex.
> Pretty different beasts.

?? R u sure?
I think there are NOT two picoblaze versions, since Sparten-II (why the hell
is everyone talking about Spartan when it is actually Spartan-II, which, our
honour, is a BIG difference???) is practically identical to Virtex.

> > size on top priority, speed was second (AFAIK)
>
> AFAIRC, the microblaze still needs only two clock per instruction ...

PICOBLAZE (TPFKAKCPSM , decode THIS;-)) executes EVERY instruction in two
clock cycles, this was made for simplicity (ressource usage). But still much
better than the original stupid 12 clocks/cycle design of the 8051 . .. .
I dont know about microblaze.

--
MfG
Falk


Goran Bilski

unread,
Oct 15, 2002, 2:03:16 PM10/15/02
to
Hi,

MicroBlaze is more pipelined and is more floorplanned than PicoBlaze.
The 150 MHz for MicroBlaze is on a V2Pro and 116MHz for PicoBlaze is on VII-6.
MicroBlaze runs at 135 MHz on a VII-6.

Since PicoBlaze is optimized for area and more pipeline has a big area cost in
processor design.

There is also not that big difference on performance with different datasizes.
The carry-chain is pretty fast as soon you starting to use it.
A 64-bit MicroBlaze would probably run at 100 MHz and a 8 bit MicroBlaze would
just run a little faster than the 32-bit version since the control decoding will
probably be the limiting factor.

By the way for most instruction MicroBlaze needs only 1 clock/instruction.

Göran

emanuel stiebler

unread,
Oct 15, 2002, 2:48:01 PM10/15/02
to
Goran Bilski wrote:
>
> There is also not that big difference on performance with different datasizes.
> The carry-chain is pretty fast as soon you starting to use it.
> A 64-bit MicroBlaze would probably run at 100 MHz and a 8 bit MicroBlaze would
> just run a little faster than the 32-bit version since the control decoding will
> probably be the limiting factor.
>
> By the way for most instruction MicroBlaze needs only 1 clock/instruction.

That was the answer I was looking for, even that I didn't make
it clear what exactly my question was ;-)

So, we can't go faster at the actual Xilinx parts than around 100
"MIPS",
independent, if it is a 8,16,32,64 bit processor, right ?

And the problem is really in the instruction decoding, control path.

Thanks

Nicholas C. Weaver

unread,
Oct 15, 2002, 2:53:35 PM10/15/02
to
In article <3DAC62E1...@ecubics.com>,

emanuel stiebler <e...@ecubics.com> wrote:
>That was the answer I was looking for, even that I didn't make
>it clear what exactly my question was ;-)
>
>So, we can't go faster at the actual Xilinx parts than around 100
>"MIPS",
>independent, if it is a 8,16,32,64 bit processor, right ?
>
>And the problem is really in the instruction decoding, control path.

Well, you CAN, but you are going to have to go multithreaded. If your
critical path is 1 32 bit add + bypassing, yeah, you really can't do
any faster. The only way to pipeline this (or the instruction decode,
if that is the critical factor) is to use an interleaving,
multithreading strategy (aka $C$-slow a microprocessor).
--
Nicholas C. Weaver nwe...@cs.berkeley.edu

Goran Bilski

unread,
Oct 15, 2002, 3:38:59 PM10/15/02
to
Hi,

If your definition of MIPS is maximum number of instruction per clock cycle, the 150
MHz MicroBlaze has 150 MIPS.
There is also possible to do a 200+ MIPS processor, if you want to optimize around
MIPS.
A heavy pipelined processor without any forwarding in the pipeline could easy run at
200+ MHz.
Would that processor be more efficient than MicroBlaze? I don't think so, the number
of stall due to pipeline hazardous will actually give it a lower performance than
MicroBlaze.

Processors in FPGAs has to be handle more delicate than ASIC processor due to
forwarding in pipeline could easy remove all benefits gain by more pipeline stages. In
FPGA a mux cost as much as an ALU which is not the case for ASIC or custom design.

Another approach is to rely on advanced compiler techniques for handling all the
pipeline hazardous but it would make it almost impossible to program the processor in
assembler since the user has to do the handling.
I personally don't think that this approach would gain that much more performance than
MicroBlaze and you have to spend a lot of resources on the compiler which could be
used for other stuff.

Another approach is to add multi-threading capabilities but I think that
multi-processing is better for FPGA than multi-threading.

Göran Bilski

Nicholas C. Weaver

unread,
Oct 15, 2002, 4:14:45 PM10/15/02
to
In article <3DAC6ED3...@Xilinx.com>,
Goran Bilski <Goran....@Xilinx.com> wrote:

>Another approach is to add multi-threading capabilities but I think that
>multi-processing is better for FPGA than multi-threading.

I disagree, based on the following observation: A 4-5 stage pipeline
is going to have 2-3 levels of FPGA logic between each pipeline stage,
suggesting that there are plentiful registers which can be exploited
if the design is retimed and interleaved.

Actual experience: Leon I (synthesized Sparc), Virtex:

Initially: 23 MHz
Retimed: 25 MHz
2-slow retimed: 46 MHz (so each thread at 23 MHz)

Rick Filipkiewicz

unread,
Oct 15, 2002, 4:30:48 PM10/15/02
to

Goran Bilski wrote:

> Hi,
>
> MicroBlaze is more pipelined and is more floorplanned than PicoBlaze.
> The 150 MHz for MicroBlaze is on a V2Pro and 116MHz for PicoBlaze is on VII-6.
> MicroBlaze runs at 135 MHz on a VII-6.
>

Aha! There we have it caught on this very NG a Xilinx person admitting that
floorplanning is important :-)).

Goran Bilski

unread,
Oct 15, 2002, 5:12:34 PM10/15/02
to
Hi,

Yes, but I don't use the floorplanner.
I added all placement constraints (RLOC) in my VHDL code.
This gives me a more stable and reliable timing.

I have actually done a test where I remove all RLOC from my code and let par do the
placement.
The result is within 5% of the handplaced version.

Göran

Ray Andraka

unread,
Oct 15, 2002, 6:32:38 PM10/15/02
to
That is lower than we typically see for data path designs. For heavily arithmetic
designs in VIrtex, VirtexE and SpartanII, we've consistently seen better than 30%
improvement, and frequently better than 60% doing the same experiment. Of course it
also depends on what you are setting the constraints as. The routing tools only work
as hard as they have to to meet your constraints, so if the target is low compared with
what could be realized, the results are going to be skewed making floorplanning not
look like a big win. For VirtexII, the same applies, except non-arithmetic logic is
quite a bit faster than the carry chains so the routing limit is artificially low if
there is arithmetic covered by the same constraint. 135MHz is not a very high target
for VirtexII.

Goran Bilski wrote:

> Hi,
>
> Yes, but I don't use the floorplanner.
> I added all placement constraints (RLOC) in my VHDL code.
> This gives me a more stable and reliable timing.
>
> I have actually done a test where I remove all RLOC from my code and let par do the
> placement.
> The result is within 5% of the handplaced version.
>
> Göran

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email r...@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759


Goran Bilski

unread,
Oct 15, 2002, 7:35:40 PM10/15/02
to
Hi Ray,

Are you implying something ;-)

I could do a better work on BRAM placement but since the number of BRAM connected to
MicroBlaze can be of any number it would force me to do a lot of different floorplans
dependent on the number of BRAMs.
The BRAM is in the critical path.

Most requests on MicroBlaze is NOT on the performance but more on functionality and there
is where I spend most of my time now.

Göran

Ray Andraka

unread,
Oct 15, 2002, 8:31:05 PM10/15/02
to
No, only that the savings is not representative of what can typically be achieved in a
pipelined datapath design through floorplanning. The placement of the BRAMs and multipliers
is certainly a driver. Most of our stuff is tolerant to additional pipelining so we will
typically surround the BRAMs with additional pipeline stages, which is something that doesn't
work too well with a simple microprocessor.

Hal Murray

unread,
Oct 16, 2002, 12:28:54 AM10/16/02
to

>Processors in FPGAs has to be handle more delicate than ASIC processor due to
>forwarding in pipeline could easy remove all benefits gain by more pipeline stages. In
>FPGA a mux cost as much as an ALU which is not the case for ASIC or custom design.
>
>Another approach is to rely on advanced compiler techniques for handling all the
>pipeline hazardous but it would make it almost impossible to program the processor in
>assembler since the user has to do the handling.
>I personally don't think that this approach would gain that much more performance than
>MicroBlaze and you have to spend a lot of resources on the compiler which could be
>used for other stuff.

This seems like an interesting opportunity for an open source project.


--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

Hal Murray

unread,
Oct 16, 2002, 12:41:35 AM10/16/02
to
>Another approach is to add multi-threading capabilities but I think that
>multi-processing is better for FPGA than multi-threading.

Why?

If I understand what multi-threading means, the idea is to interleave
alternate cycles of two execution streams in order to reduce the
losses due to stalls.

It looks like it "just" requires an extra address bit (odd/even cycle)
to the register file and the same bit selects between pairs of special
registers like the PC.

Are you telling me that the ALU and instruction decoding is small enough
so that I might just as well build two copies of the whole CPU?

Goran Bilski

unread,
Oct 16, 2002, 11:08:34 AM10/16/02
to
Hi,

Sort of.

The complete decoding and the ALU is around 10-13% of the design.
The actual instruction decoding is less than 5%.

Make it multithreading as I understand is to have more than 1 instructions
streams in the pipeline.
What is the benefit unless you double the pipeline and have two data pipelines?
Almost nothing

So with two threads in MicroBlaze, to double the pipeline is to double the size
of MicroBlaze.
You also have to double the instruction fetching data throughput in order to get
the two streams busy.
That would put a big burden on the bus infrastructure and external memory
interface which suddenly has to double it's performance.
The doubling of the pipeline and added control handling WILL also lower the
maximum clock frequency of MicroBlaze.

Compare this with two MicroBlaze which can have it's separate instruction
fetching and both running at the maximum clock frequency.

I would say that multiprocessing which is easier to do and with more performance
a better choice.
Say you suddenly would like to have 5 threads instead of 2. That is a major
change of the multithreading MicroBlaze and almost impossible to get the
instruction fetching to keep up. With multiprocessing, just add another 3
MicroBlazes and you're done.

BUT there is always a catch and that is how you write programs for these systems.

Göran

Nicholas C. Weaver

unread,
Oct 16, 2002, 12:03:31 PM10/16/02
to
In article <3DAD80F2...@Xilinx.com>,

Goran Bilski <Goran....@Xilinx.com> wrote:
>Hi,
>
>Sort of.
>
>The complete decoding and the ALU is around 10-13% of the design.
>The actual instruction decoding is less than 5%.
>
>Make it multithreading as I understand is to have more than 1 instructions
>streams in the pipeline.
>What is the benefit unless you double the pipeline and have two data pipelines?
>Almost nothing

Uhh, you don't double the pipelines, you take the single pipeline,
double up the registers IN them, and then move the regsters to
rebalance all the pipeline stages, as now you have 2x the registers
through any fedback loop, allowing you to up the clock frequncy alot.

If you do this to every register in the core (and tweak the RF), a
multithreaded design just sort of "dros out" automatically.

You can even write a tool to do that automatically.

What happens in the end is is you take adantage of the two threads to
up the clock substantially. Each individual thread is now a little
slower, but the throughput for the 2 threads is now substantiall
higher. You use more pipelining and more power, and you may or may
not end up thrashing the caches, but itdoes work.

I can send you a paper submission and a thesis chapter draft on the
subject if you want.

>So with two threads in MicroBlaze, to double the pipeline is to
>double the size of MicroBlaze. You also have to double the
>instruction fetching data throughput in order to get the two streams
>busy. That would put a big burden on the bus infrastructure and
>external memory interface which suddenly has to double it's
>performance. The doubling of the pipeline and added control handling
>WILL also lower the maximum clock frequency of MicroBlaze.

You don't need to double the exteral memory interface if you share the
cache, this is especially true on workloads where the threads are
related. The external memory interfare is now 2x the CLOCK, but you
could slow it down from there and arbitrate beween the two streams of
execution.

You also probably want to make the feeding of interrupts a little
different, so you can designate one thread as receiving the
interrupts.

>Say you suddenly would like to have 5 threads instead of 2. That is a major
>change of the multithreading MicroBlaze and almost impossible to get the
>instruction fetching to keep up. With multiprocessing, just add another 3
>MicroBlazes and you're done.

What you do is you have a 1 thread and a 2 thread version (going
beyond 2 threads seems to be less effective, maby 3 depending on the
architecture). From the exterior, however, they still look normal.
You can still tile that like any other core to create a multiprocessor
machine.

>BUT there is always a catch and that is how you write programs for these
>systems.

"one thread for I/O, one thread for processing" does come up in some
cases.


>Göran
>
>Hal Murray wrote:
>
>> >Another approach is to add multi-threading capabilities but I think that
>> >multi-processing is better for FPGA than multi-threading.
>>
>> Why?
>>
>> If I understand what multi-threading means, the idea is to interleave
>> alternate cycles of two execution streams in order to reduce the
>> losses due to stalls.
>>
>> It looks like it "just" requires an extra address bit (odd/even cycle)
>> to the register file and the same bit selects between pairs of special
>> registers like the PC.
>>
>> Are you telling me that the ALU and instruction decoding is small enough
>> so that I might just as well build two copies of the whole CPU?
>>
>> --
>> The suespammers.org mail server is located in California. So are all my
>> other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
>> commercial e-mail to my suespammers.org address or any of my other addresses.
>> These are my opinions, not necessarily my employer's. I hate spam.
>

Goran Bilski

unread,
Oct 16, 2002, 12:24:15 PM10/16/02
to
Hi,

"Nicholas C. Weaver" wrote:

Please do.

If you double all the registers in the data pipeline, hasn't you doubled the
pipeline?
Or is all functionality between the pipestages shared?

Nicholas C. Weaver

unread,
Oct 16, 2002, 1:27:12 PM10/16/02
to
In article <3DAD92AF...@Xilinx.com>,

Goran Bilski <Goran....@Xilinx.com> wrote:
>> I can send you a paper submission and a thesis chapter draft on the
>> subject if you want.
>>
>
>Please do.

Done.

>If you double all the registers in the data pipeline, hasn't you doubled the
>pipeline?
>Or is all functionality between the pipestages shared?

The functions between the pipeline stages remain unchanged, so one is
pipelining the computation on a finer grain. This can work fairly
well for FPGAs as the ratio of LUTs to FFs is 1/1, but that is usually
not the case in logic except for highly agressive designs.

Ray Andraka

unread,
Oct 16, 2002, 1:41:15 PM10/16/02
to

Goran Bilski wrote:

>
>
> Please do.
>
> If you double all the registers in the data pipeline, hasn't you doubled the
> pipeline?
> Or is all functionality between the pipestages shared?

Yes, you've doubled the pipeline by doubling the registers. In a non-pipelined
design, you presumably have several layers of LUTs between registers. Adding pipeline
registers is nearly free in the FPGA, since they are there but unused in slices used
for combinatorial LUTs. By adding the pipeline stages you can crank up the clock
considerably if you've balanced the delays between the registers. That in turn lets
you have more than one thread in process at any given moment. All the functionality
goes through the same path, it is just that the path has more registers in it, so that
the next sample can be put into the pipeline before the previous one comes out.

Goran Bilski

unread,
Oct 16, 2002, 2:06:47 PM10/16/02
to

You can't just double the number of pipestage for a processor without major impacts.
For streaming pipeline which hardware pipelines are I agree but for processor that can't
be done.

Göran

Goran Bilski

unread,
Oct 16, 2002, 2:09:55 PM10/16/02
to
Hi,

"Nicholas C. Weaver" wrote:

> In article <3DAD92AF...@Xilinx.com>,
> Goran Bilski <Goran....@Xilinx.com> wrote:
> >> I can send you a paper submission and a thesis chapter draft on the
> >> subject if you want.
> >>
> >
> >Please do.
>
> Done.
>

Thanks

>
> >If you double all the registers in the data pipeline, hasn't you doubled the
> >pipeline?
> >Or is all functionality between the pipestages shared?
>
> The functions between the pipeline stages remain unchanged, so one is
> pipelining the computation on a finer grain. This can work fairly
> well for FPGAs as the ratio of LUTs to FFs is 1/1, but that is usually
> not the case in logic except for highly agressive designs.
>

The problem to share functionality is that you need muxes to select which thread
to use.
These muxes will be in most cases larger than the function itself and will
definitely decrease the clock frequency of the processor.

I still believe that multi-threading is useful for custom/ASIC implementation but
multi-processor works better in FPGA:

Göran

Jim Granville

unread,
Oct 16, 2002, 2:31:40 PM10/16/02
to
Nicholas C. Weaver wrote:
>
> In article <3DAD80F2...@Xilinx.com>,
> Goran Bilski <Goran....@Xilinx.com> wrote:
> >So with two threads in MicroBlaze, to double the pipeline is to
> >double the size of MicroBlaze. You also have to double the
> >instruction fetching data throughput in order to get the two streams
> >busy. That would put a big burden on the bus infrastructure and
> >external memory interface which suddenly has to double it's
> >performance. The doubling of the pipeline and added control handling
> >WILL also lower the maximum clock frequency of MicroBlaze.
>
> You don't need to double the exteral memory interface if you share the
> cache, this is especially true on workloads where the threads are
> related. The external memory interfare is now 2x the CLOCK, but you
> could slow it down from there and arbitrate beween the two streams of
> execution.

Suppose the memory interface was optimised for synchronous/burst FLASH,
- doesn't this multi threading fight a little with that interface ?

Or do you always need a multi-layer memory interface ?



> You also probably want to make the feeding of interrupts a little
> different, so you can designate one thread as receiving the
> interrupts.

That sounds a good idea, plus it infers SW access to thread steering,
so small routines can also be tagged for a 'fast thread'.

Is this the same scheme Intel has comming, where they claim 25% higher
performance ( if the SW supports it :)

-jg

Nicholas C. Weaver

unread,
Oct 16, 2002, 2:35:27 PM10/16/02
to
In article <3DADAAB7...@Xilinx.com>,
Goran Bilski <Goran....@Xilinx.com> wrote:

>You can't just double the number of pipestage for a processor without
>major impacts. For streaming pipeline which hardware pipelines are I
>agree but for processor that can't be done.

Uhh, yes it can.

Double all the pipeline stages, double the register file, rebalance
the delays now that you have more pipelining, and out drops a 2-thread
multithreaded architecture. Each single thread now runs slower, but
aggregate throughput (sum of the two threads) is increased.

It is so obvious yet unintuitive that nobody has actually DONE it
before. :)

Ken McElvain

unread,
Oct 16, 2002, 2:39:34 PM10/16/02
to
On big advantage of multi threading is that the pipeline
interlocks can be eliminated if the number of threads is larger
than the longest feedback path in the pipeline. For example,
a branch instruction does not have to stall waiting for
conditions from the preceeding comparison. This yields
some boost in the total performance.

Permanent state such as conditions codes, register files have to be
expanded into larger memories with part of the index being the current
thread id, but other registers mostly do not have to be modified. Given
the distributed ram capabilities in Xilinx parts, this is pretty
cheap.

The first place I saw this was the CDC 6600 IO processors, which
I belive ran 16 threads.

- Ken

Nicholas C. Weaver

unread,
Oct 16, 2002, 3:02:38 PM10/16/02
to
In article <3DADB0...@designtools.co.nz>,
Jim Granville <jim.gr...@designtools.co.nz> wrote:

>Nicholas C. Weaver wrote:
>> You don't need to double the exteral memory interface if you share the
>> cache, this is especially true on workloads where the threads are
>> related. The external memory interfare is now 2x the CLOCK, but you
>> could slow it down from there and arbitrate beween the two streams of
>> execution.
>
>Suppose the memory interface was optimised for synchronous/burst FLASH,
>- doesn't this multi threading fight a little with that interface ?

A bit, but not too bad. You want to redo the memory interface anyway
to make the outside look like a normal, single threaded part for
convenience.

>> You also probably want to make the feeding of interrupts a little
>> different, so you can designate one thread as receiving the
>> interrupts.
>
>That sounds a good idea, plus it infers SW access to thread steering,
>so small routines can also be tagged for a 'fast thread'.
>
>Is this the same scheme Intel has comming, where they claim 25% higher
>performance ( if the SW supports it :)

It's not coming, its present in the P4 Xeons, and present but untested
and disabed in the desktop P4s. You want better software or at least
scheduling to use it RIGHT, but it is there.

Same concept (run 2 threads as 2 separate virtual processors on the
same shared hardware), largely the same programmer implications (same
working set good, different working sets BAAAAD, which is just the
opposite of an SMP), totally different implementation:

Intel/Uwashington (Hyperthreading/SMT) approach is to issue from 2
instruction streams into a superscalar core, as any one stream
actually has pretty crappy utilization of the functional units. INtel
doesn't even increase the physical registers (which are much more
numerous than the architectural registers).

$C$-slow multithreading is to double up the pipeline, issuing from two
threads on an even/odd basis, so each thread now runs a little slower,
but the finer pipelining allows the system to run faster.

Goran Bilski

unread,
Oct 16, 2002, 2:56:14 PM10/16/02
to
Hi,

I agree that you can if you also double the clock frequency of the pipeline,
creating parts of the normal clock.
What I meant was the keeping the same clock and just adding more pipestages.

I have finally got the idea of multithreading but it not as easy to implement
since you need to find a good middle point in each
pipestage that can divide the pipestage into equal parts.

The control path is also needed to split into subparts and you also need to find
good points to break it up.
The processor also definitely needs a cache which you can run at the double
speed or more ports to in order to get the data for each thread.
I think that would be the largest obstacle for multithreading MicroBlaze, the
number of ports to the BRAM is finite (2) and in my implementation BRAM is
almost already in the critical path.


Göran Bilski

Nicholas C. Weaver

unread,
Oct 16, 2002, 3:15:51 PM10/16/02
to
In article <3DADB64E...@Xilinx.com>,

Goran Bilski <Goran....@Xilinx.com> wrote:
>Hi,
>
>I agree that you can if you also double the clock frequency of the pipeline,
>creating parts of the normal clock.
>What I meant was the keeping the same clock and just adding more pipestages.

Yeah, thats a loss generally. You need to add more stages around all
feedback loops to really improve things, and that is what the
multithreading allows.

>I have finally got the idea of multithreading but it not as easy to
>implement since you need to find a good middle point in each
>pipestage that can divide the pipestage into equal parts. The
>control path is also needed to split into subparts and you also need
>to find good points to break it up.

You can actually write a tool to do it all automatically, especially
if you are willing to say "screw initial conditions".

Hal Murray

unread,
Oct 16, 2002, 3:25:12 PM10/16/02
to
>The problem to share functionality is that you need muxes to select which thread
>to use.
>These muxes will be in most cases larger than the function itself and will
>definitely decrease the clock frequency of the processor.

I don't think we are on the same wavelength yet.

The idea is not to add those muxes, but to have two threads
running through the same heavily pipelined structure on
alternating cycles.

The extra pipeline registers are free on most FPGAs. They
let you run with a faster clock rate. But if you only have
one thread you will waste many of thone new cycles on stalls.
You can get back those cycles if you let another thread use
them.

Only around the edges do you need new muxes - a wider register
file and new mux/enables on registers like the PC.


> I still believe that multi-threading is useful for custom/ASIC implementation but
> multi-processor works better in FPGA:

This isn't rocket science. All we need is to compare 2x the LUTs used
by a single threaded design with a multi threaded design. (Scaled by
clock speed if that changes.)

Hal Murray

unread,
Oct 16, 2002, 3:30:49 PM10/16/02
to
> BUT there is always a catch and that is how you write programs
> for these systems.

Standard programming problem. People are getting pretty good
at it. Yes, there are lots of applications where it doesn't work.

If you can't take advantage of multi-threading then you wouldn't
be able to use multi-processing either.

rickman

unread,
Oct 16, 2002, 4:01:34 PM10/16/02
to
Hal Murray wrote:
>
> >Processors in FPGAs has to be handle more delicate than ASIC processor due to
> >forwarding in pipeline could easy remove all benefits gain by more pipeline stages. In
> >FPGA a mux cost as much as an ALU which is not the case for ASIC or custom design.
> >
> >Another approach is to rely on advanced compiler techniques for handling all the
> >pipeline hazardous but it would make it almost impossible to program the processor in
> >assembler since the user has to do the handling.
> >I personally don't think that this approach would gain that much more performance than
> >MicroBlaze and you have to spend a lot of resources on the compiler which could be
> >used for other stuff.
>
> This seems like an interesting opportunity for an open source project.

Aren't there already CPUs in FPGA open source projects?

http://www.fpgacpu.org/

http://www.opencores.org/

The list is getting pretty long.

--

Rick "rickman" Collins

rick.c...@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

Ray Andraka

unread,
Oct 16, 2002, 4:08:05 PM10/16/02
to
No, you don't need muxes. The process is time division multiplexed in the same
hardware. It is just a matter of keeping track where the pieces of each thread are
in relation to one another. As long as all parts have the same depth through
hardware loops you get that depth worth of multithreading. The only place muxes may
be needed is for selecting outside inputs.

Goran Bilski wrote:

> Hi,
>
> "Nicholas C. Weaver" wrote:
>
> The problem to share functionality is that you need muxes to select which thread
> to use.
> These muxes will be in most cases larger than the function itself and will
> definitely decrease the clock frequency of the processor.
>

> Göran
>
> > --
> > Nicholas C. Weaver nwe...@cs.berkeley.edu

--

Ray Andraka

unread,
Oct 16, 2002, 4:09:30 PM10/16/02
to
It can, if each phase of the pipeline is assigned to a different thread.

Ray Andraka

unread,
Oct 16, 2002, 4:20:21 PM10/16/02
to
We do it all the time in our DSP designs. Granted, they are not typically
microprocessors in the traditional sense, but the fact is it is doable.

Basically what happens is that if you have a single thread, you have extra
clock cycles between each instruction to allow time for the pipeline
propagation. Let's take a really simple case where you are just pulling data
out of a register, conditionally adding 1 to it an putting it back. In a
non=pipelined version you can increment a particular value on every clock. If
you add pipelining to the data path, the result of the first increment is not
available in the memory to be used again for N clocks where N is the depth of
the pipeline. The pipelining does allow you to run the clock faster, but it
actually slows the processing of that one memory location since the process
has to be stalled until the memory is updated. However, you can use the clock
cycles in between to increment other locations in memory so long as you don't
use any values that have been updated before they become available again. In
a sense then, you can partition the memory in to N areas, each of which is
accessed only by 1 in N cycles. Each of the cycles is then a different
'thread' running on the same processor. The result is no one thread is any
faster than the unpipelined processor (it is actually a bit slower because you
add set-up and clock-Q times plus slack for each register you add to the
pipeline), but because the path is pipelined you can have more than one thread
being operated on at a time (each with a one clock skew in relation to the
previous). This is extendable to a general purpose processor as long as the
pipeline depth is consistent for all instructions.

"Nicholas C. Weaver" wrote:

--

Ray Andraka

unread,
Oct 16, 2002, 4:24:28 PM10/16/02
to
As I recall, you were getting speeds of ~135 MHz in V2. You should be able to get a
fully pipelined processor using BRAMs up to 200 MHz or so without any big problems.
In VirtexII, the carry chains will be the limiting factor, not the BRAM if you do it
right. In Virtex and VIrtexE, you can double the width of the BRAM and then use
registers to assemble consecutive accessess. It does get a bit messy in that case
because it introduces pipeline misses.


Goran Bilski wrote:

--

Ray Andraka

unread,
Oct 16, 2002, 4:25:45 PM10/16/02
to
The SRL16's make the permanent state really easy to store too.

Ken McElvain wrote:

--

Goran Bilski

unread,
Oct 16, 2002, 4:33:39 PM10/16/02
to

Hal Murray wrote:

> > BUT there is always a catch and that is how you write programs
> > for these systems.
>
> Standard programming problem. People are getting pretty good
> at it. Yes, there are lots of applications where it doesn't work.
>
> If you can't take advantage of multi-threading then you wouldn't
> be able to use multi-processing either.
>

Is that true?

Don't you need to actually have two threads in order to use the multi-threading
but multiprocessor parallelism can be more fine grain.

ex.
A code where the inner loop has a function call where some operations take place.

Is it easier to thread that function or just place the function in another
processor?

Isn't it how data is move between two processor/threads that is more crucial?

How does you actually move data between two threads in the same processor?

Göran

Goran Bilski

unread,
Oct 16, 2002, 5:06:22 PM10/16/02
to
Yes, But I also have a embedded multiplier which already is using the registrated
output.
I can't add another pipestage in that path since there is nowhere to insert it.
Then I have to add special arrangement if the instructions is using the multiplier in
the control logic.

I have painfully detect that minor tweaks in the control logic can easily make it the
critical path.
I think that is possible to have two threads in MicroBlaze but I not convince that it
would give me more performance than two separate MicroBlazes. The overall area will be
less than two MicroBlazes but not far from it.

MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and the slices
count will go up.
(There is also a lot of places for Virtex and VirtexE where I have used all in/out for a
slice and even if there is a free DFF, it can't be reached.)
It will not double the size but a significant increase will occur.

The multiprocessor approach makes it much easier to add 10 extra processors(threads).


Göran

Ray Andraka

unread,
Oct 16, 2002, 5:26:33 PM10/16/02
to

Easiest to do via memory or register file. Thread timing has to make sure the value
is available before using it.

Goran Bilski wrote:

>
> Isn't it how data is move between two processor/threads that is more crucial?
>
> How does you actually move data between two threads in the same processor?
>
> Göran
>
> >
> > --
> > The suespammers.org mail server is located in California. So are all my
> > other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
> > commercial e-mail to my suespammers.org address or any of my other addresses.
> > These are my opinions, not necessarily my employer's. I hate spam.

--

Nicholas C. Weaver

unread,
Oct 16, 2002, 5:48:48 PM10/16/02
to
In article <3DAC6ED3...@Xilinx.com>,
Goran Bilski <Goran....@Xilinx.com> wrote:

>Another approach is to rely on advanced compiler techniques for
>handling all the pipeline hazardous but it would make it almost
>impossible to program the processor in assembler since the user has
>to do the handling. I personally don't think that this approach
>would gain that much more performance than MicroBlaze and you have to
>spend a lot of resources on the compiler which could be used for
>other stuff.

MIPS: Machine without Interlocking Pipeline Stages.

Ray Andraka

unread,
Oct 16, 2002, 6:42:11 PM10/16/02
to
It won't quite double performance, but it also should not be a significantly larger area or
you either aren't doing it right or it is already heavily pipelined. The gain is not raw
performance, it is a gain of performance/area. Two separate instances will provide more
MIPs one dual threaded machine, but at the cost of more area.

Normally, the pipeline stages should be inserted so that their input comes from the LUT in
the same slice, so it is not an issue if you used up all the inputs. The only blocking
input in that case is the SR input if you are using a CLB RAM or SRL16. The control logic
can usually be pipelined similarly, but it may require a fresh start at the design rather
than patching the existing one.

Jan Gray

unread,
Oct 16, 2002, 6:47:46 PM10/16/02
to
Nicholas wrote:
> MIPS: Machine without Interlocking Pipeline Stages.

MIPS: "Microprocessor without Interlocked Pipeline Stages"

Even the R4000 (which had a 2 cycle load-to-use delay, IIRC) made sure to
have a 0 cycle ALU-to-use delay, whereas a superpipelined 250 MHz V-II RISC
would necessarily a 1 cycle ALU-to-use delay. That is rather more
challenging for the code scheduler to address.

My (unpublished) V-II architectural studies concur back up what Goran has
been writing. Furthermore, to 2-thread any such machine would increase the
area intolerably, because so much area is tied up in register files (which
would have to double in size). Also, if you grow the area of the processor,
it will slow down because of increased interconnect delays. A compact
processor is a fast processor.

Goran wrote:
> MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and
the slices count will go up.

Is that an implementation improvement over the old 900+ LUTs figure, or did
the old 900 LUTs figure include other non-core resources? Can you say what
changed?

(SPRAM reg files? (The following is fast enough for ~150 MHz operation:
Register the result in a write-back register (in FFs) on CLK rising edge;
present reg file write address and write-back data to reg file SPRAMs while
CLK high; write results to SPRAMs on CLK falling edge; present reg file read
address while CLK low; mux SPRAM outputs with immediate and/or forwarded
result; and register in operand registers.))

Jan Gray, Gray Research LLC


Goran Bilski

unread,
Oct 16, 2002, 7:14:06 PM10/16/02
to
Hi Jan,

Jan Gray wrote:

> Nicholas wrote:
> > MIPS: Machine without Interlocking Pipeline Stages.
>
> MIPS: "Microprocessor without Interlocked Pipeline Stages"
>
> Even the R4000 (which had a 2 cycle load-to-use delay, IIRC) made sure to
> have a 0 cycle ALU-to-use delay, whereas a superpipelined 250 MHz V-II RISC
> would necessarily a 1 cycle ALU-to-use delay. That is rather more
> challenging for the code scheduler to address.
>
> My (unpublished) V-II architectural studies concur back up what Goran has
> been writing. Furthermore, to 2-thread any such machine would increase the
> area intolerably, because so much area is tied up in register files (which
> would have to double in size). Also, if you grow the area of the processor,
> it will slow down because of increased interconnect delays. A compact
> processor is a fast processor.
>
> Goran wrote:
> > MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and
> the slices count will go up.
>
> Is that an implementation improvement over the old 900+ LUTs figure, or did
> the old 900 LUTs figure include other non-core resources? Can you say what
> changed?
>

Ooops, I did it again!!
Error on my side, I looked at the report file for the core and took the number
of IO instead of LUTs.


>
> (SPRAM reg files? (The following is fast enough for ~150 MHz operation:
> Register the result in a write-back register (in FFs) on CLK rising edge;
> present reg file write address and write-back data to reg file SPRAMs while
> CLK high; write results to SPRAMs on CLK falling edge; present reg file read
> address while CLK low; mux SPRAM outputs with immediate and/or forwarded
> result; and register in operand registers.))
>

(Of 900 LUTs, 256 of them are the register file => around 30%)
It's something that I have thought off but it would not leave much room for any
logic handling of the register addresses or operations on the register output.
Since I using a SRL16 as the instruction prefetch buffer and the output delay of
a SRL16 is around 2 ns, I can't add much logic to the register address before
they have to go to the register file.

But if I got some spare time (hahahaha) this is something that I would try to do
in order to get down the MicroBlaze size.
Jan, You might be able to do a clean room implementation of a MicroBlaze where
area is everything.
If you have any spare time ;-)

Hal Murray

unread,
Oct 16, 2002, 7:51:03 PM10/16/02
to

>> >Another approach is to rely on advanced compiler techniques for handling all the
>> >pipeline hazardous but it would make it almost impossible to program the processor in
>> >assembler since the user has to do the handling.
>> >I personally don't think that this approach would gain that much more performance than
>> >MicroBlaze and you have to spend a lot of resources on the compiler which could be
>> >used for other stuff.
>>
>> This seems like an interesting opportunity for an open source project.
>
>Aren't there already CPUs in FPGA open source projects?
>
>http://www.fpgacpu.org/
>
>http://www.opencores.org/
>
>The list is