Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

TBUFs in Virtex and later chips, going out of fashion, what instead

17 views
Skip to first unread message

Neil Franklin

unread,
Mar 18, 2001, 6:02:36 PM3/18/01
to
In the earlier Xilinx chips (3000, 4000, 5200) there is always 2
TBUFs per CLB of 2 LUT+FF. And both TBUF lines can be read.

In Virtex and Spartan-II it is down to 2 TBUFS for the 4 LUT+FF of an
CLB, so only one slice can be routed to them, and even only 1 line for
reading back from TBUF lines.

In Virtex-II there is even only 2 TBUFs for 8 LUT+FF per CLB. Which
makes even connecting the output of the 4-wide 2 slices of an single
carry chain to an bus impossible. The data sheet does not give the
amount of readbacks.

From this I get the impression, that Xilinx regards TBUF buses as going
out of fashion. After all, the TBUFs cost in chip space is next to nothing
relative to them many PIPs (about 900 per CLB in Virtex).


In the Jan Gray RISC processors TBUFs are used to implement processor
internal data buses in no space. I have the same type of situation,
with many data producing elements to be selected from. TBUFs seem to
be ideal _horizontal_ wide AND-ORs (vertical is being used for the bits,
because of the carry chains).

So I have a question: What is the Xilinx-suggested replacement for TBUFs?
Is one supposed to use MUXes implemented in the CLBs? Is there an other
trick I have not yet stumbled over?


Note that I need to use Spartan-II, as Spartan is too small and JBits
only runs on Virtex and Spartan-II anyway.


--
Neil Franklin, ne...@franklin.ch.remove http://neil.franklin.ch/
Hacker, Unix Guru, El Eng HTL/FH/BSc, Sysadmin, Roleplayer, LARPer

Jan Gray

unread,
Mar 19, 2001, 9:28:08 AM3/19/01
to
Indeed, the halving of TBUFs/LUT in Virtex, and again in V-II, make my
datapaths larger/less functional per LUT, compared with XC4000. (Consider
what happens to the result mux in the xr16 CPU datapath schematic S3 on p. 9
of www.fpgacpu.org/papers/xsoc-series-drafts.pdf, for example. Not to
mention the "zero cost" <<2, <<4, <<8, >>2, >>4, >>8 shifters and bus
byte/word/longword resizers you can build with spare columns of TBUFs.)

Reoptimizing for Virtex has been a chore. (Another setback in Virtex vs.
4000 was the loss of independent clock inversion on LUT RAM WCLKs and LUT FF
CLKs, but that's another story.)

But for V-II there seems to be no practical alternative but to a) use
(waste) LUTs and their interconnect to build these horizontal muxes, and/or
b) recode your design to help your technology mappers merge some of the
muxes into other logic.

Regarding (b), using the Virtex-style carry logic (including MULT_AND), it
seems possible to build these "free mux" structures:
1) o[i] = addsub ? (a[i] + b[i]) : (a[i] - b[i])
2) o[i] = add ? (a[i] + b[i]) : c[i]
3) o[i] = addb ? (a[i] + b[i]) : (a[i] + c[i])
4) o[i] = addsub ? (addand ? a[i]+b[i] : a[i]-b[i]) : (addand ? a[i]&b[i]
: a[i]^b[i])

See http://www.fpgacpu.org/log/nov00.html#001112 for details.

Synthesis tools get (1) (usually) but (as far as I know) miss the others.

Consider case (2). An add followed by a mux would seem to be a pretty common
circuit structure, and therefore important to optimize. Surely using the
single-LUT-per-bit construction is a no-brainer, right? Not so fast! There
are some tools issues.

If you inefficiently implement this as two LUTs:
t = a[i] + b[i];
o[i] = add ? t : c[i]
then trce will "see" that the latency from c[i] to o[i] is Tilo. Good.

But if you implement it in one LUT as
o[i] = add ? (a[i] + b[i]) : c[i]
e.g.
o[i] = add&(a[i]^b[i]) + ~add&c[i]
along with the appropriate configuration of MULT_AND, MUXCY, and XORCY, then
(if I recall correctly) trce will also find false ripple-carry paths from
c[i], e.g. from c[0] to o[31], which would therefore interfere with correct
static timing analysis and with timing driven placement and routing. Oops!

Therefore, I would like to see two tools enhancements to enable correct
inference of add/mux in one LUT per bit:

a) Xilinx should enhance trce to do a more precise analysis around
ripple-carry structures, e.g. to rule out the false path from c[i] through
the carry chain to o[i+1]...o[n]. Here with 'add' feeding MULT_AND, there
is no carry-out if 'add' is false, and also, c[i] does not influence the
carry-out when 'add' is true, and thus the MUXCY carry-out does not depend
upon c[i].

b) Xilinx should lobby its synthesis partners to infer add/mux structures
like (2)-(4) when possible. Or encourage a user-directive to force it.

If both (a) and (b) were done, then Xilinx customers (synthesis users and
RPM builders alike) would probably enjoy somewhat smaller and faster results
in the devices they're already using.

This add/mux inference digression aside, abundant TBUFs were useful and will
be missed. But I suppose that any FPGA feature that HDL synthesis users and
tools do not take good advantage of, is not long for this world.

Jan Gray, Gray Research LLC

Ray Andraka

unread,
Mar 19, 2001, 10:08:37 AM3/19/01
to

Jan Gray wrote:
>
> Indeed, the halving of TBUFs/LUT in Virtex, and again in V-II, make my
> datapaths larger/less functional per LUT, compared with XC4000. (Consider
> what happens to the result mux in the xr16 CPU datapath schematic S3 on p. 9
> of www.fpgacpu.org/papers/xsoc-series-drafts.pdf, for example. Not to
> mention the "zero cost" <<2, <<4, <<8, >>2, >>4, >>8 shifters and bus
> byte/word/longword resizers you can build with spare columns of TBUFs.)
>
> Reoptimizing for Virtex has been a chore. (Another setback in Virtex vs.
> 4000 was the loss of independent clock inversion on LUT RAM WCLKs and LUT FF
> CLKs, but that's another story.)

Another drawback to the VIrtex is that you no longer get the carry chain for
free for functions where you are only interested in the carry out. As a result,
something like a saturating limiter that was able to be implemented in one
column of CLBs in 4K, now takes two columns of slices with the LUTs in the first
used as pass-throughs to the carry chain :-(

>
> But for V-II there seems to be no practical alternative but to a) use
> (waste) LUTs and their interconnect to build these horizontal muxes, and/or
> b) recode your design to help your technology mappers merge some of the
> muxes into other logic.

I haven't looked at it closely, but it seems to me that you might be able to use
the horizontal OR chains for this. Have you investigated it?

>
> Regarding (b), using the Virtex-style carry logic (including MULT_AND), it
> seems possible to build these "free mux" structures:
> 1) o[i] = addsub ? (a[i] + b[i]) : (a[i] - b[i])
> 2) o[i] = add ? (a[i] + b[i]) : c[i]
> 3) o[i] = addb ? (a[i] + b[i]) : (a[i] + c[i])
> 4) o[i] = addsub ? (addand ? a[i]+b[i] : a[i]-b[i]) : (addand ? a[i]&b[i]
> : a[i]^b[i])
>
> See http://www.fpgacpu.org/log/nov00.html#001112 for details.
>
> Synthesis tools get (1) (usually) but (as far as I know) miss the others.
>
> Consider case (2). An add followed by a mux would seem to be a pretty common
> circuit structure, and therefore important to optimize. Surely using the
> single-LUT-per-bit construction is a no-brainer, right? Not so fast! There
> are some tools issues.

Jan, you are correct. The tools do not properly infer this (as well as certain
adds/counters with resets if they are anything but a dirt simple
adder/increment). This, and ability to direct placement are some reasons I
often use instantiated circuits within a generate instead of the more readable
inferred logic.

>
> If you inefficiently implement this as two LUTs:
> t = a[i] + b[i];
> o[i] = add ? t : c[i]
> then trce will "see" that the latency from c[i] to o[i] is Tilo. Good.
>
> But if you implement it in one LUT as
> o[i] = add ? (a[i] + b[i]) : c[i]
> e.g.
> o[i] = add&(a[i]^b[i]) + ~add&c[i]
> along with the appropriate configuration of MULT_AND, MUXCY, and XORCY, then
> (if I recall correctly) trce will also find false ripple-carry paths from
> c[i], e.g. from c[0] to o[31], which would therefore interfere with correct
> static timing analysis and with timing driven placement and routing. Oops!

Yep. TRCE doesn't do anything in the way of analyzing the logic in the
circuit. It just adds delays between FFs. If you are careful with the
constraints, you can block the false path, but it usually doesn't warrant the
effort or the added potential for accidently ignoring a valid path.


>
> Therefore, I would like to see two tools enhancements to enable correct
> inference of add/mux in one LUT per bit:
>
> a) Xilinx should enhance trce to do a more precise analysis around
> ripple-carry structures, e.g. to rule out the false path from c[i] through
> the carry chain to o[i+1]...o[n]. Here with 'add' feeding MULT_AND, there
> is no carry-out if 'add' is false, and also, c[i] does not influence the
> carry-out when 'add' is true, and thus the MUXCY carry-out does not depend
> upon c[i].
>
> b) Xilinx should lobby its synthesis partners to infer add/mux structures
> like (2)-(4) when possible. Or encourage a user-directive to force it.
>
> If both (a) and (b) were done, then Xilinx customers (synthesis users and
> RPM builders alike) would probably enjoy somewhat smaller and faster results
> in the devices they're already using.
>
> This add/mux inference digression aside, abundant TBUFs were useful and will
> be missed. But I suppose that any FPGA feature that HDL synthesis users and
> tools do not take good advantage of, is not long for this world.
>
> Jan Gray, Gray Research LLC

--
-Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email r...@andraka.com
http://www.andraka.com

Austin Franklin

unread,
Mar 19, 2001, 10:37:37 AM3/19/01
to
> In the earlier Xilinx chips (3000, 4000, 5200) there is always 2
> TBUFs per CLB of 2 LUT+FF. And both TBUF lines can be read.
>
> In Virtex and Spartan-II it is down to 2 TBUFS for the 4 LUT+FF of an
> CLB, so only one slice can be routed to them, and even only 1 line for
> reading back from TBUF lines.
>
> In Virtex-II there is even only 2 TBUFs for 8 LUT+FF per CLB. Which
> makes even connecting the output of the 4-wide 2 slices of an single
> carry chain to an bus impossible. The data sheet does not give the
> amount of readbacks.
>
> From this I get the impression, that Xilinx regards TBUF buses as going
> out of fashion. After all, the TBUFs cost in chip space is next to nothing
> relative to them many PIPs (about 900 per CLB in Virtex).

I believe a lot of this has to do with HDLs. I know that most all the
people I know using HDLs for Xilinx design don't even know what a TBUF is,
or even how they would use it. I also think the tools, tutorials, classes
etc. poorly support using them.

Neil Franklin

unread,
Mar 19, 2001, 4:19:01 PM3/19/01
to
Ray Andraka <r...@andraka.com> writes:

> Jan Gray wrote:
> >
> > But for V-II there seems to be no practical alternative but to a) use
> > (waste) LUTs and their interconnect to build these horizontal muxes, and/or
> > b) recode your design to help your technology mappers merge some of the
> > muxes into other logic.
>
> I haven't looked at it closely, but it seems to me that you might be able to use
> the horizontal OR chains for this. Have you investigated it?

I went quickly to the data sheet. If I get it correctly, there is only
one ORCY per slice (after the G MUXCY). So if you are using 4bits per
CLB row (given by carry logic) and want one horizontal mux per bit,
you have only 1/2 the needed ORCYs.

Dammit. Using LUTs as (F1-in&F2-enable)|(F3-in&F4-enable) and then
ORCYing them would have solved the problem. And LUTs with 3 inputs
for logic and one enable and ORCY would be ideal.

Xilinx: more ORCYs in Virtex-III, please.


> > Regarding (b), using the Virtex-style carry logic (including MULT_AND), it
> > seems possible to build these "free mux" structures:
> > 1) o[i] = addsub ? (a[i] + b[i]) : (a[i] - b[i])
> > 2) o[i] = add ? (a[i] + b[i]) : c[i]
> > 3) o[i] = addb ? (a[i] + b[i]) : (a[i] + c[i])
> > 4) o[i] = addsub ? (addand ? a[i]+b[i] : a[i]-b[i]) : (addand ? a[i]&b[i]
> > : a[i]^b[i])

Sorry, I do not speak Verilog. Here is my MUXCY based layout:

2input 4input 8input

7 67 4567 each digit is one LUT (F1&F2)|(F3&F4) + MUXCY OR
6 67 4567 digits number data path bit that is processed
5 45 4567 2x2 digits is a Virtex CLB
4 45 4567
3 23 0123
2 23 0123
1 01 0123
0 01 0123

Java/JBits code 4-enables type 4:1 Mux:

for (int Col = MuxCol; Col < MuxCol+2; Col++) {
for (int Row = MuxRow; Row < MuxRow+MuxBits/2; Row++) {
/* LUT AND OR AND, 1&2|3&4, 8888|F000 = F888 */
int Mux2i[] = Util.IntToIntArray(0xF888, 16);
Fpga.set(Row, Col, LUT.SLICE0_F, Util.InvertIntArray(Mux2i));
Fpga.set(Row, Col, LUT.SLICE0_G, Util.InvertIntArray(Mux2i));
/* wide OR with LUT=0 -> !BX (=0) and LUT=1 -> 1 */
Fpga.set(Row, Col, S0Control.XCarrySelect.XCarrySelect,
S0Control.XCarrySelect.LUT_CONTROL);
Fpga.set(Row, Col, S0Control.YCarrySelect.YCarrySelect,
S0Control.YCarrySelect.LUT_CONTROL);
Fpga.set(Row, Col, S0Control.AndMux.AndMux, S0Control.AndMux.ONE);
Fpga.set(Row, Col, S0Control.Cin.Cin, S0Control.Cin.BX);
Fpga.set(Row, Col, S0Control.BxInvert, S0Control.ON); } }

I have already employed the jump/increment program counter trick with
the carry logic increment controlled by MUL-AND. Actually read about it
in November, forgot it, reinvented it, and now re-recognized it.


> > This add/mux inference digression aside, abundant TBUFs were useful and will
> > be missed. But I suppose that any FPGA feature that HDL synthesis users and
> > tools do not take good advantage of, is not long for this world.

So I suppose the line "Consequently, the Virtex routing archi-tecture
and its place-and-route software were defined in a single optimization
process" can be translated as: we the chip designers do not support
Assembler^WJBits fossils^Wprogrammers. :-(

Juri Kanevski

unread,
Mar 20, 2001, 5:37:22 AM3/20/01
to

Neil Franklin wrote:
>
> In the earlier Xilinx chips (3000, 4000, 5200) there is always 2
> TBUFs per CLB of 2 LUT+FF. And both TBUF lines can be read.
>
> In Virtex and Spartan-II it is down to 2 TBUFS for the 4 LUT+FF of an
> CLB, so only one slice can be routed to them, and even only 1 line for
> reading back from TBUF lines.
>
> In Virtex-II there is even only 2 TBUFs for 8 LUT+FF per CLB. Which
> makes even connecting the output of the 4-wide 2 slices of an single
> carry chain to an bus impossible. The data sheet does not give the
> amount of readbacks.
>
> From this I get the impression, that Xilinx regards TBUF buses as going
> out of fashion. After all, the TBUFs cost in chip space is next to nothing
> relative to them many PIPs (about 900 per CLB in Virtex).
>
>

> So I have a question: What is the Xilinx-suggested replacement for TBUFs?
> Is one supposed to use MUXes implemented in the CLBs? Is there an other
> trick I have not yet stumbled over?

I think that Xilinx
found a compromise in the space functionality-cost.
For these TBUFs and tristate wires
Xilinx's devices were 2 fold more costly than Altera's ones, I suppose,
because of their heavy technology.

And therefore the only solution is
to try to do without large multiplexors and shared busses.
The way to do without large shared busses in the chip
is the stable tendency now and in the future
because they do not support the clock frequency increase,
and afford the high switshing energy .

A.Ser.

Austin Lesea

unread,
Mar 20, 2001, 11:10:28 AM3/20/01
to
All,

As the chips scale, driving a 1" long wire became slower and slower (relative to
the whole picture). Recently we removed TBUF's and replaced them with mux's in
a few thousands of designs, and without replacing and rerouting, the designs
were all faster. They also took up more area (the mux's). By removing TBUF's
we gain that area back (TBUF's have to be huge to drive the long wires), so that
now are still more area efficient, and higher speed than before. With PAR, the
designs are all "better" than before.

As always, removing something is always a tough decision, and we made sure we
had a way of succeeding before we asked everyone to give them up forever (which
may happen in the future).

They were great while they lasted,

Austin

TBUF: RIP 200?

Rick Collins

unread,
Mar 20, 2001, 11:46:25 AM3/20/01
to
Austin Lesea wrote:
>
> All,
>
> As the chips scale, driving a 1" long wire became slower and slower (relative to
> the whole picture). Recently we removed TBUF's and replaced them with mux's in
> a few thousands of designs, and without replacing and rerouting, the designs
> were all faster. They also took up more area (the mux's). By removing TBUF's
> we gain that area back (TBUF's have to be huge to drive the long wires), so that
> now are still more area efficient, and higher speed than before. With PAR, the
> designs are all "better" than before.
>
> As always, removing something is always a tough decision, and we made sure we
> had a way of succeeding before we asked everyone to give them up forever (which
> may happen in the future).
>
> They were great while they lasted,
>
> Austin
>
> TBUF: RIP 200?

When you say that the muxes were added, are you saying that there are
"special" muxes available? Or are you referring to the LUTs in the CLBs?

I guess I need to read the VII data sheet in detail and learn the
architecture. It seems to be the chip to beat these days.


--

Rick "rickman" Collins

rick.c...@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

Kolja Sulimma

unread,
Mar 20, 2001, 12:54:39 PM3/20/01
to
As I understand it, Austin is stating that the area that was saved by removing the
TBUFs is now used for extra CLBs
(or reducing the ASIC cost per CLB)
With 90% of the FPGA area beeing in the interconnect (Andre DeHons estimates) I doubt
that you gain much of a CLB
(including its interconnect) from removing a TBUF.
Someone stated in this group that the TBUFs are not really TBUFs anymore internally
but are realized in some other patented manner. Maybe thats the reason. (The TBUFs
worst case timing got MUCH faster with virtex when compared to the carry chain, for
example)
But of course, the longlines get faster when less TBUFs are connected to them

But remember: Marketing argument are often more important that technical aspects.
Because of the huge amount of area
dedicated to routing in an FPGA it is better to waste CLBs than to waste routing and
therefore poorly routable FPGAs
are much more area efficient.then highly routable designs.
But they got bad press for bad routability, and therefore greatly improve routability.
The new trade-off may be more
expensive to manufacture, but its easier to use and less people complain.

But wait: Everybody feels happier if he has a reason to complain so they discounted
the equivalent gate count metric, and now we have a reason to complain again.

CU,
Kolja

Austin Lesea

unread,
Mar 20, 2001, 6:32:03 PM3/20/01
to
All Mux's are special :)

The replacement was with CLB's (LUT's and special clb logic), which may implement the
F5, F6, F7 or even F8 in Virtex II, so they can be very efficient.

The fast that you could retarget the synthesis, and use even more powerful structures
(e.g. horizontal cascade carry) makes things even faster with less area.

The TBUF's occupy a large area in interconnect (metal wires) and buffers, and mux
trees to get in and out. There are these strange warts that hang off the CLB and run
really slowly now that everything has scaled so well into the ultra deep sub-micron
world.

Austin

Ray Andraka

unread,
Mar 20, 2001, 9:23:19 PM3/20/01
to
The problem with the CLB muxes is that they don't match the pitch of the carry
chain, so If I have a set of counters for instance that I want to address to
read onto a common bus, it is a royal pain in the patoot without the Tbufs.
It's a good thing there is more routing in these chips!

For example, in Virtex, if I want to use the F6 mux, which would be the one I
need to mux between two slices, I need to use the F5 muxes to feed the F6.
Since the F5 selects between the F and G LUTs, this structure is broken for
connecting to carry chain logic.

--

Jan Gray

unread,
Mar 21, 2001, 1:40:20 AM3/21/01
to
"Ray Andraka" <r...@andraka.com> wrote in message
news:3AB811ED...@andraka.com...

> The problem with the CLB muxes is that they don't match the pitch of the
carry
> chain, so If I have a set of counters for instance that I want to address
to
> read onto a common bus, it is a royal pain in the patoot without the
Tbufs.
> It's a good thing there is more routing in these chips!

Right. The various MUXFx's are of little or no use in my adder-heavy
datapaths (e.g. a <96-slice, >80 TBUF, 16-bit RISC CPU for Virtex).

For such designs, which extensively use TBUFs for wide/bus-oriented muxes,
it is hard for me to believe that the area, delay, and power of a
*dedicated* *full-custom* TBUF are inferior to the area, delay, and power of
the same topology using inter-CLB interconnect, plus CLB-local interconnect,
plus 4-LUTs.

Instead I like to believe that the *weighted proportion* of CLBs that use
TBUFs in the mix of test designs (that I assume Xilinx uses to validate and
optimize its next generations of tools and silicon) is low, so that across
the spectrum of designs, TBUF resources are almost always wasted silicon.
This is especially plausible if most such test designs use HDL synthesis,
which do not infer many TBUFs. If so, then a good architecture design
system might indeed determine that the optimal number of TBUFs per CLB is
... epsilon (arbitrarily close to zero).


Anyway, V-II is here, so how shall we make the most of it? As I wrote last
time, one important family of circuits (particularly add followed by mux)
can be done properly in just one LUT per bit (in Virtex and Virtex-II), if
Xilinx enhances TRCE (and timing-driven PAR) to address the stated false
path situation.

THIS IS IMPORTANT, because if TRCE is NOT so enhanced, this important
circuit trick is almost unusable, because TRCE (and presumably timing-driven
PAR) think each such mux delay is a whole adder delay.

Austin or Peter, what do you think? [By the way, thanks: it is great to
hear the Xilinx perspective. Imagine the deafening silence otherwise. See
also the cluetrain manifesto (cluetrain.com) thesis #84 :-)]

Rick Collins

unread,
Mar 21, 2001, 3:10:29 AM3/21/01
to
Kolja Sulimma wrote:
> But wait: Everybody feels happier if he has a reason to complain so they discounted
> the equivalent gate count metric, and now we have a reason to complain again.

You may be on to something! I know that I hate having to back calculate
the gate counts to figure out how to compare today's chips to
yesterday's chip (that I am designing in tomorrow!).

Neil Franklin

unread,
Mar 21, 2001, 1:43:36 PM3/21/01
to
Austin Lesea <austin...@xilinx.com> writes:

> Juri Kanevski wrote:
>
> > Neil Franklin wrote:
> > >
> > > In Virtex and Spartan-II it is down to 2 TBUFS for the 4 LUT+FF of an
> > > CLB, so only one slice can be routed to them, and even only 1 line for
> > > reading back from TBUF lines.
> > >
> > > In Virtex-II there is even only 2 TBUFs for 8 LUT+FF per CLB. Which
> > >

> > > So I have a question: What is the Xilinx-suggested replacement for TBUFs?
> > > Is one supposed to use MUXes implemented in the CLBs? Is there an other
> > > trick I have not yet stumbled over?
> >

> > And therefore the only solution is
> > to try to do without large multiplexors and shared busses.

Unless I have got many outputs (different logic units) that need
to be selected to go into one input (register file). Then I need an
Mux. Can't avoid it.


> > The way to do without large shared busses in the chip
> > is the stable tendency now and in the future

That is sensible anyway, simply from contention/parallelism
viewpoint. But that can be had by partitioning the TBUF lines (on
Virtex and later chips, not on 3000/4000/5200).


[rearanged to here]

> As the chips scale, driving a 1" long wire became slower and slower (relative
to
> the whole picture).

Assuming an chip-wide bus. With an bus just 5 to 10 CLBs wide (in
XC2S200) they would be quite a bit less than 1" (1/8 to 1/4").


> Recently we removed TBUF's and replaced them with mux's in
> a few thousands of designs, and without replacing and rerouting, the designs
> were all faster.

This surprises me, given more "wire" (to Mux and away). OTOH if the
Mux is inbetween the target and its nearest source then length may not
be much more. And TBUF section PIPs may be not much faster then GRM PIPs.

Were these comparisons on Virtex-II or also on plain Virtex and
Spartan-II (which is what I am using)?


> TBUF: RIP 200?

That was what I was suspecting. Thanks to get it from the horses mouth.

Hal Murray

unread,
Mar 22, 2001, 2:33:00 AM3/22/01
to
>As the chips scale, driving a 1" long wire became slower and slower (relative to
>the whole picture). Recently we removed TBUF's and replaced them with mux's in
>a few thousands of designs, and without replacing and rerouting, the designs
>were all faster. They also took up more area (the mux's). By removing TBUF's
>we gain that area back (TBUF's have to be huge to drive the long wires), so that
>now are still more area efficient, and higher speed than before. With PAR, the
>designs are all "better" than before.

I'm missing something. Why are muxes better/faster?

I see why driving a 1" long wire is tough, but I don't see why
the driver after a mux is different from a TBUF.

In one case you have to turn the driver on. In the other case
you have to get half way across the chip, then through a mux.
I'm assuming that getting the go signal to the tbuf is about
as hard as getting the select signals to a mux.


----

In either case, a critical parameter is how well the TBUF/mux
density lines up with a counter. Assume I have a counter
that uses the "obvious" fast carry logic. I need to get
that on the bus and to load other counters and registers
from that bus.

How many bits per CLB does that good counter use?

If I'm using TBUFs, I need that many TBUFs per CLB.

If I'm using muxes, I need enough routing to get all the
registers that drive the bus into the mux. I haven't tried
to build anything like that. It might be simpler if the mux
is distributed or some trick like that.

--
These are my opinions, not necessarily my employeers. I hate spam.

Ray Andraka

unread,
Mar 22, 2001, 8:20:51 AM3/22/01
to
If xilinx extended the horizontal or chains in V2 so that there was at least 4
per CLB, we'd be all set.

--

Austin Lesea

unread,
Mar 22, 2001, 11:22:14 AM3/22/01
to
Hal,

The mux's don't have to drive across the whole chip. Maybe they need to drive to one
of the CLB's that may be reached by the directs, doubles (or hexes), making the
amount of logic that can be reached quickly, large.

This discussion reminds me of the days when people still argued that asssembly
language programming was "better" than any C compiler every could be.

Just like what happened to compiled languages, we are already there with compiled
HDL's: if it isn't fast enough after the best the tools can do, you go in and "code"
by hand the pieces that were not implemented with the best possible efficiency (if
there are any).

Or, you realize that you didn't take advantage of the parallelism, or some feature of
the part, and you recode your HDL.

With > 95% of all FPGA designs being done in HDL's, we think we are spending time and
silicon on what makes that flow the fastest and best possible.

Austin

Austin Lesea

unread,
Mar 22, 2001, 11:24:41 AM3/22/01
to
Ray,

Horizontal Cascade Carry. Virtex II has so many new features, some tend to get lost.

Austin

Austin

Simon Bacon

unread,
Mar 22, 2001, 5:19:05 PM3/22/01
to

"Austin Lesea" <austin...@xilinx.com> wrote in message
news:3ABA26B6...@xilinx.com...

>
> With > 95% of all FPGA designs being done in HDL's, we think we are spending
time and
> silicon on what makes that flow the fastest and best possible.

Well, picking an HDL:

DataOut <= Reg0Val when Addr=REG0 else (others=>'z');
DataOut <= Reg1Val when Addr=REG1 else (others=>'z');
DataOut <= Reg2Val when Addr=REG2 else (others=>'z');

and the Verilog metaphor is similar. This is HDL, easy to
understand, and a good mapping to TBUFs.

Of course, ASICs do not usually support T/S lines. Even the
Xilinx implementation is an emulation in Virtex, though the
3K and 4K had the real thing.


muza...@dspia.com

unread,
Mar 22, 2001, 8:13:19 PM3/22/01
to
"Simon Bacon" <sim...@tile.demon.co.cut_this.uk> wrote:

>
>"Austin Lesea" <austin...@xilinx.com> wrote in message
>news:3ABA26B6...@xilinx.com...
>>
>> With > 95% of all FPGA designs being done in HDL's, we think we are spending
>time and
>> silicon on what makes that flow the fastest and best possible.
>
>Well, picking an HDL:
>
> DataOut <= Reg0Val when Addr=REG0 else (others=>'z');
> DataOut <= Reg1Val when Addr=REG1 else (others=>'z');
> DataOut <= Reg2Val when Addr=REG2 else (others=>'z');
>
>and the Verilog metaphor is similar. This is HDL, easy to
>understand, and a good mapping to TBUFs.
>
>Of course, ASICs do not usually support T/S lines.

By ASIC I take it you mean Standard cell based ASIC; if that's the
case, this has not been my experience. Almost all SC libraries have
T/S buffers and inverters. Some even have T/S DFFs and T/S bus
holders. Of course with a full custom ASIC, there is no limit on what
you can do.

> Even the
>Xilinx implementation is an emulation in Virtex, though the
>3K and 4K had the real thing.

I think the problem with T/S busses are well known. Difficult to time,
potential for electrical conflicts, potential for very high power
consumption etc. Being more of an analog feature, one has to spend
extra effort with RTL verification. You might even have to spice some
paths god forbid ;-)

Muzaffer

Hal Murray

unread,
Mar 23, 2001, 12:39:11 AM3/23/01
to

>The mux's don't have to drive across the whole chip. Maybe they need to drive
> to one of the CLB's that may be reached by the directs, doubles (or hexes),
> making the amount of logic that can be reached quickly, large.

I'm still not getting it.

My picture is that I have a register on the left side of the chip
and another register on the right side. Sprinkled in the middle
are various registers and counters.

I want to be able to load the register on the left from any of the
others, including the one on the right. Same in the other direction.

This is easy to understand with TBUFs. I might not like the answers,
but it's easy to floor plan things so all the routing is as good as
it will get. The pattern is clock, enable, tbuf drives longline,
setup, clock.

This fits well with a style that thinks of the problem as a
specialized CPU like gizmo that gets driven my a microcoded
engine. Most of the changes go into the microcode. The instruction
format includes a field to specify which register drives the bus
and another field for which register gets loaded. (Occasional
hacks are handy to load several registers on the same cycle.)


Are you using a mux in front of each register you want to load, or
a shared mux feeding all the registers?

If I assume a mux in the middle, then the nasty case is clock,
routing-to-mux, delay through mux, delay through long-line like
routing to get all over the chip, setup, clock.

If I assume a mux in front of each register, the pattern is
the same but some things have been pushed around. I'd guess
the timing will be similar.


Perhaps your test cases go faster because you didn't put registers
that far apart so you don't have to route across the whole chip.

If so, then the problem isn't TBUFs, it's the long line across
the whole chip that they are driving. I could easily see that
FPGA are now big enough that having a bus span the whole chip
isn't useful very often.

Did you try your test cases on a smaller chip?


Perhaps TBUFs feeding not-so-long lines that only go part way
across the chip would be interesting (aka faster). I'm assuming
there would be a pair of good TBUFs at the segment junction that
I could use to cennect them and/or put a register there to
pipeline things.

My straw man (not much thought) is that the right length
would be to match the register transfer timing to what it
takes to get through a (pipelined) RAM or multiply stage.

Or maybe short segments collapses to something mux like if
you have lots of local/medium routing resources.

0 new messages