40% less SEU's! in V4: another good reason to choose Xilinx

Austin Lesea

unread,

May 6, 2005, 5:12:14 PM5/6/05

to

All,

Latest update on atmospheric upsets:

http://tinyurl.com/c9y5l

Virtex 4 memory cells are almost twice as hard to upset as Virtex II.

We promised to reduce our susceptibility to atmospheric upsets, and we
are fulfilling that promise.

Not all semi companies have made this choice: it is hard to do, and
increases area.

I know of work being done at Intel, and Cypress to improve, but nowhere
else.

It is highly likely that competing 90nm FPGA companies have done
anything at all (except get a lot worse).

The ASIC (ASSP, hardened solutions, etc.) also have not made this choice
(as it would really blow up their area a lot). Thus, 90nm ASIC
technology has a typical SRAM FIT rate of 5,000 FIT/Mb (from neutron
data error rate specifications for a typical 90nm SRAM ASIC cell), as
compared to our less than 250 FIT/Mb.

The ASIC DFF's, logic, etc. are also a fantastic neutron detector: the
resulting hardness of the Virtex 4 is on par with, and better than a
full custom 90nm ASIC doing the same task!

Unfortunately, no data is available on ASIC's, as they just don't know.
To test, one would have to place the part in a neutron beam, while
running, which is rather hard to do with a complete system ...

Caveat Emptor!

Virtex 4 on the other hand, combines with built in ECC for the BRAM, and
built in FRAME_ECC for the configuration, which allows for selecting
whatever level of system hardness to soft errors is desired.

Austin

Ben Twijnstra

unread,

May 6, 2005, 6:47:52 PM5/6/05

to

Hi Austin,

I'm really happy for you.

Are there any V4s without the money-eating ECC stuff for us terrestrials?

Ben

Peter Alfke

unread,

May 6, 2005, 7:41:21 PM5/6/05

to

Nice try!
ECC at the 64-bit parallel level eats only 8 extra bits, and our
BlockRAMs had those traditional parity bits all the time. No extra
storage cost. Just some clever partitioning...
"The best things in life are (almost) free"
Peter Alfke

Thomas Rudloff

unread,

May 6, 2005, 8:08:16 PM5/6/05

to

Hi Peter,

I learned about SEU that you can design redundant (three times the logic
if you can convince your compiler not to remove redundant logic). This
will keep the user logic save. But is there a way to keep configuration
save since this changes logic and routing?

Regards,
Thomas

austin

unread,

May 6, 2005, 8:32:13 PM5/6/05

to

And,

The frame_ecc is 12 bits per 1312, or less than 1% overhead.

Austin

austin

unread,

May 6, 2005, 8:46:51 PM5/6/05

to

Thomas,

Yes. The Xilinx TMR (XTMR) tool converts the design from the designed
and placed to a full TMR design automatically taking advantage of our
structure so that no one config bit can upset the function.

FRAME_ECC allows a design to do redundancy in time (RIT).

Calculate what you need, check if an error has occured, if not, go on.
If an error has occurred, fix the error, step back, recalculate.

Repeat.

Between XTMR which allows you to choose only those critical areas that
need triplication for redundancy in space (RIS), and FRAME_ECC which
enables redundancy in time, an arbritraily safe system can be implemented.

For example:

Simplest - do nothing. With an effective system FIT rate of 20 FIT/Mb
of config memory, this may be so far down in the noise, it isn't an issue.

Next step - when the FRAME_ECC indicates an error, reconfigure the chip.
This creates some unavailability, but is able to keep any errors from
propagating any further. Or back up, and recalculate the result after
flipping the bit back (RIT).

Little better - when a error is detected, correct it. Since from 1 in
10 to 1 in 80 flips actually hits something that matters (real data from
real customers), there is a 1% to 10% chance that flip could ever cause
an error, and since you fix it in less than 200 ms (for the largest
part), the probability that in that 200 ms something critical changeds,
and it mattered is even tinier (like maybe one in a thousand chance).
And, if you add to this RIT, it is even more bulletproof.

Even better - since this is a system that requires a hot spare (at this
point, we are talking about 99.9995% available systems where the hard
fail rate kills you first) you detect a soft error, and switch to the
redundant unit immediately while you fix the bit, and do a system recheck.

Best - triplicate critical elements AND have a hot standby that can be
switched to in case of soft error detect.

All of the above are enabled in V4 -- it is up to you to set your FIT
rate goals, and then fufill them. Can't do that with the competition --
they just don't have all the options we do. For example, a complete
reconfig takes them down, but we can reconfig while still operating, and
fix the flipped bit back.

Austin

Piotr Wyderski

unread,

May 6, 2005, 9:00:14 PM5/6/05

to

Austin Lesea wrote:

> The ASIC DFF's, logic, etc. are also a fantastic neutron detector: the
> resulting hardness of the Virtex 4 is on par with, and better than a
> full custom 90nm ASIC doing the same task!

BTW, is it possible to order a special, rad-hard version of
a modern medium-complexity FPGA chip, say, comparable
with Cyclone 1C3? Would it mean a complete redesign of
the chip internals or is it relatively simple?

Best regards
Piotr Wyderski

austin

unread,

May 6, 2005, 10:17:53 PM5/6/05

to

Piotr,

Very observant question.

For atmospheric upsets, it is a relatively easy process to change all
memory cells to SERT or DICE single upset hardened cells, with an
increase in area as you go from 6T cells to 12T and 16T cells in the
ASMBL columnar architecture which is actually trivial to do. But who
will pay for this?

Without the ASMBL architecture, it requires a complete relayout.

If there are ways to design that result in the desired system FIT rate,
one must comapre the costs of the extra logic with the costs of
hardening the design (hard IP vs. soft IP).

I believe the answer is a judicious combination of both: make the basic
FIT rate better, and also provide some degree of hardening without
incurring too much cost.

Austin

Ben Twijnstra

unread,

May 7, 2005, 11:46:06 AM5/7/05

to

Hi Peter Alfke,

> Nice try!
> ECC at the 64-bit parallel level eats only 8 extra bits, and our
> BlockRAMs had those traditional parity bits all the time. No extra
> storage cost. Just some clever partitioning...

There's addtitional bit lanes in Altera devices too.

So what does this add then? Did you add optional hard ECC
generation/detection blocks to these 9th/18th bits? Or does the user have
to code this him/herself?

If it's an optional hard macro we're looking at 2 configurable muxes and an
ECC generator on the input side, and 2 configurable muxes and an ECC
checker on the output side for evey set of 9 bits.

Also, do the V4s run continuous config sanity checks like Altera's devices?

Best regards,

Ben

austin

unread,

May 7, 2005, 12:06:23 PM5/7/05

to

Ben,

See below,

Austin

Ben Twijnstra wrote:

> Hi Peter Alfke,
>
>
>>Nice try!
>>ECC at the 64-bit parallel level eats only 8 extra bits, and our
>>BlockRAMs had those traditional parity bits all the time. No extra
>>storage cost. Just some clever partitioning...
>
>
> There's addtitional bit lanes in Altera devices too.

To do what?

>
> So what does this add then? Did you add optional hard ECC
> generation/detection blocks to these 9th/18th bits? Or does the user have
> to code this him/herself?

We have hard ECC, 72/64 code, that can be instantiated to provide single
bit error correction, and doulble bit error detection with no soft IP
required.

>
> If it's an optional hard macro we're looking at 2 configurable muxes and an
> ECC generator on the input side, and 2 configurable muxes and an ECC
> checker on the output side for evey set of 9 bits.
>
> Also, do the V4s run continuous config sanity checks like Altera's devices?

We allow the custoemr to decide what they want to do: they can do just
a check, or a check and correct, or nothing at all. They pay the least
possible because we only harden what we need to enable this feature, not
the whole thing. What A offers is a "oh no!" bit: if it is set, you
have no recourse but to reconfigure and start over. That is all A
allows the customer to know, nothing more.

The same IP also allows the customer to flip bits so that they can see
what effect NSEUs would have without having to go to a neutron beam
(which is very expensive,, and time consuming).

Ben Twijnstra

unread,

May 7, 2005, 1:57:40 PM5/7/05

to

Hi austin,

>> There's addtitional bit lanes in Altera devices too.
> To do what?

Oh, for 9-bit video data, or parity checking, or ECC, whatever you like.

>> So what does this add then? Did you add optional hard ECC
>> generation/detection blocks to these 9th/18th bits? Or does the user have
>> to code this him/herself?

> We have hard ECC, 72/64 code, that can be instantiated to provide single
> bit error correction, and doulble bit error detection with no soft IP
> required.

That's exactly what I wanted to know. So, to summarize:

If activated, a 64-bit write to a BRAM will use 8 additional bits for
error-checking and recovery. The read and write ports have optional
dedicated hard logic that, when enabled, generate and check ECC data.

By the way, does this ECC stuf work on narrower RAM widths?

>> Also, do the V4s run continuous config sanity checks like Altera's
>> devices?
> We allow the custoemr to decide what they want to do: they can do just
> a check, or a check and correct, or nothing at all. They pay the least
> possible because we only harden what we need to enable this feature, not
> the whole thing. What A offers is a "oh no!" bit: if it is set, you
> have no recourse but to reconfigure and start over. That is all A
> allows the customer to know, nothing more.

In A, the config error pin will allow you to take any external action.
Rebooting the device is the most common application, but more elaborate
schemes are possible. Also, the internal logic is also able to respond to a
config error. Then again, since the configuration cannot be trusted
anymore, it would be best to bring the circuit offline as quickly as
possible.

The 'reloading-while-running' feature in X is cool, but if I were an FPGA
and I knew I couldn't be trusted anymore, Asimov's first law would kick in
and I'd disable myself ASAP (i.e. after sticking a Post-It to my forehead
indicating that a service technician should look me over because I went
crazy).

> The same IP also allows the customer to flip bits so that they can see
> what effect NSEUs would have without having to go to a neutron beam
> (which is very expensive,, and time consuming).

Very nice idea indeed. After getting the first documentation about A's
sanity checking we actually had to go to a nuclear lab to test the feature
(the lab was also quite interested in the feature). We didn't do any
quantitative testing (how could we, as humble end users), we just stuck the
PCB in a high-intensity neutron beam and waited. And waited. And waited.
But, in the end we found out that it did work ;-)

Best regards,

Ben

austin

unread,

May 7, 2005, 2:43:34 PM5/7/05

to

Ben,

See below,

Austin

Ben Twijnstra wrote:

> Hi austin,
>
>
>
>>>There's addtitional bit lanes in Altera devices too.
>>
>>To do what?
>
>
> Oh, for 9-bit video data, or parity checking, or ECC, whatever you like.
>

Yes we have an extra bit for evey 8 bits as well. Most folks just use it
for parity.

>
>>>So what does this add then? Did you add optional hard ECC
>>>generation/detection blocks to these 9th/18th bits? Or does the user have
>>>to code this him/herself?
>
>
>>We have hard ECC, 72/64 code, that can be instantiated to provide single
>>bit error correction, and doulble bit error detection with no soft IP
>>required.
>
>
> That's exactly what I wanted to know. So, to summarize:
>
> If activated, a 64-bit write to a BRAM will use 8 additional bits for
> error-checking and recovery. The read and write ports have optional
> dedicated hard logic that, when enabled, generate and check ECC data.
>

Yup.

> By the way, does this ECC stuf work on narrower RAM widths?
>

Nope. Customer has to insantiate whatever external muxes they would
liek to use the ECC with other widths. We felt that this extra muxing
was trivial for the customer, where if we had to do it, it would make
the block less useful and bigger for all the customers who don't want or
need ECC. Given the FIT/Mb rate of the BRAM is already 6 to 8 times
better than commercial SRAM, many customers evaluate the risk, and
decide to use simple parity rather than ECC.

>
>>>Also, do the V4s run continuous config sanity checks like Altera's
>>>devices?
>>
>>We allow the custoemr to decide what they want to do: they can do just
>>a check, or a check and correct, or nothing at all. They pay the least
>>possible because we only harden what we need to enable this feature, not
>>the whole thing. What A offers is a "oh no!" bit: if it is set, you
>>have no recourse but to reconfigure and start over. That is all A
>>allows the customer to know, nothing more.
>
>
> In A, the config error pin will allow you to take any external action.
> Rebooting the device is the most common application, but more elaborate
> schemes are possible. Also, the internal logic is also able to respond to a
> config error. Then again, since the configuration cannot be trusted
> anymore, it would be best to bring the circuit offline as quickly as
> possible.

I'm A is so smart (sarcasm), and know exactly what to do for their
customers. We, on the other hand do not presume to tell the customer
what they must do. Since only 1 in 10 to 100 bit flips actually does
anything at all, there is a 1 to 10% chance that the FPGA is still able
to decide what to do. In fact, if you triplicate a "sanity check"
monitor, and allow it to make the decisions, you do not have to tear
down the whole chip for every hit. That takes very little extra logic.

You see, A's "oh no" bit will trip 10 to 100 times more often than an
actual functional failure: why take the system down 100 times more
often that you really need to? Not very bright. Running around saying
"I've been hit, I've been hit ...." Insteaad we offer that you can
decide if you should flip just that one bit, and just continue on from
there.

If it is a video, voice, or packet application, what risk was taken? A
bad pixel? A pop or click? One bad packet? Those things happen all the
time for other reasons than SEU. No interruption. A's solution can not
do that.

"Help me, Help me! I've been hit, and I don't know where! I might be
dying, (but I am probably OK, but you can't trust me anymore."

I much prefer a more elegant solution: "Bit XYZ has flipped, do you
want to flip it back?"

>
> The 'reloading-while-running' feature in X is cool, but if I were an FPGA
> and I knew I couldn't be trusted anymore, Asimov's first law would kick in
> and I'd disable myself ASAP (i.e. after sticking a Post-It to my forehead
> indicating that a service technician should look me over because I went
> crazy).

Yes, I know, it is all A has to sell, so make sure there is lots of FUD
associated with the X solution (since it can't be matched by A).

I think it quite nice that their "solution" to SEUs is their hardcopy:
less competition for FPGA vendors! Gartner-Dataquest removes all
hardcopy revenue from A's balance sheet when comparing them with other
FPGA vendors now. Their sales may be increasing, but their FPGA market
share is decreasing. Too bad they just don't seem to be interested in
playing with us anymore. No MGTs in S2, No processor. S2: 2 many
upsets, 2 hot, 2 slow, 2 noisy; 2 little, 2 late, 2 bad.

>
>
>>The same IP also allows the customer to flip bits so that they can see
>>what effect NSEUs would have without having to go to a neutron beam
>>(which is very expensive,, and time consuming).
>
>
> Very nice idea indeed. After getting the first documentation about A's
> sanity checking we actually had to go to a nuclear lab to test the feature
> (the lab was also quite interested in the feature). We didn't do any
> quantitative testing (how could we, as humble end users), we just stuck the
> PCB in a high-intensity neutron beam and waited. And waited. And waited.
> But, in the end we found out that it did work ;-)

Does it? How do you really know? They could count ten errors, and then
say "I've been hit" and you would never know the difference.

How do you know that the ckecker wasn't hit? Do they provide a hearbeat
so you are sure the checker is checking? We do.

I say, have them prove that every single bit can be tracked.

Upset rates are different for LUT, DFF, RAM, config. Do you know what
is checked? On V4, it is very clear what is being monitored. And you
know what is happening all the time.

If you are really as paranoid as you claim (WCGW, WGW, AATWPM - what can
go wrong, will go wrong, and at the worst possible moment), I would
think not even knowing what is checked, and what flipped would drive you
crazy.

Paul Leventis (at home)

unread,

May 17, 2005, 3:49:18 AM5/17/05

to

Hi Austin,

> For atmospheric upsets, it is a relatively easy process to change all
> memory cells to SERT or DICE single upset hardened cells, with an increase
> in area as you go from 6T cells to 12T and 16T cells in the ASMBL columnar

> architecture which is actually trivial to do. Without the ASMBL

> architecture, it requires a complete relayout.

No offence, but this sounds like bull to me. So you are claiming that since
you have columns of blocks (er, ASMBL architecture), you can suddenly
tolerate changing the fundamental layout of your configuration RAM cells
without touching anything else? This would imply that not only are your
various blocks floorplanned as columns, but that the memory cells sprinkled
throughout those blocks also line up perfectly and that no other circuitry
would need to be adjusted.

Regards,

Paul Leventis
Altera Corp.

Austin Lesea

unread,

May 17, 2005, 10:50:17 AM5/17/05

to

Paul,

I understand how frustrated you are.

We are 40% better in SEUs than V2 Pro (or V2).

You folks must be really scrambling since you did absolutely nothing to
reduce your SEU FIT rate (by using 90nm 6T cells for config).

Enjoy your ??? FIT/Mb 90nm 6T config memory.

Compared to S2, V4 is probably at least twice as good, perhaps even
three times better. Actel will probably hire IRoC again to test us
both. It will be fun to see that report!

Unfortunately, since you do not support customer readback, we can't test
your part in the neutron beam, as we could not really be able to count
all the upsets, and where they actually occur. Not knowing must really
be a pain for you guys. No way to really know if ICDES has accomplished
anything at all.

Separate FIT rates for config, and BRAM are a requirment for our
customers, as well as having a number of techniques that can be used to
mitigate the SEU issue, and a design flow to achieve any desired system
FIT rate.

I'd like to see your numbers for config and BRAM, as we are very
satisfied with our improvements.

It will be fun to watch as this sinks in the minds of the customers out
there ....

Sorry you can not say "we are just like Xilinx" anymore. I was glad to
do all the work, but I am afraid that we will derive all the benefits
now that we thought through all of the issues.

Austin