Best Async FIFO Implementation

Davy

unread,

Oct 16, 2005, 10:15:33 AM10/16/05

to

Hi all,

Does there exist a best implementation of Asynchronous FIFO?

Any suggestions will be appreciated!
Best regards,
Davy

Sylvain Munaut

unread,

Oct 16, 2005, 10:36:37 AM10/16/05

to

I guess it depends on what you're looking for.
At minimum, it should *work* ...
Then the rest is a compromise of resources/speed/feature(like almost
empty/full flags,...)/...(reliability?)

Sylvain

Peter Alfke

unread,

Oct 16, 2005, 2:33:17 PM10/16/05

to

All members of the Virtex-4 family from Xilinx have a
(hard-coded=full-custom) FIFO controller in each of their BlockRAMs. It
accepts different clocks for read and write (called "asynchronous
operation") at any frequency up to 500 MHz. Capacity is 18 Kbits, the
width is 4 to 36 bits, and the depth is accordingly from 4K to 512
addresses (depth and width can easily be expanded with additional
BlockRAMs)
There is an EMPTY and a FULL flag, and also an ALMOST EMPTY and an
ALMOST FULL flag, both fully programmable (with 1-address granularity).

I designed the crucial asynchronous empty arbitration logic, and it
works perfectly: We tested it by writing data at ~200 MHz into the
FIFO, and reading it out at ~500 MHz, and the asynchrous empty-detect
logic had worked flawlessly for all those >10e14 operations when we
stopped the test after a week.
No real FIFO application will probably ever go empty 200 million times
a second...
The high performance is due to very fast and compact full-custom logic,
and our long experience in analyzing and dealing with the effects of
metastability.

Peter Alfke, Xilinx Applications (posting from home)

Jim Granville

unread,

Oct 16, 2005, 4:06:39 PM10/16/05

to

Peter Alfke wrote:
> All members of the Virtex-4 family from Xilinx have a
> (hard-coded=full-custom) FIFO controller in each of their BlockRAMs. It
> accepts different clocks for read and write (called "asynchronous
> operation") at any frequency up to 500 MHz. Capacity is 18 Kbits, the
> width is 4 to 36 bits, and the depth is accordingly from 4K to 512
> addresses (depth and width can easily be expanded with additional
> BlockRAMs)
> There is an EMPTY and a FULL flag, and also an ALMOST EMPTY and an
> ALMOST FULL flag, both fully programmable (with 1-address granularity).
>
> I designed the crucial asynchronous empty arbitration logic, and it
> works perfectly: We tested it by writing data at ~200 MHz into the
> FIFO, and reading it out at ~500 MHz, and the asynchrous empty-detect
> logic had worked flawlessly for all those >10e14 operations when we
> stopped the test after a week.

Why stop after 1 week ?. Sounds like the sort of app nice to have
spinning in the corner of the lab forever....
Did you also test the full detect, or is that expected to be the same
by symmetry ?

> No real FIFO application will probably ever go empty 200 million times
> a second...
> The high performance is due to very fast and compact full-custom logic,
> and our long experience in analyzing and dealing with the effects of
> metastability.

So does that mean devices without this full-custom logic, can expect
lower performance, and if so, how much lower ?
[eg Spartan 3 / 3E ?]

-jg

Peter Alfke

unread,

Oct 16, 2005, 6:30:22 PM10/16/05

to

Hi, Jim..
We stopped after a week because we were satisfied. In one week, we
proved 10e14, it would take 10 weeks to prove 10e15, and 2 years to
prove 10e16. Diminishing returns...But we definitely did NOT stop
because we found an error. No cheating on my watch!

For some strange reason (fixed in "Virtex-5") there is a
one-clock-pulse latency for FULL. I suggest using ALMOST FULL instead.
FULL is not as important as EMPTY, since a properly designed system
should never overflow the FIFO, whereas it might be nice to empty it
completely. (I often use the savings-account analogy).

Yes, using the fabric to implement the FIFO controller might limit the
speed to 250 MHz.
The reasons for the "hard" FIFO controller were:
Higher performance, guaranteed reliable operation without user
involvement, and saving fabric resources as well as power consumption.
The same reasoning will be used for future "hard" subfunctions. It's
the best way to increase speed, functionality, and user-friendliness.
How else can we improve by a factor 2 or even more?
Peter Alfke

mindenpilot

unread,

Oct 16, 2005, 11:17:07 PM10/16/05

to

In most datacomm applications, filling a buffer can be caused by network
congestion, so to prevent dropped packets, you'd want to correctly detect
FIFO full, and backpressure accordingly.

"Peter Alfke" <al...@sbcglobal.net> wrote in message
news:1129501822.6...@f14g2000cwb.googlegroups.com...

mindenpilot

unread,

Oct 16, 2005, 11:16:42 PM10/16/05

to

In most datacomm applications, filling a buffer can be caused by network
congestion, so to prevent dropped packets, you'd want to correctly detect
FIFO full, and backpressure accordingly.

"Peter Alfke" <al...@sbcglobal.net> wrote in message
news:1129501822.6...@f14g2000cwb.googlegroups.com...

Peter Alfke

unread,

Oct 16, 2005, 11:27:07 PM10/16/05

to

The Virtex-4 has a FULL flag that is synchronous with the write clock
(obviously, the read clock does not care) but the FULL flag is
activated one clock period late. (The EMPTY flag, synchronous with the
read clock does not have this latency, it gets activated by the same
clock edge that read the last valid data. Doing that right and fast is
the art of asynchronous FIFO design...))
I claim that it is easy to use the ALMOST FULL flag, since the exaxt
max capacity of a FIFO is not critical. Set it for 1020 for a 1024-deep
FIFO, and you will never be bothered by the latency, you actually get
an early warning...
Peter Alfke

Alex Shot

unread,

Oct 17, 2005, 10:58:22 AM10/17/05

to

Xilinx's asynchronous fifos have a depth of (power of 2) -1 bytes.
According to my analysis, using Xilinx's application notes, the reason
of it is that full flag can be really generated 1 writing clock period
after it is really expected. To overcome overflowing, the fifo depth is
decreased by 1.
Alex

Peter Alfke

unread,

Oct 17, 2005, 1:17:54 PM10/17/05

to

Let me correct this:
The addressing depth of Virtex-4 FIFOs is 512, 1024, 2048, or 4096
locations. The word "byte" is meaningful only for the 2048 x 9
configuration.
The FULL flag goes active one write clock cycle after the FIFO has been
filled. That means, in a continuous write situation, the last written
entry will be lost. That's why I recommend using the ALMOST FULL flag
instead of the FULL flag.
EMPTY does not have this problem. It goes active on the same read clock
edge that is reading the last piece of data out of the FIFO. EMPTY
then goes inactive again after a data entry has been written into the
FIFO and the internal signal hes been re-synchronized to the read
clock, which takes a few read clock cycles.
This asymmetric behavior assures that the EMPTY flag is appropriately
extremely fast in stopping any further erroneous reads, but is more
"relaxed" in allowing the reading to restart again. Note that this read
latency only occurs after the FIFO had gone empty.
If anybody has questions about the Virtex-4 FIFO, I am the right person
to ask. I have designed FIFOs, on and off, for over 35 years...
Peter Alfke, Xilinx Applications

Dave Pollum

unread,

Oct 17, 2005, 3:32:48 PM10/17/05

to

So Peter, what do those of us with lowly Spartan-II FPGA's do if we
want say, a 16x9 FIFO?
-Dave

Peter Alfke

unread,

Oct 17, 2005, 5:24:28 PM10/17/05

to

Spartan is much cheaper and does not have all the bells and whistles of
Virtex.
So you have to knit your own, or grab a core.
Complexity depends on synchronous/asynchronous choice, and on max clock
rate.
At a slow speed, you can time share read and write.
Many choices...
Peter Alfke

Peter C. Wallace

unread,

Oct 17, 2005, 10:20:50 PM10/17/05

to

A very small 16 by N sync FIFO is easy in the SpartanII using N SRL16E's
(and a 5 bit counter)

Peter Wallace

raul

unread,

Oct 19, 2005, 5:04:38 PM10/19/05

to

For simulation, are the Xilinx FIFO models any faster than before?
Just recently I had to write fully-synchronous FIFO models to
accelerate the simulations and achieved 100X (one hundred times)
improvement.

RAUL

Peter Alfke

unread,

Oct 19, 2005, 6:34:19 PM10/19/05

to

Simulating asynchronous clocking must be very difficult and time
consuming (I dare not use the word "impossible" for fear of being
flamed). How do you cover all clock phase relationship, down to the
femtosecond level? Synchronizers operate with that kind of timing
resolution.
Peter Alfke, speaking for himself.

raul

unread,

Oct 20, 2005, 1:10:28 PM10/20/05

to

Event-based simulation allows you to have very fine resolutions. Just
make sure that all your signals crossing clock domains are flopped and
that there are no Clock-to-Q delays involved in your model. I have run
the fast FIFO models in ModelSim PE 6.1a and Veritak 1.75A and they
have indentical behavior to the Xilinx models.

Peter Alfke

unread,

Oct 20, 2005, 5:04:01 PM10/20/05

to

Raul, this may just reveal my ignornce, but anyhow:

How do you model metastability, which needs sub-femtosecond resolution?
How do you model that an asynchronous FIFO generates its EMPTY flag in
time, even under the most adverse timing conditions between the two
incoming clocks?
Those have been things that kept me awake at night :-(

Peter Alfke

Kim Enkovaara

unread,

Oct 21, 2005, 4:59:28 AM10/21/05

to

Usually in RTL simulations you don't even want to model things like that.
Most important thing is to get fast simulation times for the whole design.
And at least in the past Xilinx models were overly complex for pure RTL
simulations, and usually own simulation models were needed to get the speed.

The correctness of the async fifos must come from the design, reviews
etc. It's impossible to simulate all the cases.

Of course with netlist simulations timing accurate models are needed,
but that is small part of simulations. That is usually done to check
timing constraints and synthesis bugs (if formal verification tools are
not part of the users toolset). Asynch portions are almost impossible to
simulate. Nowadays there are also formal tools that check clock domain
crossing correctness etc. Those tools can even inject errors during
simulation that could be caused by metastability (the places are found by the
formal portion).

--Kim

Peter Alfke

unread,

Oct 21, 2005, 10:21:05 AM10/21/05

to

Kim, thank you for that clarification. That means I was right in
considering any simulation of metastability-causing asynchronous
clocking impossible. There is no substitute for creativity, circuit
analysis, some deep thinking, and experimentation. All of that we have
done to verify the metastable behavior of our flip-flops, and to verify
the behavior of our asynchronous FIFO in Virtex-4.
Obviously, one can always simulate the effect that a given metastable
delay has on the rest of the circuitry, but one cannot simulate the
origin of the metastable delay.
Peter Alfke, Xilinx Applications

raul

unread,

Oct 21, 2005, 11:14:23 AM10/21/05

to

Hi,

There is no need to simulate metastability. The RTL simulations are
functional. All conditions of empty and full have been verified with
directed and random behavior over long simulations with clocks sliding
past each other. The FIFOs are as assymetrical as 128 bits in and 16
bits out and with clocks as different as 37.125 MHz and 100 MHz.

The simulations have been proven correct in the lab on Virtex-2 Xilinx
FPGAs running for several hours with real data.

ModelSim PE's code profiler said that time was being spent mostly in
the Xilinx FIFOs.

RAUL

raul...@gmail.com

unread,

Oct 23, 2005, 11:19:12 AM10/23/05

to

Hi Jim,

Xilinx synchronization FIFO problems show in just a few minutes. There
used to be a problem with the previous release of their core generator,
the latest one works fine in the lab. Just make sure you always have
the latest version of the core generator to avoid headaches in the lab
that would not show in simulation.

RAUL

Gabor

unread,

Oct 24, 2005, 11:07:21 AM10/24/05

to

I think the thread was about async FIFO. For this SRL16's are of no
use. Your best bet is the COREgen FIFO using distributed memory for
shallow (x15 or x31) FIFO's or block memory for deeper ones.

cli...@sunburst-design.com

unread,

Nov 10, 2005, 7:18:16 PM11/10/05

to

Hi, Davy -

You may want to browse a number of papers on my web page for coding
guidelines and coding styles related to multi-clock design and
asynchronous FIFO design.

At the web page: www.sunburst-design.com/papers

Look for the San Jose SNUG 2001 paper:
Synthesis and Scripting Techniques for Designing Multi-Asynchronous
Clock Designs

Look for the San Jose SNUG 2002 paper:
Simulation and Synthesis Techniques for Asynchronous FIFO Design

Look for the second San Jose SNUG 2002 paper (co-authored with Peter
Alfke of Xilinx):
Simulation and Synthesis Techniques for Asynchronous FIFO Design with
Asynchronous Pointer Comparisons

Peter likes the second FIFO style better but the asynchronous nature of
the design does not lend itself well to timing analysis and DFT.

I prefer the more synchronous style of the first FIFO paper.

I hope to have another FIFO paper on my web page soon that uses Peter's
clever quadrant-based full-empty detection with a more synchronous
coding style.

We spend hours covering multi-clock and Async FIFO design in my
Advanced Verilog Class. These are non-trivial topics that are poorly
covered in undergraduate training. I have had engineers email me to
tell me that their manager told them to run all clock-crossing signals
through a pair of flip-flops and everything should work! WRONG!

Regards - Cliff Cummings
Verilog & SystemVerilog Guru
www.sunburst-design.com