Example of multi-clock design?

ÉRDI Gergő

unread,

Feb 14, 2021, 7:50:48 AM2/14/21

to clash-l...@googlegroups.com

Hi,

I think I am in need of a circuit with a slow domain containing the CPU,
the fast domain containing the video signal generator, and a shared (and
access-arbitrated) block RAM in between. If it helps, I can make the
fast/slow ratio an integer.

I have absolutely zero idea where to start on this, where's good
documentation? Moreover, is there an example Clash project that shows how
to do something like this?

Thanks,
Gergo

--

.--= ULLA! =-----------------.
\ http://gergo.erdi.hu \
`---= ge...@erdi.hu =-------'
"He'll be the first against gcc -Wall when the revolution comes."

Peter Lebbing

unread,

Feb 14, 2021, 10:35:01 AM2/14/21

to clash-l...@googlegroups.com

Hi Gergő,

On Sun, 14 Feb 2021, ÉRDI Gergő wrote:
> I have absolutely zero idea where to start on this, where's good
> documentation?

Frankly, I don't know. Designing a system with multiple clocks has its
intricacies, especially related to meta-stability of signals. The good
news is that Clash already has a synchronizer primitive that uses a
multi-clock blockRAM to synchronize two domains:
Clash.Explicit.Prelude.asyncFIFOSynchronizer. When you want to
synchronize signals more than a bit wide at the full speed of the slower
domain, that's about your only option, as far as I'm aware.

> Moreover, is there an example Clash project that shows how to do
> something like this?

While the tutorial has a bit on multi-clock design that essentially
creates the asyncFIFOSynchronizer from the ground up, that's pretty
low-level. But it's informative, if you somewhat grok the paper it is
based on (it's linked).

I just wrote a little variation on the Blinker example, with two clock
domains. Instead of a pattern of 8 LEDs, it's two patterns of 4 LEDs,
one at 15 MHz, the other at 10 MHz. Just for show, I synchronise the
button that controls the mode from the 15 MHz domain to the 10 MHz
domain, as if it were synchronised to the 15 MHz domain instead of an
asynchronous and probably bouncy button :-D. It's just one bit wide, so
dualFlipFlopSynchronizer is fine to quell all but the worst
meta-stability (in practice, dualFlipFlopSynchronizer is stable).

I deviate from the tutorial Blinker a bit. Most importantly, I think the
clock domains are named really poorly in the tutorial. We have a 50 MHz
clock called DomInput and a 20 MHz clock called Dom50. That name is
based on the period, but to pick two clocks where the frequency of the
one is the period of the other and vice versa is just asking for
confusion, IMO. Plus, I personally think of clocks having a frequency
primarily. So if you give me a domain named Dom50, I am inclined to
assume it runs at 50 MHz, or kHz, or something. So I picked something
else than 20 MHz for my generated clocks.

Furthermore, I wanted to make some more types explicit, and changed the
stuff related to the size of the counter to more suit my preferences.

The result is this:

--8<---------------cut here---------------start------------->8---
module MultiBlinker where

import Clash.Prelude

import qualified Clash.Explicit.Prelude as CEP
import Clash.Intel.ClockGen

createDomain vSystem{vName="DomInput", vPeriod=hzToPeriod 50e6}
createDomain vSystem{vName="Dom15", vPeriod=hzToPeriod 15e6}
createDomain vSystem{vName="Dom10", vPeriod=hzToPeriod 10e6}

topEntity
:: Clock DomInput
-> Signal DomInput Bool
-> Signal Dom15 Bit
-> ( Signal Dom15 (BitVector 4)
, Signal Dom10 (BitVector 4)
)

topEntity clk rst key15 = (leds15, leds10)
where
leds15 = exposeClockResetEnable blinker pll15Out rst15Sync enableGen key15
leds10 = exposeClockResetEnable blinker pll10Out rst10Sync enableGen key10
key10 = CEP.dualFlipFlopSynchronizer pll15Out pll10Out rst10Sync enableGen 1
key15
(pll15Out,pll15Stable) =
altpll @Dom15 (SSymbol @"altpll15") clk (unsafeFromLowPolarity rst)
rst15Sync =
resetSynchronizer pll15Out (unsafeFromLowPolarity pll15Stable) enableGen
(pll10Out,pll10Stable) =
altpll @Dom10 (SSymbol @"altpll10") clk (unsafeFromLowPolarity rst)
rst10Sync =
resetSynchronizer pll10Out (unsafeFromLowPolarity pll10Stable) enableGen
{-# ANN topEntity
(Synthesize
{ t_name = "blinker"
, t_inputs = [PortName "CLOCK_50", PortName "KEY0", PortName "KEY1"]
, t_output = PortProduct ""
[ PortName "LEDH"
, PortName "LEDL"
]
}) #-}

blinker
:: ( HiddenClockResetEnable dom
, KnownNat n)
=> Signal dom Bit
-> Signal dom (BitVector n)
blinker =
mealy
-- 15 MHz: update period 333 ms
-- 10 MHz: update period 500 ms
(blinkerT @5000000)
(1, False, 0)
. isRising 1

blinkerT
:: forall d n
. ( KnownNat d
, KnownNat n
, 1 <= d)
=> (BitVector n, Bool, Index d)
-> Bool
-> ((BitVector n, Bool, Index d), BitVector n)
blinkerT (leds,mode,cntr) key1R = ((leds',mode',cntr'),leds)
where
cntr' = satSucc SatWrap cntr

mode' | key1R = not mode
| otherwise = mode

leds' | cntr == 0 = if mode then complement leds
else rotateL leds 1
| otherwise = leds
--8<---------------cut here---------------end--------------->8---

It is untested! Current master seems to have an issue. It generates VHDL
with Clash 1.2.5, but I glanced through the generated files and I
strongly suspect we set INCLK0_INPUT_FREQUENCY wrongly in the PLL qsys
files. I'm filing bugs.

I also made no effort to correctly write the LED output port. I think
the original Blinker should just synthesise directly for some Altera dev
boards, but I just made up the pin names for the output LEDs, so it
won't.

I hope it clarifies things, even though it is not complete.

HTH,

Peter.

Gergő Érdi

unread,

Feb 15, 2021, 3:03:56 AM2/15/21

to CLaSH - Hardware Description Language

Hi Peter,

Thanks for the useful links & example, I'll play around with it to get
a better idea of the basics. I have also found
https://www.edn.com/synchronizer-techniques-for-multi-clock-domain-socs-fpgas/
which seems like a good introduction to the topic. However, I still
can't quite wrap my head around how memory access would work.

Suppose I have two addressing signals `Signal fast (Maybe (addr, Maybe
dat))` and `Signal slow (Maybe (addr, Maybe dat))`, and I want to
connect them to a shared synchronous block RAM (with some static
arbitration, i.e. either the fast one always takes precedence, or the
slow one always takes precedence). I imagine it is simpler to turn the
`slow` one into a `fast` one that just happens to change less often;
but then, the result is also in `fast` (albeit, again, changing less
frequently). In a single domain, the basic assumption about
synchronous block RAM is that the read result is available in the
period just after the read address changes; what does this idea look
like when the block RAM "lives in" the `fast` domain, and its result
is used in the `slow` domain?

> --
> You received this message because you are subscribed to the Google Groups "Clash - Hardware Description Language" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clash-languag...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/clash-language/alpine.DEB.2.21.2102141605350.11033%40terrence.lucas.digitalbrains.com.

Christiaan Baaij

unread,

Feb 15, 2021, 5:23:07 AM2/15/21

to clash-l...@googlegroups.com

While the Clash prelude has asyncFIFOSynchronizer, it's far from ideal in terms of safe clock domain crossing (CDC).

You really want to use the vendor CDC FIFOs, e.g.

* XPM_FIFO_ASYNC on Xilinx (https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_2/ug974-vivado-ultrascale-libraries.pdf)

* DCFIFO on Intel (https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_fifo.pdf)

That's because they come with proper embedded timing and placement constraints.

With regards to a shared RAM, you'll have to make a model and blackbox/primitive for a true dual ported RAM.

For the model you get something like:

trueDualPortRam addrA wrEnA dataInA rdEnA addrB wEnB dataInB rdEnB = let

periodA = snatToNum (clockPeriod @domA)

periodB = snatToNum (clockPeriod @domB)

fastDom = min periodA periodB

addrA_fast = veryUnsafeSynchronizer periodA fastDom addrA

wEnA_fast = veryUnsafeSynchronizer periodA fastDom wEnA

dataInA_fast = veryUnsafeSynchronizer periodA fastDom dataInA

addrB_fast = veryUnsafeSynchronizer periodB fastDom addrB

wEnB_fast = veryUnsafeSynchronizer periodB fastDom wEnB

dataInB_fast = veryUnsafeSynchronizer periodB fastDom dataInB

ram = writeLogic addrA_fast wEnA_fast dataInA_fast addrB_fast wEnB_fast dataInB_fast

ramA = veryUnsafeSynchronizer fastDom periodA ram

ramB = veryUnsafeSynchronizer fastDom periodB ram

in (readLogic ramA addrA rdEnA, readLogic ramB addrB rdEnB)

where `writeLogic` implements all the "conflict resolution" (i.e. make the value "undefined" when both address are the same and both write enables are asserted) and does the "correct" thing when the address or the writeEnable is "undefined" (throws an X exception).

and `readLogic` implements the read delay.

Again, the above can only be a model, the vendor synthesis tools are highly unlikely to infer a true dual ported blockram if you would let Clash generate HDL for the above.

I vaguely remember some additional intricacies when we did this work for a client (sadly that work couldn't be open-sourced), so you'd have to setup some test infrastructure to make sure the Haskell model and the vendor HDL model behave exactly the same.

In all honesty, both the true dual port ram and dual clock fifo, including primitive/blackbox wrappers for the vendor HDL models, should really be in the Clash standard library...

To view this discussion on the web visit https://groups.google.com/d/msgid/clash-language/CAO7EwMqaO63AfBOKZY4wwRFhOACHD%3DKTvJz-WTcE8mJecXPuQw%40mail.gmail.com.

Peter Lebbing

unread,

Feb 15, 2021, 6:18:18 AM2/15/21

to CLaSH - Hardware Description Language

Hi Gergő,

On Mon, 15 Feb 2021, Gergő Érdi wrote:
> However, I still can't quite wrap my head around how memory access
> would work.

These dual-port blockRAMs in the FPGA are very special beasts. I never
looked further into how they are constructed.

For the FIFO synchronizer, we use a dual-port blockRAM with separate
read and write clocks. It has one read port synchronised to one clock,
and one write port synchronised to another clock. Once the timing of the
blockRAM tells us that a write has landed, a subsequent read of that
address will return the correct data. The write was clocked by the write
clock, but the subsequent read by the read clock. There is no known
relation between the two clocks; when they don't come from the same
source, they will drift relatively to each other. This is no problem for
this RAM.

The behaviour of this dual-port blockRAM can, I think, not be expressed
in relation to a single clock. It doesn't matter which clock is slow,
which is fast, and what the phase relation between the two is (the
latter will drift with independent clocks).

At its core, an SRAM is asynchronous. So maybe multiple clock domains is
not quite as magic as it seems on first glance. It's registered, and
those registers are clocked. But every register in itself belongs to a
single clock domain, nothing special there. The actual reads and writes
to the SRAM however are asynchronous, there is no clock involved
anymore. I think most of the magic actually comes from it being
dual-ported. These blockRAMs are quite special beasts. They have two
ports, and if you want, both of those ports can be used for reads as
well as writes. In this application, we don't use that feature.

Where the scientific paper comes into the equation is for the
synchronising of the read and write pointers to the other domain. I
think that's where the intellectual effort lies, in proving that no
matter the glitches, the read and write pointers will never read past
the end or write past the start of the circular buffer. And of course we
only update the write pointer once it is certain that the data has
landed in the SRAM.

When two ports of a blockRAM which are in separate clock domains do an
operation on the same address, there is by definition no knowing what
the result will be for the data at that address (either read or, in the
case of two conflicting writes, written). You'll find in the datasheet
that the behaviour on write conflicts, for instance, is defined or
configurable if the two operations are from the same clock domain.
However, if they are from different clock domains, the datasheets will
just say "undefined result". Because it really cannot be constrained.
One clock might rise a femtosecond before the other on one cycle, and
only after the other on another cycle. Who's to say which was first? Of
course, it's possible to construct clocks that always have a fixed phase
relation with each other. And then you could build simpler
synchronisers. This is unexplored territory for Clash. We assume two
clocks have no fixed relation, and then you need "proper" synchronisers.

> Suppose I have two addressing signals `Signal fast (Maybe (addr, Maybe
> dat))` and `Signal slow (Maybe (addr, Maybe dat))`, and I want to
> connect them to a shared synchronous block RAM (with some static
> arbitration, i.e. either the fast one always takes precedence, or the
> slow one always takes precedence).

Where this train of thought derails is on "synchronous blockRAM". For
understanding, I think you need to split that into "asynchronous SRAM"
and "registers before, and possibly after, this asynchronous SRAM". And
then view the registers separately. Some are in one clock domain, others
are in another. But every single register is in a single clock domain.

I hope this all makes some sense. Also note that I might be wrong in
parts, this is purely what I made of it after thinking about it for some
time. Because I was quite intrigued by this strange RAM that has
multiple ports and clock domains, yet I never made the time to try to
find an authoritative source to explain it to me (well, other than the
datasheet for the FPGA). I have also never been formally educated on
multiple clock domains in one circuit. I think I'm right in what I write
here, but that could just be the Dunning-Kruger effect :-). I would
definitely ask an expert for guidance before I design a multi-clock
circuit myself that needs to be correct. There are bound to be
intricacies that I'm not aware of.

> On Sun, Feb 14, 2021 at 11:37 PM Peter Lebbing <pe...@qbaylogic.com> wrote:
> > It generates VHDL with Clash 1.2.5, but I glanced through the
> > generated files and I strongly suspect we set INCLK0_INPUT_FREQUENCY
> > wrongly in the PLL qsys files. I'm filing bugs.

There was no problem with the generated Qsys file, it was a
misunderstanding on my part. Apparently Quartus expresses frequencies in
picoseconds. Think about that for a second (heh). The mind boggles.

HTH,

Peter.

Peter Lebbing

unread,

Feb 15, 2021, 6:24:22 AM2/15/21

to clash-l...@googlegroups.com

On Mon, 15 Feb 2021, Christiaan Baaij wrote:
> In all honesty, both the true dual port ram and dual clock fifo,
> including primitive/blackbox wrappers for the vendor HDL models,
> should really be in the Clash standard library...

If I'm remembering correctly: You explained to me once, when I said the
same thing, that the annoying thing is the proper HDL primitive depends
on the vendor you're targetting, whereas with most of our generated HDL
code, we strive to generate something that all vendors will accept and
handle correctly.

Peter.

ge...@erdi.hu

unread,

Nov 5, 2021, 9:16:10 PM11/5/21

to Clash - Hardware Description Language

With https://github.com/clash-lang/clash-compiler/pull/1726 merged, I hope this can be revisited. TBH I completely dropped this line of inquiry since the original thread, because I had smaller fish to fry and all the replies made it sound like without TDP RAM it'd be an uphill
battle anyway.

So with Clash HEAD, is it now possible for someone to offer me nicely generic "solved-once-and-for-all" Clash code for the following problems, with the assumption `freq(fast) >> freq(slow)`?

1. Shared RAM with r/w access from domain `slow` and r/o access from domain `fast`: I suppose this is a oneliner with the new TDP RAM primitives of Clash.

2. `strobe :: Signal fast (Maybe a) -> Signal slow (Maybe a)`: The intended semantics of this is that if the `fast` signal becomes `Just x` for one period, and the frequency of these `Just` spikes is very low even compared to `slow`, then the `slow` signal should become `Just x` for one (slow) period.

3. `control :: Signal slow (Maybe a) -> Signal fast (Maybe a)`: the diff vs `strobe` would be that we drop the low-frequency requirement, i.e. the input `slow` signal can be allowed to change to a different `Just` value on every slow period, and each one should map to exactly one `Just` value in the `fast` domain.

4. `slow :: Signal slow a -> Signal fast a`, basically seeing the last "consistent" value of `a` from the `slow` side on the `fast` side.

Also, am I right in thinking that the difficulty increases from 1 to 4?

Thanks,

Gergo

Reply all

Reply to author

Forward