Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Xilinx and Multi-port memories

11 views
Skip to first unread message

Rob Doyle

unread,
Dec 25, 2009, 11:38:48 PM12/25/09
to

I'm trying to build a register-file for an ALU which has 3 read ports
and 1 write port. There is a single clock design but I need to assume
that all ports are in use on every clock cycle, worst case.

I can envision implementing this using 3 Dual Port Memories each
with one read port and one write port as follows:

Sorry - ASCII ART (use fixed width font)

Address lines elided -

+-------+
+--->| RAM1 | -----> read1
| +-------+
| +-------+
write1 --+--->| RAM2 | -----> read2
| +-------+
| +-------+
+--->| RAM3 | -----> read3
+-------+

Is this the best way to do this?

If I *had* to add another write port to the memory - can you do that
using memories? I can't see it...

Thanks in advance.

Rob Doyle

whygee

unread,
Dec 26, 2009, 2:15:26 AM12/26/09
to
Rob Doyle wrote:
> Is this the best way to do this?
AFAIK, IMHO, etc. yes

> If I *had* to add another write port to the memory - can you do that
> using memories? I can't see it...

Good question indeed.

There are some ways to fake a dual-write register
set using single-write blocks but nothing I know
can really do it. If you are designing a CPU,
this can badly affect the ISA and/or performance.
That's why "most RISC CPU" only have 2R1W instruction.

One simple way to double the number of ports is by clocking
the register set 2x faster, but I assume from your post
that it's what you intend to do. That's a good bet
if your pipeline/clock/timing can handle it, the memory
blocks can often run faster than the logic on some FPGAs.

Another method is to implement a register/buffer/write cache
that gets written back to the main register set on the following
cycle. The method assumes that the 2W instructions are not
very common and it stalls the pipeline for one cycle in order
to perform the write back, in case the following instruction
does a write too. A special path must also forward the recently
written value and bypass the read ports.

Yet another method splits the register set into two parts,
say the odd and even banks. The ISA will specify that the
2W instructions can not write to two registers of the same
bank. The restriction can be loosened a bit with more banks
(4 or 8) depending on the available resources.

As far as I know, the venerable Alpha EV6 used powerful
combinations of these methods, implementing the 32-register set
with 2 huge banks that could handle 4 simultaneous instructions
at once (with the help of register renaming and out-of-order
execution). comp.arch readers will fill the gaps :-)


> Thanks in advance.
keep us informed of your advances,

> Rob Doyle
yg

--
http://ygdes.com / http://yasep.org

Muzaffer Kal

unread,
Dec 26, 2009, 3:14:17 PM12/26/09
to
On Fri, 25 Dec 2009 21:38:48 -0700, Rob Doyle <radi...@gmail.com>
wrote:

>
>If I *had* to add another write port to the memory - can you do that
>using memories? I can't see it...

One thing you can do is to have two copies of your register file and
keep a 'most-recently-written' state for each location. Then each read
path has an additional 2-1 mux after it controlled by the same signal
(and a comparator for the read address). This should give the datapath
you want but whether the extra delay is acceptable depends on your
requirements.
--
Muzaffer Kal

DSPIA INC.
ASIC/FPGA Design Services

http://www.dspia.com

Amal

unread,
Dec 28, 2009, 1:17:46 AM12/28/09
to
On Dec 26, 3:14 pm, Muzaffer Kal <k...@dspia.com> wrote:
> On Fri, 25 Dec 2009 21:38:48 -0700, Rob Doyle <radioe...@gmail.com>

Xilinx supports 3-port memories as well if it helps. You can have one
read/write port and two read ports with 3 different addresses.

You can either infer it or instantiate the component directly.

Cheers,
-- Amal

whygee

unread,
Dec 28, 2009, 3:40:16 AM12/28/09
to
hello,

Amal wrote:
> Xilinx supports 3-port memories as well if it helps. You can have one
> read/write port and two read ports with 3 different addresses.
> You can either infer it or instantiate the component directly.

the OP and the thread are speaking about multiple write port,
because multiple read ports are trivially implemented.
However, simultaneously having 2 read AND 2 write ports
(4 simultaneous addresses, for example) is not as easy
and I guess that few synthesisers will infer the correct
SRAM blocks.

> Cheers,
> -- Amal
happy new year,

rickman

unread,
Dec 28, 2009, 11:53:05 AM12/28/09
to
On Dec 28, 3:40 am, whygee <y...@yg.yg> wrote:
> hello,
>
> Amal wrote:
> > Xilinx supports 3-port memories as well if it helps. You can have one
> > read/write port and two read ports with 3 different addresses.
> > You can either infer it or instantiate the component directly.
>
> the OP and the thread are speaking about multiple write port,
> because multiple read ports are trivially implemented.
> However, simultaneously having 2 read AND 2 write ports
> (4 simultaneous addresses, for example) is not as easy
> and I guess that few synthesisers will infer the correct
> SRAM blocks.

The OP doesn't need four ports from a single memory, only three are
needed. If two write ports are needed, there is no way to fake that,
at least no easy way. Three ports are needed, two write and one
read. Then as many read ports can be added as needed by duplicating
the memory with separate read addresses. If the OP can use the same
address for one write port and the read port, then this is just a two
port memory with read/write capability.

If a two write, one read port memory with three addresses is needed
and only two addresses can be input to the memory, then a third memory
can be used to "arbitrate" between two duplicated, two address, one
write, one read port memories. When a write happens, the data is
written to the memory connected to that write port. The third memory
is just one bit wide and the bit at the corresponding address is set
or reset to indicate which port the most recent write has been from.
When a read is done, a mux selects from the appropriate memory block.
The only fly in the ointment is that the "flag" memory has to have two
write and two read ports. So it can't be done in a block ram, but
must be made of logic and FFs. That is some savings over doing the
entire memory in logic if it is wider than one bit... but still a PITA
and quite a mess. But if you *have* to have this memory and the part
does not support it... I guess you have no choice, eh?

Rick

Selensky

unread,
Dec 29, 2009, 10:07:54 AM12/29/09
to

Amal,

Are native 3-port mode supported by which Xilinx devices? Without
duplicating memories? Is there any constraint about using this
operation mode combined with different aspect data ratio? I had heard
something about native support of 3-port (two read and one write port
with 3 different addresses) in Altera devices only.

Selensky

Rob Doyle

unread,
Dec 29, 2009, 4:37:18 PM12/29/09
to

I *need* one write port and three read ports - so I'm OK just
duplicating the RAM.

I could save a clock cycle in the ALU if I could do two writes
and three reads. If I have to stall the pipeline to implement
this, I've gained nothing.

The timing won't permit 2 Register clock cycles per ALU clock cycle
to double-up the register accesses.

The multi-port "flag" memory is the trick I was looking for. The ALU
has 1024 registers so I can envision some tall data selectors,
multiplexers, and accompanying levels of logic to implement the
address decoders.

I think I'm going to stay with simple for now and put this in my
back pocket as "Plan B".

I greatly appreciate the help.

Rob Doyle

Brian Drummond

unread,
Dec 29, 2009, 7:16:41 PM12/29/09
to
On Tue, 29 Dec 2009 14:37:18 -0700, Rob Doyle <radi...@gmail.com> wrote:

>rickman wrote:
>> On Dec 28, 3:40 am, whygee <y...@yg.yg> wrote:
>>> hello,
>>>
>>> Amal wrote:
>>>> Xilinx supports 3-port memories as well if it helps. You can have one
>>>> read/write port and two read ports with 3 different addresses.

>I could save a clock cycle in the ALU if I could do two writes
>and three reads. If I have to stall the pipeline to implement
>this, I've gained nothing.
>
>The timing won't permit 2 Register clock cycles per ALU clock cycle
>to double-up the register accesses.
>
>The multi-port "flag" memory is the trick I was looking for. The ALU
>has 1024 registers so I can envision some tall data selectors,
>multiplexers, and accompanying levels of logic to implement the
>address decoders.

Another trick, IF you have some control over the instruction set, is to split
the register set in two. Then you can schedule two writes, provided they are to
different halves. (The read side can appear as a single unified regset)

With the large number of registers a BRAM gives you, it is likely that there
will always be "vacant" slots in each half. So this restriction can usually be
hidden by register allocation policy, if you get to dictate such restrictions to
whoever is writing the code generator.

Essentially you trade simpler hardware for some additional complexity in
software; the compiler writers make that invisible to the user.

- Brian

John_H

unread,
Jan 1, 2010, 6:13:32 PM1/1/10
to
On Dec 29 2009, 10:07 am, Selensky <selen...@gmail.com> wrote:
>
> Are native 3-port mode supported by which Xilinx devices? Without
> duplicating memories? Is there any constraint about using this
> operation mode combined with different aspect data ratio? I had heard
> something about native support of 3-port (two read and one write port
> with 3 different addresses) in Altera devices only.
>
> Selensky

Any good synthesizer can take one memory array with one write and
multiple reads and effectively replicate the memories without manual
intervention from the user. As long as the memory inferences are
written in a way which properly instantiates one memory, having
multiple reads inferred in the same "structure" automatically
replicates the memories delivering post-synthesis names for the memory
elements that are slightly different.

John_H

unread,
Jan 1, 2010, 6:20:28 PM1/1/10
to
On Dec 29 2009, 4:37 pm, Rob Doyle <radioe...@gmail.com> wrote:
>
> I *need* one write port and three read ports - so I'm OK just
> duplicating the RAM.
>
> I could save a clock cycle in the ALU if I could do two writes
> and three reads.  If I have to stall the pipeline to implement
> this, I've gained nothing.
>
> The timing won't permit 2 Register clock cycles per ALU clock cycle
> to double-up the register accesses.
>
> The multi-port "flag" memory is the trick I was looking for.  The ALU
> has 1024 registers so I can envision some tall data selectors,
> multiplexers, and accompanying levels of logic to implement the
> address decoders.
>
> I think I'm going to stay with simple for now and put this in my
> back pocket as "Plan B".
>
> I greatly appreciate the help.
>
> Rob Doyle

If you wanted fewer registers (1024, really?) there's a nice technique
that can use LUT RAMs to provide (combinatorially) the read values you
want with two write ports. Two writes to the same address would
result in an undefined value but avoiding that condition results in
seamless operation. Two write ports with three reads would use 8 dual-
port LUT RAM arrays - (write_ports x (read_ports+1) ).

The reason LUT RAMs are needed is the operation is a read-modify-
write.

One could get around the read-modify-write need by delaying the write
one cycle but that method of selecting the read value or the delayed
write value is still needed.

I can provide more detail if needed. Multiple write ports start to
eat resources but they're doable if the performance gain is worth the
resource loss.

Peter Alfke

unread,
Jan 1, 2010, 7:54:56 PM1/1/10
to

From the bowels of my computer I resurrected a file written more than
3 years ago:

Using Virtex-5 CLB as Multi-Port Memory

The four M-LUTs in a half-CLB can be combined to form a quad-port RAM,
ideally suited for register-file applications.
The four LUTs, called A, B, C, and D are configured in such a way that
the write address applied to D is automatically also multiplexed onto
the write addressing of LUTs A, B, and C.
Writing into D thus also writes into the same location in A, B, and C,
but these three LUTs have their address inputs still available as read
addresses. (In this application, LUT D is never read.)
The structure functions as a quad-port RAM with one write port
(address applied to D) and common data written into LUTs A, B, and C .
There are three independent read ports (addresses applied to LUTs A,
B, and C.) Writing is synchronous, reading is combinatorial.
Each LUT can either be a 64 x 1, or a 32 x 2 RAM.

A similar structure, using common read addresses and individual Data
inputs, acts as simple dual-port memory, either 3 bits wide and 64
deep, or 3 bits wide and 32 deep.

In the Virtex-5 MicroBlaze application, the 32 x 32 register file with
one write port and two read ports, using 384 LUTs in Virtex-4, is
reduced to 44 LUTs, a saving of over 88%.

Peter Alfke, 3-21-06

whygee

unread,
Jan 2, 2010, 4:06:08 AM1/2/10
to
wow, a great new year's present :-)))

Peter Alfke wrote:
> From the bowels of my computer I resurrected a file written more than
> 3 years ago:
>
> Using Virtex-5 CLB as Multi-Port Memory

<snip>


> In the Virtex-5 MicroBlaze application, the 32 x 32 register file with
> one write port and two read ports, using 384 LUTs in Virtex-4, is
> reduced to 44 LUTs, a saving of over 88%.
>
> Peter Alfke, 3-21-06

Any more information, diagram, schematics, source code,
appnote, or whatever, would be really appreciated :-)

thanks and greetings,

John_H

unread,
Jan 2, 2010, 11:49:48 AM1/2/10
to

Greetings Peter, always a pleasure.

You describe a physical implementation which folds beautifully into
the Xilinx fabric showing how few routing resources are needed to
implement the bits of the multi-port (read) memories. But isn't this
precisely what one gets when inferring a single port write, multi-port
read memory through HDL?

For that, I wouldn't think example code would be needed since the
inference can have the same physical implementation you describe.
Without explicit placement constraints, both inferred and instantiated
methods are left to the Place & Route to fold everything for each bit
into single CLBs, aren't they? It's certainly easier to apply those
constraints if the designer defines the names for each instance in the
first place.

The bigger challenge raised in this thread is the multi-port with two
write ports which can be performed in CLBs very nicely but with a
little overhead.

If the reader has no interest in multi-port writes with CLB memories,
you can ignore the rest of the message.


reg [n:0] m1 [m:0], m2 [m:0];
wire [n:0] rd1, rd2, rd3;
always @(posedge clk) if( we1 ) m1[wa1] <= wdata1 ^ m2[wa1];
always @(posedge clk) if( we2 ) m2[wa2] <= wdata2 ^ m1[wa2];
assign rd1 = m1[ra1] ^ m2[ra1];
assign rd2 = m1[ra2] ^ m2[ra2];
assign rd3 = m1[ra3] ^ m2[ra3];

Since m1 has 4 unique addresses, the inferred memory will be
replicated for 4 total copies.
Since 2 memories are needed for 2 writes, there are 2 sets of these 4
memory copies.

Since writing a value to m1 doesn't affect m2, reading that address
later results in
rdx == m1[was_wa1] ^ m2[was_wa1] == (was_wdata1 ^ m2[was_wa1]) ^ m2
[was_wa1] == was_wdata1

The XOR on the input and the output resurrects the original data
written to that port independent of which read memory accesses it.
The one caveat: two writes to the same address on the same clock
results in no change to the existing data. Priority could be assigned
to one write port by disabling the write on the other port when a
conflict is detected instead.

This is where CLB SelectRAM design gets interesting and fun!

0 new messages