Efficient Multi-Ported Memories for FPGAs

Eric

unread,

Apr 20, 2010, 1:00:13 PM4/20/10

to

Hi All,

I've recently published a paper exploring how to implement memories
with multiple read and write ports on existing FPGAs. I figured it
might be of interest to some.

Summary, paper, slides, and example code are here:
http://www.eecg.utoronto.ca/~laforest/multiport/index.html

There are no patents or other additional IP encumbrances on the code.
If you have any comments or other feedback, I'd like to hear it.

Eric LaForest
PhD student, ECE Dept.
University of Toronto
http://www.eecg.utoronto.ca/~laforest/

John_H

unread,

Apr 21, 2010, 6:26:17 AM4/21/10

to

Could you mention here or on your page what you mean by
"multipumping?" If you mean time multiplexed access, I can see why
multipumping is bad. [The "pure logic" approach also isn't obvious.]

Do you update the LVT in the same way I might update the RAM value in
a many-write BlockRAM? To implement a many-write BlockRAM, each
"write bank" is maintained separately. To write a new value, only the
write bank for the associated write port is updated but with the XOR
of the write data with the reads for all the other write banks at the
same address. By reading the same address from all write banks, the
XOR of all the reads is the most recent write value (unless there are
multiple writes in the same clock cycle). Since the LVT only has to
be as wide as the number of entries, I can see how this would be very
beneficial for wide data and many write ports.

Aside from wide data, however, I don't see (without going into the
attachments on that page) how updating the LVT is any different than
updating the memory in the first place.

Eric

unread,

Apr 22, 2010, 10:52:16 AM4/22/10

to

On Apr 21, 6:26 am, John_H <newsgr...@johnhandwork.com> wrote:
> Could you mention here or on your page what you mean by
> "multipumping?" If you mean time multiplexed access, I can see why
> multipumping is bad. [The "pure logic" approach also isn't obvious.]

Yes, multipumping is time-multiplexing. It's not entirely bad, as
there may be a speed margin leftover that you can trade for area using
multipumping. Also, it is useful if you have few ports or low speed
requirements.

Pure logic refers simply to using only the reconfigurable fabric of
the FPGA to implement the memory. It's not a very scalable
solution. :)

> Do you update the LVT in the same way I might update the RAM value in
> a many-write BlockRAM?

No. We've had several independent mentions of using XOR, but we hadn't
heard of it at the time. We'll be looking at it in the future. The LVT
is implemented in pure logic and has multiple read and write ports
which can all work simultaneously. It remains practical because it is
very narrow (log2(# of write ports) instead of full data word width).

> Aside from wide data, however, I don't see (without going into the
> attachments on that page) how updating the LVT is any different than
> updating the memory in the first place.

The LVT manages a bunch of Block RAMs with only one write and one read
port, making them all behave as a single multiported memory. The LVT
simply keeps track of which port last wrote to each address. Since the
actual data is stored in Block RAMs, the end result is faster and more
area efficient than other approaches.

Please let me know if you have more questions.

Eric

rickman

unread,

Apr 22, 2010, 12:36:53 PM4/22/10

to

I guess I don't understand what you are accomplishing with this.
Block rams in FPGAs are almost always multiported. Maybe not N way
ported, but you assume they are single ported when they are dual
ported.

Can you give a general overview of what you are doing without using
jargon? I took a look and didn't get it at first glance.

Rick

Eric

unread,

Apr 22, 2010, 1:55:41 PM4/22/10

to

On Apr 22, 12:36 pm, rickman <gnu...@gmail.com> wrote:
> I guess I don't understand what you are accomplishing with this.
> Block rams in FPGAs are almost always multiported. Maybe not N way
> ported, but you assume they are single ported when they are dual
> ported.

But what if you want more ports, say 2-write/4-read, without wait
states?
I assume them to be "simply dual-ported", which means one write port
and one read port, both operating concurrently. It is also possible to
run them in "true dual port" mode, where each port can either read or
write in a cycle. Some of the designs in the paper do that.

> Can you give a general overview of what you are doing without using
> jargon? I took a look and didn't get it at first glance.

OK. Let me try:

Assume a big, apparently multiported memory of some given capacity and
number of ports. Inside it, I use a small multiported memory
implemented using only the fabric of an FPGA, which stores only the
number of the write port which wrote last to a given address. Thus
this small memory is of the same depth as the whole memory, but much
narrower, hence it scales better.

When you read at a given address from the big memory, internally you
use that address to look up which write port wrote there last, and use
that information to steer the read to the correct internal memory bank
which will hold the data you want. These banks are built-up of
multiple Block RAMs so as to have one write port each, and as many
read ports as the big memory appears to have.

The net result is a memory which appears to have multiple read and
write ports which can all work simultaneously, but which leaves the
bulk of the storage to Block RAMs instead of the FPGA fabric, which
makes for better speed and smaller area.

Does that help?

Eric

John_H

unread,

Apr 22, 2010, 6:45:54 PM4/22/10

to

I appreciate the elaboration here in the newsgroup.

The "true dual port" nature of the BlockRAMs allows one independent
address on each of the two ports with a separate write enable for each
port. The behavior of the BlockRAM can be modified to provide read
data based on the new write data, old data, or no change in the read
data value from last cycle (particularly helpful for multi-pumping).

For an M write, N read memory, your approach appears to need M x (N+1)
memories since you can have M writes all happening at the same time N
accesses are made to the same "most recently written" memory. Please
correct me if I'm wrong. This is the same number of memories required
with the XOR approach but without the LVT overhead. The time delay in
reading the LVT and multiplexing the memories feels like it would be
cumbersome. While this might not add "wait states" it appears the
system would not be able to run terribly quickly. XORs are pretty
quick.

There are always more ways to approach a problem that any one group
can come up with. Kudos on your effort to bring a better approach to
a tough system level issue for difficult designs.

Weng Tianxiang

unread,

Apr 23, 2010, 10:39:42 PM4/23/10

to

John_H,
What is the XOR method in this regard? Can you give an explanation? or
give a hint on the source?

Weng

Weng Tianxiang

unread,

Apr 25, 2010, 12:03:53 AM4/25/10

to

Eric,
Here is my answer to your paper.

1. It is an excellent paper. If let me give a 0-100 mark, I would give
90.

2. It really has inventive idea: it solidly resolves a problem: with a
current FPGA chip available, how to generate
a block RAM with required number of multiple read/write ports using
available block RAM which have limited
and fixed number of write/read ports. In this regard, your method is
the best !!! I will use it if I need to.

3. The references are good, but it lacks the most important things:
patents in this regard filed by Altera and Xilinx.
They must have their technique patented. No matter what, big or small,
both companies would patent everything they think it is important to
them.

4. If you want to expand your inventive idea to new block RAM with
multiple write/read ports made in FPGA, it is not a good idea:
Essentially there is no difficulty to build a block RAM with multiple
write/read ports. If you add 4 sets of read decoder
and 4 sets of write decoder to a current block RAM, the block RAM
immediately would have 4 write/read ports.There is no data racing in
the design at all.
The problem is if there is big need in the market for FPGA
manufactures to do so.

5. Your final conclusion about write-and-read schedule is not right.
When people is using your method, they are still facing write-and-read
scheduling.
For example, there is a wait list pool to receive write and read
requests, and the pool can hold 200 write requests and 200 read
requests.
How to deliver the proper write and read request to your block RAM
ports has the write-and-read scheduling problem.
So your method doesn't eliminate the write-and-read scheduling
problem. And you can only say that your solution
doesn't generate new write-and-read scheduling problem and there is a
new write/read rule: if a write and a read with
same address are issued in the same clock cycle, the read would return
the data which is to be overwritten
by the write request issued in the same clock cycle.

Here is an answer to John_H and Rick:

> > For an M write, N read memory, your approach appears to need M x (N+1)
> > memories since you can have M writes all happening at the same time N
> > accesses are made to the same "most recently written" memory. Please
> > correct me if I'm wrong.

John_H is wrong. It needs M block RAM with another column used by
LVT.
The LVT column needs M write ports and N read ports and its data width
is Upper_bound( log M ).
Each write port writes its data into its block RAM and at the same
time writes its column number to LVT column
so that LVT column stores the latest write column number for the
latest write address.

When being read, read port first reads the LVT column to get the
column number which stores the latest write data with the read
address,
then read the latest write column ( block RAM ) into its read output
register.

It is really a clever idea !!!

Weng

John_H

unread,

Apr 25, 2010, 10:57:31 PM4/25/10

to

On Apr 25, 12:03 am, Weng Tianxiang <wtx...@gmail.com> wrote:
> Each write port writes its data into its block RAM and at the same
> time writes its column number to LVT column
> so that LVT column stores the latest write column number for the
> latest write address.
>
> When being read, read port first reads the LVT column to get the
> column number which stores the latest write data with the read
> address,
> then read the latest write column ( block RAM ) into its read output
> register.
>

So what happens when the multiple reads all want to read from the same
memory because that happens to be where all the last values were
written for that read?

As to the XOR, I don't have code to share; I developed it a while ago
for some asynchronous stuff and it applies well to multi-port writes.

As I try to put together a clearer explanation, I find I may have been
wrong about the memory count for the XOR approach such that the LVT
would use fewer. I still believe the LVT approach requires M*N
BlockRAMs for an M-write, N-read multi-port memory plus the LVT; I'm
having trouble remembering why I thought the "+1" was needed. The XOR
approach appears to need M*(M-1+N) memories.

If you have 3 write ports and 4 read ports, you'll need 3 sets of *6*
memories. The 6 memories in the "write bank" set all have the same
write address and write data corresponding to that write port. When a
write occurs, the *read* value for that write address is accessed from
the other two write banks so each set of 6 must provide a read address
for the other two write bank addresses. The four reads have their own
memories for their own addresses in each of the write banks.

When the write occurs, the read values for that write address are
retrieved from the other write banks and the XOR of those values along
with the new write data are written to all the write ports for its
bank of 6 memories. When a read is performed, the read value within
each write bank is retrieved and the values (3 in this case) are XORed
to get the original data.

newDataWr0^oldDataWr1^oldDataWr2 overwrites oldDataWr0 for the write.

The later reads retrieve oldDataWr0, oldDataWr1, and oldDataWr2 but
since oldDataWr0 was updated to newDataWr0^oldDataWr1^oldDataWr2,
the XOR of the three read values are
oldDataWr0^oldDataWr1^oldDataWr2 ==
(newDataWr0^oldDataWr1^oldDataWr2)^oldDataWr1^oldDataWr2 ==
newDataWr0^(oldDataWr1^oldDataWr1)^(oldDataWr2^oldDataWr2) ==
newDataWr0^(0)^(0) ==
newDataWr0

So with a little coordination, the XOR approach requires M*(M-1+N)
BlockRAMs for an M write, N read port memory along with the XORs.

The LVT needs a memory for each write port but requires multiples of
them to accommodate every read port in case the multiple reads for any
one cycle are all from the same write bank for the most recently
updated value.

Depending on the complexity of the LVT, the number of write ports, and
the allowable latencies, the LVT could be a more effective approach.

Weng Tianxiang

unread,

Apr 26, 2010, 12:04:13 AM4/26/10

to

I don't see any problem reading any of 9 block RAM into one read
register by a selection data or to any number of read registers by a
selection data.

Your XOR method is inferior to the LVT method absolutely, even though
I have no full knowledge of your XOR method.

The LVT method is very clean, very fast and easy to implement and so
flexible that
even 2 block RAM with each having independent 2 write ports and 2 read
ports can be easily
expanded into 4 write ports and 4 read ports.

Weng

John_H

unread,

Apr 26, 2010, 7:31:03 AM4/26/10

to

There are engineering tradeoffs in everything. You surprise me by
saying my approach is inferior "absolutely." Bad form.

I reread your post after adding mine. It looks like you believe some
form of scheduling is still needed. That's what the LVT approach was
trying to overcome! Or at least I thought it was.

Have you ever tried to implement a 512-entry distributed CLB
SelectRAM? The latency and resources are pretty extreme. If one has
a design with no clocking concerns and lots of room for logic in the
FPGA, the LVT approach is stellar. If the clock rate is a serious
concern and there are plenty of memories left over, the XOR approach
excels. "Absolutely."

If the LVT approach is not designed to allow next-cycle reads, the
approach has little merit over simpler approaches like "multipumping"
or cache-based write queues. But the LVT *can* handle multiple reads
by having one memory for each read port associated with each write
memory.

It's all tradeoffs. There are absolutely no absolutes.

Eric

unread,

Apr 26, 2010, 10:23:12 AM4/26/10

to

On Apr 25, 10:57 pm, John_H <newsgr...@johnhandwork.com> wrote:
> As to the XOR, I don't have code to share; I developed it a while ago
> for some asynchronous stuff and it applies well to multi-port writes.

The idea is definitely floating around. Multiple people have
independently suggested it to me after seeing the LVT approach. It's
definitely an interesting approach.

> As I try to put together a clearer explanation, I find I may have been
> wrong about the memory count for the XOR approach such that the LVT
> would use fewer. I still believe the LVT approach requires M*N
> BlockRAMs for an M-write, N-read multi-port memory plus the LVT; I'm
> having trouble remembering why I thought the "+1" was needed. The XOR
> approach appears to need M*(M-1+N) memories.
>
> If you have 3 write ports and 4 read ports, you'll need 3 sets of *6*

> memories. <snip>

Yep, you are right. The number of BlockRAMs required is M*N, plus the
LVT (which uses no Block RAMs).
Thanks for the explanation of the XOR design. You're the first to do
so. It makes a lot of sense, except I don't see why you need the extra
M-1 memories on the read side?

> The LVT needs a memory for each write port but requires multiples of
> them to accommodate every read port in case the multiple reads for any
> one cycle are all from the same write bank for the most recently
> updated value.

*Exactly*

> Depending on the complexity of the LVT, the number of write ports, and
> the allowable latencies, the LVT could be a more effective approach.

It tends to be, given the research so far, but if an extra cycle of
latency (to do the reads to get the decoding data) is acceptable for a
design, the XOR approach could be very useful. The LVT does add delay,
but it's still faster than the alternatives I explored for an
arbitrary number of ports (for small numbers, pure multipumping works
better).

Eric

unread,

Apr 26, 2010, 10:32:45 AM4/26/10

to

On Apr 25, 12:03 am, Weng Tianxiang <wtx...@gmail.com> wrote:

> 5. Your final conclusion about write-and-read schedule is not right.
> When people is using your method, they are still facing write-and-read
> scheduling.
> For example, there is a wait list pool to receive write and read
> requests, and the pool can hold 200 write requests and 200 read
> requests.

<snip>

There is no read/write scheduling problem *within a cycle*. If you
have more pending reads/writes than there are ports, then of course
there will always be a scheduling problem, but that's a different
problem more akin to that solved by reorder buffers in a dynamically
scheduled CPU.

As for simultaneous reads and writes to the same address, the
behaviour is a function of how the underlying Block RAMs are
configured. For a read and a write to the same address, you can
specify that the read port returns either the old value or the current
one being written. This doesn't affect the rest of the LVT-based
design.

In the case of two simultaneous *writes* to the same address, the
default behaviour is like most any other multiported memory:
undefined. However, there is no *physical* conflict in the banks of
Block RAMs, but only in the LVT. So you can go ahead and store both
writes and decide what to store in the LVT as part of your conflict
resolution logic (eg: lower port number has priority, etc...).

John_H

unread,

Apr 26, 2010, 6:52:30 PM4/26/10

to

On Apr 26, 10:23 am, Eric <eric.lafor...@gmail.com> wrote:
>
> Yep, you are right. The number of BlockRAMs required is M*N, plus the
> LVT (which uses no Block RAMs).
> Thanks for the explanation of the XOR design. You're the first to do
> so. It makes a lot of sense, except I don't see why you need the extra
> M-1 memories on the read side?

Because the write of the new data is the DataIn XORed with the old
data at the new WrAddr, each write address needs a read from the other
write memory sets.

For M write ports, there are M write sets. Each write set has the N
read memories for the N-port read and also has M-1 reads
from_the_other_write_addresses to complete the XORed data value to
store to those write sets.

Eric

unread,

Apr 27, 2010, 8:59:48 AM4/27/10

to

On Apr 26, 6:52 pm, John_H <newsgr...@johnhandwork.com> wrote:
<snip>

> For M write ports, there are M write sets. Each write set has the N
> read memories for the N-port read and also has M-1 reads
> from_the_other_write_addresses to complete the XORed data value to
> store to those write sets.

Ah. Of course! This way you don't have to wait for the read ports to
be available to get the data you need to do the XORed write. Thank
you!

That's funny. I was pondering on the impact of the additional read
cycle this solution implied, but if I understand this, just plain
replication is the solution. :)

You mentioned earlier that you had implemented this at some point in
the past. Could you tell me more about where you heard about this
(domain, application, etc...) or did you come up with it yourself? I'm
just trying to suss out the common origin of the multiple XOR
suggestions I've received.

John_H

unread,

Apr 27, 2010, 6:32:58 PM4/27/10

to

On Apr 27, 8:59 am, Eric <eric.lafor...@gmail.com> wrote:
>
> You mentioned earlier that you had implemented this at some point in
> the past. Could you tell me more about where you heard about this
> (domain, application, etc...) or did you come up with it yourself? I'm
> just trying to suss out the common origin of the multiple XOR
> suggestions I've received.

This was original development. It started from flag management
between two asynchronous domains. Then it moved to data manipulation
between two async domains. Extending it to larger memories just
flowed.

John_H

unread,

Apr 27, 2010, 8:16:02 PM4/27/10

to

On Apr 27, 6:32 pm, John_H <newsgr...@johnhandwork.com> wrote:
>
> This was original development.

A little clarification:
Many things that are "good" but not immediately obvious have already
been done. After I'd been using the technique for async transfers and
talked about it on this newsgroup applied to multiport memories a
couple years back, I found the technique was already used.

Like the rhombic dodecahedron and triacontahedron I "invented" in high
school (the Greeks beat me to it by far) those shapes and the XOR
technique are unique to those who haven't been exposed to them.