TileLink questions and NVDLA performance

936 views
Skip to first unread message

Tynan McAuley

unread,
Jul 2, 2020, 12:48:05 PM7/2/20
to Chipyard

TileLink questions and the NVDLA

Hello! I'm working on an SoC based on Chipyard, and we're using the NVDLA in the design. I'm currently exploring the NVDLA's performance using FireSim and some RTL simulations. This conversation comes from an issue I filed here, but this group seemed like the more appropriate place to have this conversation.

For a quick background, the NVDLA has an integrated AXI4-based DMA engine (the DBB interface, or Data Backbone), which it uses:
  • to read neural network layer weights and input feature maps from system memory, and
  • to write output feature maps to system memory.
This AXI4 interface is converted to TileLink here:

// TL <-> AXI
(dbb_tl_node
  := TLBuffer()
  := TLWidthWidget(dataWidthAXI/8)
  := AXI4ToTL()
  := AXI4UserYanker(capMaxFlight=Some(16))
  := AXI4Fragmenter()
  := AXI4IdIndexer(idBits=3)
  := AXI4Buffer()
  := dbb_axi_node)


This TileLink node is connected to the FrontBus here:

fbus.fromMaster(name = Some("nvdla_dbb"), buffer = BufferParams.default) {
  TLBuffer.chainNode(p(NVDLAFrontBusExtraBuffers))
} := nvdla.dbb_tl_node


The fromMaster method is defined here:

def fromMaster[D,U,E,B <: Data]
    (name: Option[String] = None, buffer: BufferParams = BufferParams.none)
    (gen: => NodeHandle[D,U,E,B,TLClientPortParameters,TLManagerPortParameters,TLEdgeOut,TLBundle] =
      TLNameNode(name)): InwardNodeHandle[D,U,E,B] = {
  from("master" named name) {
    inwardNode :=* TLBuffer(buffer) :=* TLFIFOFixer(TLFIFOFixer.all) :=* gen
  }
}


This adds a TLFIFOFixer between that AXI4-to-TileLink converter stack and the FrontBus. As I've found out in my performance experiments, the TLFIFOFixer widget prevents multiple outstanding TileLink transactions from being issued by the NVDLA to the L2 Cache. As you might imagine, this is a huge performance bottleneck in a design that should be able to issue read and write requests back-to-back on consecutive cycles.

My understanding of the TLFIFOFixer comes mainly from here. From this description, it seems like in this case, the TLFIFOFixer is intended to maintain ordering of the NVDLA's TileLink responses relative to the responses of other TileLink clients. I don't understand why that's necessary, or why that should result in this one-transaction-in-flight-at-a-time behavior.

In any case, I removed the TLFIFOFixer from the NVDLA-to-FrontBus connection by swapping the fromMaster call with the coupleFrom method, defined here.

So now Periphery.scala uses this code rather than "fbus.fromMaster(...)":

fbus.coupleFrom("nvdla_dbb") { bus =>
  (bus
    := TLBuffer(BufferParams(p(NVDLAFrontBusExtraBuffers)))
    := nvdla.dbb_tl_node)}


After I made this change, performance drastically improved, but the NVDLA started producing incorrect inference results. So I enabled the TLMonitors in the design, and once the NVDLA started issuing some DMA traffic, the TLMonitors threw an exception at the NVDLA's AXI4ToTL widget, saying that a TileLink "source" field was being reused from an existing in-flight transaction. It makes sense that this would result in the AXI4ToTL widget returning incorrect data to the NVDLA, since the "source" field must uniquely identify an in-flight TileLink transaction.

In the end, I've got a few questions:
  • Why is the TLFIFOFixer necessary in this case? Or is it?
  • When removing the TLFIFOFixer, what is leading this AXI4-to-TileLink converter stack to issue illegal TileLink transactions?
  • Am I going about this wrong, and is there some other way to resolve this memory performance bottleneck?
Thanks so much for any help or clarification on TileLink behavior!

-Tynan

Howard Mao

unread,
Jul 6, 2020, 10:52:48 PM7/6/20
to Chipyard
I agree that FIFO ordering shouldn't be necessary here. The NVDLA wrapper should probably be using "fromPort" instead of "fromMaster". I'll try changing it and see if it complains. 

--
You received this message because you are subscribed to the Google Groups "Chipyard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chipyard+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chipyard/3d7fa4fe-1e63-454e-af0e-fb41f5345359n%40googlegroups.com.

Howard Mao

unread,
Jul 6, 2020, 11:51:34 PM7/6/20
to Chipyard
Actually, nevermind. It probably should use coupleFrom as you did. I suspect there's something wrong with the chain of widgets that the NVDLA wrapper puts between dbb_axi_node and the AXI to TL converter. Can you tell me how you are running your tests?

Tynan McAuley

unread,
Jul 6, 2020, 11:57:55 PM7/6/20
to Chipyard
Thanks for taking a look!

Looking at these two methods (fromPort and fromMaster), isn't the only difference the port name that gets passed to the "from" function? They both add a TLFIFOFixer into the connection path.

-Tynan

Howard Mao

unread,
Jul 7, 2020, 12:43:38 AM7/7/20
to Chipyard
Oh, you're right. I never noticed this before because I only used fromPort with masters that don't set "requestFIFO", which means the FIFO fixer doesn't do anything to it. In this case, it's likely the AXIFragmenter that causes "requestFIFO" to get set.
A possible easy fix is to change the UserYanker so that it only allows a single inflight request per ID. Could you try this out on your end? Might be quicker than me trying to get my test environment set up. In NVDLA.scala, change

AXI4UserYanker(capMaxFlight=Some(16))

to 

AXI4UserYanker(capMaxFlight=Some(1))

Then see if it fixes your TLMonitor violations.

Tynan McAuley

unread,
Jul 7, 2020, 6:59:12 PM7/7/20
to Chipyard
Ya, I also ran that same experiment, and that does solve the assertion errors. Granted, we're still stuck with poor performance, because the NVDLA is trying to issue many back-to-back requests with the same AXI ID, and we're only able to service them one-at-a-time.

I dug into the AXI4ToTL block a bit more, and I feel that I have a good understanding of the problem. The root cause is how multiple outstanding TileLink requests interact with AXI's assumptions of outstanding requests. To start with, the "Operation Ordering" section of the TileLink spec says:

> Within a TileLink network, there may be multiple outstanding operations inflight at any given time. These operations may be completed in any order.

Section 5.1 of the AXI4 spec says:

> All transactions with a given AXI ID value must remain ordered, but there is no restriction on the ordering of transactions with different ID values.

So, when the "capMaxFlight" parameter in the AXI4UserYanker is set to more than 1, AXI4ToTL will send multiple TileLink transactions for requests coming from a single AXI ID. However, AXI4ToTL has no logic to deal with out-of-order TileLink responses. For sending read requests, this is the logic AXI4ToTL uses to generate multiple TileLink requests for AXI requests on a single AXI ID:

val r_count = RegInit(Vec.fill(numIds) { UInt(0, width = txnCountBits) })
val r_id = if (maxFlight == 1) {
  Cat(in.ar.bits.id, UInt(0, width=1))
} else {
  Cat(in.ar.bits.id, r_count(in.ar.bits.id)(logFlight-1,0), UInt(0, width=1))
}

So, r_count is a vector of counters, one counter for each supported AXI ID. Here's the code for how those counters are incremented:

val r_sel = UIntToOH(in.ar.bits.id, numIds)
(r_sel.asBools zip r_count) foreach { case (s, r) =>
  when (in.ar.fire() && s) { r := r + UInt(1) }
}

As expected, whenever a request is sent, we increment the appropriate counter based on the AXI ID the request came from.

To see the problem with this scheme, think of this setup: we support only 1 AXI ID (so numIds is 1), and we support two transactions in flight for this AXI ID (so maxFlight is 2). We receive two AXI read requests from this single AXI ID, back-to-back, and AXI4ToTL will happily issue two TileLink Get requests back-to-back, each with unique "r_id"s, since r_count is incremented after the first request. After sending the second request, r_count is incremented a second time, which means r_id returns to the original value used for the first TileLink Get request. Now, if the two responses come back out-of-order (maybe request 1 is a cache miss and request 2 is a cache hit), and there's another AXI request waiting to be sent, then when we send request 3, we'll re-use the same TileLink source identifier from request 1, which is still in flight. This is illegal, since all outstanding TileLink requests must be uniquely identifiable by their source IDs.

So that's one problem with this system: we re-use TileLink source IDs when responses come back out-of-order, and this leads to the TileLink Monitors triggering assertions. There's one more problem though: let's assume we fixed AXI4ToTL so that it would only issue TileLink requests with valid source IDs when an out-of-order response came back. This response is still invalid from the AXI4 master's perspective, because it expects all of its requests from a single AXI ID to turn into well-ordered responses. Thus, AXI4ToTL would need re-ordering logic for all outstanding requests within a single AXI ID to convert the out-of-order TileLink responses into in-order AXI responses.

Let me know if I explained that well enough, or if you want me to take another crack at it, or if you think I have it completely wrong.

-Tynan

Howard Mao

unread,
Jul 8, 2020, 12:24:05 AM7/8/20
to Chipyard
You're absolutely right. The logic in the converter violates AXI4 and TL ordering rules. The proper way would be to buffer the TL responses so that you can send then back to the AXI requester in the proper order. If you're up for it, I would recommend fixing this in the AXI4ToTL module yourself and then sending a PR to Rocket Chip. I can help out if you have any questions along the way.

Charles Hong

unread,
Jul 11, 2020, 2:00:13 AM7/11/20
to Chipyard
Hi Tynan,

I previously worked with Abe on integrating NVDLA and would be interested in helping fix this bottleneck for some further work. 

Have you managed to make any progress on a fix? I would be happy to help test or provide other assistance as needed.

Thanks,
Charles

Tynan McAuley

unread,
Jul 16, 2020, 12:00:06 PM7/16/20
to Chipyard
Sorry for the silence on this, and thanks for the offers for help!

I've been doing some experiments on this myself, and had this idea, which builds upon the current AXI4ToTL design. Say maxFlight is the number of TileLink transactions that can be in flight at once for each AXI ID, and numIds is the number of supported AXI IDs. When maxFlight is greater than 1, we add in a memory which I'll call the ReorderRAM. This can be a single-ported SRAM. The ReorderRAM has (maxFlight * numIds) entries, and each entry will store the following data:
  • out.d.bits.data
  • out.d.bits.resp
  • out.d.bits.last
  • out.d.bits.user
In parallel with the ReorderRAM, we maintain a Vec of (maxFlight * numIds) single-bit registers, which indicate if each entry in the ReorderRAM is valid. These are initialized to 0.

Furthermore, we maintain two Vecs of counters: read-request counters and read-response counters. Each Vec contains numIds counters, and each counter will count from 0 to (maxFlight - 1). All of these counters are initialized to 0.

The TileLink source field is constructed in the same way as the current AXI4ToTL design, which can be described by this bundle:

class CompositeSourceID(val numIds: Int, val maxFlight: Int) extends Bundle {
  val axi4Id = UInt(log2Ceil(numIds).W)
  val sourceId = UInt(log2Ceil(maxFlight).W)
  val isWrite = Bool()
}


When we make a read request from an AXI ID, we increment the read-request counter corresponding to that AXI ID. The "sourceId" field in the TileLink request for this AXI read request is set to the appropriate read-request counter value.

When a read response comes back, we compare the "sourceId" field of the TileLink response to the read-response counter for the appropriate AXI ID. If those numbers match, the response is in-order, and if not, the response is out-of-order. In-order responses are sent to the AXI master's R channel as per usual, and then the read-response counter for this AXI ID is incremented. Out-of-order responses are written into the ReorderRAM. The address in the ReorderRAM is given by ((out.d.bits.source.axi4Id * maxFlight.U) + out.d.bits.source.sourceId). That same address is used to set the valid bit corresponding to this entry in the ReorderRAM.

Every time an in-order read response comes back, we check the ReorderRAM valid bit for the next read response, located at entry ((out.d.bits.source.axi4Id * maxFlight.U) + out.d.bits.source.sourceId + 1.U). If that entry is valid, then we start reading data out of the ReorderRAM and sending it to the AXI master's R channel. During this time, we set out.d.ready to 0 if the response on the d channel is a read response. Write responses can flow during this time. In this FSM, we walk through the entries in the ReorderRAM for this single AXI ID until we get to an invalid entry. After reading each entry, we mark it as invalid. Once we get to an invalid entry, we exit this FSM and return to normal read-response behavior. Note that during this time, it is also valid to send read requests.

For a particular AXI ID, once the read-request counter and read-response counter equal each other, and there is more than one transaction outstanding for that AXI ID, we know that we have issued maxFlight read requests for that AXI ID, and we must throttle any further read requests for that AXI ID until the read-response counter is incremented.

Upsides:
  • This is a relatively simple design, and doesn't need a huge number of resources.
  • Even with a single-ported, sequential/synchronous-read, sequential/synchronous-write SRAM (SyncReadMem) for the ReorderRAM, we can saturate the AXI master's R channel with data, so long as we get TileLink responses back fast enough.
Downsides:
  • If the AXI master's pattern of requests uses a single AXI ID over and over again, then the utilization of the ReorderRAM will be very poor.
  • If we get an out-of-order read-response, we can't send another read request with that particular TileLink source ID until that read response gets read out of the ReorderRAM.
  • This only works for single-beat requests (which we can assume for the AXI4ToTL block, but isn't that ideal). To support multi-beat requests, you would need to multiply the number of ReorderRAMs by the maximum number of beats you want to support. That could get expensive quickly.
Let me know what you think, suggestions are most welcome!

-Tynan

Howard Mao

unread,
Jul 17, 2020, 2:11:02 AM7/17/20
to Chipyard
This is a good start. I have an idea on how to solve the space utilization issue. It's based on the design of the ListBuffer used in the SiFive inclusive cache. 


As the name suggests, it's essentially a set of memories implementing several linked lists. You have two memories of size numIds storing head and tail pointers, two mems of nEntries: one for the data, and one for next pointers, one nIds bit register indicating if an ID is valid, and one nEntries bit register indicating if an entry is used. Enqueued entries are written into data(tail(pushId)) and dequeued from data(head(popId)). In our case, we could set nEntries to be maxFlight.

In our case, instead of writing the data when a request is made, we just reserve the buffer entry by setting the used bit. Instead of statically mapping TL sources to AXI4 ids, the TL source would just be the index in the data table (plus a bit to indicate read/write). That way, on a read response, we can use the response source to directly index the data table to write the data. We add one additional memory, a reverse mapping to the AXI4 ID, and one additional register, tracking whether the data has been filled in or not. On a response, we use the reverse mapping and head table to check if it is the head of an ID's list. If it is, we forward the response along with any subsequent present entries in the list by following the next pointers. Otherwise, we write the data into the table and move onto the next TL response.

With this design, the space requirement scales much better. Instead of O(numIds * maxFlight) it would just be O(maxFlight). The downside is that the indirection could potentially increase the critical path. And of course there's still the issue of only supporting single-beat requests.

Tynan McAuley

unread,
Dec 1, 2020, 5:53:58 PM12/1/20
to Chipyard
Again, sorry for going dark on this. I've got some time to spend some cycles on this issue again, and I'd like to more fully explore multi-beat burst support, as that's going to be very useful for me in the future. Unfortunately, it seems like that's going to be difficult to accomplish while being performant and not using an enormous amount of SRAM. I do like the idea of a data structure based on the ListBuffer module, but supporting multi-beat bursts will be tricky.

I ended up developing the design I described in my previous post, and that worked well enough for designs that only produced single-beat AXI bursts. I was working with a design that produced multi-beat AXI bursts, and discovered something strange: contrary to the AXI4Fragmenter's documentation on Chipyard, the AXI4Fragmenter is capable of producing multi-beat bursts. They just have to contain a power-of-two number of beats, so they can translate into legal TileLink bursts. This behavior is also mentioned in the pull request containing the initial AXI4Fragmenter implementation. This has a few consequences:
  • We can't depend on the AXI4Fragmenter to produce single-beat AXI4 requests for AXI4ToTL.
  • Side effect: the AXI4Fragmenter can produce illegal AXI4-Lite traffic, since AXI4-Lite requires that all transactions have a burst length of 1.
  • The Chipyard documentation should probably updated to reflect this.
Does that sound right to you?

-Tynan

Howard Mao

unread,
Dec 3, 2020, 9:34:36 PM12/3/20
to Chipyard
Yeah, it seems like I was mistaken when I wrote the documentation for AXI4Fragmenter in the chipyard docs. Its behavior is as you say, though it seems there's an optional flag you can pass to make it conform to AXI4-Lite.

Tynan McAuley

unread,
Dec 9, 2020, 1:29:26 AM12/9/20
to Chipyard
Oh nice, what flag is that? I don't see that in the AXI4Fragmenter file.

I thought a bit more about your idea based on the ListBuffer, and I like it so far. I guess it can be extended to support multi-beat bursts by just instantiating a number of the data memories (one memory for each beat), and then selecting which one to use based on the beat number in a burst. Of course you'll have some wasted space, since not all transactions will use the max burst size. I don't have a great sense for the timing impact, but it appears you could have a cascade of several flop-based RAM reads which feed into a flop-based RAM write. I guess if you're having serious timing issues and you want to revert to the current single-transaction-at-a-time behavior, you could just set the max number of outstanding TileLink transactions to 1, and then a lot of this logic should be unnecessary.

Related to burst size, TileLink can nicely negotiate the max-supported burst size with diplomacy (producing the maxTransfer value in each TL*Parameters object), but AXI doesn't have the same support. The AXI4 spec simply has a maximum burst length of 256 beats for INCR bursts, regardless of bus width. It would be nice to not have to support 256-beat-bursts in all scenarios, especially when a designer knows that their AXI components will only produce a bounded burst length. Right now, it appears that most of the Rocket Chip AXI code just assumes the need to support 256-beat-bursts, given the prevalence of (1 << AXI4Parameters.lenBits). Should this parameter be negotiated by diplomacy? As a stop-gap measure, could we allow users to override the max-supported-burst-length for this module, given the potential for generating way too much hardware?

-Tynan

Howard Mao

unread,
Dec 16, 2020, 10:08:59 PM12/16/20
to Chipyard
> Oh nice, what flag is that? I don't see that in the AXI4Fragmenter file.

Oh huh, I guess they've removed it from master recently. I swear I saw it at some point ...

> Should this parameter be negotiated by diplomacy? As a stop-gap measure, could we allow users to override the max-supported-burst-length for this module, given the potential for generating way too much hardware?

This would be a useful change. You should submit an issue for it on the rocket-chip repository.



Tynan McAuley

unread,
Dec 21, 2020, 4:03:58 PM12/21/20
to Chipyard

Tynan McAuley

unread,
Jan 14, 2021, 1:08:59 AM1/14/21
to Chipyard
And here's the PR for the feature: https://github.com/chipsalliance/rocket-chip/pull/2773
Message has been deleted

Sriram R

unread,
May 23, 2025, 1:19:19 PMMay 23
to Chipyard

I have doubt on the integration of the dbb interface , how the 512-bit data is handled from the dbb interface (MCIF)
Reply all
Reply to author
Forward
0 new messages