Memory Bandwidth

203 views
Skip to first unread message

Björn Gottschall

unread,
May 10, 2022, 6:02:36 AM5/10/22
to FireSim
Hei,

I have build the following configuration within the FireSim 1.12 release:

TARGET_CONFIG=DDR3FRFCFSLLC4MB_WithDefaultFireSimBridges_WithFireSimHighPerfConfigTweaks_chipyard.MegaBoomConfig

After my understanding this should give me a Core frequency of 3.2 GHz while the Busses are all running at 1 GHz and a relatively powerfull DDR3 model with a LLC cache. Now I ran the stream benchmark and got the following results:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            1655.0     0.097115     0.096678     0.100155
Scale:           1676.7     0.095556     0.095426     0.095782
Add:             2339.5     0.102643     0.102585     0.102728
Triad:           2373.3     0.101152     0.101126     0.101182

As you can see they are very far away of all known DDR3 bandwidths. Any ideas what could have gone wrong?

David Biancolin

unread,
May 10, 2022, 1:10:45 PM5/10/22
to FireSim
Hey Bjorn. 

Can you try a DRAM model without the LLC and report your results?

I'd encourage you to enable the instrumentation features in FASED which you can sample by setting profile-interval to something != -1. Better yet, pulling open a metasimulation waveform might help if possible. 

- David 

David Biancolin

unread,
May 10, 2022, 1:17:05 PM5/10/22
to FireSim
It's also worth checking the parameters of the AXI4 IF driving FASED, if you copy the firrtl annotation for the FASED bridge i can have look at it. 

Björn Gottschall

unread,
May 10, 2022, 1:57:27 PM5/10/22
to FireSim
Thanks David! I'll check you other suggestions as soon as I can, for now here is the part of the firrtl that I think you were referring to. I want to mention that we have only connected one memory bank of the u250 FPGAs in our design, since we don't need more than 16GB of memory and the timing improves a lot so we can drive the frequency quite a bit higher. But that shouldn't impact FASED, right?

{
    "class":"midas.widgets.SerializableBridgeAnnotation",
    "target":"~FireSim|FASEDBridge",
    "channelNames":[
      "reset",
      "axi4_r_fwd",
      "axi4_r_rev",
      "axi4_b_fwd",
      "axi4_b_rev",
      "axi4_ar_fwd",
      "axi4_ar_rev",
      "axi4_w_fwd",
      "axi4_w_rev",
      "axi4_aw_fwd",
      "axi4_aw_rev"
    ],
    "widgetClass":"midas.models.FASEDMemoryTimingModel",
    "widgetConstructorKey":{
      "class":"midas.models.CompleteConfig",
      "userProvided":{
        "class":"midas.models.FirstReadyFCFSConfig",
        "dramKey":{
          "maxBanks":8,
          "maxRanks":4,
          "dramSize":17179869184,
          "lineBits":8
        },
        "schedulerWindowSize":8,
        "transactionQueueDepth":8,
        "backendKey":{
          "writeDepth":4,
          "readDepth":4,
          "latencyBits":12
        },
        "params":{
          "maxReads":32,
          "maxWrites":32,
          "maxReadLength":256,
          "maxWriteLength":256,
          "detectAddressCollisions":false,
          "stallEventCounters":false,
          "localHCycleCount":false,
          "latencyHistograms":false,
          "llcKey":{
            "ways":{
              "min":1,
              "max":8
            },
            "sets":{
              "min":1,
              "max":4096
            },
            "blockBytes":{
              "min":8,
              "max":128
            },
            "mshrs":{
              "min":1,
              "max":8
            }
          },
          "xactionCounters":true,
          "beatCounters":true,
          "targetCycleCounter":false,
          "occupancyHistograms":[
            0,
            2,
            4,
            8
          ],
          "addrRangeCounters":0
        }
      },
      "axi4Widths":{
        "dataBits":64,
        "addrBits":35,
        "idBits":4
      },
      "axi4Edge":{
        "maxReadTransfer":8,
        "maxWriteTransfer":8,
        "idReuse":2,
        "maxFlight":20,
        "address":[
          {
            "base":2147483648,
            "mask":2147483647
          },
          {
            "base":4294967296,
            "mask":4294967295
          },
          {
            "base":8589934592,
            "mask":8589934591
          },
          {
            "base":17179869184,
            "mask":2147483647
          }
        ]
      },
      "memoryRegionName":"MainMemory_0"
    }
  },

Björn Gottschall

unread,
May 13, 2022, 6:39:58 AM5/13/22
to FireSim
Hi David,

I'm still in the process of figuring out what is going on, but to make sure I wanted to check with you what I should expect in the first place. The memory bus is by default configured to a width of XLen which in this case if 64bit. Its frequency is set to 1 GHz with a comment that if one wants to change this, the FASED configuration needs to be rerun. Within FASED I see that the default timings are taken from the DDR3-2133 configuration. But the DDR3 runs with the memory bus frequency right? So I should not expect DDR3-2133 speeds (i.e. 17 GB/s) but a peak bandwidth of 8 GB/s. right? If I want to run the DDR3 in 2133 MHz is it enough to increase the memory bus frequency or is there something else that I need to pay attention to in FASED?

I ran different configurations now with DDR3 and only LatencyPipe with the MegaBoom and Rocket Core.

Megaboom DDR3 & LLC - 1654 MB/s
Megaboom DDR3 - 4105 MB/s
Rocket DDR3 - 1262 MB/s
MegaBoom LatencyPipe - 5218 MB/s
Rcoket LatencyPipe - 1130 MB/s

I currently run a verilator SmallBoom DDR3 & LLC simulation with the proxy kernel to see what the numbers are. I've also sampled the memory stats, but honestly I don't know what to do with those numbers since I don't even know what they represent or how they should look like.

David Biancolin

unread,
May 16, 2022, 4:00:09 PM5/16/22
to FireSim
Hey Bjorn,

On Friday, May 13, 2022 at 3:39:58 AM UTC-7 Björn Gottschall wrote:
Hi David,

I'm still in the process of figuring out what is going on, but to make sure I wanted to check with you what I should expect in the first place. The memory bus is by default configured to a width of XLen which in this case if 64bit. Its frequency is set to 1 GHz with a comment that if one wants to change this, the FASED configuration needs to be rerun. Within FASED I see that the default timings are taken from the DDR3-2133 configuration. But the DDR3 runs with the memory bus frequency right? So I should not expect DDR3-2133 speeds (i.e. 17 GB/s) but a peak bandwidth of 8 GB/s. right? If I want to run the DDR3 in 2133 MHz is it enough to increase the memory bus frequency or is there something else that I need to pay attention to in FASED?
The controller runs at MBus frequencies, but there are two transfers per clock period.  I think you hit on the big "gotcha" and that is that the MBus width is not sized to accomodate the theoretical bandwidth of a 64b data bus. If you want to use the standard 64b data bus, the MBus should be doubled to be 128b. 


I ran different configurations now with DDR3 and only LatencyPipe with the MegaBoom and Rocket Core.

Megaboom DDR3 & LLC - 1654 MB/s
Megaboom DDR3 - 4105 MB/s
Rocket DDR3 - 1262 MB/s
MegaBoom LatencyPipe - 5218 MB/s
Rcoket LatencyPipe - 1130 MB/s

The LLC bandwidths are a little bit concerning, i'd have to look at a waveform to figure out what's happening there. As it is, the existing LLC is not well suited for systems that can generate tons of outstanding memory requests. It might be better to write a simpler model that composes with the LBP for that purpose. 

 
I currently run a verilator SmallBoom DDR3 & LLC simulation with the proxy kernel to see what the numbers are. I've also sampled the memory stats, but honestly I don't know what to do with those numbers since I don't even know what they represent or how they should look like. 
There should be stuff for LLC hit rates that hopefully explain the LLC performance.
 

On Tuesday 10 May 2022 at 19:57:27 UTC+2 Björn Gottschall wrote:
Thanks David! I'll check you other suggestions as soon as I can, for now here is the part of the firrtl that I think you were referring to. I want to mention that we have only connected one memory bank of the u250 FPGAs in our design, since we don't need more than 16GB of memory and the timing improves a lot so we can drive the frequency quite a bit higher. But that shouldn't impact FASED, right?
It will have no impact.
 

Björn Gottschall

unread,
May 18, 2022, 3:49:47 PM5/18/22
to FireSim
If I double the memory bus width to accommodate the dual transfers, wouldn't I need to do the same to the SystemBus and TileBus since they are configured all the same way and would impose the next bottleneck in the system? Before I did run 1.10 and everything was running with 3.2 GHz, so I didn't had any bandwidth issues of course. The new configuration makes absolute sense to me, but then the bandwidths aren't as expected for the set DDR3 configuration. I would like to simulate a realistic system with DDR3 memory and optimally also have a L3 cache. Looking at the numbers and doubling the MemoryBus in width, should ideally give me double the bandwidth, right? But that is still kind of far away from what I thought the DDR3 will provide.


These are the memory stats of the LLC from a MegaBoom with DDR3 and LLC. Since its a stream benchmark which misses basically all the time the main part of it looks fine to me. The beginning and end of the plot marks the boot and power off.

llc_stats.png

I can also capture a waveform of a smaller streaming benchmark, if that is of any help. Will be big though.

David Biancolin

unread,
May 20, 2022, 5:31:05 PM5/20/22
to FireSim
Since the clock crossing happens at the MBUS, you shouldn't have to change the widths of internal buses, since their frequency is greater than >2x the rate of the MBUS frequency. It _should_ just be a free bandwidth improvement. 

I need to be frank that, for OoO machines or vector machines, you're probably best not using the default LLC model in FASED. Instead i'd suggest either writing your own, or improving the existing one. "Writing your own" here could mean either writing a FASED model, or something that was elaborated in the target itself. Ideally Chipyard would have its own coherent L3 cache that would be emitted as part of the target. There's lots of state available on the FPGA so there's no real reason the cache model _needs_ to go into DRAM. 

Ignoring the LLC: Given a 1GHz frequency and a 64b data bus, 16 GiB/s is the theoretical peak for a single DDR3 memory channel right? This neglects refresh + non-ideal locality of reference. Approaching that limit will require using FRFCFS with an open-page policy, and always getting row buffer hits. The instrumentation should report the number of activations and column commands issued, you can use those to determine the row-buffer hit rate. 

Another easy way to improve dram bandwidth without changing the DRAM data bus width or choosing a non-standard frequency, is to use more channels, and to stripe the address assignment across them. I've done this before, and i believe the default configurations in Chipyard will stripe cachelines across the memory channels if you ask for more than 1 channel. 

Björn Gottschall

unread,
May 30, 2022, 7:22:44 AM5/30/22
to FireSim
I've now tried to configure just the beatWidth of the MemoryBus twice as wide but during the midas transformation it reports overlapping address ranged somewhere in FASED. I guess it is not just that easy.

The numbers some messages ago are from a benchmark which just copies one big array to another. The implementation without the DDR3 and LLC using the LatencyPipe with just one cycle delay only goes near 7ish GB/s. Do you have an idea why I'm in half the theoretical bandwidth territory? I'll have a look at the row buffer hit rate as soon as I find some time.

On the side I also tested the current 1.13.4 branch on AWS and it reports the same numbers. The shipped LargeBoom (DDR3FRFCFSLLC4MB_WithDefaultFireSimBridges_WithFireSimTestChipConfigTweaks_chipyard.LargeBoomConfig) is even lower at around 1 GB/s. 

Björn Gottschall

unread,
Jul 19, 2022, 6:19:07 AM7/19/22
to FireSim
I want to give some feedback in case more poeple running into this. We have figured out most of the problems with the memory bandwidth.

We have first optimized our streaming benchmark (which I just assume for this case to be the most common data access pattern and contemporary CPUs handle these access patterns without any bandwidth problems) and unrolled our streaming loops to a width of 8 which increased our DDR3 bandwidth to 2.6 GiB/s (see picutre below)
bandwidth_stream.png
But one can see the bandwidth is still extremely low after the L1. So we went with a configuration that removes everything below the L1 (DDR3, L3 and L2) and the bandwidth increased to 4 GB/s after L1 showing that there are more problems to solve than the memory bus or L3.

The first and most important problem was that the MegaBoom (4-wide) configuration is simply too small to saturate the MSHRs of the L1 data cache. We have 8 MSHRs in the L1d and 32 STQ/LDQ entries. A streaming benchmark can only use at most 4 MSHRs in this configuration. So we have increased the ressources of the MegaBoom configuration by 50% - 100% (issues queues, rob, ldq, stq...) so that all MSHRs of the L1d were used. We have also applied a fix in the Boom that prevented the use of the last ldq/stq entry (you can find the fix in the open issues on the BOOM github) and made a fix in the MSHRs that allows to read and write at the same time to the MSHR linebuffer (credits to David Metz).
While all this improved the bandwidth a bit it was still not enough. We have also increased the memory bus to 128 bit which actually should be the default in my eyes since the Double Data Rate of a DDR3 cannot work on a 64 bit memory bus that only works on a single clock edge. The next problem was that clock domain crossing buffers that backpressured, so we increased them as well. And lastly the coherence trackers were also not sufficient. Waveforms showed that around 9-10 are used if all MSHRs the L1d are perfectly used. With all this the bandwidth curve looks like this:
bandwidth_stream.png
Lastly we have disabled the L3 which is like you David said not ideal in this scenario:
bandwidth_stream.png

This is up to 68% of the theoretical load bandwidth used on the 128 bit memory bus.

Cheers,
Björn

Reply all
Reply to author
Forward
0 new messages