Multicore VexRiscv with SMP

422 views
Skip to first unread message

Charles Papon

unread,
Apr 19, 2020, 4:59:23 AM4/19/20
to Linux for LiteX FPGA SoC
Hi,

Currently working on having a multicore VexRiscv with memory coherency/consistancy to be compatible with linux and then bring it to linux.

This will be based around write-through invalidate L1 protocol, so something simple and which should stay lite ^^
So far, got a simulaiton SoC running SMP tests successfuly, it can boot linux in single core configuration, run openSBI in multicore, now i'm trying compile Linux 5.6, as openSBI require HSM support from the supervisor (linux) (https://github.com/riscv/riscv-sbi-doc/blob/master/riscv-sbi.adoc#hart-state-management-extension-extension-id-0x48534d-hsm)

As soon i have something more or less working with linux SMP in simulation, i will sync everything and put here the steps to reproduce :)


Charles Papon

unread,
Apr 19, 2020, 12:39:24 PM4/19/20
to Linux for LiteX FPGA SoC
Rawrrrrrr

OpenSBI v0.6-2-gad1aa82
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

Platform Name          : VexRiscv SMP simulation
Platform HART Features : RV32AIMS
Platform Max HARTs     : 4
Current Hart           : 0
Firmware Base          : 0x80000000
Firmware Size          : 84 KB
Runtime SBI Version    : 0.2

MIDELEG : 0x00000222
MEDELEG : 0x0000b101
[    0.000000] No DTB passed to the kernel
[    0.000000] Linux version 5.0.9 (rawrr@rawrr) (gcc version 8.4.0 (Buildroot 2020.02.1-00038-g1be560af06-dirty)) #1 SMP Sun Apr 19 13:41:21 CEST 2020
[    0.000000] Initial ramdisk at: 0x(ptrval) (8388608 bytes)
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x00000000c0000000-0x00000000c7ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00000000c0000000-0x00000000c7ffffff]
[    0.000000] Initmem setup node 0 [mem 0x00000000c0000000-0x00000000c7ffffff]
[    0.000000] elf_hwcap is 0x1100
[    0.000000] percpu: Embedded 10 pages/cpu @(ptrval) s16784 r0 d24176 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 32512
[    0.000000] Kernel command line: mem=128M@0xC0000000 rootwait console=hvc0 root=/dev/ram0 init=/sbin/init
[    0.000000] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
[    0.000000] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
[    0.000000] Sorting __ex_table...
[    0.000000] Memory: 118648K/131072K available (2191K kernel code, 88K rwdata, 334K rodata, 132K init, 185K bss, 12424K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[    0.000000] NR_IRQS: 0, nr_irqs: 0, preallocated irqs: 0
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x171024e7e0, max_idle_ns: 440795205315 ns
[    0.000137] sched_clock: 64 bits at 100MHz, resolution 10ns, wraps every 4398046511100ns
[    0.001462] Console: colour dummy device 80x25
[    0.037717] printk: console [hvc0] enabled
[    0.038853] Calibrating delay loop (skipped), value calculated using timer frequency.. 200.00 BogoMIPS (lpj=400000)
[    0.041396] pid_max: default: 32768 minimum: 301
[    0.044466] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.046219] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.065496] rcu: Hierarchical SRCU implementation.
[    0.079124] smp: Bringing up secondary CPUs ...
[    0.108535] smp: Brought up 1 node, 4 CPUs
[    0.115564] devtmpfs: initialized
[    0.129911] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.132290] futex hash table entries: 1024 (order: 4, 65536 bytes)
[    0.186026] clocksource: Switched to clocksource riscv_clocksource
[    0.300318] Unpacking initramfs...
[    0.773632] Initramfs unpacking failed: junk in compressed archive
[    0.783493] workingset: timestamp_bits=30 max_order=15 bucket_order=0
[    0.886834] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
[    0.888444] io scheduler mq-deadline registered
[    0.889494] io scheduler kyber registered
[    1.621251] random: get_random_bytes called from init_oops_id+0x4c/0x60 with crng_init=0
[    1.633257] Freeing unused kernel memory: 132K
[    1.634720] This architecture does not have kernel memory protection.
[    1.636139] Run /init as init process
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Saving random seed: [    2.541507] random: dd: uninitialized urandom read (512 bytes read)
OK
Starting network: ip: socket: Function not implemented
ip: socket: Function not implemented
FAIL

Welcome to Buildroot
buildroot login: root
root
Rawrrrr

root@buildroot:~# nproc
nproc
4
root@buildroot:~# cat /proc/cpuinfo
cat /proc/cpuinfo
processor : 0
hart : 0
isa : rv32im
mmu : sv32

processor : 1
hart : 1
isa : rv32im
mmu : sv32

processor : 2
hart : 2
isa : rv32im
mmu : sv32

processor : 3
hart : 3
isa : rv32im
mmu : sv32

root@buildroot:~#

Keep in mind, that's this simulation only contain the 4 CPU and their interconnect :) There is quite some integration/optimisation work remaining. 

About opensbi, basicaly the 0.6 release is fine and do not require HSM.
In the CPU itself, i have to fix one issue related to AMO LR/SC aquire/release, but else it seem stable.

Florent Kermarrec

unread,
Apr 19, 2020, 2:55:50 PM4/19/20
to linux...@googlegroups.com
Nice! Thanks for sharing.
Florent

--
You received this message because you are subscribed to the Google Groups "Linux for LiteX FPGA SoC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linux-litex...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/linux-litex/6ee2b043-9a7c-4b5c-baad-66a2969e3344%40googlegroups.com.

Charles Papon

unread,
Apr 21, 2020, 12:52:02 PM4/21/20
to Linux for LiteX FPGA SoC
Pushed everything, It can be run with the joined makefile: 

make clone opensbi_compile buildroot_init buildroot_compile vexriscv_assembly
make vexriscv_sim


Le dimanche 19 avril 2020 10:59:23 UTC+2, Charles Papon a écrit :
makefile

Charles Papon

unread,
Apr 26, 2020, 5:11:13 AM4/26/20
to Linux for LiteX FPGA SoC
First VexRiscv smp cluster which should more or less litex ready joined. The toplevel is named VexRiscvLitexSmpCluster.
rtl.tar.gz

Charles Papon

unread,
Apr 27, 2020, 7:02:51 AM4/27/20
to Linux for LiteX FPGA SoC
Things are looking good.
Screenshot at 2020-04-27 12-59-12.png

Florent Kermarrec

unread,
Apr 27, 2020, 7:10:23 AM4/27/20
to linux...@googlegroups.com
Great thanks, i'll work on integrating it.
Florent


Le lun. 27 avr. 2020 à 13:02, Charles Papon <charles....@gmail.com> a écrit :
Things are looking good.
Screenshot at 2020-04-27 12-59-12.png

--
You received this message because you are subscribed to the Google Groups "Linux for LiteX FPGA SoC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linux-litex...@googlegroups.com.

Drew Fustini

unread,
Apr 27, 2020, 7:21:26 AM4/27/20
to linux...@googlegroups.com
On Mon, Apr 27, 2020 at 1:02 PM Charles Papon
<charles....@gmail.com> wrote:
> Things are looking good.

Nice!

What tool is that in the screenshot?

thanks

Charles Papon

unread,
Apr 27, 2020, 7:28:25 AM4/27/20
to linux...@googlegroups.com
Vivado, unfortunately ^^
> --
> You received this message because you are subscribed to the Google Groups "Linux for LiteX FPGA SoC" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-litex...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/linux-litex/CAEf4M_C9cAjEfmdro3aAbFACZEB2FyO3%2B1xf1OZw6KAVCcb_sQ%40mail.gmail.com.

Charles Papon

unread,
Apr 27, 2020, 8:07:49 AM4/27/20
to Linux for LiteX FPGA SoC
FMax in Artix 7 speed grade 3 is about 175 Mhz, in a speed grade 1 (Arty A7) it is about 125 Mhz. Actualy the critical path is in the memory stage of each CPU, were the MMU do the address translation, and that translation is checked against the d$ tags.

Charles Papon

unread,
Apr 29, 2020, 5:40:38 PM4/29/20
to Linux for LiteX FPGA SoC
Florent got the following on a litex simulation

"BBBBuuuuiiiilllldddd    yyyyoooouuuurrrr    hhhhaaaarrrrddddwwwwaaaarrrreeee,,,,    eeeeaaaassssiiiillllyyyy!!!!"

Hmmmmmmmmmm

Charles Papon

unread,
May 1, 2020, 6:36:29 AM5/1/20
to Linux for LiteX FPGA SoC
Linux now boot on hardware. Arty A7 35T with 4 cores, at 100 Mhz.
Tested the performances and it seem to scale well, running for instance 3 instance of dhrystone at the same time only increase the duration of the test of about 10-20 %. And still there is a lot to do on the design to improve the memory system. So that look good.
Also it seem stable.

Florent and me are currently cleaning things.
Should get something reproducable soon.

Charles Papon

unread,
May 1, 2020, 9:35:45 AM5/1/20
to Linux for LiteX FPGA SoC
https://github.com/enjoy-digital/litex_vexriscv_smp
should now be usable. But it still miss the informations about how to build opensbi/linux/buildroot.

Tim 'mithro' Ansell

unread,
May 1, 2020, 9:41:40 AM5/1/20
to linux...@googlegroups.com
How does the resource usage of this SMP design compare to the single core version?

On Fri, 1 May 2020 at 06:35, Charles Papon <charles....@gmail.com> wrote:
https://github.com/enjoy-digital/litex_vexriscv_smp
should now be usable. But it still miss the informations about how to build opensbi/linux/buildroot.

--
You received this message because you are subscribed to the Google Groups "Linux for LiteX FPGA SoC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linux-litex...@googlegroups.com.

Charles Papon

unread,
May 1, 2020, 10:02:51 AM5/1/20
to Linux for LiteX FPGA SoC
The ressource usage of the whole SoC on ArtyA7 ? I don't know how much the non-smp version is, i will have to check.But currently with the quad-core, 128 bits iBus, 32 bits dBus, it is about 13k LUT. But this will rise a bit, since i need to design and integrate a few other features on the dBus to improve the performances. But overall, the SMP overhead compared to a non SMP quad core seem small.

When i will have more details about this, i will post them here ^^


Le vendredi 1 mai 2020 15:41:40 UTC+2, Tim 'mithro' Ansell a écrit :
How does the resource usage of this SMP design compare to the single core version?

On Fri, 1 May 2020 at 06:35, Charles Papon <charles...@gmail.com> wrote:
https://github.com/enjoy-digital/litex_vexriscv_smp
should now be usable. But it still miss the informations about how to build opensbi/linux/buildroot.

--
You received this message because you are subscribed to the Google Groups "Linux for LiteX FPGA SoC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linux...@googlegroups.com.

Charles Papon

unread,
May 1, 2020, 10:29:29 AM5/1/20
to Linux for LiteX FPGA SoC
Hoo my bad, the 13K what without the litex toplevel (no DDR, no UART, ...)
With those included, it is : 

+----------------------------+-------+-------+-----------+-------+
|          Site Type         |  Used | Fixed | Available | Util% |
+----------------------------+-------+-------+-----------+-------+
| Slice LUTs                 | 14762 |     0 |     20800 | 70.97 |
|   LUT as Logic             | 14229 |     0 |     20800 | 68.41 |
|   LUT as Memory            |   533 |     0 |      9600 |  5.55 |
|     LUT as Distributed RAM |   530 |     0 |           |       |
|     LUT as Shift Register  |     3 |     0 |           |       |
| Slice Registers            | 12492 |     0 |     41600 | 30.03 |
|   Register as Flip Flop    | 12492 |     0 |     41600 | 30.03 |
|   Register as Latch        |     0 |     0 |     41600 |  0.00 |
| F7 Muxes                   |    36 |     0 |     16300 |  0.22 |
| F8 Muxes                   |     7 |     0 |      8150 |  0.09 |
+----------------------------+-------+-------+-----------+-------+

So, 15K LUT

Tim 'mithro' Ansell

unread,
May 1, 2020, 10:49:26 AM5/1/20
to linux...@googlegroups.com
Just checking,
~13k LUT for the CPU core complex and roughly ~2k LUT for all the support peripherals like DDR / UART / etc?


To unsubscribe from this group and stop receiving emails from it, send an email to linux-litex...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/linux-litex/10a2195d-4e88-4fb6-a966-30a66f951bdf%40googlegroups.com.

Charles Papon

unread,
May 1, 2020, 12:09:18 PM5/1/20
to Linux for LiteX FPGA SoC
Yes, but more for the cores, bit less for the peripherals, but to get a proper idea, se should disable netlist flattening durring synthesis.

+---------------------------------------------+-------------------------+------------+------------+---------+------+-------+--------+--------+--------------+
|                   Instance                  |          Module         | Total LUTs | Logic LUTs | LUTRAMs | SRLs |  FFs  | RAMB36 | RAMB18 | DSP48 Blocks |
+---------------------------------------------+-------------------------+------------+------------+---------+------+-------+--------+--------+--------------+
| top                                         |                   (top) |      14762 |      14229 |     530 |    3 | 12492 |     21 |     28 |           16 |
|   (top)                                     |                   (top) |       1332 |       1185 |     144 |    3 |  1966 |      9 |      0 |            0 |
|   VexRiscvLitexSmpCluster                   | VexRiscvLitexSmpCluster |      13433 |      13047 |     386 |    0 | 10526 |     12 |     28 |           16 |
|     (VexRiscvLitexSmpCluster)               | VexRiscvLitexSmpCluster |        276 |        276 |       0 |    0 |   989 |      0 |      0 |            0 |
|     cluster                                 |      VexRiscvSmpCluster |      11245 |      11043 |     202 |    0 |  9173 |      8 |     28 |           16 |
|       (cluster)                             |      VexRiscvSmpCluster |         18 |         18 |       0 |    0 |   218 |      0 |      0 |            0 |
|       cpus_0_core                           |                VexRiscv |       2727 |       2677 |      50 |    0 |  2182 |      2 |      7 |            4 |
|         (cpus_0_core)                       |                VexRiscv |       1087 |       1039 |      48 |    0 |  1812 |      0 |      0 |            4 |
|         IBusCachedPlugin_cache              |        InstructionCache |        477 |        477 |       0 |    0 |   104 |      2 |      1 |            0 |
|         dataCache_4                         |             DataCache_5 |       1166 |       1164 |       2 |    0 |   266 |      0 |      6 |            0 |
|       cpus_1_core                           |              VexRiscv_1 |       2725 |       2675 |      50 |    0 |  2184 |      2 |      7 |            4 |
|         (cpus_1_core)                       |              VexRiscv_1 |       1106 |       1058 |      48 |    0 |  1807 |      0 |      0 |            4 |
|         IBusCachedPlugin_cache              |    InstructionCache_1_3 |        497 |        497 |       0 |    0 |   104 |      2 |      1 |            0 |
|         dataCache_4                         |             DataCache_4 |       1127 |       1125 |       2 |    0 |   273 |      0 |      6 |            0 |
|       cpus_2_core                           |              VexRiscv_2 |       2711 |       2661 |      50 |    0 |  2160 |      2 |      7 |            4 |
|         (cpus_2_core)                       |              VexRiscv_2 |       1111 |       1063 |      48 |    0 |  1805 |      0 |      0 |            4 |
|         IBusCachedPlugin_cache              |    InstructionCache_1_1 |        500 |        500 |       0 |    0 |   104 |      2 |      1 |            0 |
|         dataCache_4                         |             DataCache_2 |       1107 |       1105 |       2 |    0 |   251 |      0 |      6 |            0 |
|       cpus_3_core                           |              VexRiscv_3 |       2666 |       2616 |      50 |    0 |  2160 |      2 |      7 |            4 |
|         (cpus_3_core)                       |              VexRiscv_3 |       1097 |       1049 |      48 |    0 |  1805 |      0 |      0 |            4 |
|         IBusCachedPlugin_cache              |      InstructionCache_1 |        463 |        463 |       0 |    0 |   104 |      2 |      1 |            0 |
|         dataCache_4                         |               DataCache |       1112 |       1110 |       2 |    0 |   251 |      0 |      6 |            0 |
|       dBusArbiter                           |              BmbArbiter |        135 |        135 |       0 |    0 |    20 |      0 |      0 |            0 |
|         (dBusArbiter)                       |              BmbArbiter |         12 |         12 |       0 |    0 |    15 |      0 |      0 |            0 |
|         memory_arbiter                      |           StreamArbiter |        123 |        123 |       0 |    0 |     5 |      0 |      0 |            0 |
|       exclusiveMonitor                      |     BmbExclusiveMonitor |        242 |        242 |       0 |    0 |   234 |      0 |      0 |            0 |
|         (exclusiveMonitor)                  |     BmbExclusiveMonitor |        114 |        114 |       0 |    0 |   226 |      0 |      0 |            0 |
|         cmdArbiter                          |         StreamArbiter_2 |         29 |         29 |       0 |    0 |     3 |      0 |      0 |            0 |
|         exclusiveReadArbiter                |         StreamArbiter_1 |        100 |        100 |       0 |    0 |     5 |      0 |      0 |            0 |
|       invalidateMonitor                     |    BmbInvalidateMonitor |         24 |         22 |       2 |    0 |    15 |      0 |      0 |            0 |
|         io_output_rsp_fork                  |            StreamFork_1 |          8 |          8 |       0 |    0 |     3 |      0 |      0 |            0 |
|         rspLogic_rspToSyncFiltred_fifo      |              StreamFifo |         16 |         14 |       2 |    0 |    12 |      0 |      0 |            0 |
|     dBusDecoder                             |              BmbDecoder |         14 |         14 |       0 |    0 |     7 |      0 |      0 |            0 |
|     dMemBridge                              |           BmbToLiteDram |        876 |        788 |      88 |    0 |   109 |      4 |      0 |            0 |
|       (dMemBridge)                          |           BmbToLiteDram |          4 |          4 |       0 |    0 |     6 |      0 |      0 |            0 |
|       cmdContext_fifo                       |            StreamFifo_2 |         38 |         38 |       0 |    0 |    12 |      1 |      0 |            0 |
|       io_input_upSizer                      |        BmbUpSizerBridge |         12 |         12 |       0 |    0 |    15 |      0 |      0 |            0 |
|       io_input_upSizer_io_output_unburstify |           BmbUnburstify |        250 |        250 |       0 |    0 |    51 |      0 |      0 |            0 |
|       io_output_rdata_fifo                  |  StreamFifoLowLatency_0 |        403 |        315 |      88 |    0 |    11 |      0 |      0 |            0 |
|       streamFork_4                          |            StreamFork_2 |         18 |         18 |       0 |    0 |     2 |      0 |      0 |            0 |
|       streamFork_4_io_outputs_1_thrown_fifo |            StreamFifo_1 |        155 |        155 |       0 |    0 |    12 |      3 |      0 |            0 |
|     iBusArbiter                             |            BmbArbiter_1 |         52 |         52 |       0 |    0 |     5 |      0 |      0 |            0 |
|       memory_arbiter                        |         StreamArbiter_3 |         52 |         52 |       0 |    0 |     5 |      0 |      0 |            0 |
|     iBusDecoder                             |            BmbDecoder_1 |          7 |          7 |       0 |    0 |     6 |      0 |      0 |            0 |
|     iBusDecoder_io_outputs_0_downSizer      |      BmbDownSizerBridge |        102 |        102 |       0 |    0 |    99 |      0 |      0 |            0 |
|     iMemBridge                              |         BmbToLiteDram_1 |        277 |        181 |      96 |    0 |    63 |      0 |      0 |            0 |
|       (iMemBridge)                          |         BmbToLiteDram_1 |          9 |          9 |       0 |    0 |     6 |      0 |      0 |            0 |
|       cmdContext_fifo                       |            StreamFifo_3 |         30 |         22 |       8 |    0 |    17 |      0 |      0 |            0 |
|       io_input_unburstify                   |         BmbUnburstify_1 |        127 |        127 |       0 |    0 |    29 |      0 |      0 |            0 |
|       io_output_rdata_fifo                  |    StreamFifoLowLatency |        115 |         27 |      88 |    0 |    11 |      0 |      0 |            0 |
|     peripheralArbiter                       |            BmbArbiter_2 |        541 |        541 |       0 |    0 |     3 |      0 |      0 |            0 |
|       memory_arbiter                        |         StreamArbiter_4 |        541 |        541 |       0 |    0 |     3 |      0 |      0 |            0 |
|     peripheralArbiter_io_output_toWishbone  |           BmbToWishbone |         53 |         53 |       0 |    0 |    72 |      0 |      0 |            0 |
+---------------------------------------------+-------------------------+------------+------------+---------+------+-------+--------+--------+--------------+
* Note: The sum of lower-level cells may be larger than their parent cells total, due to cross-hierarchy LUT combining




Le vendredi 1 mai 2020 16:49:26 UTC+2, Tim 'mithro' Ansell a écrit :
Just checking,
~13k LUT for the CPU core complex and roughly ~2k LUT for all the support peripherals like DDR / UART / etc?


Tim 'mithro' Ansell

unread,
May 1, 2020, 12:49:20 PM5/1/20
to linux...@googlegroups.com
I know this is just a "get something working" type stage, but a couple of random thoughts;

 - Looks like there is currently a lot of overhead in the dMemBridge and iMemBridge and peripheralArbiter? Could LiteX do more here around native support for native VexRISCV structures to reduce that?
 - Does it make sense for the cache to be shared between cores? 
 - It seems like the caches are using a lot of LUTs? Should some of that be mapping to LUTRAMs?

Keep up the super awesome work!

Tim 'mithro' Ansell


To unsubscribe from this group and stop receiving emails from it, send an email to linux-litex...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/linux-litex/9c04d667-5382-476f-a870-c21c072c03d8%40googlegroups.com.

Charles Papon

unread,
May 1, 2020, 1:08:02 PM5/1/20
to Linux for LiteX FPGA SoC
>  It seems like the caches are using a lot of LUTs? Should some of that be mapping to LUTRAMs?
>  Looks like there is currently a lot of overhead in the dMemBridge and iMemBridge and peripheralArbiter? Could LiteX do more here around native support for native VexRISCV structures to reduce that?

At this scale (inner composent) we can't give any proper diagnostic with the above report, as Vivado moved/combined LUT across the hierarchy. I realy have to disable the hierarchy flattening to get a proper usage report.
One good instance of that is the area usage of the io_output_rdata_fifo which is totaly crazy, likely, 3 times smaller.
I will give a synthesis run with the proper settings to get a good inner LUT usage.
Likely the D$ are shown as big, because the synthesis included in them the MMU and a few other things)


> Does it make sense for the cache to be shared between cores? 

The L1 or a pentential future L2 ?

About the L1, the good thing about having a dedicated one on each CPU is that is allow the CPU to be placed and routed independantly of others CPU, in other words, it relax the place and route. Also if the L1 is shared between CPU, then one issue is that it would require to have quite many way to avoid cash trashing, which would add quite some combinatorial path, and would require longer CPU pipeline.
So it is likely a tradeoff between shared L1 $ with high latency and many way VS dedicated L1 low latency low number of way.


About a potential L2, sure, with the FPGA constraints, i think it realy should be shared.


> Keep up the super awesome work!

Thanks :)


Le vendredi 1 mai 2020 18:49:20 UTC+2, Tim 'mithro' Ansell a écrit :
I know this is just a "get something working" type stage, but a couple of random thoughts;

 - Looks like there is currently a lot of overhead in the dMemBridge and iMemBridge and peripheralArbiter? Could LiteX do more here around native support for native VexRISCV structures to reduce that?
 - Does it make sense for the cache to be shared between cores? 
 - It seems like the caches are using a lot of LUTs? Should some of that be mapping to LUTRAMs?

Keep up the super awesome work!

Tim 'mithro' Ansell


Charles Papon

unread,
May 2, 2020, 3:16:35 AM5/2/20
to Linux for LiteX FPGA SoC
@tim I did a synthesis run with -flatten_hierarchy none, to avoid that issues of LUT occupancy being suffled between modules, and the results are realy, realy realy differents.

Apparently your first guess was more accurate :) 12.5k for the cluster, 2.3k for the peripherals 
Also notes, the size of the DCache are multiple times smaller than previously reported.

> Looks like there is currently a lot of overhead in the dMemBridge and iMemBridge and peripheralArbiter? Could LiteX do more here around native support for native VexRISCV structures to reduce that?

So looking now at the accurate report bellow, things seem to me fine now, i agree that on the unaccurate report, there was realy a lot of overhead.
So basicaly, the IBusBridge/dBusBridge do the following : 
- translate burst into single beat access for litedram (litedram requirement) (unburstify)
- buffer litedram read responses, as litedram do not implement back presure
- keep track of pending read request to not overflow the read responses buffer (from above)
- for the dBus it also has to upsize the bus from 32 bits to 128 bits data width
- implement a context fifo, basicaly, on BMB, a request can give a context value which will be given back on the response, which is usefull to implement all sorts of bridges, That's the cmdContext_fifo

They are all thing which has to be done one place or another, for instance having litedram supporting backpresure would only move the fifo from the dBusBridge to litedram.
So to me, that's all good :)



+---------------------------------------------+-------------------------+------------+------------+---------+------+-------+--------+--------+--------------+
|                   Instance                  |          Module         | Total LUTs | Logic LUTs | LUTRAMs | SRLs |  FFs  | RAMB36 | RAMB18 | DSP48 Blocks |
+---------------------------------------------+-------------------------+------------+------------+---------+------+-------+--------+--------+--------------+
| top                                         |                   (top) |      14869 |      14336 |     530 |    3 | 12782 |     21 |     28 |           16 |
|   (top)                                     |                   (top) |       2417 |       2270 |     144 |    3 |  1966 |      9 |      0 |            0 |
|   VexRiscvLitexSmpCluster                   | VexRiscvLitexSmpCluster |      12455 |      12069 |     386 |    0 | 10816 |     12 |     28 |           16 |
|     (VexRiscvLitexSmpCluster)               | VexRiscvLitexSmpCluster |        375 |        375 |       0 |    0 |   996 |      0 |      0 |            0 |
|     cluster                                 |      VexRiscvSmpCluster |      11236 |      11034 |     202 |    0 |  9358 |      8 |     28 |           16 |
|       (cluster)                             |      VexRiscvSmpCluster |         39 |         39 |       0 |    0 |   222 |      0 |      0 |            0 |
|       cpus_0_core                           |                VexRiscv |       2736 |       2686 |      50 |    0 |  2236 |      2 |      7 |            4 |
|         (cpus_0_core)                       |                VexRiscv |       2239 |       2191 |      48 |    0 |  1827 |      0 |      0 |            4 |
|         IBusCachedPlugin_cache              |        InstructionCache |         75 |         75 |       0 |    0 |   121 |      2 |      1 |            0 |
|         dataCache_4                         |            DataCache__2 |        422 |        420 |       2 |    0 |   288 |      0 |      6 |            0 |
|       cpus_1_core                           |              VexRiscv_1 |       2677 |       2627 |      50 |    0 |  2206 |      2 |      7 |            4 |
|         (cpus_1_core)                       |              VexRiscv_1 |       2200 |       2152 |      48 |    0 |  1819 |      0 |      0 |            4 |
|         IBusCachedPlugin_cache              |   InstructionCache_1__2 |         58 |         58 |       0 |    0 |   100 |      2 |      1 |            0 |
|         dataCache_4                         |            DataCache__3 |        419 |        417 |       2 |    0 |   287 |      0 |      6 |            0 |
|       cpus_2_core                           |              VexRiscv_2 |       2701 |       2651 |      50 |    0 |  2206 |      2 |      7 |            4 |
|         (cpus_2_core)                       |              VexRiscv_2 |       2225 |       2177 |      48 |    0 |  1819 |      0 |      0 |            4 |
|         IBusCachedPlugin_cache              |      InstructionCache_1 |         57 |         57 |       0 |    0 |   100 |      2 |      1 |            0 |
|         dataCache_4                         |               DataCache |        420 |        418 |       2 |    0 |   287 |      0 |      6 |            0 |
|       cpus_3_core                           |              VexRiscv_3 |       2723 |       2673 |      50 |    0 |  2214 |      2 |      7 |            4 |
|         (cpus_3_core)                       |              VexRiscv_3 |       2246 |       2198 |      48 |    0 |  1819 |      0 |      0 |            4 |
|         IBusCachedPlugin_cache              |   InstructionCache_1__1 |         58 |         58 |       0 |    0 |   100 |      2 |      1 |            0 |
|         dataCache_4                         |            DataCache__1 |        421 |        419 |       2 |    0 |   295 |      0 |      6 |            0 |
|       dBusArbiter                           |              BmbArbiter |        109 |        109 |       0 |    0 |    20 |      0 |      0 |            0 |
|         (dBusArbiter)                       |              BmbArbiter |         20 |         20 |       0 |    0 |    15 |      0 |      0 |            0 |
|         memory_arbiter                      |           StreamArbiter |         89 |         89 |       0 |    0 |     5 |      0 |      0 |            0 |
|       exclusiveMonitor                      |     BmbExclusiveMonitor |        231 |        231 |       0 |    0 |   239 |      0 |      0 |            0 |
|         (exclusiveMonitor)                  |     BmbExclusiveMonitor |        139 |        139 |       0 |    0 |   231 |      0 |      0 |            0 |
|         cmdArbiter                          |         StreamArbiter_2 |         39 |         39 |       0 |    0 |     3 |      0 |      0 |            0 |
|         exclusiveReadArbiter                |         StreamArbiter_1 |         53 |         53 |       0 |    0 |     5 |      0 |      0 |            0 |
|       invalidateMonitor                     |    BmbInvalidateMonitor |         21 |         19 |       2 |    0 |    15 |      0 |      0 |            0 |
|         (invalidateMonitor)                 |    BmbInvalidateMonitor |          3 |          3 |       0 |    0 |     0 |      0 |      0 |            0 |
|         io_output_rsp_fork                  |            StreamFork_1 |          5 |          5 |       0 |    0 |     3 |      0 |      0 |            0 |
|         rspLogic_rspToSyncFiltred_fifo      |              StreamFifo |         13 |         11 |       2 |    0 |    12 |      0 |      0 |            0 |
|     dBusDecoder                             |              BmbDecoder |         47 |         47 |       0 |    0 |     7 |      0 |      0 |            0 |
|     dMemBridge                              |           BmbToLiteDram |        358 |        270 |      88 |    0 |   207 |      4 |      0 |            0 |
|       (dMemBridge)                          |           BmbToLiteDram |         17 |         17 |       0 |    0 |     6 |      0 |      0 |            0 |
|       cmdContext_fifo                       |            StreamFifo_2 |         17 |         17 |       0 |    0 |    12 |      1 |      0 |            0 |
|       io_input_upSizer                      |        BmbUpSizerBridge |        165 |        165 |       0 |    0 |   114 |      0 |      0 |            0 |
|       io_input_upSizer_io_output_unburstify |           BmbUnburstify |         37 |         37 |       0 |    0 |    50 |      0 |      0 |            0 |
|       io_output_rdata_fifo                  |    StreamFifoLowLatency |        102 |         14 |      88 |    0 |    11 |      0 |      0 |            0 |
|       streamFork_4                          |            StreamFork_2 |          4 |          4 |       0 |    0 |     2 |      0 |      0 |            0 |
|       streamFork_4_io_outputs_1_thrown_fifo |            StreamFifo_1 |         16 |         16 |       0 |    0 |    12 |      3 |      0 |            0 |
|     iBusArbiter                             |            BmbArbiter_1 |         50 |         50 |       0 |    0 |     5 |      0 |      0 |            0 |
|       (iBusArbiter)                         |            BmbArbiter_1 |          2 |          2 |       0 |    0 |     0 |      0 |      0 |            0 |
|       memory_arbiter                        |         StreamArbiter_3 |         48 |         48 |       0 |    0 |     5 |      0 |      0 |            0 |
|     iBusDecoder                             |            BmbDecoder_1 |         79 |         79 |       0 |    0 |     6 |      0 |      0 |            0 |
|     iBusDecoder_io_outputs_0_downSizer      |      BmbDownSizerBridge |         70 |         70 |       0 |    0 |    99 |      0 |      0 |            0 |
|     iMemBridge                              |         BmbToLiteDram_1 |        163 |         67 |      96 |    0 |    63 |      0 |      0 |            0 |
|       (iMemBridge)                          |         BmbToLiteDram_1 |         15 |         15 |       0 |    0 |     6 |      0 |      0 |            0 |
|       cmdContext_fifo                       |            StreamFifo_3 |         23 |         15 |       8 |    0 |    17 |      0 |      0 |            0 |
|       io_input_unburstify                   |         BmbUnburstify_1 |         23 |         23 |       0 |    0 |    29 |      0 |      0 |            0 |
|       io_output_rdata_fifo                  | StreamFifoLowLatency__1 |        102 |         14 |      88 |    0 |    11 |      0 |      0 |            0 |
|       streamFork_4                          |            StreamFork_3 |          0 |          0 |       0 |    0 |     0 |      0 |      0 |            0 |
|     peripheralArbiter                       |            BmbArbiter_2 |         65 |         65 |       0 |    0 |     3 |      0 |      0 |            0 |
|       (peripheralArbiter)                   |            BmbArbiter_2 |          1 |          1 |       0 |    0 |     0 |      0 |      0 |            0 |
|       memory_arbiter                        |         StreamArbiter_4 |         64 |         64 |       0 |    0 |     3 |      0 |      0 |            0 |
|     peripheralArbiter_io_output_toWishbone  |           BmbToWishbone |         13 |         13 |       0 |    0 |    72 |      0 |      0 |            0 |
+---------------------------------------------+-------------------------+------------+------------+---------+------+-------+--------+--------+--------------+
* Note: The sum of lower-level cells may be larger than their parent cells total, due to cross-hierarchy LUT combining

Charles Papon

unread,
May 3, 2020, 8:52:06 AM5/3/20
to Linux for LiteX FPGA SoC

Did some tests where the bandwidth is mesured. The shell is after 2.5 s.
Note those tests aren't using litedram, but are in a virtual SoC.

The noise is due to the timer ticks (i filtred the data a bit to reduce that noise)

bandwidth.png


also, in the shell, ran some commands : 

At 4.5 =>  3 dhrystone running in parallel
At 5 => 1 dhrystone running alone

bandwidth2.png






Reply all
Reply to author
Forward
0 new messages