super-scalar RV32I cpu core; 3.8/3.6 DMIPS/MHz(2/3-stage)

369 views
Skip to first unread message

lite RISC

unread,
May 23, 2019, 10:49:04 PM5/23/19
to RISC-V HW Dev

Hello everyone,

  I am very glad to introduce my super-scalar RV32I cpu core. It is synthesizable and parameterizable fully. It has outstanding benchmark score: 3.8/3.6 DMIPS/MHz(2/3 stage).

 In theory, you can assign random number of instructions to be executed in the same cycle. This core will estimate instructions fetched to determine how many instructions to be issued actually.  Below is parallel-issue status of 16 (1 iteration of " CoreMark 1.0" test )

CoreMark 1.0

ticks =      650135  instructions =      716495  I/T = 1.102071

           0 --      178771 -- 0.274975

           1 --      322573 -- 0.496163

           2 --       69757 -- 0.107296

           3 --       69186 -- 0.106418

           4 --        5334 -- 0.008204

           5 --        2696 -- 0.004147

           6 --        1343 -- 0.002066

           7 --         219 -- 0.000337

           8 --         114 -- 0.000175

           9 --          62 -- 0.000095

          10 --          36 -- 0.000055

          11 --           4 -- 0.000006

          12 --          11 -- 0.000017

          13 --           1 -- 0.000002

          14 --          12 -- 0.000018

          15 --           0 -- 0.000000

          16 --          16 -- 0.000025

#how many instructions executed in 1 ticks# number of ticks# its % ratio

 

You can see, unless the compiler  knows the parallel feature and makes use of it, more than 4 ALUs are useless. On the other hand, the evaluation of instructions fetched is an iteration participated by every instruction. It is a critical path increased by the number. Fortunately, if this number is 3 or 4, the core could reach the maximum of DMIPS/MHz, and the critical path is acceptable(3/4 will get 20 ns (50MHz) on Altera DE2-115 FPGA, the worst condition).

 

Anyone who is interested in this project , please visit my github page:  https://github.com/risclite/SuperScalar-RISCV-CPU

 

Just download my code, give some critical numbers,  run simulation and see what happens.

lkcl

unread,
May 24, 2019, 2:31:05 AM5/24/19
to RISC-V HW Dev


On Friday, May 24, 2019 at 3:49:04 AM UTC+1, lite RISC wrote:

Hello everyone,

  I am very glad to introduce my super-scalar RV32I cpu core. It is synthesizable and parameterizable fully. It has outstanding benchmark score: 3.8/3.6 DMIPS/MHz(2/3 stage).


fascinating, i will study it in detail as i have a keen interest in superscalar and out-of-order designs.  i notice you based the design on syntacore1's work - well done not re-inventing things that you did not have to!  it has allowed you to focus on one area of significant improvement.

ticks =      650135  instructions =      716495  I/T = 1.102071

           0 --      178771 -- 0.274975

           1 --      322573 -- 0.496163

           2 --       69757 -- 0.107296

           3 --       69186 -- 0.106418

           4 --        5334 -- 0.008204

           5 --        2696 -- 0.004147

           6 --        1343 -- 0.002066

 

#how many instructions executed in 1 ticks# number of ticks# its % ratio

 

You can see, unless the compiler  knows the parallel feature and makes use of it, more than 4 ALUs are useless. On the other hand, the evaluation of instructions fetched is an iteration participated by every instruction. It is a critical path increased by the number. Fortunately, if this number is 3 or 4, the core could reach the maximum of DMIPS/MHz, and the critical path is acceptable(3/4 will get 20 ns (50MHz) on Altera DE2-115 FPGA, the worst condition).


interesting.  can you describe in particular what the design does when it encounters read-after-write, write-after-read and write-after-write register dependencies, how it detects them, and what the core does when they're detected?

are you planning to add what is known as "register renaming", to be able to avoid some of these hazards and get better performance?  register renaming, i gather, is what allows the parallelism to avoid the limitations of the compiler not knowing about the internal parallelism.

also, will you be adding support for RVC at some point?  our team will be doing a multi-issue superscalar OoO design, and mixed 16/32-bit instructions in the queue makes for some fun computer science design issues to solve :)

l.

lite RISC

unread,
May 24, 2019, 4:12:13 AM5/24/19
to RISC-V HW Dev
Hi, lkcl,
   An instruction can be thought as some kind of function : Rd = F(Rs0,Rs1). 
   Rs0/Rs1 has a register list, which lists registers who are Rd of ahead instructions. If it matchs, this instruction will not be issued.
   Rd has also a register list, which lists registers untouchable.  
  These two register lists are formed by all ahead instructions. If it checks that Rs0/Rs1/Rd are not one of the lists, this instruction are good for executation. Then, this instruction will contribute to register lists.

   I don't think "register renaming" will solve anything. Some register hazard is local,which will not involve with several instructions. Because there are 31 registers, compiler will not always favor one register and ignore others.Why not just buy it and  add more ALUs? If you are stalled by some hazard, as time by, cpu with more ALUs can catch up.
  
   I have added RVC instruction set to my design, it works but leads to low speeds. I have to abandon it to make it run fast. Maybe next time I will add it.


在 2019年5月24日星期五 UTC+8下午2:31:05,lkcl写道:

lkcl

unread,
May 24, 2019, 5:03:23 AM5/24/19
to RISC-V HW Dev


On Friday, May 24, 2019 at 9:12:13 AM UTC+1, lite RISC wrote:
Hi, lkcl,
   An instruction can be thought as some kind of function : Rd = F(Rs0,Rs1). 
   Rs0/Rs1 has a register list, which lists registers who are Rd of ahead instructions. If it matchs, this instruction will not be issued.
   Rd has also a register list, which lists registers untouchable.  

ah that is funny, because i was considering exactly this technique, to augment the scoreboard design.  basically, what you are doing is, detecting (in advance, before issue) that there will be no clashes, either on the source or the destination registers.

do you have a system for ensuring that a register that is to be written to is not allowed to be read?

i.e., if the flag is raised in the Rdest register list, does it stop an instruction from being issued if it is in the Rs0/Rs1 of an instruction?


   I don't think "register renaming" will solve anything.

it's... particularly complex to explain: i've spent the past... errr.... six months getting to understand it.

register renaming basically avoids the hazards above, allowing parallel ALUs to proceed *even* when some registers have the exact same names.  

detecting the opportunities is, to be honest, a pig to describe, although the actual logic, once implemented, is surprisingly (even confusingly) simple, and may be done in a near-combinatorial fashion.
 
Some register hazard is local,which will not involve with several instructions. Because there are 31 registers, compiler will not always favor one register and ignore others.Why not just buy it and  add more ALUs?

i'm adding more ALUs *and* adding register-renaming.  it's done implicitly, as part of the 6600 scoreboard design.  it's not obvious - at all - that the 6600-style scoreboard even *provides* register-renaming (automatically).

If you are stalled by some hazard, as time by, cpu with more ALUs can catch up.
  
   I have added RVC instruction set to my design, it works but leads to low speeds.

that's interesting, in itself, it sounds... anomalous.  do you mean that the FPGA runs slower, or do you mean that the SpecMark performance is lower?


I have to abandon it to make it run fast. Maybe next time I will add it.

look forward to seeing what happens.  do you have the version that you abandoned?  (i always make sure that all commits are public, so that it is possible to "go back in time").

l.

lite RISC

unread,
May 24, 2019, 5:26:59 AM5/24/19
to RISC-V HW Dev
Sorry for my poor English. I will show your code:

        rglist_in[0]     = init_rglist;
rglist_out[0]    = init_rglist;
  These are two register lists for source and dest. Its initials are Rds of memory loading operations.

        assign instr_rg_hit[i] = ( |(rglist_in[i]&(((1'b1<<instr_rs0[i])|(1'b1<<instr_rs1[i]))>>1)) )|( |(rglist_out[i]&((1'b1<<instr_rd[i])>>1)) );
   
  It will check whether source/dist of current instruction are match.
  
      If current instruction is about to be suspended :
            rglist_in[i+1]    = rglist_in[i]|( (1'b1<<instr_rd[i])>>1 );
rglist_out[i+1]   = rglist_out[i]|( ((1'b1<<instr_rd[i])|(1'b1<<instr_rs1[i])|(1'b1<<instr_rs0[i]))>>1  );
     if current instruction is about to be executed:
                                         rglist_in[i+1]    = rglist_in[i]|( (1'b1<<instr_rd[i])>>1 );
rglist_out[i+1]   = rglist_out[i]|( (1'b1<<instr_rd[i])>>1 );
       
      The difference is because if an instruction is to be executed, its source will be used soon, it will not be avoided by other instructions.
 
      That is simple. Hope it helps you.

在 2019年5月24日星期五 UTC+8上午10:49:04,lite RISC写道:

Andrew Waterman

unread,
May 24, 2019, 5:43:24 AM5/24/19
to lite RISC, RISC-V HW Dev
On Thu, May 23, 2019 at 9:49 PM lite RISC <risc...@gmail.com> wrote:
>
> Hello everyone,
>
> I am very glad to introduce my super-scalar RV32I cpu core. It is synthesizable and parameterizable fully. It has outstanding benchmark score: 3.8/3.6 DMIPS/MHz(2/3 stage).

Nice work!

3.8 DMIPS/MHz is wide OOO superscalar territory. Are you sure you're
following the benchmarking rules (i.e. no inlining)?
> --
> You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
> To post to this group, send email to hw-...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/377fca2e-394d-4206-a138-1e751d282a53%40groups.riscv.org.

Marko Zec

unread,
May 24, 2019, 6:03:06 AM5/24/19
to Andrew Waterman, lite RISC, RISC-V HW Dev
On Fri, 24 May 2019 04:43:08 -0500
Andrew Waterman <wate...@eecs.berkeley.edu> wrote:

> On Thu, May 23, 2019 at 9:49 PM lite RISC <risc...@gmail.com> wrote:
> >
> > Hello everyone,
> >
> > I am very glad to introduce my super-scalar RV32I cpu core. It is
> > synthesizable and parameterizable fully. It has outstanding
> > benchmark score: 3.8/3.6 DMIPS/MHz(2/3 stage).
>
> Nice work!
>
> 3.8 DMIPS/MHz is wide OOO superscalar territory. Are you sure you're
> following the benchmarking rules (i.e. no inlining)?

The web page reports 2.14/1.95 (2/3-stage) with -O3 -fno-inline, but
quite possibly even that may not be enough to be fully compliant.

To entirely prevent all inlining (with gcc-7 at least) I had to add
-fno-inline-small-functions -finline-limit=0

to be able to find all the calls to dhrystone's Procs and Funcs really
in place where they should be when looking in the objdump.

Marko

lkcl

unread,
May 24, 2019, 6:06:34 AM5/24/19
to RISC-V HW Dev


On Friday, May 24, 2019 at 10:26:59 AM UTC+1, lite RISC wrote:
Sorry for my poor English.

no problem.  you're doing fine.  clarity is always needed no matter what the language :)
 
I will show your code:

        rglist_in[0]     = init_rglist;
rglist_out[0]    = init_rglist;
  These are two register lists for source and dest. Its initials are Rds of memory loading operations.

        assign instr_rg_hit[i] = ( |(rglist_in[i]&(((1'b1<<instr_rs0[i])|(1'b1<<instr_rs1[i]))>>1)) )|( |(rglist_out[i]&((1'b1<<instr_rd[i])>>1)) );
   

oof.  that's far too densely packed to be human-readable.  let me try to line things up, enable fixed-width fonts, and create some additional temporary vars.

assign rs0_unary = (1'b1<<instr_rs0[i]) >> 1;
assign rs1_unary = (1'
b1<<instr_rs1[i]) >> 1;
assign rd_unary  
= (1'b1<<instr_rd[i])>>1);


assign instr_rg_hit[i] = ( |(rglist_in[i] &
                             ((rs0_unary | rs1_unary)
                            ) ) |
                         ( |(rglist_out[i] &
                             rd_unary );


so, noting the | operator in front, which ORs all bits together to produce a single bit.

ok so the shift down by 1 is presumably because there's no interest in register zero (rs/rd==0) ? 

and, instr_rg_hit, rglist_in and rglist_out are a 2D array, 1st (outer) dimension is the number of multi-issue instructions, 2nd dimension is the number of registers?

so, in english:

a hit per register is true if the UNARY version of rs0 or rs1 match the "rglist_in" row for this to-be-issued-instruction, or the UNARY version of rd matches the "rglist_out" row for this to-be-issued-instruction.


  It will check whether source/dist of current instruction are match.
  
      If current instruction is about to be suspended :

suspended... do you mean stalled or do you mean "retired" (as in, "end execution")?  or, does "suspended" mean something else?
 
            rglist_in[i+1]    = rglist_in[i]|( (1'b1<<instr_rd[i])>>1 );
rglist_out[i+1]   = rglist_out[i]|( ((1'b1<<instr_rd[i])|(1'b1<<instr_rs1[i])|(1'b1<<instr_rs0[i]))>>1  );
 
so, re-formatting these, and applying fixed-width font:

rglist_in[i+1]    = rglist_in[i] | rd_unary;
rglist_out
[i+1]   = rglist_out[i]|( rd_unary | rs1_unary | rs2_unary );

ok!  so that's quite clear, the rg_list_in accumulates (for each instruction [i]) the destination registers, and rglist_out accumulates the source *and* dest registers used in this particular (multi-issue) clock cycle.


     if current instruction is about to be executed:
                                         rglist_in[i+1]    = rglist_in[i]|( (1'b1<<instr_rd[i])>>1 );
rglist_out[i+1]   = rglist_out[i]|( (1'b1<<instr_rd[i])>>1 );

and, reformatting these (using the same temporaries above)

rglist_in[i+1]  = rglist_in[i]  | rd_unary;
rglist_out
[i+1] = rglist_out[i] | rd_unary;

so that's also clear.
 
       
      The difference is because if an instruction is to be executed, its source will be used soon, it will not be avoided by other instructions.

i don't quite follow because i don't understand what "suspended" means (it might mean "stalled").

if an instruction is not to be "stalled" (i.e. it is to be executed in *this* cycle), then it is just about to read its source registers, therefore it should not cause a block on the I+1, i+2 etc. instructions.

is that correct?


i have one additional question: are the rglist_in and rglist_out carried over onto the next clock cycle, and are the entries *cleared out* when the instruction finishes?

i would expect that rglist_in[0] and rglist_out[0] would start from the state of the previous clock cycle, is that correct?

thank you for your patience, this is fascinating to understand.

l.

Jacob Lifshay

unread,
May 24, 2019, 7:40:05 AM5/24/19
to lite RISC, RISC-V HW Dev
On Fri, May 24, 2019, 01:12 lite RISC <risc...@gmail.com> wrote:
Hi, lkcl,
   An instruction can be thought as some kind of function : Rd = F(Rs0,Rs1). 
   Rs0/Rs1 has a register list, which lists registers who are Rd of ahead instructions. If it matchs, this instruction will not be issued.
   Rd has also a register list, which lists registers untouchable.  
  These two register lists are formed by all ahead instructions. If it checks that Rs0/Rs1/Rd are not one of the lists, this instruction are good for executation. Then, this instruction will contribute to register lists.

   I don't think "register renaming" will solve anything. Some register hazard is local,which will not involve with several instructions. Because there are 31 registers, compiler will not always favor one register and ignore others.Why not just buy it and  add more ALUs? If you are stalled by some hazard, as time by, cpu with more ALUs can catch up.
If register renaming is combined with branch prediction and speculative execution, then the processor can be much faster, such as in the following example:

notice how the processor does the equivalent of loop unrolling as it executes the code. that allows it to achieve 5 instructions/cycle for a larger number of loop iterations.

I use r15 instead of a5 or x12 to emphasize that the result of renaming are a different set of registers that can be much larger than the 31 registers that the compiler can specify.


// void func(uint64_t *ptr, uint64_t size)
func:
    // ptr passed in a0
    // size passed in a1
    slli a1, a1, 3
    add a1, a0, a1 // create pointer to end of array
    li a3, 123

loop:
    ld a2, (a0)
    mul a2, a2, a3
    sd a2, (a0)
    addi a0, a0, 8
    bne a0, a1, loop
    ret

assuming all branches are correctly predicted, then the instructions can be run as follows:

starting with a0 renamed to r00 and a1 renamed to r01:

clock cycle 1:
issue #01: "slli a1, a1, 3"      renamed to "slli r02, r01, 3"
issue #02: "add a1, a0, a1"      renamed to "add r03, r00, r02"
issue #03: "li a3, 123"          renamed to "li r04, 123"
issue #04: "ld a2, (a0)"         renamed to "ld r05, (r00)"
issue #05: "mul a2, a2, a3"      renamed to "mul r06, r05, r04"
issue #06: "sd a2, (a0)"         renamed to "sd r06, (r00)"
issue #07: "addi a0, a0, 8"      renamed to "addi r07, r00, 8"
issue #08: "bne a0, a1, loop"    renamed to "bne r07, r03, loop"

clock cycle 2:
execute #01: slli r02, r01, 3
execute #03: li r04, 123
execute #04: ld r05, (r00)
execute #07: addi r07, r00, 8
issue #09: "ld a2, (a0)"         renamed to "ld r08, (r07)"
issue #10: "mul a2, a2, a3"      renamed to "mul r09, r08, r04"
issue #11: "sd a2, (a0)"         renamed to "sd r09, (r07)"
issue #12: "addi a0, a0, 8"      renamed to "addi r10, r07, 8"
issue #13: "bne a0, a1, loop"    renamed to "bne r10, r03, loop"

clock cycle 3:
retire #01: slli r02, r01, 3
execute #02: add r03, r00, r02
execute #05: mul r06, r05, r04
execute #09: ld r08, (r07)
execute #12: addi r10, r07, 8
issue #14: "ld a2, (a0)"         renamed to "ld r11, (r10)"
issue #15: "mul a2, a2, a3"      renamed to "mul r12, r11, r04"
issue #16: "sd a2, (a0)"         renamed to "sd r12, (r10)"
issue #17: "addi a0, a0, 8"      renamed to "addi r13, r10, 8"
issue #18: "bne a0, a1, loop"    renamed to "bne r13, r03, loop"

clock cycle 4:
retire #02: add r03, r00, r02
retire #03: li r04, 123
retire #04: ld r05, (r00)
retire #05: mul r06, r05, r04
execute #06: sd r06, (r00)
execute #08: bne r07, r03, loop
execute #10: mul r09, r08, r04
execute #13: bne r10, r03, loop
execute #14: ld r11, (r10)
execute #17: addi r13, r10, 8
issue #19: "ld a2, (a0)"         renamed to "ld r14, (r13)"
issue #20: "mul a2, a2, a3"      renamed to "mul r15, r14, r04"
issue #21: "sd a2, (a0)"         renamed to "sd r15, (r13)"
issue #22: "addi a0, a0, 8"      renamed to "addi r16, r13, 8"
issue #23: "bne a0, a1, loop"    renamed to "bne r16, r03, loop"

clock cycle 5:
retire #06: sd r06, (r00)
retire #07: addi r07, r00, 8
retire #08: bne r07, r03, loop
retire #09: ld r08, (r07)
retire #10: mul r09, r08, r04
execute #11: sd r09, (r07)
execute #15: mul r12, r11, r04
execute #18: bne r13, r03, loop
execute #19: ld r14, (r13)
execute #22: addi r16, r13, 8
issue #24: "ld a2, (a0)"         renamed to "ld r17, (r16)"
issue #25: "mul a2, a2, a3"      renamed to "mul r18, r17, r04"
issue #26: "sd a2, (a0)"         renamed to "sd r18, (r16)"
issue #27: "addi a0, a0, 8"      renamed to "addi r19, r16, 8"
issue #28: "bne a0, a1, loop"    renamed to "bne r19, r03, loop"

clock cycle 6:
retire #11: sd r09, (r07)
retire #12: addi r10, r07, 8
retire #13: bne r10, r03, loop
retire #14: ld r11, (r10)
retire #15: mul r12, r11, r04
execute #16: sd r12, (r10)
execute #20: mul r15, r14, r04
execute #23: bne r16, r03, loop
execute #24: ld r17, (r16)
execute #27: addi r19, r16, 8
issue #29: "ld a2, (a0)"         renamed to "ld r20, (r19)"
issue #30: "mul a2, a2, a3"      renamed to "mul r21, r20, r04"
issue #31: "sd a2, (a0)"         renamed to "sd r21, (r19)"
issue #32: "addi a0, a0, 8"      renamed to "addi r22, r19, 8"
issue #33: "bne a0, a1, loop"    renamed to "bne r22, r03, loop"

clock cycle 7:
retire #16: sd r12, (r10)
retire #17: addi r13, r10, 8
retire #18: bne r13, r03, loop
retire #19: ld r14, (r13)
retire #20: mul r15, r14, r04
execute #21: sd r15, (r13)
execute #25: mul r18, r17, r04
execute #28: bne r19, r03, loop
execute #29: ld r20, (r19)
execute #32: addi r22, r19, 8
issue #34: "ret"                 renamed to "ret"

clock cycle 8:
retire #21: sd r15, (r13)
retire #22: addi r16, r13, 8
retire #23: bne r16, r03, loop
retire #24: ld r17, (r16)
retire #25: mul r18, r17, r04
execute #26: sd r18, (r16)
execute #30: mul r21, r20, r04
execute #33: bne r22, r03, loop
execute #34: ret

clock cycle 9:
retire #26: sd r18, (r16)
retire #27: addi r19, r16, 8
retire #28: bne r19, r03, loop
retire #29: ld r20, (r19)
retire #30: mul r21, r20, r04
execute #31: sd r21, (r19)

clock cycle 10:
retire #31: sd r21, (r19)
retire #32: addi r22, r19, 8
retire #33: bne r22, r03, loop
retire #34: ret

lkcl

unread,
May 27, 2019, 9:34:53 AM5/27/19
to RISC-V HW Dev
Hello,

I received your private response, and I replied that I prefer to keep to public discussions. I then did not receive a response from you.

The reason for keeping the discussion public is so that others may learn from the discussion, and others may verify that the insights and evaluations are correct.

For example, in the private reply (that people have not seen so I now have to describe it), you mentioned that the instructions which do not match the dependencies are delayed until the next clock cycle.

This would seem to suggest that the design is executing instructions out of order, and if an interrupt were to occur it could potentially result in data corruption.

Can you please post that private reply publically so that this can be evaluated by people with more experience?

You may find the following reply by Mitch Alsup useful:

https://groups.google.com/d/msg/comp.arch/LXWtd1L9JoY/7P7yifihBQAJ

The algorithm that Mitch describes is extremely similar to the one that you are using.

However, it works because there is a FULL proper RaW/WaR/WaW dependency system already in place, that is capable of precise instruction order preservation, preventing writes from taking place out of order as well as ensuring that register results are used only when they are ready.

Question: how does the design that you have written deal with interrupts, and how does it guarantee instruction order?

L.

lite RISC

unread,
May 27, 2019, 10:21:54 AM5/27/19
to lkcl, RISC-V HW Dev
Hi, lkcl,
   I have a dedicated area called "QUEUE", which keeps instructions not being executed but assuming done. When "QUEUE" area is full, cpu will have to be in-order and no instructions could be skipped. So, when "QUEUE" is full, cpu just loses the ability of out-of-order, but still is alive. When an interrupt is coming, cpu could fetch instructions of trap entry and executes them one by one in order. Any way, maybe it is possible that we could supply another empty "QUEUE" area dedicated for interrupts. Only instructions in this dedicated "QUEUE" are empty, we can say that services for interrupts are over. 
  I think "QUEUE" area is a simple and efficient way to do out-of-order. Just put aside instructions stalled by data hazard and let instructions behind to be evaluated.  If some instructions are stuck a long time and then make "QUEUE" area full, CPU will never go die and do in-order work.
  I thought I have answered to a forum publicly, but I did not.
 Just remind me if I could not supply clear answer. I try to explain clearly because I have encountered these problems in design work and glad to share these.


lkcl <luke.l...@gmail.com> 于2019年5月27日周一 下午9:34写道:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lkcl

unread,
May 27, 2019, 10:39:07 AM5/27/19
to RISC-V HW Dev, luke.l...@gmail.com


On Monday, May 27, 2019 at 3:21:54 PM UTC+1, lite RISC wrote:
Hi, lkcl,
   I have a dedicated area called "QUEUE", which keeps instructions not being executed but assuming done. When "QUEUE" area is full, cpu will have to be in-order and no instructions could be skipped.

ah super, that would do it!  as long as it the writing to registers (and memory?) is done in-order, they can still be *executed* out-of-order.  and as long as the dependencies for read and write are also respected (which the rglist_in/out provide), then register file corruption cannot occur.
 
So, when "QUEUE" is full, cpu just loses the ability of out-of-order, but still is alive. When an interrupt is coming, cpu could fetch instructions of trap entry and executes them one by one in order. Any way, maybe it is possible that we could supply another empty "QUEUE" area dedicated for interrupts. Only instructions in this dedicated "QUEUE" are empty, we can say that services for interrupts are over. 

what the normal way to do this is, to get good interrupt latency, is to cancel any pending instructions (throw away the contents of "QUEUE"), start executing the instructions from the interrupt, and once the interrupt is over, re-issue the instructions that were previously thrown away.

a QUEUE for interrupts has the problem that interrupts can be themselves interrupted, so now you have to have dedicated QUEUEs per interrupt level, and it gets complicated very very quickly.

  I think "QUEUE" area is a simple and efficient way to do out-of-order. Just put aside instructions stalled by data hazard and let instructions behind to be evaluated.  If some instructions are stuck a long time and then make "QUEUE" area full, CPU will never go die and do in-order work.
  I thought I have answered to a forum publicly, but I did not.

i have done that many times... :)
 
 Just remind me if I could not supply clear answer. I try to explain clearly because I have encountered these problems in design work and glad to share these.

it's really appreciated.

l.

lite RISC

unread,
May 27, 2019, 10:47:54 AM5/27/19
to lkcl, RISC-V HW Dev
Hi, lkcl,
   Data memory operations are forced to be in-order, which will help me avoid memory hazard.  Allocation of memory operations is a source-limited progress because the length of "Memory buffer" is not big enough. When a memory-related instruction stalled by hazard goes into "QUEUE" area, cpu will declare that "Memory buffer" is full but that is not true, just a method to force to be in-order. Memory-related instructions behind will not skip this memory instruction stored in "QUEUE" area. So, this is in-order memory operation.
  As for register-register hazard, I have introduced a method of register list. Any instruction will have a register list that tells it that these registers are not be your list of Rd,Rs1 and Rs0. If this instruction has a match, this instruction will go into "QUEUE" area to wait for next evaluating in the next tick. Never mind, in the next tick, it will have more priority than other instructions fetched. When "QUEUE" area is full, a new instruction happens to be stalled and then the instruction buffer will freeze because we could not put aside this new one, and instructions behind are not allowed to be evaluated, freezing the instruction buffer is the only method. In this tick, if an interrupt is coming,  this new-stalled instruction is the breakpoint, not instructions stored in "QUEUE" because instruction in "QUEUE" are treated "DONE", instructions behind have been executed and these could never be revoked.
  That is how I deal with hazard.


lite RISC <risc...@gmail.com> 于2019年5月27日周一 下午10:21写道:

Marc Gauthier

unread,
May 27, 2019, 6:07:12 PM5/27/19
to lite RISC, lkcl, RISC-V HW Dev
Are queued instructions guaranteed never to raise an exception?

-M

lite RISC

unread,
May 27, 2019, 6:59:50 PM5/27/19
to Marc Gauthier, lkcl, RISC-V HW Dev
When this instruction is evaluated to be "QUEUE" or "EXEC" area, first of all, this instruction should be legal. If there is some wrong about it, it should be effective immediately, otherwise, this one should be treated as an executed instruction but effective in another cycle.

Marc Gauthier <consu...@gstardust.com> 于2019年5月28日周二 上午6:07写道:

lite RISC

unread,
Jun 2, 2019, 10:53:16 PM6/2/19
to RISC-V HW Dev
Hi,  I have added RV32M and RV32C support, and some benchmark scores have changed. 


在 2019年5月24日星期五 UTC+8上午10:49:04,lite RISC写道:

Hello everyone,

lkcl

unread,
Jun 3, 2019, 9:59:01 PM6/3/19
to RISC-V HW Dev
Fantastic to hear. How did the checking go, on making sure the compiler optimisations were definitely disabled?

Also, earlier, I meant exceptions not interrupts.

In the Libre RISCV SoC which is a full OoO design, we use something known as "shadowing", a matrix of DFFs that ensure that any instruction that might throw an exception will actively prevent ALL newer instructions from committing its results, whether CSRs, STs or REG File.

The shadow is released the moment the instruction knows it will never raise the exception.

Interrupts - actual execution of them - are effectively just a change of CSR and a new PC. ok plus a bit more, but not a lot.

Sorry for confusion.

So Marc's question is interesting to me.

I know that by only doing single issue LD/ST you have avoided the need to deal with LD/ST exceptions. Are there any other places where exceptions are raised?

If not, you have managed to avoid the problem :)

lite RISC

unread,
Jun 3, 2019, 11:12:25 PM6/3/19
to lkcl, RISC-V HW Dev
Yes, you are right.
For LD/ST instructions, I have a module "MEMBUF", which will collect them and issue them one by one. If there is one exception raised on some LD/ST instruction( for example, dmem_resp is error condition), instructions behind have been issued, we have a breakpoint which is not this LD/ST instruction but some one behind this LD/ST instruction.
There are two solutions. One is that we make sure no instructions behind a LD/ST instruction could be executed until this LD/ST instruction has finished correctly. The whole processing of instructions will stall when encountering a LD/ST instruction. 
The other is that the exception service program admits the breakpoint is not the very point of this exception.It will read other CSR to know which LD/ST instruction results this. It will make sure this LD/ST instcruction be dealed with well and continue, or quit from this breakpoint.
I do not know which one is better. It is a good issue to solve.
Thank you!

lkcl <luke.l...@gmail.com> 于2019年6月4日周二 上午9:59写道:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lite RISC

unread,
Jun 3, 2019, 11:26:24 PM6/3/19
to lkcl, RISC-V HW Dev
Another way is that "MEMBUF" keeps pc of every LD/ST instruction. Any one raised an exception will make its pc as a breakpoint. However, instructions after this exception one have been executed, any exception service program has to admit that and based that condition, do someting. 

lite RISC <risc...@gmail.com> 于2019年6月4日周二 上午11:10写道:

Daniel Petrisko

unread,
Jun 3, 2019, 11:38:01 PM6/3/19
to lite RISC, lkcl, RISC-V HW Dev
Hi,

I’m curious about this exception mechanism. 

When this instruction is evaluated to be "QUEUE" or "EXEC" area, first of all, this instruction should be legal

How do you enforce the ordering for illegal instruction exceptions? Presumably you need to drain the in flight instructions and then take the exception. 

That handles the case where the instruction is obviously illegal, e.g. all zeros or multiplication when you don’t support it. But what about something like an emulated CSR access? Do you do the CSR decode before issuing to the QUEUE area as well?

Best,
Dan Petrisko

lite RISC

unread,
Jun 3, 2019, 11:49:01 PM6/3/19
to Daniel Petrisko, lkcl, RISC-V HW Dev
HI, Only "ALU" and "MEM" instructions could be out-of-order. Such as, mult, csr, jump, conditional jump, fence, system instruction should be forced to be in order. A good method is that the last ALU unit is dedicated to these odd instructions. When these instructions are issued to be executed, instructions after are not to be queued or executed. So, when it raises an exeception, it will be dealed with as other CPU cores.  

Daniel Petrisko <petr...@cs.washington.edu> 于2019年6月4日周二 上午11:37写道:

高野茂幸

unread,
Jun 4, 2019, 2:19:58 AM6/4/19
to lite RISC, Daniel Petrisko, RISC-V HW Dev, lkcl
Hi,

Then how to treat dependency on register file shared with out-of-order ALUs, and also CSR?

Best Regards,
T

2019年6月4日(火) 12:49 lite RISC <risc...@gmail.com>:

lite RISC

unread,
Jun 4, 2019, 4:03:57 AM6/4/19
to lkcl, RISC-V HW Dev
Hi, All,

 Instructions are devided into two types: common( alu and memory operation ) and uncommon( CSR, mul,div, jump, conditional jump,system, ret, fence and so on). 

The latter are execuated in-order because some are rare(CSR, mul/div,fence ) and some are not allowed to be out-of-order(direct jump, conditional jump, ret, sbreak).

 Alu and memory operations are very common, whether they could be executed depends on their Rs and Rd. So alu and memory instructions can be out-of-order. 

I have made a mistake that when a LD/ST instruction is issued to memory bus, this instruction is not retired until memory bus reports resp OK. So, until memory bus reports resp OK, no following instructions could write Rd to register file.  

  lkcl gives me a good solution: when a memory instruction is issued, following instructions could be executed, but their Rd should be stored in temp registers, not in register file. If this memory instruction reports OK, temp registers are written to register file, or be abandoned. Temp registers are limited, if they are full and no info from memory bus, cpu will stall until memory bus return someting. Any Rs from ALU module are fetched from temp registers firstly, then register file. 



lkcl <luke.l...@gmail.com> 于2019年6月4日周二 上午9:59写道:
On Monday, June 3, 2019 at 10:53:16 AM UTC+8, lite RISC wrote:

lkcl

unread,
Jun 4, 2019, 9:33:48 AM6/4/19
to RISC-V HW Dev, risc...@gmail.com, petr...@cs.washington.edu, luke.l...@gmail.com


On Tuesday, June 4, 2019 at 7:19:58 AM UTC+1, adaptiveprocessor wrote:
Hi,

Then how to treat dependency on register file shared with out-of-order ALUs,

this was discussed earlier
 
and also CSR?

in lite's design, he is dropping back to "standard" single-issue in-order.

in the Libre RISCV OoO design we are discussing treating CSRs as "Another Register File With Associated Dependency Matrices".

the Dependency Matrices will catch both the read and write points on all CSRs, allowing copies of state information to travel freely in time-sync to ALU pipelines as part of the instruction opcodes that require them, safe in the knowledge that the dependencies on those CSRs has been taken care of.

any CSR that *MUST* be "global" in nature CANNOT be treated this way.

however VL (Vector Length), FP CSR flag information and so on definitely can.  the FP CSR status flags are simply "another register with a write dependency that happens to sit side-by-side with the FP reg and thus has a corresponding write dependency just like the FP reg result".

cancellation of any instruction due to exceptions will thus PREVENT not only the FP result reg from reaching the FP register file it will *prevent the CSR FP status flags from reaching the CSR regfile as well*.

l.

lkcl

unread,
Jun 4, 2019, 9:35:30 AM6/4/19
to RISC-V HW Dev, luke.l...@gmail.com
On Tuesday, June 4, 2019 at 9:03:57 AM UTC+1, lite RISC wrote:

  lkcl gives me a good solution: when a memory instruction is issued, following instructions could be executed, but their Rd should be stored in temp registers,

latches.  so, yes, temp-registers.  the latches containing the results are held until you know - for certain - that the exception will definitely, definitely not be raised.

l.

高野茂幸

unread,
Jun 4, 2019, 10:38:37 AM6/4/19
to lkcl, RISC-V HW Dev, luke.l...@gmail.com, petr...@cs.washington.edu, risc...@gmail.com
Hi Luke-san,

I do not yet understand about in case of interrupt/exception (looking like a different special thread) and common thread on the ALU, two different threads shares the one register file having more read/write ports (highly energy consumption and large scale).
So do you mean to stall or spin the common thread until the special thread terminates (sleep), so this makes exclusive manner between them and reduces complexity of the shared register file, right?
But I do not know this can avoid clover of variables on register file by special thread (needing spilling out context of common thread, common approach).

And making CSR as alternative register file is similar to this way;

Best,
T.

2019年6月4日(火) 22:33 lkcl <lk...@lkcl.net>:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lkcl

unread,
Jun 4, 2019, 11:56:05 AM6/4/19
to RISC-V HW Dev, lk...@lkcl.net, luke.l...@gmail.com, petr...@cs.washington.edu, risc...@gmail.com


On Tuesday, June 4, 2019 at 3:38:37 PM UTC+1, adaptiveprocessor wrote:
Hi Luke-san,

I do not yet understand about in case of interrupt/exception (looking like a different special thread) and common thread on the ALU, two different threads shares the one register file having more read/write ports (highly energy consumption and large scale).

 no threads were mentioned previously. and this is slightly confusing terminology, to consider interrupts to be "threads".  from what i understand, interrupts are simply a change of CSR to set the mstatus bit "HI".  there is no "threading" involved in that, although i understand why you use this terminology: interrupts are conceptually just a "different PC and a different context".

plus, and someone else will have to confirm this, i believe that interrupts (traps) really do nothing other than:
* update the mstatus flag
* set the epc to point to where the userspace program was previously executing
* change pc to point to the start of the trap (the appropriate address being looked up in mtvec).

errr... that's it.  that's all.  there is *no* change of registers, no switching register banks, nothing.  even user-mode can have traps, and likewise will use uepc to record the current pc, and redirect to the user trap address by changing pc to point to the appropriate utvec entry.

So do you mean to stall or spin the common thread
 
the current execution

until the special thread
 
trap

terminates (sleep), so this makes exclusive manner between them and reduces complexity of the shared register file, right?

not precisely, because threads are not being discussed.

however the register files are still shared (between the current execution and the interrupt), however the designers of RISC-V took that into account when interrupts (traps) occur, and provided some clean and simple mechanisms to deal with it.

when the trap has occurred, it is the *TRAP's* responsibility to save any registers.  this is done using mscratch.

being discussed is the idea of stopping even changes to MEPC as a commit-prevented operation!  i.e. a trap could be in the process of being executed, results are in the pipelines, get stored in the Reservation Stations (lite calls them "temp registers")... oh and then there's a CANCELLATION event that causes the trap's results to be *thrown away* before they are allowed to be committed to the CSR Memory and the Register Files!

frickin freaky to think even of a trap as "unwindable" or "cancellable" in this way.

But I do not know this can avoid clover of variables on register file by special thread (needing spilling out context of common thread, common approach).

And making CSR as alternative register file is similar to this way;

appreciated the link, will take a look.
 

Best,
T.

2019年6月4日(火) 22:33 lkcl <lk...@lkcl.net>:


On Tuesday, June 4, 2019 at 7:19:58 AM UTC+1, adaptiveprocessor wrote:
Hi,

Then how to treat dependency on register file shared with out-of-order ALUs,

this was discussed earlier
 
and also CSR?

in lite's design, he is dropping back to "standard" single-issue in-order.

in the Libre RISCV OoO design we are discussing treating CSRs as "Another Register File With Associated Dependency Matrices".

the Dependency Matrices will catch both the read and write points on all CSRs, allowing copies of state information to travel freely in time-sync to ALU pipelines as part of the instruction opcodes that require them, safe in the knowledge that the dependencies on those CSRs has been taken care of.

any CSR that *MUST* be "global" in nature CANNOT be treated this way.

however VL (Vector Length), FP CSR flag information and so on definitely can.  the FP CSR status flags are simply "another register with a write dependency that happens to sit side-by-side with the FP reg and thus has a corresponding write dependency just like the FP reg result".

cancellation of any instruction due to exceptions will thus PREVENT not only the FP result reg from reaching the FP register file it will *prevent the CSR FP status flags from reaching the CSR regfile as well*.

l.

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-...@groups.riscv.org.

Samuel Falvo II

unread,
Jun 4, 2019, 12:10:49 PM6/4/19
to lkcl, RISC-V HW Dev, lkcl, petr...@cs.washington.edu, risc...@gmail.com
On Tue, Jun 4, 2019 at 8:56 AM lkcl <lk...@lkcl.net> wrote:
> plus, and someone else will have to confirm this, i believe that interrupts (traps) really do nothing other than:
> * update the mstatus flag
> * set the epc to point to where the userspace program was previously executing
> * change pc to point to the start of the trap (the appropriate address being looked up in mtvec).

Forgot one:

* change mcause to indicate the source of the interrupt

Also, this list assumes no interrupt delegation is in place.
Delegation logic may permit the processor substitute m* CSRs with s*
or even u* CSRs, depending on how the delegation bits are configured.
Note that (as I understand the current privspec) some fields are
common between different CSRs (e.g., SPP appears in both mstatus and
sstatus), while others are unique to their respective CSRs (e.g.,
mscratch and mcause, vs sscratch and scause).

--
Samuel A. Falvo II

高野茂幸

unread,
Jun 4, 2019, 4:43:23 PM6/4/19
to lkcl, RISC-V HW Dev, luke.l...@gmail.com, petr...@cs.washington.edu, risc...@gmail.com
Luke-san,

Thank you very much for you guide, all most of all I understand.
So, reservation station needs a single bit column to indicate the trap in order to discard appropriate entry, right?

Best,
T

2019年6月5日(水) 0:56 lkcl <lk...@lkcl.net>:
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.

To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lkcl

unread,
Jun 4, 2019, 5:43:21 PM6/4/19
to RISC-V HW Dev, lk...@lkcl.net, luke.l...@gmail.com, petr...@cs.washington.edu, risc...@gmail.com
On Wednesday, June 5, 2019 at 4:43:23 AM UTC+8, adaptiveprocessor wrote:
> Luke-san,
>
>
> Thank you very much for you guide, all most of all I understand.
>
> So, reservation station needs a single bit column to indicate the trap in order to discard appropriate entry, right?

Please do bear in mind it is an idea only just being considered, so the implications have not yet fully been thought through.

The answer is yes... however the reason is as follows.

Whilst CSRs are stored usually in SRAM and treated as bytes, words or DWORDs, Samuel created a CSR Regfile that stores VARIABLE length compacted bits ie only the fields that are writable.

This to save FPGA space.

Because of this I realised that actually, the bits of CSRs can be viewed as SEPARATE registers, even though they are stored in the same contiguously addressable range.

Thus, MIE and so on are SEPARATE REGISTERs, even though they are 1 bit long.

Thus, YES, there would be a single bit - a copy of the MIE or other *register* - stored in the Reservation Station and passed to decision-making logic for LOCAL use where most other designs would access and check a GLOBAL MIE field, protected by full OoO engine quiescence, full stalling and buffer flushing protocols and so on.

The issue here is that the Function Units are indeed passing around *copies* of CSR bitfields (protected by Dependency Matrices so it is safe to do so), and what I have not thought through yet is quite how *much* information would need to be passed around in this way.

If it is say only even up to 16 or even 20 bits of CSR State, that is tolerable.

However if it turns out to be 128 or 200 CSR bits that need to be passed into every ALU via the Reservation Stations then that is clearly intolerable, given that the operands are only 64 bit.

A mixed approach however may turn out to be feasible.

Where it will be tricky is the LD/ST, particularly on context switch to M, S and H Mode, and where the ASID changes.

There is still a lot to consider, I mentioned it wa a very early idea under consideration.

However the fallback is always just to stall, wait for the engine to commit all outstanding instructions, quiesce, and then it is safe to change a global CSR.

L.

高野茂幸

unread,
Jun 5, 2019, 1:56:16 AM6/5/19
to lkcl, RISC-V HW Dev, lk...@lkcl.net, petr...@cs.washington.edu, risc...@gmail.com
Luke-san,

No problem, I just want to clear an issue on the idea in order to realize. Let us make it possible the idea.

Best,
T

2019年6月5日(水) 6:43 lkcl <luke.l...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lite RISC

unread,
Sep 11, 2019, 11:36:15 PM9/11/19
to 高野茂幸, lkcl, RISC-V HW Dev, lkcl, Daniel Petrisko
Hi, All,

  I have updated the structure of SSRV and a tutorial on this. It is based on 4 multiple-in, multiple-out buffers connected with each other. To change parameters of 4 buffers leads to different performance.

  Here is the tutorial: https://risclite.github.io/

  Here is the Github Repository : https://github.com/risclite/SuperScalar-RISCV-CPU/

高野茂幸 <adaptive...@gmail.com> 于2019年6月5日周三 下午1:56写道:

lite RISC

unread,
Sep 11, 2019, 11:38:48 PM9/11/19
to RISC-V HW Dev
Hi, All,

  I have updated the structure of SSRV and a tutorial on this. It is based on 4 multiple-in, multiple-out buffers connected with each other. To change parameters of 4 buffers leads to different performance.

  Here is the tutorial: https://risclite.github.io/

  Here is the Github Repository : https://github.com/risclite/SuperScalar-RISCV-CPU/


lite RISC <risc...@gmail.com> 于2019年5月24日周五 上午10:49写道:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lite RISC

unread,
Mar 22, 2020, 8:41:37 AM3/22/20
to RISC-V HW Dev
Hello, everyone,
   This project has been updated to an amazing performance: 6.0 CoreMark/MHz. Its Dhrystone scores are 2.8 DMIPS/MHz(legal), 4.8 DMIPS/MHz(best). 
   Please visit: https://github.com/risclite/SuperScalar-RISCV-CPU. There is a total solution of simulation and FPGA implementation.
   FYI

   Thanks

lite RISC <risc...@gmail.com> 于2019年9月12日周四 上午11:38写道:
Reply all
Reply to author
Forward
0 new messages