Hello everyone,
I am very glad to introduce my super-scalar RV32I cpu core. It is synthesizable and parameterizable fully. It has outstanding benchmark score: 3.8/3.6 DMIPS/MHz(2/3 stage).
In theory, you can assign random number of instructions to be executed in the same cycle. This core will estimate instructions fetched to determine how many instructions to be issued actually. Below is parallel-issue status of 16 (1 iteration of " CoreMark 1.0" test )
CoreMark 1.0
ticks = 650135 instructions = 716495 I/T = 1.102071
0 -- 178771 -- 0.274975
1 -- 322573 -- 0.496163
2 -- 69757 -- 0.107296
3 -- 69186 -- 0.106418
4 -- 5334 -- 0.008204
5 -- 2696 -- 0.004147
6 -- 1343 -- 0.002066
7 -- 219 -- 0.000337
8 -- 114 -- 0.000175
9 -- 62 -- 0.000095
10 -- 36 -- 0.000055
11 -- 4 -- 0.000006
12 -- 11 -- 0.000017
13 -- 1 -- 0.000002
14 -- 12 -- 0.000018
15 -- 0 -- 0.000000
16 -- 16 -- 0.000025
#how many instructions executed in 1 ticks# number of ticks# its % ratio
You can see, unless the compiler knows the parallel feature and makes use of it, more than 4 ALUs are useless. On the other hand, the evaluation of instructions fetched is an iteration participated by every instruction. It is a critical path increased by the number. Fortunately, if this number is 3 or 4, the core could reach the maximum of DMIPS/MHz, and the critical path is acceptable(3/4 will get 20 ns (50MHz) on Altera DE2-115 FPGA, the worst condition).
Anyone who is interested in this project , please visit my github page: https://github.com/risclite/SuperScalar-RISCV-CPU
Just download my code, give some critical numbers, run simulation and see what happens.
Hello everyone,
I am very glad to introduce my super-scalar RV32I cpu core. It is synthesizable and parameterizable fully. It has outstanding benchmark score: 3.8/3.6 DMIPS/MHz(2/3 stage).
ticks = 650135 instructions = 716495 I/T = 1.102071
0 -- 178771 -- 0.274975
1 -- 322573 -- 0.496163
2 -- 69757 -- 0.107296
3 -- 69186 -- 0.106418
4 -- 5334 -- 0.008204
5 -- 2696 -- 0.004147
6 -- 1343 -- 0.002066
#how many instructions executed in 1 ticks# number of ticks# its % ratio
You can see, unless the compiler knows the parallel feature and makes use of it, more than 4 ALUs are useless. On the other hand, the evaluation of instructions fetched is an iteration participated by every instruction. It is a critical path increased by the number. Fortunately, if this number is 3 or 4, the core could reach the maximum of DMIPS/MHz, and the critical path is acceptable(3/4 will get 20 ns (50MHz) on Altera DE2-115 FPGA, the worst condition).
Hi, lkcl,An instruction can be thought as some kind of function : Rd = F(Rs0,Rs1).Rs0/Rs1 has a register list, which lists registers who are Rd of ahead instructions. If it matchs, this instruction will not be issued.Rd has also a register list, which lists registers untouchable.
I don't think "register renaming" will solve anything.
Some register hazard is local,which will not involve with several instructions. Because there are 31 registers, compiler will not always favor one register and ignore others.Why not just buy it and add more ALUs?
If you are stalled by some hazard, as time by, cpu with more ALUs can catch up.I have added RVC instruction set to my design, it works but leads to low speeds.
I have to abandon it to make it run fast. Maybe next time I will add it.
Sorry for my poor English.
I will show your code:rglist_in[0] = init_rglist;rglist_out[0] = init_rglist;These are two register lists for source and dest. Its initials are Rds of memory loading operations.assign instr_rg_hit[i] = ( |(rglist_in[i]&(((1'b1<<instr_rs0[i])|(1'b1<<instr_rs1[i]))>>1)) )|( |(rglist_out[i]&((1'b1<<instr_rd[i])>>1)) );
assign rs0_unary = (1'b1<<instr_rs0[i]) >> 1;
assign rs1_unary = (1'b1<<instr_rs1[i]) >> 1;
assign rd_unary = (1'b1<<instr_rd[i])>>1);
assign instr_rg_hit[i] = ( |(rglist_in[i] &
((rs0_unary | rs1_unary)
) ) |
( |(rglist_out[i] &
rd_unary );
It will check whether source/dist of current instruction are match.If current instruction is about to be suspended :
rglist_in[i+1] = rglist_in[i]|( (1'b1<<instr_rd[i])>>1 );rglist_out[i+1] = rglist_out[i]|( ((1'b1<<instr_rd[i])|(1'b1<<instr_rs1[i])|(1'b1<<instr_rs0[i]))>>1 );
rglist_in[i+1] = rglist_in[i] | rd_unary;
rglist_out[i+1] = rglist_out[i]|( rd_unary | rs1_unary | rs2_unary );
if current instruction is about to be executed:rglist_in[i+1] = rglist_in[i]|( (1'b1<<instr_rd[i])>>1 );rglist_out[i+1] = rglist_out[i]|( (1'b1<<instr_rd[i])>>1 );
rglist_in[i+1] = rglist_in[i] | rd_unary;
rglist_out[i+1] = rglist_out[i] | rd_unary;
The difference is because if an instruction is to be executed, its source will be used soon, it will not be avoided by other instructions.
Hi, lkcl,An instruction can be thought as some kind of function : Rd = F(Rs0,Rs1).Rs0/Rs1 has a register list, which lists registers who are Rd of ahead instructions. If it matchs, this instruction will not be issued.Rd has also a register list, which lists registers untouchable.These two register lists are formed by all ahead instructions. If it checks that Rs0/Rs1/Rd are not one of the lists, this instruction are good for executation. Then, this instruction will contribute to register lists.I don't think "register renaming" will solve anything. Some register hazard is local,which will not involve with several instructions. Because there are 31 registers, compiler will not always favor one register and ignore others.Why not just buy it and add more ALUs? If you are stalled by some hazard, as time by, cpu with more ALUs can catch up.
I received your private response, and I replied that I prefer to keep to public discussions. I then did not receive a response from you.
The reason for keeping the discussion public is so that others may learn from the discussion, and others may verify that the insights and evaluations are correct.
For example, in the private reply (that people have not seen so I now have to describe it), you mentioned that the instructions which do not match the dependencies are delayed until the next clock cycle.
This would seem to suggest that the design is executing instructions out of order, and if an interrupt were to occur it could potentially result in data corruption.
Can you please post that private reply publically so that this can be evaluated by people with more experience?
You may find the following reply by Mitch Alsup useful:
https://groups.google.com/d/msg/comp.arch/LXWtd1L9JoY/7P7yifihBQAJ
The algorithm that Mitch describes is extremely similar to the one that you are using.
However, it works because there is a FULL proper RaW/WaR/WaW dependency system already in place, that is capable of precise instruction order preservation, preventing writes from taking place out of order as well as ensuring that register results are used only when they are ready.
Question: how does the design that you have written deal with interrupts, and how does it guarantee instruction order?
L.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/757b9c59-3f2c-4c2a-a57c-a2f5548522b3%40groups.riscv.org.
Hi, lkcl,I have a dedicated area called "QUEUE", which keeps instructions not being executed but assuming done. When "QUEUE" area is full, cpu will have to be in-order and no instructions could be skipped.
So, when "QUEUE" is full, cpu just loses the ability of out-of-order, but still is alive. When an interrupt is coming, cpu could fetch instructions of trap entry and executes them one by one in order. Any way, maybe it is possible that we could supply another empty "QUEUE" area dedicated for interrupts. Only instructions in this dedicated "QUEUE" are empty, we can say that services for interrupts are over.
I think "QUEUE" area is a simple and efficient way to do out-of-order. Just put aside instructions stalled by data hazard and let instructions behind to be evaluated. If some instructions are stuck a long time and then make "QUEUE" area full, CPU will never go die and do in-order work.I thought I have answered to a forum publicly, but I did not.
Just remind me if I could not supply clear answer. I try to explain clearly because I have encountered these problems in design work and glad to share these.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/CADWq0vHket3NgMRqR%3DyCFYDzx-LsCNYnuaen2S2hrKQ0Kep5jw%40mail.gmail.com.
Hello everyone,
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/e7c47525-69da-4e38-bc05-28b449419570%40groups.riscv.org.
When this instruction is evaluated to be "QUEUE" or "EXEC" area, first of all, this instruction should be legal
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/CADWq0vEs%2BTKQAPmA1CNnozSrmH0GmFecGp9pWdW%2B1t%3Devbafww%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/CADWq0vFEHCZfN2WCj0gXZnJXT8pvr4NY22fv%3DwYricHWNAtKEQ%40mail.gmail.com.
On Monday, June 3, 2019 at 10:53:16 AM UTC+8, lite RISC wrote:
Hi,Then how to treat dependency on register file shared with out-of-order ALUs,
and also CSR?
lkcl gives me a good solution: when a memory instruction is issued, following instructions could be executed, but their Rd should be stored in temp registers,
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/a3cfee3e-5e7d-4aed-9146-f495691dd760%40groups.riscv.org.
Hi Luke-san,I do not yet understand about in case of interrupt/exception (looking like a different special thread) and common thread on the ALU, two different threads shares the one register file having more read/write ports (highly energy consumption and large scale).
So do you mean to stall or spin the common thread
until the special thread
terminates (sleep), so this makes exclusive manner between them and reduces complexity of the shared register file, right?
But I do not know this can avoid clover of variables on register file by special thread (needing spilling out context of common thread, common approach).And making CSR as alternative register file is similar to this way;You can also optimize control flows.
Best,T.
2019年6月4日(火) 22:33 lkcl <lk...@lkcl.net>:
--
On Tuesday, June 4, 2019 at 7:19:58 AM UTC+1, adaptiveprocessor wrote:Hi,Then how to treat dependency on register file shared with out-of-order ALUs,this was discussed earlierand also CSR?in lite's design, he is dropping back to "standard" single-issue in-order.in the Libre RISCV OoO design we are discussing treating CSRs as "Another Register File With Associated Dependency Matrices".the Dependency Matrices will catch both the read and write points on all CSRs, allowing copies of state information to travel freely in time-sync to ALU pipelines as part of the instruction opcodes that require them, safe in the knowledge that the dependencies on those CSRs has been taken care of.any CSR that *MUST* be "global" in nature CANNOT be treated this way.however VL (Vector Length), FP CSR flag information and so on definitely can. the FP CSR status flags are simply "another register with a write dependency that happens to sit side-by-side with the FP reg and thus has a corresponding write dependency just like the FP reg result".cancellation of any instruction due to exceptions will thus PREVENT not only the FP result reg from reaching the FP register file it will *prevent the CSR FP status flags from reaching the CSR regfile as well*.l.
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-...@groups.riscv.org.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/e69b9f3d-a172-44fa-b755-9748ed7c13b4%40groups.riscv.org.
Please do bear in mind it is an idea only just being considered, so the implications have not yet fully been thought through.
The answer is yes... however the reason is as follows.
Whilst CSRs are stored usually in SRAM and treated as bytes, words or DWORDs, Samuel created a CSR Regfile that stores VARIABLE length compacted bits ie only the fields that are writable.
This to save FPGA space.
Because of this I realised that actually, the bits of CSRs can be viewed as SEPARATE registers, even though they are stored in the same contiguously addressable range.
Thus, MIE and so on are SEPARATE REGISTERs, even though they are 1 bit long.
Thus, YES, there would be a single bit - a copy of the MIE or other *register* - stored in the Reservation Station and passed to decision-making logic for LOCAL use where most other designs would access and check a GLOBAL MIE field, protected by full OoO engine quiescence, full stalling and buffer flushing protocols and so on.
The issue here is that the Function Units are indeed passing around *copies* of CSR bitfields (protected by Dependency Matrices so it is safe to do so), and what I have not thought through yet is quite how *much* information would need to be passed around in this way.
If it is say only even up to 16 or even 20 bits of CSR State, that is tolerable.
However if it turns out to be 128 or 200 CSR bits that need to be passed into every ALU via the Reservation Stations then that is clearly intolerable, given that the operands are only 64 bit.
A mixed approach however may turn out to be feasible.
Where it will be tricky is the LD/ST, particularly on context switch to M, S and H Mode, and where the ASID changes.
There is still a lot to consider, I mentioned it wa a very early idea under consideration.
However the fallback is always just to stall, wait for the engine to commit all outstanding instructions, quiesce, and then it is safe to change a global CSR.
L.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/bcdfe18a-92dd-4204-8c16-d0a347d3dafe%40groups.riscv.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/377fca2e-394d-4206-a138-1e751d282a53%40groups.riscv.org.