With regret I find I must declare defeat. I won't have my submission ready by the deadline. My core passes all RV32I compliance tests, but despite advice and even published code from several other participants, I have been unable to get Zephyr built, after about 12 hours spent fighting it, and consequently haven't even gotten to the point of building FPGA bitstreams. I am accustomed to FPGAs from the Big FPGA Vendors, and had a fair bit of difficulty getting the Lattice iCEcube2 installed and working. I was never able to get Microsemi Libero working at all. In both cases, the problems were primarily with license management. I used to have license management problems with Big FPGA Vendor toolchains, but they seem to have made these issues a lot more manageable in recent years.
At the outset I thought I had a good chance of getting first place in
both low-resource categories. I still think that my core will likely be
smaller than most of the contest entries. I chose a vertically microcoded microarchitecture, which I call Glacial. It is, in fact, _so_ vertical compared to real-world vertically-microcoded processors that I think there needs to be a new term to describe it, "skyscraper microcode".
It was an interesting experience designing a core to optimize size only, with literally NO consideration given for performance. I constantly had to restrain myself from adding features that would improve performance but add LUTs. During development, I tried to migrate my original Verilog code, which was written in an extremely behavioral style, to a somewhat more structural style. That partially worked, but some of the changes unexpectedly caused Synplify to require significantly more LUTs. The result is that I still have a ridiculous casez statement over an 8-bit subfield of the microintruction, where the four LSBs are "????" for all cases. Trying to use casez on only the important 4 bits, or even 5-7 bits with some ? bits, all cause significantly higher LUT utilization, as does rewriting the casez to use explicit decode for the outputs it generates.
Without any specific memory interface, and without the SPI interface, Glacial uses 227 LUTs in the Lattice part. Adding the SPI interface (as yet untested) adds about 10 LUTs. The datapath is 8 bits, and core uses a single memory address space (also 8 bits wide) to contain the microcode, scratchpad, and RISC-V address space, similar to the IBM System/360 Model 25. The microcode and scratchpad memory take up a little under 2.5KiB of memory. The "Y" register, which is the memory pointer, is currently limited to 16 bits, restricting the RISC-V memory to a maximum of 61.5KiB, sufficient for the Zephyr demos. The Y register can be easily expanded to 24 or 32 bits, but that requires additional LUTs. Rather than use a UART in the FPGA fabric, requiring LUTs for both the UART and for address decode, the core has an optional microcode bit-banged UART output, which is included in the LUT count above. A RISC-V "custom0" instruction is used to output a character.
In practice, because neither vendor's FPGAs support initialization of large RAMs from the FPGA bitstream, I intended to use the Cortex-M3 of the SmartFusion2 to load the microcode and RISC-V program, but on the Lattice I intended to put a small portion of the microprogram in one EBR RAM, and have it boot the rest from the SPI flash.
The core alone will run at around 50 MHz on the Lattice part; maybe somewhat lower once RAM is interfaced. However, each microinstruction takes four clock cycles to execute, and each RISC-V instruction takes hundreds of microinstructions. The name "Glacial" is an understatement.
The tasks I _have_ accomplished are:
* defined a microarchitecture
* wrote a microassembler (in Python)
* wrote a microarchitecture simulator
(in Python)
* wrote microcode
* debugged microcode (passes RV32I tests)
* wrote Verilog core
* debugged Verilog core (passes RV32I tests)
Writing the microassembler was somewhat easier than doing it from scratch, because I had previously written an assembler for the Intel 8089 (yes, 808_9_) in Python. I was able to adapt it, though a fair bit of rewrite and new code was nevertheless required.
In my opinion, the timeline of this contest, from announcement to deadline, was unrealistically short, even for a project less ambitious than mine. I estimate that I would have needed another week to complete my entry with the Zephyr samples running and bitstreams for both FPGAs.
For what it's worth, the core is now on Github: