On 5/9/21 1:50 PM, Thomas Koenig wrote:
> That is of course a possibility. In the CPU-based approach I
> simply used OpenMP with schedule(dynamic). However, for this
> kind of hobbyist thing, I'd rather learn something interesting
> than throw money at a cloud provider :-)
Agreed. But distributed computing can do wonder for many brute-force
mathematical problems (e.g.
<
https://en.wikipedia.org/wiki/Lychrel_number#196_palindrome_quest> ;-) ).
> Seems to be rather high-level, and also rather abstract (ok, so these
> systems are usually aimed at professionals, not at hobbyists).
> I'll look around a bit and see if I can find anything that helps me,
> but at the moment, I have to say it all looks rather daunting :-)
It depends where you start... I see two problems to solve to use a FPGA:
(a) implementing the hardware operator(s) in a fast/efficient way;
(b) integrating said operator(s) in an acceleration framework.
Tackling both at once can be daunting. However, tackling only the first
(which is likely the most interesting one and sort-of-research) should
be quite achievable. Once that works, you can figure out how to put them
in 'production' by having the right set up (number of operators, speed,
memory requirements, etc.).
E.g., if you can express the problem as a (set of) 32x32 -> 32 operators
(with some ad-hoc encoding, presumably), you can easily add dedicated
instructions in a 32-bits softcore to evaluate the performance benefit.
Then you can figure out a way to high-performance. Some softcore might
let you do wider operands/results - 64-bits should not be much of a
problem with 64-bit softcores.
Blowing my own horn here (sorry), but for instance you can have a look
at: <
https://github.com/rdolbeau/VexRiscvBPluginGenerator/> which is
designed to easily add integer-pipeline instructions to a VexRiscv
(RV32) core in a Linux-capable Litex SoC. I run a 100 MHz quad-cores on
a ~$90 board; it won't be faster than a beefy CPU, but it's a cheap way
to start evaluating implementations. There's also support for 32x32->64
instructions (from the draft P 'packed simd' extension to RISC-V [2]),
32x32x32->32 instructions (from the draft Zbt 'bitmanip ternary'
extension [1]) and you can even do 32x32x32->64 if you really want (by
abusing both systems, for instance to implement a faster Chacha [3], see
e.g.
<
https://github.com/rdolbeau/VexRiscvBPluginGenerator/blob/master/data_Chacha64.txt>).
That would the easiest way to prototype some operators, I believe.
Alternatively, VexRiscv has a FPU 'coprocessor' that can be an
inspiration to implement a dedicated unit with data width up to 64 bits
(but it will be more complicated).
Finally you can go for the full-custom acceleration peripheral; the
CFU-playgrounds is one way, or you can look at the ESP project as they
are more focused on acceleration. But that's basically solving (b) along
with (a).
Cordially,
Romain
[1] <
https://github.com/riscv/riscv-bitmanip>
[2] <
https://github.com/riscv/riscv-p-spec>
[3] <
https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant>
--
Romain