One thing worth cautioning you about is that most of the magic of blis
isn't the kernels - it's the framework - meaning the loops, the
parallelism, the traversal patterns through the data - the Goto
Algorithm. And the framework is optimized around the on-chip memory
system of a CPU. It's likely that to get good performance out of a GPU,
you'd use a different "framework" to deal with far higher latencies and
far lower amounts of cache per thread - the same theoretical framework,
but very different traversal. On an FPGA, things might look different
yet again.
It's all about the notion that matrix multiply is a memory movement
problem, not a compute problem.
On 8/8/21 4:33 PM, Jeff Diamond wrote:
> Hi Minh. When you mentioned offloading kernels to the GPU using
> OpenCL, I didn't know if OpenCL supported embedding custom assembly
> language, and I didn't know if GPUs directly accepted assembly
> language. (I guess technically AMD ones probably would, I just don't