[riscv-hw] hwacha source

S Madhu

unread,

Dec 4, 2014, 1:09:40 AM12/4/14

to hw-...@lists.riscv.org

Is there any plans to release the Hwacha source ?

We are trying to make a call between a regular SIMD unit and

a pure VP unit. Having the source would help answer some of our questions

in the area of data path efficiency. I know the VP will be more efficient but

would prefer to some some actual comparisons.

Madhu

Tommy Thorn

unread,

Dec 4, 2014, 1:21:03 AM12/4/14

to S Madhu, hw-...@lists.riscv.org

I hope I won't derail this thread, but I'm interested in how you choose here.
Wouldn't the more important question be which of the two models you can
best support in the tool chain (ie. which is easier for the compiler to exploit)?

Thanks
Tommy

S Madhu

unread,

Dec 4, 2014, 1:54:56 AM12/4/14

to Tommy Thorn, hw-...@lists.riscv.org

Threads are meant to be derailed, that is how we get rambling but interesting conversations !

In this case, since we plan to add our own compiler extensions, I am less concerned

about tool chain support. That is not to say that toolchain issues should be ignored or that

they are not a major issue.

SIMD involved data path compromises and adding it as part of the FU matrix in the core

creates all kind of nasty issues and is ultimately a compromise. And you pay the price

in power.

But how good or bad a compromise depends on the workload. If we have a predominantly

non-vector workload and some occasional SSE type code, then it is probably not worth the hassle

of a VP. But I may be wrong here. The UCB folks have actual data on this and are best

qualified to give an authoritative answer. But if you are going for DP HPC type workloads, then the

VPU works way better.

But even here I an not certain as to how close the VPU is tied to the ALU complex. I need to look

at the RISC-V coprocessor interface in more detail. The programming model will be evident from this.

Tommy Thorn

unread,

Dec 4, 2014, 2:07:50 AM12/4/14

to S Madhu, hw-...@lists.riscv.org

> On Dec 3, 2014, at 22:54 , S Madhu <smadh...@gmail.com> wrote:
>
> Threads are meant to be derailed, that is how we get rambling but interesting conversations !
>
> In this case, since we plan to add our own compiler extensions, I am less concerned
> about tool chain support. That is not to say that toolchain issues should be ignored or that
> they are not a major issue.
>
> SIMD involved data path compromises and adding it as part of the FU matrix in the core
> creates all kind of nasty issues and is ultimately a compromise. And you pay the price
> in power.

Why is SIMD inherently tighter in the FU matrix than a VPU? Assuming an SSE-style SIMD
with private registers, it would be fairly decoupled except for the explicit instructions to transport
data between RISC-V registers and the co-processor.

> But how good or bad a compromise depends on the workload. If we have a predominantly
> non-vector workload and some occasional SSE type code, then it is probably not worth the hassle
> of a VP.

Yes, I meant to write that; some workloads (eg. media) may favor short, low latency SIMD vectors,
whereas VP supposedly have higher throughput at the cost of vector latency. Maybe it would be
possible to design a hybrid that could support both models.

Tommy

Andrew Waterman

unread,

Dec 4, 2014, 4:10:39 AM12/4/14

to Tommy Thorn, S Madhu, hw-...@lists.riscv.org

As Tommy suggests, it's important to distinguish between the
implementation and the abstraction that the ISA provides. Both
vectors and subword-SIMD can be tightly coupled or heavily decoupled.
For example, Intel's implementations of subword-SIMD are somewhere in
the middle: they achieve modest access-execute decoupling via the
out-of-order issue queues.

At the ISA level, a major distinction is that code generation for
vector machines is ignorant of the hardware vector length, whereas
subword-SIMD codegen requires knowledge of the vector length for
correctness. In the former case, binary code can leverage longer
vectors that future implementations might provide.

But back to the matter at hand: we intend to release the Hwacha source
code in 2015. We don't, at UCB, have plans to implement a RISC-V
subword-SIMD extension.

kr...@eecs.berkeley.edu

unread,

Dec 4, 2014, 4:24:54 AM12/4/14

to Andrew Waterman, Tommy Thorn, S Madhu, hw-...@lists.riscv.org

>>>>> On Thu, 4 Dec 2014 01:10:39 -0800, Andrew Waterman <wate...@eecs.berkeley.edu> said:
| As Tommy suggests, it's important to distinguish between the
| implementation and the abstraction that the ISA provides. Both
| vectors and subword-SIMD can be tightly coupled or heavily
| decoupled. For example, Intel's implementations of subword-SIMD are
| somewhere in the middle: they achieve modest access-execute
| decoupling via the out-of-order issue queues.

And as another example, the Cray-1 vector unit was much more tightly
coupled than the Intel or ARM SIMD extensions of current
implementations.

A common fallacy is that packed-SIMD extensions are lower latency than
true vectors for short application vectors. The opposite is usually
true, especially if the length and alignment of the short vectors is
not known at compile time.

| But back to the matter at hand: we intend to release the Hwacha
| source code in 2015. We don't, at UCB, have plans to implement a
| RISC-V subword-SIMD extension.

But we probably will have some thoughts on the ISA spec.

Krste

S Madhu

unread,

Dec 4, 2014, 6:16:12 AM12/4/14

to Andrew Waterman, Tommy Thorn, hw-...@lists.riscv.org

I agree. SIMD or classical VPU per se do not have any correlation to the degree of coupling.

I was referring more to the trend of x86 SIMDs being moderately coupled. I had GPUs in mind

when thinking about VPU exemplars.

We have currently implemented a SIMD (weird PPC/x86 hybrid !). But I am less and less convinced

about SIMD and leaning towards a fairly tightly coupled VPU instead. Data path design is cleaner

and I can avoid the pain of hard coding datawidth. But need some data points from my HPC end users

before I make the call.

But at least while implementing our SIMD unit, we realized the fallacy of using reservation stations

(this was inherited from an older design) instead of physical register files when SIMD is involved.

Still have to figure out the implications of combining a highly OO pipeline with a moderately couple VPU.

It probably does not make sense having a highly OO integer pipeline in a HPC type workload.

Time to run real life programs to get data-points I guess !

Yunsup Lee

unread,

Dec 4, 2014, 5:54:26 PM12/4/14

to Krste Asanovic, Andrew Waterman, Tommy Thorn, S Madhu, hw-...@lists.riscv.org

I’d like to point out that our current thought is to define a fully predicated vector ISA, and first bring up an OpenCL compiler within the LLVM compiler framework. We intend to open-source our OpenCL compiler as well. The vector ISA will be defined so that it will be easy to target with a loop vectorizer.

For those who might be interested in why we picked predication over a hardware-based approach to handle (complex) control flow in data-parallel architectures, please take a look at our MICRO paper: http://www.cs.berkeley.edu/~yunsup/papers/predication-micro2014.pdf. The paper also sketches out the compiler algorithm that predicates all control flow.

-Yunsup

Tommy Thorn

unread,

Dec 4, 2014, 10:14:14 PM12/4/14

to Yunsup Lee, Krste Asanovic, Andrew Waterman, S Madhu, hw-...@lists.riscv.org

> On Dec 4, 2014, at 14:54 , Yunsup Lee <yun...@eecs.berkeley.edu> wrote:
>
> I��d like to point out that our current thought is to define a fully predicated vector ISA, and first bring up an OpenCL compiler within the LLVM compiler framework. We intend to open-source our OpenCL compiler as well. The vector ISA will be defined so that it will be easy to target with a loop vectorizer.
>
> For those who might be interested in why we picked predication over a hardware-based approach to handle (complex) control flow in data-parallel architectures, please take a look at our MICRO paper: http://www.cs.berkeley.edu/~yunsup/papers/predication-micro2014.pdf. The paper also sketches out the compiler algorithm that predicates all control flow.

Very interesting paper, thanks for the pointer.

I do have an issue with the implicit premise though. I worked in the CUDA group for 3.5 years and
while toy examples can be elegant in SPMD, such as CUDA, realistic example quickly becomes
so convoluted that I suspect you would have been better off with different model instead.
Example: http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
Related, many problems map astonishingly poorly to the SPMD model.

I'm sure I'm ignorant of academic literature comparing different models however.

I look forward to your vector ISA and hope that it works well for arbitrary vector lengths.

Regards,
Tommy

Reply all

Reply to author

Forward