Proposal for Implementing Universal Intrinsics in Openblas

Kan Chen

unread,

Aug 13, 2020, 6:32:58 AM8/13/20

to OpenBLAS-dev

I'm a developer working on both Numpy and Openblas. In Numpy, it provides a set of unified SIMD optimization instructions macros named Universal Intrinsics to abstract out typical platform-specific intrinsics. So, you only need written optimized code once if you implement your algorithms based on these unified instructions macros.

Though the kernels in Openblas were written in assembly code, I think developers still can benifit from Universal Intrinsics, specically when implementing new algorithms.

I and my team members will try to implement the same machnisim as Universal Intrinsics in Openblas. And We have contacted Sayed who is the main author of Universal Intrinsics in Numpy. He may support this work.

How do you guys think about it? Does it make sense?

For more details about Universal Intrinsics: https://numpy.org/devdocs/reference/simd/simd-optimizations.html

The original proposal of Universal Intrinsics: https://numpy.org/neps/nep-0038-SIMD-optimizations.html

Thanks!

guobin...@intel.com

unread,

Aug 13, 2020, 11:30:41 PM8/13/20

to OpenBLAS-dev

IMO, the universal intrinsic way is beautiful in concept, but not in real. With below considerations:

1. Not all the instruction/intrinsic has on-par equivalence cross architecture or even cross platform. That means you cannot write a single piece of code with 'universal intrinsic' as always, but only for some limited cases;

2. Even if you can write a single piece of 'universal intrinsic' code, it may not be the optimized one for each arch/platform, while performance is the only or at least top reason why we use intrinsic. E.x.: some intrinsics/instructions are with better performance on one platform, but some others better on some other arch/platform. Leads to different intrinsic sequence or even different algorithms per arch/platform;

3. Different arch/platform's intrinsic/instructions may have different runtime mode/rounding method/exception handling, all these leads to different coding/algorithm that a single piece of code is not always possible.

If we consider mix using 'universal intrinsic' and normal intrinsic, that will be even bigger mess-up no matter for implementation and maintaining.

And I don't think Numpy's practice is a good example here -- their universal intrinsic thing just merged months before, you can still find there is normal intrinsic based PR merging recently. If one day we see that Numpy has fully switched to their universal intrinsic, and solved all the concerns above elegantly, then it can be a good practise.

Zhang Xianyi

unread,

Aug 15, 2020, 6:10:52 AM8/15/20

to Kan Chen, OpenBLAS-dev

Hi Kan,

Thank you for your proposal.

Recently, our team tried porting Universal Intrinsics of OpenCV project into RISC-V Vector ISA. （Please check the slides in Chinese, http://crva.ict.ac.cn/crvs2020/index/slides/1-5.pdf )

As @guobin mentioned, not all the intrinsic has on-par equivalence. In some special cases (e.g. v_pack_b in OpenCV), we cannot find a good RISC-V V implementation.

Thanks

Xianyi

Kan Chen <chen...@huawei.com> 于2020年8月13日周四下午6:33写道：

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openblas-dev/5e935a19-2018-48d7-9e7b-d0862289ee57o%40googlegroups.com.

Kan Chen

unread,

Aug 21, 2020, 5:46:57 AM8/21/20

to OpenBLAS-dev

Thanks for your comments.
In my perspective, "Universal Intrinsics" is more like a set of APIs instead of a real unified set of intrinsics. This set of APIs is in a higher level than latter. Because it is oriented to the specific scenario of numerical computation, it can be easily converged to fewer common functions. Simultaneously, there is hardly a problem of no equivalent implementations on different platforms.
For developers, this mechanism can reduce the effort of adaptation to different platforms and still produce good performance. According to some test cases in Numpy, the performance of the optimized functions is almost as good as the performance of manually optimized ones.

Reply all

Reply to author

Forward