IMO, the universal intrinsic way is beautiful in concept, but not in real. With below considerations:
1. Not all the instruction/intrinsic has on-par equivalence cross architecture or even cross platform. That means you cannot write a single piece of code with 'universal intrinsic' as always, but only for some limited cases;
2. Even if you can write a single piece of 'universal intrinsic' code, it may not be the optimized one for each arch/platform, while performance is the only or at least top reason why we use intrinsic. E.x.: some intrinsics/instructions are with better performance on one platform, but some others better on some other arch/platform. Leads to different intrinsic sequence or even different algorithms per arch/platform;
3. Different arch/platform's intrinsic/instructions may have different runtime mode/rounding method/exception handling, all these leads to different coding/algorithm that a single piece of code is not always possible.
If we consider mix using 'universal intrinsic' and normal intrinsic, that will be even bigger mess-up no matter for implementation and maintaining.
And I don't think Numpy's practice is a good example here -- their universal intrinsic thing just merged months before, you can still find there is normal intrinsic based PR merging recently. If one day we see that Numpy has fully switched to their universal intrinsic, and solved all the concerns above elegantly, then it can be a good practise.