What would be the best way to access CPU intrinsics in Cython?

176 views
Skip to first unread message

mra...@gmail.com

unread,
Aug 10, 2021, 6:25:08 AM8/10/21
to cython-users
```
cdef extern int __builtin_popcountll(unsigned long long) nogil

cpdef unsigned long long main() except *:
    return __builtin_popcountll(1)
```

When I try compiling the above I just get a linker error that it cannot resolve `__builtin_popcountll`. A possible reason for that is that I am using MSVC instead of GCC which has different intrinsics.

I think that some Python libraries give access to these CPU intrinsics, but I am not sure if Cython would optimize the calls to them so I am avoiding that option. What would be the best way to access CPU intrinsics in Cython in a performant and portable manner? Is there a library for this?

D Woods

unread,
Aug 12, 2021, 2:24:23 AM8/12/21
to cython-users
I think the thing with CPU intrinsics is that they're inherently unportable there probably isn't a good solution. You'd be best looking for a set of C macros that wrap them and wrapping that with Cython.

I believe people have succeeded in using the vectorized SIMD-type intrinsics from Cython (and you can find a few examples if you search for it) but it isn't pretty (because I think the associated numeric types don't make sense to Cython) and it isn't portable.

Jonathan Kliem

unread,
Aug 12, 2021, 2:37:57 AM8/12/21
to cython-users
See my previous answer on how we deal with it (don't know if it is best of course).


. Our approach includes popcount. You can use a tiny C/C++ file for this with preprocessor instructions. You need to check something like
```
#if (__POPCNT__) && (INTPTR_MAX == INT64_MAX)
    _mm_popcnt_u64(A)
#else
    backup(A)
#endif
```
Note that `_mm_popcnt_u64` doesn't work on 32-bit, hence the second test.
You probably also want to check that `unsigned unsigned long` really is 8 bytes.

This approach is at least as portable as SageMath and you just need to compile with `-march=native` or `-mpopcnt` or whatever your compiler supports.
(I just hope that any decent compiler will define `__POPCNT__` if you compile with `-mpopcnt`)

I don't know if a special library for this makes a lot of sense performancewise.
For more involved routines with multiple instructions (and very specific paths for different architectures) it would not be much help.

Jonathan

Btw, the intrinsics are the same for each compiler. However, GCC provides `__builtin_popcountll` which will access intrinsics if availabe and otherwise use a backup.

mra...@gmail.com

unread,
Aug 12, 2021, 2:44:09 AM8/12/21
to cython-users
Thanks.
Reply all
Reply to author
Forward
0 new messages