See my previous answer on how we deal with it (don't know if it is best of course).
. Our approach includes popcount. You can use a tiny C/C++ file for this with preprocessor instructions. You need to check something like
```
#if (__POPCNT__) && (INTPTR_MAX == INT64_MAX)
_mm_popcnt_u64(A)
#else
backup(A)
#endif
```
Note that `_mm_popcnt_u64` doesn't work on 32-bit, hence the second test.
You probably also want to check that `unsigned unsigned long` really is 8 bytes.
This approach is at least as portable as SageMath and you just need to compile with `-march=native` or `-mpopcnt` or whatever your compiler supports.
(I just hope that any decent compiler will define `__POPCNT__` if you compile with `-mpopcnt`)
I don't know if a special library for this makes a lot of sense performancewise.
For more involved routines with multiple instructions (and very specific paths for different architectures) it would not be much help.
Jonathan
Btw, the intrinsics are the same for each compiler. However, GCC provides `__builtin_popcountll` which will access intrinsics if availabe and otherwise use a backup.