LAST build fails on non-amd64 platforms.

Charles Plessy

unread,

Dec 20, 2019, 9:18:58 PM12/20/19

to last-...@googlegroups.com, 946...@bugs.debian.org

[CCed to Debians' bug tracking system]

Hi Martin,

LAST fails to build on non-amd64 platforms, where it does not find
the immintrin.h header file.

We can either try to fix that (most likely relying on you to do it), or
restrict the builds to the amd64 platform (missing opportunities to find
portability bugs such as this one).

What would you prefer ?

Have a nice day,

Charles

--
Charles Plessy Akano, Uruma, Okinawa, Japan
Debian Med packaging team http://www.debian.org/devel/debian-med
Tooting from work, https://mastodon.technology/@charles_plessy
Tooting from home, https://framapiaf.org/@charles_plessy

mcf...@edu.k.u-tokyo.ac.jp

unread,

Dec 20, 2019, 9:37:01 PM12/20/19

to last-align

Hi Charles

that's because LAST now uses SIMD, which makes it faster, but requires SSE4.1.

Does it help to use x86intrin.h instead? I guess not.

What other platforms do you have? How important is it to support them?

Have a nice day,

Martin

David Eccles (gringer)

unread,

Dec 20, 2019, 10:36:54 PM12/20/19

to last-...@googlegroups.com

On 21.12.19 15:37, mcf...@edu.k.u-tokyo.ac.jp wrote:
> What other platforms do you have? How important is it to support them?

By my count, 9 official ports, 22 other ports:

https://www.debian.org/ports/

- David

Frith, Martin

unread,

Dec 23, 2019, 3:00:39 AM12/23/19

to David Eccles (gringer), last-align, ple...@debian.org

Thanks David for helping me understand what "platform" means here.

The latest version (1045) might work on the other platforms. The compiler might complain about an unknown "msse4" option, in which case that option can be omitted, e.g. by:

make CXXFLAGS='-O3 -std=c++11 -pthread -DHAS_CXX_THREADS'

But don't do that on the amd64 platform, because we want to keep the SIMD speedup!!!

It should be quite easy to add ARM SIMD, if that would be useful.

Have a nice day,

Martin

--
You received this message because you are subscribed to a topic in the Google Groups "last-align" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/last-align/yzqdAcch-h8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to last-align+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/last-align/dbf6759a-ccb1-5e6a-e326-0a847fcab996%40gringene.org.

michael...@gmail.com

unread,

Dec 31, 2019, 3:07:49 AM12/31/19

to last-align

Dear Martin and other last-align contributors:

Prior to the release of version 1045 I developed a patch for last-align using the "SIMD Everywhere" library that automatically translates SIMD intrinsics to previous generations (avx2 to avx, sse4.1, ssse3, sse3, sse2, sse, mmx) or to the SIMD instructions for different processor architectures (NEON on ARM, etc..):

https://salsa.debian.org/med-team/last-align/blob/master/debian/patches/simde

https://github.com/nemequ/simde

Perhaps incorporating and adapting this patch would be useful for the last-align authors? That way you can write your code using the most advanced SIMD intrinsics and get the best possible implementation for the hardware available to the user. AVX2 support is in progress, but currently covers all AVX2 usage in last-align, and I would be happy to add support myself for any other needed AVX2 instructions. AVX-512 support is planned, and I can work on that as well, if needed.

I also amended how Debian builds last-align on x86 systems: we do a fresh compile for each "-m" SIMD level ('-mavx2", "-mavx", and so on) and then we ship all these binaries behind a wrapper script that chooses the best one based upon /proc/cpuinfo

https://salsa.debian.org/med-team/last-align/blob/master/debian/bin/simd-dispatch

https://salsa.debian.org/med-team/last-align/blob/master/debian/rules#L35

It would be great if this was also incorporated and adapted by you all, though it is a bit Debian specific at the moment. If you don't like the wrapper script approach then it should be possible to create binaries that contain all the SIMD variations and dispatch within the binary at runtime. This I have less experience with, though I would like to create a general guide for programmers who want to use that technique.

Anyhow, I hope these patches are useful to you all. Thanks again for sharing you work as open source!

P.S. I plan at being at the BioHackathon in Hiroshima in (October?) 2020, perhaps we might meet there or in Tokyo?

Cheers,

On Monday, December 23, 2019 at 9:00:39 AM UTC+1, Frith, Martin wrote:

Thanks David for helping me understand what "platform" means here.

The latest version (1045) might work on the other platforms. The compiler might complain about an unknown "msse4" option, in which case that option can be omitted, e.g. by:

make CXXFLAGS='-O3 -std=c++11 -pthread -DHAS_CXX_THREADS'

But don't do that on the amd64 platform, because we want to keep the SIMD speedup!!!

It should be quite easy to add ARM SIMD, if that would be useful.

Have a nice day,
Martin

On Sat, Dec 21, 2019 at 12:36 PM David Eccles (gringer) <bioinfo...@gringene.org> wrote:

On 21.12.19 15:37, mcf...@edu.k.u-tokyo.ac.jp wrote:
> What other platforms do you have? How important is it to support them?

By my count, 9 official ports, 22 other ports:

https://www.debian.org/ports/

- David

--
You received this message because you are subscribed to a topic in the Google Groups "last-align" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/last-align/yzqdAcch-h8/unsubscribe.

To unsubscribe from this group and all its topics, send an email to last-...@googlegroups.com.

Frith, Martin

unread,

Jan 6, 2020, 7:14:21 PM1/6/20

to michael...@gmail.com, last-align

Dear Michael,

this is very interesting and impressive: many thanks for this suggestion!!!

I don't fully understand it, but I have some doubts.

As far as I can understand, if we use AVX2, then the code with your patch compiles down to the exact same thing as it did before your patch. Fine.

But if we use SSE4, the code with your patch compiles down to something a bit different than it did before, and I think less efficient (though the difference might be minor).

As for ARM NEON: I think your patch replaces each AVX/SSE instruction with a NEON instruction. But that's not great for horizontal max/min, which use multiple AVX/SSE instructions, but can be implemented with a single NEON instruction (I think).

I don't fully understand your makefile changes: I guess you have some extra script that runs make with different "SFX"s and makes the wrapper script? And you somehow override the -msse4 option?

I sometimes run LAST on Mac computers, which don't seem to have /proc/cpuinfo.

In my tests (before your patch), compiling with AVX2 does not make it much faster, versus compiling with SSE4. Maybe 5% faster at best. Surprising, I don't understand. This reduces my interest in automatically using the "best" available out of AVX, SSE, etc. But my interest would be resurrected if this would make it much faster, which it seems like it should...

Hope I don't seem too negative, but I'll try to learn from these ideas and bear them in mind.

Have a nice day,

Martin

P.S. I can't find much info about that BioHackathon, but you're very welcome to visit us in Tokyo!

To unsubscribe from this group and all its topics, send an email to last-align+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/last-align/769fbb23-4ff6-4ec8-9f98-0859632ab69d%40googlegroups.com.

nem...@gmail.com

unread,

Feb 8, 2020, 5:39:43 PM2/8/20

to last-align

I'm the primary author of SIMDe, hopefully I can clear some stuff up a little.

On Tuesday, January 7, 2020 at 12:14:21 AM UTC, Frith, Martin wrote:

Dear Michael,

this is very interesting and impressive: many thanks for this suggestion!!!
I don't fully understand it, but I have some doubts.

As far as I can understand, if we use AVX2, then the code with your patch compiles down to the exact same thing as it did before your patch. Fine.

Correct.

But if we use SSE4, the code with your patch compiles down to something a bit different than it did before, and I think less efficient (though the difference might be minor).

Well, if you *don't* use AVX2 it compiles to something a bit different. Technically if you use AVX2 you also use MMX, SSE, SSE2, etc. That's probably what you meant, I just want to be clear.

Yes, the non-native implementations will generally be less efficient. For example, if you're using AVX2 then _mm256_add_epi32 is implemented using _mm256_add_epi32. Otherwise, if SSE2 is supported then it's implemented using two _mm_add_epi32 calls. If that doesn't work but GCC-style vector extensions are supported (GCC, clang, icc, etc.) then we use them. If that doesn't work it falls back on a loop with OpenMP 4 SIMD annotations (which also works in any C99 compiler; the annotations are ignored if unsupported). To be clear, that all happens at compile time.

Without AVX2 all of these actually compile to the same thing on Intel, but on ARM they'll compile to the equivalent of two vpaddq_s32 calls.

SIMDe shouldn't ever make the code slower, only more portable. Yes, restricting yourself to SSE4.1 instead of AVX2 is going to be slower, but it will actually work on machines which don't support AVX2, and for machines that do support AVX2 you should be compiling with AVX2.

As for ARM NEON: I think your patch replaces each AVX/SSE instruction with a NEON instruction. But that's not great for horizontal max/min, which use multiple AVX/iSSE instructions, but can be implemented with a single NEON instruction (I think).

Yes, for this reason using SIMDe will often be slower than a fully ported implementation targeting NEON, assuming the person doing the porting knows what they're doing of course.

I don't think that really matters, though. You're getting a port for basically zero effort, and SIMDe doesn't prevent you from doing a traditional port. Actually, SIMDe can be a big help there; instead of having to port everything at once you can use SIMDe to create the initial port then add more optimized versions as desired. If you find that horizontal min/max is a bottleneck in your code you can just port those functions and leave everything else (or port everything if you deem it necessary).

I guess the point is that you don't *lose* anything by using SIMDe. What you gain is likely to be slightly flawed, but in many cases it's good enough and make further work easier.

Also, keep in mind that we're just talking about x86 and ARM here, but there is also POWER, RISC-V, WebAssembly, MIPS, and many many others. SIMDe should work everywhere C99 (or C++98) does. There is one architecture (Kalray) I didn't even know existed until a couple weeks ago, and SIMDe works on it.

I don't fully understand your makefile changes: I guess you have some extra script that runs make with different "SFX"s and makes the wrapper script? And you somehow override the -msse4 option?

I sometimes run LAST on Mac computers, which don't seem to have /proc/cpuinfo.

In my tests (before your patch), compiling with AVX2 does not make it much faster, versus compiling with SSE4. Maybe 5% faster at best. Surprising, I don't understand. This reduces my interest in automatically using the "best" available out of AVX, SSE, etc. But my interest would be resurrected if this would make it much faster, which it seems like it should...

The most likely explanation is that you have a bottleneck somewhere so you can't run the 256-bit vectors at full speed. Based on that 5% figure there is a good chance that your 128-bit version isn't really running full speed, either. Have you tried looking for stalls?

Hope I don't seem too negative, but I'll try to learn from these ideas and bear them in mind.

Have a nice day,
Martin

P.S. I can't find much info about that BioHackathon, but you're very welcome to visit us in Tokyo!

To view this discussion on the web visit https://groups.google.com/d/msgid/last-align/769fbb23-4ff6-4ec8-9f98-0859632ab69d%40googlegroups.com.

Michael Crusoe

unread,

Feb 9, 2020, 3:48:24 AM2/9/20

to Frith, Martin, last-align, Evan Nemerson

On Tue, Jan 7, 2020 at 1:14 AM Frith, Martin <mcf...@edu.k.u-tokyo.ac.jp> wrote:

Dear Michael,

this is very interesting and impressive: many thanks for this suggestion!!!

You are very welcome!

I don't fully understand your makefile changes: I guess you have some extra script that runs make with different "SFX"s and makes the wrapper script? And you somehow override the -msse4 option?

Correct, we run the patched makefile (which doesn't have -msse5 any more) from https://salsa.debian.org/med-team/last-align/blob/master/debian/rules#L35 multiple times to make multiple binaries, each with a different level of SIMD intrinsics enabled: avx2, then avx, then sse4.1, then ssse3, then sse3, and finally just sse2 for x86-64 systems, and ssse3, then sse3, then sse2, then sse, then mmx, and finally without any SIMD intrinsics for x86 32bit systems.

The wrapper script then selects which binary based upon the capabilities of the user's machine at run time.

For non-x86 systems we compile only once and don't use any wrapper scrip.

The patch I shared was specifically for use of Debian in the context of making a last-align binary package. So we would need to make some changes so it is more generic.

I sometimes run LAST on Mac computers, which don't seem to have /proc/cpuinfo.

Yes. Looks like "sysctl machdep.cpu.features machdep.cpu.feature_bits" is the equivalent for macOS.

Another alternative is to do the dispatching as part of the last-align binary itself. If you only support gcc (or gcc + clang) then this is easy using the target_clones attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target_005fclones-function-attribute (and wouldn't require all those changes to the Makefile)

In my tests (before your patch), compiling with AVX2 does not make it much faster, versus compiling with SSE4. Maybe 5% faster at best. Surprising, I don't understand. This reduces my interest in automatically using the "best" available out of AVX, SSE, etc. But my interest would be resurrected if this would make it much faster, which it seems like it should...

It goes both ways: possible faster execution for more advanced processors and enabling execution at all for those without sse4. In Debian we are supposed to ship binaries for x86-64 that still work on systems with only SSE2 and on x86 32-bit that work without even MMX!

Hope I don't seem too negative, but I'll try to learn from these ideas and bear them in mind.

Your responses are very reasonable, thank you for being open to them!

Have a nice day,
Martin

P.S. I can't find much info about that BioHackathon, but you're very welcome to visit us in Tokyo!

Great! It hasn't been announced on their mailing list. The general website is http://www.biohackathon.org/ and you can be alerted by signing up at https://groups.google.com/forum/#!forum/biohackathon

--

Michael R. Crusoe

Frith, Martin

unread,

Feb 11, 2020, 9:02:12 PM2/11/20

to Michael Crusoe, last-align, Evan Nemerson

Thank you both for the explanations!

> The most likely explanation is that you have a bottleneck somewhere so you can't run the 256-bit vectors at full speed. Based on that 5% figure there is a good chance that your 128-bit version isn't really running full speed, either. Have you tried looking for stalls?

This seems the most important thing for me. I don't really know what a "stall" is, or how to look for one. I'd be grateful for any advice.

> SIMDe will often be slower than a fully ported implementation targeting NEON

I wonder if it's generally better to start with a NEON implementation, then use SIMDe to make it work with AVX/SSE...

> > I don't fully understand your makefile changes

It seems you already sent me the links answering this, the first time. Don't know how I missed them, sorry!

I just noticed this: https://github.com/google/highway

Perhaps my main concern about SIMDe is that I'd prefer "Width-agnostic" SIMD. (And runtime dispatch.)

Have a nice day,

Martin

Michael Crusoe

unread,

Feb 12, 2020, 7:55:51 AM2/12/20

to Frith, Martin, last-align, Evan Nemerson

On Wed, Feb 12, 2020, 03:02 Frith, Martin <mcf...@edu.k.u-tokyo.ac.jp> wrote:

Thank you both for the explanations!

You are very welcome! I'll let Evan respond to most of your questions.

> > I don't fully understand your makefile changes
It seems you already sent me the links answering this, the first time. Don't know how I missed them, sorry!

No worries, there was a lot going on in my first email!

I just noticed this: https://github.com/google/highway
Perhaps my main concern about SIMDe is that I'd prefer "Width-agnostic" SIMD. (And runtime dispatch.)

Neat project. For Debian we must still support running on SSE2-only on x86-64 and without any SIMD whatsoever on i686. That project seems to require that x86(-64) CPUs support at least SSE4.

So whatever route you choose, as long as I can still run using the above restrictions (and ideally on arm64 and riscv64) then I'm happy with that.

As for runtime dispatch, that is hard to provide in a library as ideally the dispatch would happen much higher up in the call stack, so that there is little to no overhead. If there is anything SIMDe can do to assist, I'm sure we'd be happy to hear about it.

Evan Nemerson

unread,

Feb 12, 2020, 2:07:24 PM2/12/20

to Frith, Martin, Michael Crusoe, last-align

On Wed, 2020-02-12 at 11:01 +0900, Frith, Martin wrote:

Thank you both for the explanations!

> The most likely explanation is that you have a bottleneck somewhere so you can't run the 256-bit vectors at full speed. Based on that 5% figure there is a good chance that your 128-bit version isn't really running full speed, either. Have you tried looking for stalls?

This seems the most important thing for me. I don't really know what a "stall" is, or how to look for one. I'd be grateful for any advice.

Warning, that's a fairly large subject. Basically, a stall refers to any time the CPU has to stop work to wait for something to become available. The most common is cache misses where the CPU can't get the data it needs right away and has to go out to L2, L3, or even main memory, but you can also see stalls for other reasons like a port on the CPU being busy (common in shuffle-heavy code on x86 because shuffles can only run on one port).

On Linux you can use perf to look for these (<http://www.brendangregg.com/perf.html> has some helpful information). I would probably start with toplev (see <https://halobates.de/blog/p/262> for an introduction) to get a feel for what the bottlenecks are, then use perf to drill down into the program to find where they are.

> SIMDe will often be slower than a fully ported implementation targeting NEON
I wonder if it's generally better to start with a NEON implementation, then use SIMDe to make it work with AVX/SSE...

Each approach has advantages and disadvantages. Intel's are generally more expressive, so you can offload more of the optimization logic to SIMDe. The vast majority of NEON instructions, on the other hand, have direct equivalents on x86/x86_64 so the port is more straightforward, but you can leave a lot of performance on the table because you're doing everything in several steps whereas maybe with SSE/AVX there are instructions that combine several operations into one more efficiently.

My suspicion is that going from Intel to NEON is generally better. That said, SIMDe's NEON API is nowhere near as complete as the SSE/AVX APIs, so it's probably not really an option right now. If you really want to try it I'd recommend Intel's ARM_NEON_2_x86_SSE: <https://github.com/intel/ARM_NEON_2_x86_SSE>. It's much more complete, but only allows you to go from NEON to SSE; you can't use it on POWER, for example, nor on x86 CPU's that don't support the ISA extensions it uses (up to 4.2, but in a lot of cases SSE2 should be sufficient). Unless you used SIMDe to implement the instructions that AN2XS targets, which would be kind of funny but probably not be very efficient.

> > I don't fully understand your makefile changes
It seems you already sent me the links answering this, the first time. Don't know how I missed them, sorry!

I just noticed this: https://github.com/google/highway
Perhaps my main concern about SIMDe is that I'd prefer "Width-agnostic" SIMD. (And runtime dispatch.)

That's an option. SIMDe is designed to get your application ported with minimal effort. If you're willing to spend the time to rewrite your code then SIMDe may not be the best way to go.

IMHO the biggest problem with abstraction layers like Highway are that you're forced to use a lowest-common-denominator of functionality. Basically, you have the same problems you would with a NEON->SSE automatic translation, but it tends to be a bit more severe.

The runtime dispatch is cool, but I feel like it's at the wrong level; if you move it up a bit higher in the call stack you only have the overhead of calling it once, or maybe a few times, instead of hitting it over and over (for every time you call Highway). The trade-off, of course, is binary size and you have to think about where to put the dynamic dispatch bits. That's why I've always considered it to be out of scope for SIMDe.

You may also want to look into OpenMP 4's SIMD support, which I actually use in SIMDe. It doesn't do the dynamic dispatching, but it has good compiler support (GCC, clang, ICC, and now even MSVC support it, plus a few others), and if the compiler doesn't support it it just becomes normal C code. It doesn't create a runtime dependency on OpenMP in most compilers (gcc/clang/icc have -fopenmp-simd/-qopenmp-simd), and is width-agnostic. IIRC I learned it by reading <https://pdfs.semanticscholar.org/549d/bb402cc7b59cae393e1c85c5e66f9a0e1169.pdf>, but there may be better resources available these days.

You can combine that with something like Google's cpu_features library (<https://github.com/google/cpu_features/>) for portable dynamic dispatch, or use GCC's target_clones attribute if you don't care about other compilers. target_clones has the benefit of not requiring build system changes, or code changes for that matter, but anything that uses the preprocessor to switch features (like SIMDe) won't work with it.

Intel's SPMD Program Compiler (ISPC, see <https://ispc.github.io/ispc.html#selecting-the-compilation-target>) is also very nice, but AFAIK the targets are limited to x86 and ARM. It's probably the most pleasant way to write SIMD code with good performance right now.

For what its worth, when I'm writing new code I typically go to OpenMP 4 SIMD first. If its not fast enough I'll start rewriting hotspots to target a specific ISA using intrinsics with some ifdefs to keep the portable version working.

Frith, Martin

unread,

Feb 14, 2020, 2:42:08 AM2/14/20

to Evan Nemerson, Michael Crusoe, last-align

Thanks so much for this awesome free advice! I have a lot to learn...

Evan Nemerson

unread,

Feb 14, 2020, 12:47:59 PM2/14/20

to Frith, Martin, Michael Crusoe, last-align

Actually I should be thanking you. This has given me lots of ideas of things that I should be documenting for SIMDe, which is a task I've been neglecting. In fact, it has already helped push me to create an FAQ to address several of these issues, and I plan on adding some more.