Help needed testing vectorized Stockfish pawns.cpp...

421 views
Skip to first unread message

Nick Pelling

unread,
Sep 22, 2019, 6:35:54 PM9/22/19
to FishCooking
Hi everyone,

I've just pushed a first attempt at vectorizing the big evaluate() loop in pawns.cpp to my Stockfish fork:


Because it relies heavily on SSE4 opcodes (basically, I don't think SF can get much benefit from SSE3 or earlier), I've had to add a new sse4 flag and a new target (ARCH=x86-64-sse4) to the Makefile so that I can test it locally (e.g. my PC is recent enough to have SSE4, but not recent enough to have BMI2 etc).

Hence the immediate problem is that I don't know how to go about autoconfiguring fishtest PCs to select a SSE4 target if they have SSE4, so I don't know how I can get this testing properly in fishtest. Any suggestions?

Moreover, I'm pretty sure that if this same (completely generic C) code was compiled for AVX or AVX2, gcc should happily target AVX's wider parallelism. So I'd be delighted if anyone would be able to try this out for AVX/AVX2. Of course, you'd then have the same problem of how to stitch it into the Makefile and fishtest (e.g. there is currently no avx flag), but if it performs well, perhaps that would be a nice problem to have. :-)

More generally, this change inevitably falls into the "hippopotamus" category: and because of SSE4 limitations (no gather load, no register-selected bitshifts, etc), I had to do things in a slightly more Bitboard-orientated way than before. But the bulk of the changes sits in pawns.cpp, and apart from being a touch more long-winded than before, a lot of the original code should be recognizably still in place and unchanged. Doubtless there's plenty of room for improvement.

All comments, suggestions and experiments are welcome!

Thanks, Nick

PS: note that the signature is 4186329 rather than (the current) 4272173. I'm not yet sure if this is due to bugs on my part or subtle rounding differences in the calculation (or quite possibly both), but hopefully it's not too far from fully working.

Ente

unread,
Sep 22, 2019, 10:57:02 PM9/22/19
to FishCooking
Did a short match. Didn't spot any losses on time or anything suspicious but it lost badly. Basefish had a score of nearly 60%.

Nick Pelling

unread,
Sep 23, 2019, 1:59:44 AM9/23/19
to FishCooking
Hi Ente,

As I flagged in my email, this patch probably isn't fully working yet (so it's no surprise at all it got beaten), but it's compiling and linking well enough to highlight a number of issues with the testing side of things, e.g. detecting sse4 on fishtest etc. And that's the infrastructure stuff I'm trying to figure out at the moment.

Thanks, Nick

Ente

unread,
Sep 23, 2019, 7:39:47 AM9/23/19
to FishCooking
I won't be able to help with the code. Just wanted to say that it worked on my bmi2 system.

Nick Pelling

unread,
Oct 13, 2019, 4:25:24 PM10/13/19
to FishCooking
Hi everyone ,

I've just posted a branch with autovectorized pawn code and am running it on fishtest: https://github.com/nickpelling/Stockfish/tree/vectorizepawns6

It benches OK (no change to signature) and godbolt.org assures me that the main loop is now vectorized, but I'm not seeing a huge amount of speedup. How do I tell if any of the vectorizing code is even being called on fishtest?

Also: the same pawn code should vectorize even better on AVX / AVX2 targets, but that isn't something I've targeted before. Ente, do you want to give this a try?

Thanks, Nick

Ente

unread,
Oct 15, 2019, 2:49:39 PM10/15/19
to FishCooking
I'm sorry - currently I don't have a lot of time to do testing etc. Maybe someone else can try a bmi2 build? If I manage to find the time I'll let you know.

Budi Kusasi

unread,
Oct 17, 2019, 1:06:45 AM10/17/19
to FishCooking
try add CXXFLAGS =-mtune=‘skylake-avx512’ in make command line

CXXFLAGS=-mtune=‘skylake-avx512’ make build ARCH=x86-64....

ts.tom...@gmail.com

unread,
Oct 17, 2019, 10:15:12 AM10/17/19
to FishCooking
Could you post the godbolt link with the code? I was unable to reproduce it with GCC. Clang only vectorizes the code that is, funnily enough, marked "Non-vectorizable entry code"
Also a few things that are apparent to me:
1. Relying on automatic vectorization here is a lot to ask for in this case.
2. Pawn eval is cached, the evaluation is not so costly - the preparation of arguments for the vectorized approach may have too big overhead - it's ~1kB of stack space.
3. If you want to have at least a small chance that it gets vectorized then you need to always prepare 8 element arrays. Even if there is less pawns.

ts.tom...@gmail.com

unread,
Oct 17, 2019, 10:16:21 AM10/17/19
to FishCooking

3. If you want to have at least a small chance that it gets vectorized then you need to always prepare 8 element arrays. Even if there is less pawns.
Note. For me it doesn't vectorize even if I do this.

ts.tom...@gmail.com

unread,
Oct 17, 2019, 10:17:40 AM10/17/19
to FishCooking

ts.tom...@gmail.com

unread,
Oct 17, 2019, 10:18:39 AM10/17/19
to FishCooking
Also AVX512 may be required to have good performance due to bitboards.

Nick Pelling

unread,
Oct 17, 2019, 12:20:57 PM10/17/19
to FishCooking
Hi Tomek,

I'm really sorry, you were looking at my initial engineering test branch "vectorpawn2" (which I used to see if the general idea could be made to work), not the actual code delivery branch (which I tested to get the same test signature), which was "vectorpawn6".


This is the code I put into godbolt.org, gcc x86-64 9.2 with -O3 -m64 -msse -msse2 -msse3 -mssse3 -msse4.1 and also with -O3 -m64 -mavx (and I expect -O3 -m64 -mavx2 should also work).

Sorry for the misunderstanding!

Cheers, Nick
Reply all
Reply to author
Forward
0 new messages