Thank you all for your interest in this new move generation codes.
Here is a comparison between my own kindergarten approach, and Toshihiko Okuhara's carry propagation that looks the fastest one, with various compiler (gcc 4.7 clang 3.0 and icc 12.1), using pgo (on icc & gcc only), and code targeted to x64, x64-modern (with popcount) and x86. The ffo20-39 test was run 5 times, using 1 core on my laptop (with a sandy bridge cpu running at 3.1 Ghz), and fedora 17 as OS. The accuracy for the timing is around 1%.
kindergarten carry acceleration
-----------------------------------------------------
x64-modern-pgo-icc 00:05,18 00:05,18 0,00%
x64-modern-icc 00:05,19 00:05,11 1,61%
x64-icc 00:05,38 00:05,28 1,86%
x64-modern-pgo-gcc 00:05,40 00:05,39 0,07%
x64-modern-gcc 00:05,57 00:05,40 3,11%
x64-gcc 00:05,72 00:05,62 1,74%
x64-modern-clang 00:05,90 00:05,81 1,58%
x64-clang 00:06,03 00:05,94 1,58%
x86-clang 00:09,06 00:08,39 7,96%
x86-icc 00:09,21 00:08,89 3,60%
x86-gcc 00:10,26 00:09,71 5,62%
as you can see, the acceleration vary from 0% to 8% and strongly depends on the compiler used and the targeted abi. In practice, the acceleration is only important on 32 bits abi (the x86), where optimized code to this target was used only with the carry approach. To my humble opinion, I do not think it is important to be fast here, as running Edax on a 32 bits system is whatever much slower than running it on a 64 bits system; so I am not interested to optimize the kindergarten approach to this system (although this can be done).
What is exciting me to further improve code generation speed, is the announcement of the AVX2 extensions in future CPU (coming in 2013). Code like "(O & 0x0002040810204080ULL)* 0x0101010101010101ULL >> 57" will be replaced by a single instruction (PDEP), and will be easy to reverse without using an array (PEXT). This will probably make the kindergarten approach very competitive.
So see you in the future for faster move generator.