Oh, I didn't take Brendan's comment as a no. I just wanted to check my assumptions before diving too deep.
I'm aiming at those simple inner loops that are clearly-defined vector operations. I understand that some of my numgo stuff looks dangerous, but that was more due to the complexity of the library that I was crafting towards. Please let me know if there's a routine that needs explaining.
BenchmarkCumSum1-4 13.3 21.7 +63.16%
BenchmarkCumSum5-4 27.1 28.7 +5.90%
BenchmarkCumSum10-4 46.3 35.2 -23.97%
BenchmarkCumSum100-4 478 166 -65.27%
BenchmarkCumSum1000-4 4943 1468 -70.30%
BenchmarkCumSum10000-4 48359 15257 -68.45%
BenchmarkCumSum100000-4 610421 172687 -71.71%
BenchmarkCumSum500000-4 3275409 2053480 -37.31%
BenchmarkAddConst1-4 9.33 12.9 +38.26%
BenchmarkAddConst5-4 19.0 15.3 -19.47%
BenchmarkAddConst10-4 31.6 19.0 -39.87%
BenchmarkAddConst100-4 245 59.5 -75.71%
BenchmarkAddConst1000-4 2217 489 -77.94%
BenchmarkAddConst10000-4 22340 6521 -70.81%
BenchmarkAddConst100000-4 228892 83048 -63.72%
BenchmarkAddConst500000-4 1738894 1265879 -27.20%
BenchmarkDiv1-4 12.8 16.4 +28.12%
BenchmarkDiv5-4 23.5 20.7 -11.91%
BenchmarkDiv10-4 38.5 28.4 -26.23%
BenchmarkDiv100-4 372 180 -51.61%
BenchmarkDiv1000-4 3697 1797 -51.39%
BenchmarkDiv10000-4 38007 18188 -52.15%
BenchmarkDiv100000-4 512667 233465 -54.46%
BenchmarkDiv500000-4 3318610 3015322 -9.14%
BenchmarkL1norm1-4 20.1 21.6 +7.46%
BenchmarkL1norm5-4 37.6 25.5 -32.18%
BenchmarkL1norm10-4 63.7 31.3 -50.86%
BenchmarkL1norm100-4 439 151 -65.60%
BenchmarkL1norm1000-4 4601 1230 -73.27%
BenchmarkL1norm10000-4 44533 13667 -69.31%
BenchmarkL1norm100000-4 534738 157612 -70.53%
BenchmarkL1norm500000-4 3135488 1775517 -43.37%
BenchmarkCaxpy1-4 19.0 18.1 -4.74%
BenchmarkCaxpy5-4 51.5 24.1 -53.20%
BenchmarkCaxpy10-4 96.5 25.9 -73.16%
BenchmarkCaxpy100-4 915 143 -84.37%
BenchmarkCaxpy1000-4 9889 1228 -87.58%
BenchmarkCaxpy10000-4 92242 13957 -84.87%
BenchmarkCaxpy100000-4 1015226 163840 -83.86%
BenchmarkCaxpy500000-4 5060157 2002061 -60.43%
For the comparison to floats, I'm re-creating the entire function in which the asm routine is called. That extra function call shows up in the smallest cases, but that's quickly recouped as the data grows. Caxpy is a direct asm call, since it's replacing another internal/asm call.