Vectorize floats functions

77 views
Skip to first unread message

Kunde21

unread,
Apr 26, 2016, 2:37:52 AM4/26/16
to gonum-dev
It came up in this issue that I've got quite a few assembly routines that could be used in gonum/floats functions.  I've been tinkering with them in my spare time as well as numgo's implementation (removing the broadcasting is pretty easy).  The normal speedup is about 3x compared to the for loop versions.  I have these implemented, with a bit of cleanup needed:  CumSum, CumProd, Div, Mul, Sum, Scale, AddConst, Distance(L=1), Distance(L=+Inf)

I'm going to work on Equal, EqualApprox, Same, HasNaN, Max, Min, and Span in my free time.  

Are there any issues with adding more assembly behind the floats library?  Of course the assembly will go in internal/asm and floats will call into it.

Vladimír Chalupecký

unread,
Apr 30, 2016, 11:57:26 PM4/30/16
to gonum-dev
I think there are no issues apart from having to maintain the assembly in the future. I'd be happy with the additions provided they come with benchmarks.

Brendan Tracey

unread,
May 1, 2016, 9:57:15 AM5/1/16
to gonum-dev
I said this last time it came up, which is that I'm not sure how useful most of them are. One of the goals of gonum is to be legible scientific coding, and assembly is a hamper to that. Assembly can also bring huge speed benefits, so there is clearly a speed tradeoff. Daxpy and Ddot are clear winners as they are in the inner loop of a lot of commands (such as matrix multiplication). Scale also seems like it would have sufficient benefit to merit the cost. Are there any other ones? Span for example I usually run once or twice at the beginning of my program. If Span were free the runtime would be unaffected.

I'm not opposed to having any assembly, I just think there's a cost to it, and so we should be judicious. If someone wanted to help improve gonum speed, I'd recommend they help improve the BLAS implementations rather than add more floats functions in assembly (I'm not saying what you're doing is bad Kunde, just that there is a maintenance burden).

Kunde21

unread,
May 1, 2016, 5:54:29 PM5/1/16
to gonum-dev
I might be misunderstanding the purpose of the floats library, then.  It seems, to me, to be a means to export the internal/asm routines with requirement checks for safely.  People are already using it for the tight asm loops that are implemented and I've seen it recommended for efficiency in multiple places (reddit, golang-nuts, etc).  If it's just a for loop that doesn't leverage hardware instructions in asm routines, I don't see how adding a function call to the mix is useful.  

Dan Kortschak

unread,
May 1, 2016, 6:08:55 PM5/1/16
to Kunde21, gonum-dev
I don't think the answer was a no. Part of the reason for the use of floats is legibility, part is performance. Initially I was skeptical about floats from a legibility perspective, but I have come to see the value of it in more complex numerical code.

The concerns I think Brendan has are that the use of asm should not get in the way of two things too much, reviewability and maintainability. This is obviously a cost benefit balance. Part of the issue is that the core gonum group have relatively limited asm experience (we are learning), so complex asm code adds a maintenance burden - that is often justified by a performance benefit, but needs to be considered.

The reviewability perspective is very important, at least to me, given the astonishingly high prevalence of crappy scientific software in daily use (and the bad science that follows from it). If the asm code is well tested, clearly written and commented and can be excluded via build tags that replace it with the Go implementation, then I am happy.

Dan

Kunde21

unread,
May 1, 2016, 10:47:01 PM5/1/16
to gonum-dev, kun...@gmail.com
Oh, I didn't take Brendan's comment as a no.  I just wanted to check my assumptions before diving too deep.

I'm aiming at those simple inner loops that are clearly-defined vector operations.  I understand that some of my numgo stuff looks dangerous, but that was more due to the complexity of the library that I was crafting towards.  Please let me know if there's a routine that needs explaining.

Benchmark status FYI:

BenchmarkCumSum1-4            13.3          21.7          +63.16%
BenchmarkCumSum5-4            27.1          28.7          +5.90%
BenchmarkCumSum10-4           46.3          35.2          -23.97%
BenchmarkCumSum100-4          478           166           -65.27%
BenchmarkCumSum1000-4         4943          1468          -70.30%
BenchmarkCumSum10000-4        48359         15257         -68.45%
BenchmarkCumSum100000-4       610421        172687        -71.71%
BenchmarkCumSum500000-4       3275409       2053480       -37.31%
BenchmarkAddConst1-4          9.33          12.9          +38.26%
BenchmarkAddConst5-4          19.0          15.3          -19.47%
BenchmarkAddConst10-4         31.6          19.0          -39.87%
BenchmarkAddConst100-4        245           59.5          -75.71%
BenchmarkAddConst1000-4       2217          489           -77.94%
BenchmarkAddConst10000-4      22340         6521          -70.81%
BenchmarkAddConst100000-4     228892        83048         -63.72%
BenchmarkAddConst500000-4     1738894       1265879       -27.20%
BenchmarkDiv1-4               12.8          16.4          +28.12%
BenchmarkDiv5-4               23.5          20.7          -11.91%
BenchmarkDiv10-4              38.5          28.4          -26.23%
BenchmarkDiv100-4             372           180           -51.61%
BenchmarkDiv1000-4            3697          1797          -51.39%
BenchmarkDiv10000-4           38007         18188         -52.15%
BenchmarkDiv100000-4          512667        233465        -54.46%
BenchmarkDiv500000-4          3318610       3015322       -9.14%
BenchmarkL1norm1-4            20.1          21.6          +7.46%
BenchmarkL1norm5-4            37.6          25.5          -32.18%
BenchmarkL1norm10-4           63.7          31.3          -50.86%
BenchmarkL1norm100-4          439           151           -65.60%
BenchmarkL1norm1000-4         4601          1230          -73.27%
BenchmarkL1norm10000-4        44533         13667         -69.31%
BenchmarkL1norm100000-4       534738        157612        -70.53%
BenchmarkL1norm500000-4       3135488       1775517       -43.37%
BenchmarkCaxpy1-4             19.0          18.1          -4.74%
BenchmarkCaxpy5-4             51.5          24.1          -53.20%
BenchmarkCaxpy10-4            96.5          25.9          -73.16%
BenchmarkCaxpy100-4           915           143           -84.37%
BenchmarkCaxpy1000-4          9889          1228          -87.58%
BenchmarkCaxpy10000-4         92242         13957         -84.87%
BenchmarkCaxpy100000-4        1015226       163840        -83.86%
BenchmarkCaxpy500000-4        5060157       2002061       -60.43%

For the comparison to floats, I'm re-creating the entire function in which the asm routine is called.  That extra function call shows up in the smallest cases, but that's quickly recouped as the data grows.  Caxpy is a direct asm call, since it's replacing another internal/asm call.
Reply all
Reply to author
Forward
0 new messages