Re-organize gonum/internal

97 views
Skip to first unread message

Kunde21

unread,
Apr 30, 2016, 8:33:00 PM4/30/16
to gonum-dev
While working in gonum/internal, I'm thinking that lumping everything into internal/asm and identifying the type with function name prefixes isn't the most readable or convenient (esp C64 is C and C128 is Z).  My suggestion is to refactor gonum/internal by datatype, with the understanding that every function will be backed by asm, if possible.  Each sub-package would have the common API functionality implemented for all datatypes, with type-specific functionality only in it's type sub-package.  The result would be something like:

Current code:
gonum/internal/f64
gonum/internal/f32
gonum/internal/c64
gonum/internal/c128

Future?:
gonum/internal/i32
gonum/internal/i64

Each sub-package could be organized in a similar fashion, i.e. all standard stubs in one file, each asm routine in its own file, all standard "noasm" tag code in the same file, non-asm functions in the same file, etc.  One benefit is that any type-specific routines could be defined under that type in a separate stubs/noasm pair.  Another benefit is the readability in the other gonum libraries.  f64.Axpy(a, x,y) and c64.Axpy(a,x,y) have the same functionality and function names, but are easily differentiated by the type-specific library name.  

This would impact mat64, floats, and blas/native libraries, as they utilize internal/asm (gorename could do most/all? of the change-over).

Dan Kortschak

unread,
Apr 30, 2016, 8:38:03 PM4/30/16
to Kunde21, gonum-dev
SGTM, but wait for input from Brenda/Vladimir.

Vladimír Chalupecký

unread,
Apr 30, 2016, 11:58:11 PM4/30/16
to gonum-dev
SGTM

Brendan Tracey

unread,
May 1, 2016, 10:00:38 AM5/1/16
to gonum-dev
SGTM, though I think it should be gonum/internal/asm/f64. I could see that we would want other kinds of internal routines in the future, so keeping the asm all in one subdirectory seems reasonable.

Kunde21

unread,
May 18, 2016, 11:29:24 PM5/18/16
to gonum-dev
I created a PR to allow for line-level comments on the code.  I'd ask that you review the tests first, because functionality is the first priority.  Then, don't be afraid to add line comments to anything you want explained/commented/modified.  I have benchmarks in a separate project (in order to import and compare gonum/floats), so adding them is planned and not a problem.

Kunde21

unread,
Jun 22, 2016, 11:21:37 PM6/22/16
to gonum-dev
kortschak and I have been working through this PR, but I'd like to put as more eyes on this code review.  The tests should identify any issues with results, memory alignment, and out-of-range memory access.  If you see anything that the tests might be missing, I'd rather you speak up and possibly be wrong than be quiet and possibly be right.  Sacrificing correct results isn't on my to-do list.

Sorry for the size of the PR, by the way; it should have been at least 3 if I had planned it out correctly.  On that note, it's easier to review the files, not the commits.  The PRs I'm prepping will not make the same mistake.

On Wednesday, May 18, 2016 at 8:29:24 PM UTC-7, Kunde21 

Kunde21

unread,
Jun 22, 2016, 11:23:32 PM6/22/16
to gonum-dev
Here's the most recent Benchmark data:

name                           old time/op  new time/op   delta
C128AxpyUnitary1-12            5.77ns ± 2%   9.16ns ±14%  +58.80%   (p=0.000 n=8+10)
C128AxpyUnitary2-12            8.12ns ±13%   9.92ns ±11%  +22.23%    (p=0.000 n=9+9)
C128AxpyUnitary3-12            10.2ns ±11%   11.9ns ± 7%  +15.87%  (p=0.000 n=10+10)
C128AxpyUnitary4-12            11.9ns ± 5%    9.8ns ± 6%  -17.11%    (p=0.000 n=9+9)
C128AxpyUnitary5-12            14.4ns ± 7%   10.7ns ± 2%  -25.23%   (p=0.000 n=10+8)
C128AxpyUnitary10-12           26.0ns ± 9%   15.2ns ±12%  -41.49%  (p=0.000 n=10+10)
C128AxpyUnitary100-12           240ns ± 3%     71ns ± 9%  -70.24%  (p=0.000 n=10+10)
C128AxpyUnitary1000-12         2.32µs ± 5%   0.69µs ± 4%  -70.29%  (p=0.000 n=10+10)
C128AxpyUnitary5000-12         11.7µs ± 6%    4.3µs ± 3%  -63.45%    (p=0.000 n=9+8)
C128AxpyUnitary10000-12        23.4µs ± 5%   13.3µs ± 3%  -43.45%    (p=0.000 n=9+7)
C128AxpyUnitary50000-12         119µs ±10%     70µs ± 5%  -40.88%   (p=0.000 n=10+9)
C128AxpyUnitaryTo1-12          6.38ns ± 5%   9.86ns ± 5%  +54.58%  (p=0.000 n=10+10)
C128AxpyUnitaryTo2-12          8.40ns ±12%  10.68ns ± 8%  +27.13%   (p=0.000 n=10+9)
C128AxpyUnitaryTo3-12          9.90ns ± 3%  12.56ns ± 7%  +26.91%   (p=0.000 n=9+10)
C128AxpyUnitaryTo4-12          11.8ns ± 4%   10.6ns ± 2%  -10.79%    (p=0.000 n=9+9)
C128AxpyUnitaryTo5-12          14.0ns ± 9%   11.5ns ± 5%  -17.40%   (p=0.000 n=10+9)
C128AxpyUnitaryTo10-12         24.2ns ± 8%   15.9ns ±12%  -34.09%  (p=0.000 n=10+10)
C128AxpyUnitaryTo100-12         218ns ±11%     73ns ±10%  -66.43%  (p=0.000 n=10+10)
C128AxpyUnitaryTo1000-12       2.05µs ± 1%   0.91µs ± 3%  -55.40%    (p=0.000 n=8+8)
C128AxpyUnitaryTo5000-12       10.5µs ±10%    4.9µs ±10%  -53.55%  (p=0.000 n=10+10)
C128AxpyUnitaryTo10000-12      21.5µs ± 8%   17.0µs ± 4%  -21.00%   (p=0.000 n=9+10)
C128AxpyUnitaryTo50000-12       106µs ± 4%     85µs ± 2%  -20.38%   (p=0.000 n=8+10)
C128AxpyIncN1Inc1-12           7.41ns ± 8%   8.78ns ± 4%  +18.42%  (p=0.000 n=10+10)
C128AxpyIncN2Inc1-12           10.4ns ±11%   10.6ns ± 4%     ~     (p=0.170 n=10+10)
C128AxpyIncN2Inc2-12           10.4ns ± 7%   10.4ns ± 5%     ~      (p=0.784 n=10+9)
C128AxpyIncN2Inc4-12           10.5ns ± 8%   10.8ns ± 9%     ~     (p=0.181 n=10+10)
C128AxpyIncN2Inc10-12          10.4ns ± 5%   10.9ns ±14%     ~      (p=0.233 n=9+10)
C128AxpyIncN3Inc1-12           12.7ns ± 7%   11.6ns ± 7%   -8.81%    (p=0.000 n=9+9)
C128AxpyIncN3Inc2-12           12.9ns ±11%   11.5ns ± 4%  -10.72%   (p=0.000 n=10+9)
C128AxpyIncN3Inc4-12           12.5ns ± 8%   11.8ns ±12%   -5.70%   (p=0.048 n=9+10)
C128AxpyIncN3Inc10-12          12.5ns ± 5%   11.4ns ± 2%   -8.86%    (p=0.000 n=9+8)
C128AxpyIncN4Inc1-12           14.9ns ± 2%   10.4ns ± 7%  -30.35%   (p=0.000 n=8+10)
C128AxpyIncN4Inc2-12           15.1ns ±12%   10.6ns ±13%  -29.72%  (p=0.000 n=10+10)
C128AxpyIncN4Inc4-12           15.0ns ±13%   10.5ns ± 9%  -30.11%  (p=0.000 n=10+10)
C128AxpyIncN4Inc10-12          14.9ns ± 4%   10.1ns ± 3%  -32.11%    (p=0.000 n=9+8)
C128AxpyIncN10Inc1-12          30.2ns ± 9%   16.0ns ± 6%  -46.98%   (p=0.000 n=10+9)
C128AxpyIncN10Inc2-12          30.3ns ± 8%   16.0ns ± 7%  -47.10%  (p=0.000 n=10+10)
C128AxpyIncN10Inc4-12          30.0ns ± 5%   15.6ns ± 6%  -47.87%   (p=0.000 n=9+10)
C128AxpyIncN10Inc10-12         30.3ns ± 6%   16.2ns ±13%  -46.42%  (p=0.000 n=10+10)
C128AxpyIncN1000Inc1-12        2.62µs ± 5%   0.88µs ±16%  -66.23%  (p=0.000 n=10+10)
C128AxpyIncN1000Inc2-12        2.69µs ± 6%   1.13µs ± 6%  -58.09%   (p=0.000 n=10+9)
C128AxpyIncN1000Inc4-12        2.83µs ±10%   2.16µs ±14%  -23.87%  (p=0.000 n=10+10)
C128AxpyIncN1000Inc10-12       4.21µs ±11%   2.15µs ± 4%  -49.07%   (p=0.000 n=10+9)
C128AxpyIncN100000Inc1-12       264µs ±11%    144µs ± 4%  -45.42%  (p=0.000 n=10+10)
C128AxpyIncN100000Inc2-12       317µs ± 7%    232µs ± 6%  -26.85%  (p=0.000 n=10+10)
C128AxpyIncN100000Inc4-12       614µs ± 8%    560µs ± 9%   -8.76%  (p=0.000 n=10+10)
C128AxpyIncN100000Inc10-12     1.95ms ± 2%   1.48ms ± 5%  -23.89%   (p=0.000 n=9+10)
C128AxpyIncN100000IncM1-12      263µs ± 7%    145µs ± 3%  -45.03%  (p=0.000 n=10+10)
C128AxpyIncN100000IncM2-12      324µs ±10%    230µs ± 2%  -28.99%   (p=0.000 n=10+9)
C128AxpyIncN100000IncM4-12      631µs ±11%    572µs ± 9%   -9.35%  (p=0.001 n=10+10)
C128AxpyIncN100000IncM10-12    1.97ms ± 4%   1.55ms ± 9%  -21.57%  (p=0.000 n=10+10)
C128AxpyIncToN1Inc1-12         8.27ns ± 5%  10.22ns ± 8%  +23.63%   (p=0.000 n=9+10)
C128AxpyIncToN2Inc1-12         11.2ns ±12%   12.2ns ±10%   +9.00%   (p=0.001 n=9+10)
C128AxpyIncToN2Inc2-12         11.1ns ± 6%   12.1ns ± 9%   +8.65%  (p=0.000 n=10+10)
C128AxpyIncToN2Inc4-12         11.2ns ± 5%   11.9ns ±10%   +6.80%  (p=0.008 n=10+10)
C128AxpyIncToN2Inc10-12        11.3ns ±13%   12.1ns ± 8%   +6.90%  (p=0.008 n=10+10)
C128AxpyIncToN3Inc1-12         13.4ns ± 3%   12.9ns ± 6%   -3.38%    (p=0.005 n=8+9)
C128AxpyIncToN3Inc2-12         13.8ns ±10%   12.9ns ± 6%   -6.69%    (p=0.001 n=9+9)
C128AxpyIncToN3Inc4-12         13.6ns ± 4%   12.9ns ± 6%   -4.91%   (p=0.003 n=9+10)
C128AxpyIncToN3Inc10-12        13.8ns ± 5%   12.9ns ± 4%   -6.03%  (p=0.000 n=10+10)
C128AxpyIncToN4Inc1-12         16.0ns ± 8%   10.9ns ± 5%  -31.77%   (p=0.000 n=10+9)
C128AxpyIncToN4Inc2-12         15.8ns ± 7%   10.9ns ± 6%  -30.98%   (p=0.000 n=10+9)
C128AxpyIncToN4Inc4-12         15.7ns ± 3%   10.9ns ± 4%  -30.78%   (p=0.000 n=9+10)
C128AxpyIncToN4Inc10-12        16.2ns ± 9%   10.9ns ± 3%  -32.43%   (p=0.000 n=10+9)
C128AxpyIncToN10Inc1-12        33.1ns ± 9%   17.4ns ± 9%  -47.25%  (p=0.000 n=10+10)
C128AxpyIncToN10Inc2-12        31.8ns ± 2%   17.2ns ± 7%  -45.94%   (p=0.000 n=8+10)
C128AxpyIncToN10Inc4-12        33.0ns ±14%   17.2ns ± 7%  -47.96%  (p=0.000 n=10+10)
C128AxpyIncToN10Inc10-12       32.4ns ± 6%   17.5ns ± 4%  -46.12%    (p=0.000 n=9+9)
C128AxpyIncToN1000Inc1-12      2.61µs ± 7%   1.06µs ± 4%  -59.44%   (p=0.000 n=9+10)
C128AxpyIncToN1000Inc2-12      2.67µs ± 2%   1.29µs ± 5%  -51.72%   (p=0.000 n=9+10)
C128AxpyIncToN1000Inc4-12      2.81µs ± 5%   2.53µs ± 3%  -10.09%    (p=0.000 n=9+9)
C128AxpyIncToN1000Inc10-12     5.69µs ± 3%   4.44µs ± 6%  -21.87%   (p=0.000 n=10+9)
C128AxpyIncToN100000Inc1-12     269µs ± 4%    176µs ± 3%  -34.43%  (p=0.000 n=10+10)
C128AxpyIncToN100000Inc2-12     370µs ± 4%    311µs ± 4%  -16.03%  (p=0.000 n=10+10)
C128AxpyIncToN100000Inc4-12    1.18ms ± 4%   1.15ms ± 3%   -2.67%  (p=0.029 n=10+10)
C128AxpyIncToN100000Inc10-12   2.62ms ± 2%   2.34ms ± 1%  -10.82%   (p=0.000 n=9+10)
C128AxpyIncToN100000IncM1-12    268µs ± 4%    181µs ± 2%  -32.64%   (p=0.000 n=10+9)
C128AxpyIncToN100000IncM2-12    376µs ± 3%    313µs ± 3%  -16.87%   (p=0.000 n=8+10)
C128AxpyIncToN100000IncM4-12   1.18ms ± 4%   1.20ms ± 6%     ~     (p=0.353 n=10+10)
C128AxpyIncToN100000IncM10-12  2.66ms ± 2%   2.45ms ± 3%   -7.97%  (p=0.000 n=10+10)
C64AxpyUnitary1-12             6.04ns ±11%   8.95ns ± 2%  +48.35%   (p=0.000 n=10+9)
C64AxpyUnitary2-12             8.96ns ± 6%  10.16ns ± 8%  +13.41%   (p=0.000 n=9+10)
C64AxpyUnitary3-12             12.1ns ± 7%   11.9ns ± 8%     ~      (p=0.088 n=9+10)
C64AxpyUnitary4-12             14.9ns ± 3%   13.1ns ± 9%  -12.28%   (p=0.000 n=9+10)
C64AxpyUnitary5-12             18.3ns ± 2%   14.1ns ± 3%  -23.19%    (p=0.000 n=8+9)
C64AxpyUnitary10-12            38.0ns ± 4%   12.9ns ± 6%  -66.03%  (p=0.000 n=10+10)
C64AxpyUnitary100-12            450ns ± 6%     43ns ±11%  -90.40%  (p=0.000 n=10+10)
C64AxpyUnitary1000-12          4.58µs ± 7%   0.33µs ± 5%  -92.71%   (p=0.000 n=10+9)
C64AxpyUnitary5000-12          22.9µs ± 6%    2.1µs ± 5%  -90.65%  (p=0.000 n=10+10)
C64AxpyUnitary10000-12         45.4µs ± 7%    4.3µs ± 4%  -90.49%   (p=0.000 n=10+9)
C64AxpyUnitary50000-12          226µs ± 7%     35µs ± 1%  -84.47%  (p=0.000 n=10+10)
C64AxpyUnitaryTo1-12           6.09ns ± 6%   9.73ns ± 1%  +59.86%   (p=0.000 n=10+8)
C64AxpyUnitaryTo2-12           9.35ns ± 3%  10.73ns ± 4%  +14.75%    (p=0.000 n=9+9)
C64AxpyUnitaryTo3-12           14.0ns ± 6%   12.4ns ± 1%  -11.26%   (p=0.000 n=10+8)
C64AxpyUnitaryTo4-12           18.5ns ± 4%   13.8ns ± 6%  -25.49%  (p=0.000 n=10+10)
C64AxpyUnitaryTo5-12           23.1ns ± 4%   14.9ns ± 7%  -35.16%   (p=0.000 n=9+10)
C64AxpyUnitaryTo10-12          46.2ns ± 5%   13.8ns ± 9%  -70.11%   (p=0.000 n=10+9)
C64AxpyUnitaryTo100-12          459ns ± 5%     47ns ±17%  -89.83%  (p=0.000 n=10+10)
C64AxpyUnitaryTo1000-12        4.55µs ± 1%   0.33µs ± 2%  -92.72%    (p=0.000 n=8+8)
C64AxpyUnitaryTo5000-12        22.8µs ± 2%    2.3µs ± 6%  -89.86%   (p=0.000 n=9+10)
C64AxpyUnitaryTo10000-12       45.6µs ± 2%    4.8µs ± 5%  -89.48%  (p=0.000 n=10+10)
C64AxpyUnitaryTo50000-12        229µs ± 4%     42µs ± 4%  -81.43%   (p=0.000 n=9+10)
C64AxpyIncN1Inc1-12            7.30ns ±11%   8.79ns ± 8%  +20.40%   (p=0.000 n=10+9)
C64AxpyIncN2Inc1-12            10.2ns ± 7%   10.6ns ± 9%     ~     (p=0.070 n=10+10)
C64AxpyIncN2Inc2-12            10.2ns ± 6%   10.6ns ± 4%     ~      (p=0.081 n=10+9)
C64AxpyIncN2Inc4-12            10.0ns ±11%   10.6ns ± 6%   +6.02%   (p=0.002 n=9+10)
C64AxpyIncN2Inc10-12           10.2ns ±16%   10.8ns ± 9%   +5.71%  (p=0.043 n=10+10)
C64AxpyIncN3Inc1-12            12.7ns ± 3%   11.6ns ± 2%   -8.84%    (p=0.000 n=8+9)
C64AxpyIncN3Inc2-12            12.9ns ± 6%   11.8ns ±10%   -9.04%  (p=0.001 n=10+10)
C64AxpyIncN3Inc4-12            12.7ns ± 4%   11.7ns ± 4%   -8.43%   (p=0.000 n=9+10)
C64AxpyIncN3Inc10-12           12.5ns ± 2%   11.5ns ± 1%   -8.39%    (p=0.000 n=8+8)
C64AxpyIncN4Inc1-12            15.7ns ± 6%   10.3ns ±11%  -34.49%   (p=0.000 n=9+10)
C64AxpyIncN4Inc2-12            15.5ns ± 1%   10.0ns ± 2%  -35.45%    (p=0.000 n=8+8)
C64AxpyIncN4Inc4-12            16.0ns ± 4%   10.2ns ± 5%  -36.46%  (p=0.000 n=10+10)
C64AxpyIncN4Inc10-12           15.9ns ± 8%   10.2ns ± 5%  -35.99%  (p=0.000 n=10+10)
C64AxpyIncN10Inc1-12           34.9ns ± 5%   15.5ns ± 4%  -55.53%   (p=0.000 n=9+10)
C64AxpyIncN10Inc2-12           35.0ns ± 2%   15.6ns ± 5%  -55.37%   (p=0.000 n=10+9)
C64AxpyIncN10Inc4-12           35.4ns ± 7%   15.9ns ±11%  -55.14%  (p=0.000 n=10+10)
C64AxpyIncN10Inc10-12          34.8ns ± 1%   15.6ns ± 4%  -55.32%    (p=0.000 n=8+9)
C64AxpyIncN1000Inc1-12         3.32µs ± 5%   0.82µs ± 4%  -75.20%    (p=0.000 n=9+9)
C64AxpyIncN1000Inc2-12         3.38µs ±10%   0.86µs ± 7%  -74.54%   (p=0.000 n=9+10)
C64AxpyIncN1000Inc4-12         3.34µs ± 2%   1.13µs ± 6%  -66.15%   (p=0.000 n=8+10)
C64AxpyIncN1000Inc10-12        3.39µs ± 6%   2.10µs ± 3%  -38.02%   (p=0.000 n=9+10)
C64AxpyIncN100000Inc1-12        331µs ± 2%     93µs ± 4%  -71.96%   (p=0.000 n=9+10)
C64AxpyIncN100000Inc2-12        330µs ± 2%    144µs ± 1%  -56.25%    (p=0.000 n=8+9)
C64AxpyIncN100000Inc4-12        359µs ± 3%    232µs ± 5%  -35.32%   (p=0.000 n=8+10)
C64AxpyIncN100000Inc10-12       921µs ± 8%    761µs ±12%  -17.41%  (p=0.000 n=10+10)
C64AxpyIncN100000IncM1-12       338µs ± 5%     93µs ± 8%  -72.34%  (p=0.000 n=10+10)
C64AxpyIncN100000IncM2-12       334µs ± 2%    146µs ± 3%  -56.28%   (p=0.000 n=9+10)
C64AxpyIncN100000IncM4-12       368µs ± 8%    232µs ± 4%  -37.03%  (p=0.000 n=10+10)
C64AxpyIncN100000IncM10-12      921µs ±11%    770µs ±15%  -16.33%  (p=0.000 n=10+10)
C64AxpyIncToN1Inc1-12          8.40ns ± 4%  10.09ns ± 5%  +20.09%    (p=0.000 n=9+9)
C64AxpyIncToN2Inc1-12          11.2ns ± 4%   11.9ns ± 8%   +6.23%  (p=0.000 n=10+10)
C64AxpyIncToN2Inc2-12          11.1ns ± 2%   12.0ns ± 9%   +7.95%   (p=0.000 n=9+10)
C64AxpyIncToN2Inc4-12          11.2ns ± 4%   11.9ns ±10%   +6.46%   (p=0.000 n=9+10)
C64AxpyIncToN2Inc10-12         11.2ns ± 4%   11.8ns ± 1%   +5.02%   (p=0.000 n=10+7)
C64AxpyIncToN3Inc1-12          14.6ns ± 4%   13.0ns ± 7%  -10.87%  (p=0.000 n=10+10)
C64AxpyIncToN3Inc2-12          14.6ns ± 2%   12.9ns ± 4%  -11.91%   (p=0.000 n=8+10)
C64AxpyIncToN3Inc4-12          14.6ns ± 1%   12.9ns ± 5%  -11.49%    (p=0.000 n=8+9)
C64AxpyIncToN3Inc10-12         14.7ns ± 4%   12.9ns ± 8%  -12.09%   (p=0.000 n=10+9)
C64AxpyIncToN4Inc1-12          19.6ns ± 5%   11.0ns ± 4%  -43.83%  (p=0.000 n=10+10)
C64AxpyIncToN4Inc2-12          19.4ns ± 1%   11.0ns ± 6%  -43.12%   (p=0.000 n=8+10)
C64AxpyIncToN4Inc4-12          19.7ns ± 3%   11.1ns ±10%  -43.47%   (p=0.000 n=9+10)
C64AxpyIncToN4Inc10-12         19.8ns ± 6%   11.3ns ±11%  -43.19%  (p=0.000 n=10+10)
C64AxpyIncToN10Inc1-12         48.5ns ± 2%   17.0ns ± 3%  -65.08%    (p=0.000 n=9+8)
C64AxpyIncToN10Inc2-12         48.6ns ± 5%   17.3ns ±10%  -64.44%   (p=0.000 n=9+10)
C64AxpyIncToN10Inc4-12         47.9ns ± 1%   16.9ns ± 2%  -64.75%    (p=0.000 n=8+8)
C64AxpyIncToN10Inc10-12        48.2ns ± 2%   17.1ns ± 6%  -64.56%  (p=0.000 n=10+10)
C64AxpyIncToN1000Inc1-12       4.79µs ± 1%   0.87µs ± 2%  -81.80%    (p=0.000 n=9+9)
C64AxpyIncToN1000Inc2-12       4.79µs ± 1%   1.05µs ± 2%  -78.16%    (p=0.000 n=8+8)
C64AxpyIncToN1000Inc4-12       4.89µs ± 5%   1.29µs ± 5%  -73.61%   (p=0.000 n=10+9)
C64AxpyIncToN1000Inc10-12      4.85µs ± 2%   2.55µs ± 1%  -47.48%    (p=0.000 n=9+9)
C64AxpyIncToN100000Inc1-12      485µs ± 2%    103µs ± 4%  -78.82%   (p=0.000 n=9+10)
C64AxpyIncToN100000Inc2-12      488µs ± 2%    177µs ± 2%  -63.66%   (p=0.000 n=9+10)
C64AxpyIncToN100000Inc4-12      505µs ± 4%    311µs ± 5%  -38.55%    (p=0.000 n=9+9)
C64AxpyIncToN100000Inc10-12    1.49ms ± 4%   1.46ms ± 4%     ~     (p=0.089 n=10+10)
C64AxpyIncToN100000IncM1-12     482µs ± 2%    102µs ± 3%  -78.76%   (p=0.000 n=9+10)
C64AxpyIncToN100000IncM2-12     491µs ± 3%    182µs ± 2%  -63.03%  (p=0.000 n=10+10)
C64AxpyIncToN100000IncM4-12     507µs ± 6%    313µs ± 6%  -38.40%   (p=0.000 n=10+9)
C64AxpyIncToN100000IncM10-12   1.49ms ± 4%   1.46ms ± 3%   -2.62%   (p=0.002 n=10+9)
F32AxpyUnitary1-12             4.02ns ± 1%   5.58ns ± 2%  +38.77%    (p=0.000 n=9+9)
F32AxpyUnitary2-12             5.40ns ± 1%   7.12ns ± 3%  +31.92%   (p=0.000 n=9+10)
F32AxpyUnitary3-12             6.50ns ± 2%   8.63ns ± 2%  +32.78%   (p=0.000 n=9+10)
F32AxpyUnitary4-12             7.55ns ± 3%   5.73ns ± 3%  -24.09%  (p=0.000 n=10+10)
F32AxpyUnitary5-12             8.74ns ± 5%   7.09ns ± 1%  -18.95%   (p=0.000 n=10+9)
F32AxpyUnitary10-12            15.7ns ± 2%   10.2ns ± 2%  -34.50%  (p=0.000 n=10+10)
F32AxpyUnitary100-12            212ns ± 1%     22ns ± 3%  -89.71%   (p=0.000 n=9+10)
F32AxpyUnitary1000-12          2.26µs ± 3%   0.18µs ± 2%  -92.04%    (p=0.000 n=9+9)
F32AxpyUnitary5000-12          11.4µs ± 1%    0.9µs ± 2%  -92.28%    (p=0.000 n=9+9)
F32AxpyUnitary10000-12         22.9µs ± 3%    1.8µs ± 1%  -92.36%   (p=0.000 n=10+9)
F32AxpyUnitary50000-12          114µs ± 2%     15µs ± 1%  -87.17%    (p=0.000 n=9+9)
F32AxpyUnitaryTo1-12           4.68ns ± 2%   6.17ns ± 4%  +31.81%    (p=0.000 n=9+9)
F32AxpyUnitaryTo2-12           5.70ns ± 2%   8.11ns ± 1%  +42.20%    (p=0.000 n=8+9)
F32AxpyUnitaryTo3-12           6.78ns ± 5%  10.21ns ± 4%  +50.66%   (p=0.000 n=9+10)
F32AxpyUnitaryTo4-12           7.54ns ± 3%   6.21ns ± 2%  -17.64%   (p=0.000 n=9+10)
F32AxpyUnitaryTo5-12           8.27ns ± 2%   7.66ns ± 3%   -7.35%    (p=0.000 n=9+9)
F32AxpyUnitaryTo10-12          15.4ns ± 4%   11.2ns ± 3%  -26.87%   (p=0.000 n=9+10)
F32AxpyUnitaryTo100-12          153ns ± 5%     22ns ± 1%  -85.33%   (p=0.000 n=9+10)
F32AxpyUnitaryTo1000-12        1.53µs ± 5%   0.18µs ± 1%  -88.24%   (p=0.000 n=9+10)
F32AxpyUnitaryTo5000-12        7.70µs ± 6%   0.90µs ± 7%  -88.31%  (p=0.000 n=10+10)
F32AxpyUnitaryTo10000-12       15.3µs ± 6%    1.8µs ± 3%  -88.45%  (p=0.000 n=10+10)
F32AxpyUnitaryTo50000-12       78.1µs ± 9%   18.7µs ± 2%  -76.11%  (p=0.000 n=10+10)
F32AxpyIncN1Inc1-12            5.48ns ± 2%   5.59ns ± 2%   +2.08%    (p=0.003 n=8+8)
F32AxpyIncN2Inc1-12            6.76ns ± 2%   7.69ns ± 5%  +13.71%    (p=0.000 n=9+9)
F32AxpyIncN2Inc2-12            6.86ns ± 7%   7.68ns ± 5%  +11.86%  (p=0.000 n=10+10)
F32AxpyIncN2Inc4-12            6.75ns ± 4%   7.82ns ± 8%  +15.75%   (p=0.000 n=9+10)
F32AxpyIncN2Inc10-12           6.93ns ± 1%   7.58ns ± 1%   +9.44%    (p=0.000 n=8+8)
F32AxpyIncN3Inc1-12            8.31ns ± 1%   9.73ns ± 5%  +17.02%    (p=0.000 n=8+9)
F32AxpyIncN3Inc2-12            8.39ns ± 2%   9.84ns ± 6%  +17.32%   (p=0.000 n=9+10)
F32AxpyIncN3Inc4-12            8.31ns ± 1%   9.67ns ± 4%  +16.38%    (p=0.000 n=8+9)
F32AxpyIncN3Inc10-12           8.41ns ± 2%   9.77ns ± 4%  +16.20%   (p=0.000 n=8+10)
F32AxpyIncN4Inc1-12            10.0ns ± 1%    7.7ns ± 2%  -23.11%   (p=0.000 n=8+10)
F32AxpyIncN4Inc2-12            10.1ns ± 5%    7.7ns ± 4%  -23.49%   (p=0.000 n=10+9)
F32AxpyIncN4Inc4-12            10.1ns ± 4%    7.7ns ± 2%  -23.77%    (p=0.000 n=9+8)
F32AxpyIncN4Inc10-12           10.1ns ± 4%    7.8ns ± 7%  -22.83%   (p=0.000 n=10+9)
F32AxpyIncN10Inc1-12           16.0ns ± 4%   13.1ns ± 5%  -18.26%    (p=0.000 n=9+9)
F32AxpyIncN10Inc2-12           16.0ns ± 4%   12.9ns ± 1%  -19.41%    (p=0.000 n=9+8)
F32AxpyIncN10Inc4-12           16.0ns ± 3%   13.0ns ± 5%  -18.75%   (p=0.000 n=10+9)
F32AxpyIncN10Inc10-12          15.9ns ± 2%   13.0ns ± 4%  -18.39%    (p=0.000 n=9+9)
F32AxpyIncN1000Inc1-12         1.05µs ± 5%   0.72µs ± 9%  -31.73%  (p=0.000 n=10+10)
F32AxpyIncN1000Inc2-12         1.04µs ± 2%   0.71µs ± 5%  -31.56%    (p=0.000 n=9+9)
F32AxpyIncN1000Inc4-12         1.04µs ± 3%   0.72µs ± 8%  -29.98%   (p=0.000 n=9+10)
F32AxpyIncN1000Inc10-12        1.36µs ± 6%   1.30µs ± 3%   -4.52%   (p=0.001 n=10+9)
F32AxpyIncN100000Inc1-12        103µs ± 5%     72µs ±11%  -30.24%  (p=0.000 n=10+10)
F32AxpyIncN100000Inc2-12        102µs ± 2%     73µs ± 5%  -28.56%    (p=0.000 n=9+9)
F32AxpyIncN100000Inc4-12        136µs ± 2%    119µs ± 6%  -12.33%  (p=0.000 n=10+10)
F32AxpyIncN100000Inc10-12       286µs ± 5%    275µs ± 7%     ~      (p=0.065 n=9+10)
F32AxpyIncN100000IncM1-12       103µs ± 3%     73µs ±15%  -28.75%   (p=0.000 n=9+10)
F32AxpyIncN100000IncM2-12       104µs ± 4%     73µs ± 3%  -29.49%   (p=0.000 n=10+9)
F32AxpyIncN100000IncM4-12       140µs ± 2%    119µs ± 3%  -15.50%  (p=0.000 n=10+10)
F32AxpyIncN100000IncM10-12      284µs ± 2%    274µs ± 6%   -3.76%   (p=0.013 n=9+10)
F32AxpyIncToN1Inc1-12          6.41ns ± 4%   6.46ns ± 1%     ~       (p=0.119 n=9+8)
F32AxpyIncToN2Inc1-12          7.75ns ± 2%   8.19ns ± 5%   +5.60%    (p=0.000 n=9+9)
F32AxpyIncToN2Inc2-12          7.75ns ± 2%   8.21ns ± 5%   +5.84%    (p=0.000 n=8+9)
F32AxpyIncToN2Inc4-12          7.77ns ± 3%   8.15ns ± 2%   +4.88%    (p=0.000 n=9+8)
F32AxpyIncToN2Inc10-12         7.95ns ± 5%   8.15ns ± 1%   +2.46%   (p=0.020 n=10+8)
F32AxpyIncToN3Inc1-12          9.34ns ± 3%  10.07ns ± 7%   +7.82%   (p=0.000 n=10+9)
F32AxpyIncToN3Inc2-12          9.39ns ± 3%  10.03ns ± 5%   +6.90%   (p=0.000 n=10+9)
F32AxpyIncToN3Inc4-12          9.35ns ± 2%  10.00ns ± 3%   +6.94%    (p=0.000 n=9+9)
F32AxpyIncToN3Inc10-12         9.49ns ± 4%  10.08ns ± 6%   +6.27%   (p=0.000 n=9+10)
F32AxpyIncToN4Inc1-12          10.6ns ± 2%    8.7ns ± 6%  -17.82%    (p=0.000 n=8+9)
F32AxpyIncToN4Inc2-12          10.6ns ± 3%    8.8ns ± 6%  -16.28%   (p=0.000 n=9+10)
F32AxpyIncToN4Inc4-12          10.6ns ± 2%    8.7ns ± 2%  -17.91%    (p=0.000 n=9+9)
F32AxpyIncToN4Inc10-12         10.7ns ± 6%    8.7ns ± 2%  -18.08%    (p=0.000 n=9+9)
F32AxpyIncToN10Inc1-12         18.9ns ± 5%   14.7ns ± 1%  -22.10%   (p=0.000 n=10+8)
F32AxpyIncToN10Inc2-12         18.8ns ± 3%   14.7ns ± 2%  -21.39%    (p=0.000 n=9+9)
F32AxpyIncToN10Inc4-12         18.7ns ± 2%   14.9ns ± 3%  -20.68%   (p=0.000 n=9+10)
F32AxpyIncToN10Inc10-12        18.9ns ± 6%   14.7ns ± 2%  -22.56%   (p=0.000 n=10+8)
F32AxpyIncToN1000Inc1-12       1.18µs ± 4%   0.83µs ± 2%  -29.43%   (p=0.000 n=10+8)
F32AxpyIncToN1000Inc2-12       1.18µs ± 3%   0.84µs ± 2%  -29.25%  (p=0.000 n=10+10)
F32AxpyIncToN1000Inc4-12       1.22µs ± 2%   0.84µs ± 6%  -31.04%    (p=0.000 n=9+9)
F32AxpyIncToN1000Inc10-12      1.56µs ± 3%   1.56µs ± 4%     ~       (p=1.000 n=9+9)
F32AxpyIncToN100000Inc1-12      116µs ± 4%     83µs ± 5%  -28.26%  (p=0.000 n=10+10)
F32AxpyIncToN100000Inc2-12      120µs ± 6%     89µs ± 5%  -25.57%  (p=0.000 n=10+10)
F32AxpyIncToN100000Inc4-12      179µs ± 2%    152µs ± 4%  -15.33%   (p=0.000 n=9+10)
F32AxpyIncToN100000Inc10-12     456µs ± 5%    434µs ± 6%   -4.83%    (p=0.011 n=9+9)
F32AxpyIncToN100000IncM1-12     115µs ± 4%     83µs ± 4%  -27.62%    (p=0.000 n=9+9)
F32AxpyIncToN100000IncM2-12     119µs ± 6%     88µs ± 3%  -26.05%  (p=0.000 n=10+10)
F32AxpyIncToN100000IncM4-12     183µs ± 2%    151µs ± 2%  -17.56%   (p=0.000 n=9+10)
F32AxpyIncToN100000IncM10-12    453µs ± 5%    433µs ± 6%   -4.44%    (p=0.014 n=9+9)
F64AxpyUnitary1-12             4.00ns ± 2%   3.29ns ± 2%  -17.69%   (p=0.000 n=9+10)
F64AxpyUnitary2-12             5.32ns ± 3%   3.33ns ± 5%  -37.33%    (p=0.000 n=9+9)
F64AxpyUnitary3-12             6.12ns ± 2%   3.97ns ± 4%  -35.17%   (p=0.000 n=9+10)
F64AxpyUnitary4-12             7.06ns ± 2%   4.26ns ± 3%  -39.68%   (p=0.000 n=10+9)
F64AxpyUnitary5-12             7.89ns ± 2%   4.88ns ± 4%  -38.21%   (p=0.000 n=9+10)
F64AxpyUnitary10-12            12.6ns ± 5%    6.3ns ± 2%  -49.57%  (p=0.000 n=10+10)
F64AxpyUnitary100-12            100ns ± 2%     32ns ± 3%  -68.35%   (p=0.000 n=9+10)
F64AxpyUnitary1000-12           902ns ± 3%    234ns ± 6%  -74.03%   (p=0.000 n=9+10)
F64AxpyUnitary5000-12          4.49µs ± 2%   1.56µs ± 5%  -65.33%   (p=0.000 n=9+10)
F64AxpyUnitary10000-12         8.99µs ± 2%   3.09µs ± 2%  -65.66%    (p=0.000 n=9+9)
F64AxpyUnitary50000-12         46.0µs ± 2%   29.3µs ± 3%  -36.47%   (p=0.000 n=9+10)
F64AxpyUnitaryTo1-12           4.66ns ± 3%   3.90ns ± 4%  -16.33%    (p=0.000 n=9+9)
F64AxpyUnitaryTo2-12           5.81ns ± 3%   4.06ns ± 3%  -30.13%  (p=0.000 n=10+10)
F64AxpyUnitaryTo3-12           6.70ns ± 5%   4.60ns ± 2%  -31.43%   (p=0.000 n=10+9)
F64AxpyUnitaryTo4-12           7.27ns ± 3%   4.62ns ± 3%  -36.45%    (p=0.000 n=9+9)
F64AxpyUnitaryTo5-12           7.90ns ± 2%   5.38ns ± 2%  -31.87%    (p=0.000 n=9+8)
F64AxpyUnitaryTo10-12          13.0ns ± 3%    6.7ns ± 1%  -48.54%    (p=0.000 n=9+9)
F64AxpyUnitaryTo100-12          101ns ± 6%     32ns ± 3%  -68.20%   (p=0.000 n=10+9)
F64AxpyUnitaryTo1000-12         899ns ± 3%    231ns ± 3%  -74.26%    (p=0.000 n=9+9)
F64AxpyUnitaryTo5000-12        4.47µs ± 4%   1.66µs ± 3%  -62.80%    (p=0.000 n=9+9)
F64AxpyUnitaryTo10000-12       8.95µs ± 1%   3.53µs ±10%  -60.60%   (p=0.000 n=9+10)
F64AxpyUnitaryTo50000-12       48.8µs ± 3%   37.1µs ± 1%  -24.07%    (p=0.000 n=9+9)
F64AxpyIncN1Inc1-12            5.54ns ± 3%   4.99ns ± 4%   -9.86%  (p=0.000 n=10+10)
F64AxpyIncN2Inc1-12            6.80ns ± 5%   5.93ns ± 2%  -12.78%  (p=0.000 n=10+10)
F64AxpyIncN2Inc2-12            6.75ns ± 1%   5.92ns ± 3%  -12.27%   (p=0.000 n=9+10)
F64AxpyIncN2Inc4-12            6.83ns ± 4%   5.89ns ± 2%  -13.67%  (p=0.000 n=10+10)
F64AxpyIncN2Inc10-12           6.78ns ± 4%   5.95ns ± 4%  -12.21%   (p=0.000 n=10+9)
F64AxpyIncN3Inc1-12            8.35ns ± 2%   6.60ns ± 2%  -21.02%   (p=0.000 n=10+9)
F64AxpyIncN3Inc2-12            8.35ns ± 2%   6.60ns ± 3%  -20.94%   (p=0.000 n=10+9)
F64AxpyIncN3Inc4-12            8.33ns ± 2%   6.58ns ± 2%  -21.03%   (p=0.000 n=10+9)
F64AxpyIncN3Inc10-12           8.41ns ± 3%   6.62ns ± 3%  -21.30%   (p=0.000 n=10+9)
F64AxpyIncN4Inc1-12            10.1ns ± 3%    8.3ns ± 3%  -18.06%   (p=0.000 n=10+9)
F64AxpyIncN4Inc2-12            10.2ns ± 3%    8.3ns ± 2%  -18.49%   (p=0.000 n=9+10)
F64AxpyIncN4Inc4-12            10.1ns ± 4%    8.3ns ± 2%  -18.34%    (p=0.000 n=9+9)
F64AxpyIncN4Inc10-12           10.0ns ± 3%    8.3ns ± 2%  -17.04%   (p=0.000 n=9+10)
F64AxpyIncN10Inc1-12           16.0ns ± 3%   13.2ns ± 3%  -16.96%   (p=0.000 n=9+10)
F64AxpyIncN10Inc2-12           16.0ns ± 4%   13.2ns ± 4%  -17.34%   (p=0.000 n=10+9)
F64AxpyIncN10Inc4-12           16.0ns ± 3%   13.3ns ± 3%  -17.04%    (p=0.000 n=9+9)
F64AxpyIncN10Inc10-12          16.0ns ± 3%   13.2ns ± 1%  -17.64%    (p=0.000 n=9+8)
F64AxpyIncN1000Inc1-12         1.04µs ± 2%   0.89µs ± 1%  -13.96%    (p=0.000 n=9+9)
F64AxpyIncN1000Inc2-12         1.04µs ± 4%   0.90µs ± 3%  -13.54%    (p=0.000 n=9+9)
F64AxpyIncN1000Inc4-12         1.14µs ± 9%   1.07µs ± 2%   -5.86%   (p=0.000 n=10+9)
F64AxpyIncN1000Inc10-12        2.10µs ± 3%   2.07µs ± 1%     ~       (p=0.136 n=9+9)
F64AxpyIncN100000Inc1-12        103µs ± 1%     92µs ± 1%  -11.37%   (p=0.000 n=8+10)
F64AxpyIncN100000Inc2-12        136µs ± 5%    126µs ± 1%   -7.28%    (p=0.000 n=9+8)
F64AxpyIncN100000Inc4-12        236µs ± 2%    224µs ± 2%   -5.27%    (p=0.000 n=9+9)
F64AxpyIncN100000Inc10-12       769µs ± 7%    735µs ± 9%   -4.37%  (p=0.035 n=10+10)
F64AxpyIncN100000IncM1-12       103µs ± 3%     92µs ± 3%  -11.07%    (p=0.000 n=9+8)
F64AxpyIncN100000IncM2-12       141µs ± 5%    129µs ± 3%   -8.66%  (p=0.000 n=10+10)
F64AxpyIncN100000IncM4-12       240µs ± 5%    224µs ± 2%   -6.49%  (p=0.000 n=10+10)
F64AxpyIncN100000IncM10-12      786µs ± 7%    749µs ± 8%   -4.77%  (p=0.009 n=10+10)
F64AxpyIncToN1Inc1-12          6.44ns ± 2%   6.15ns ± 4%   -4.47%   (p=0.000 n=9+10)
F64AxpyIncToN2Inc1-12          7.86ns ± 6%   6.96ns ± 4%  -11.40%   (p=0.000 n=10+9)
F64AxpyIncToN2Inc2-12          7.77ns ± 2%   6.89ns ± 2%  -11.40%    (p=0.000 n=9+9)
F64AxpyIncToN2Inc4-12          7.71ns ± 0%   7.02ns ± 3%   -8.99%   (p=0.000 n=8+10)
F64AxpyIncToN2Inc10-12         7.74ns ± 1%   6.94ns ± 4%  -10.28%   (p=0.000 n=8+10)
F64AxpyIncToN3Inc1-12          9.60ns ±12%   7.63ns ± 2%  -20.55%   (p=0.000 n=10+9)
F64AxpyIncToN3Inc2-12          9.38ns ± 2%   7.63ns ± 3%  -18.69%   (p=0.000 n=8+10)
F64AxpyIncToN3Inc4-12          9.43ns ± 3%   7.61ns ± 3%  -19.33%    (p=0.000 n=9+9)
F64AxpyIncToN3Inc10-12         9.40ns ± 5%   7.62ns ± 2%  -18.94%   (p=0.000 n=9+10)
F64AxpyIncToN4Inc1-12          10.7ns ± 4%    9.4ns ± 2%  -11.76%   (p=0.000 n=10+9)
F64AxpyIncToN4Inc2-12          10.7ns ± 5%    9.3ns ± 2%  -12.68%   (p=0.000 n=10+9)
F64AxpyIncToN4Inc4-12          10.7ns ± 5%    9.4ns ± 4%  -11.96%  (p=0.000 n=10+10)
F64AxpyIncToN4Inc10-12         10.5ns ± 1%    9.4ns ± 3%  -10.51%   (p=0.000 n=8+10)
F64AxpyIncToN10Inc1-12         18.6ns ± 2%   14.9ns ± 2%  -20.28%    (p=0.000 n=9+8)
F64AxpyIncToN10Inc2-12         18.7ns ± 3%   14.8ns ± 2%  -21.04%   (p=0.000 n=10+9)
F64AxpyIncToN10Inc4-12         18.9ns ± 4%   14.8ns ± 2%  -21.58%    (p=0.000 n=9+9)
F64AxpyIncToN10Inc10-12        18.8ns ± 3%   14.8ns ± 3%  -21.36%   (p=0.000 n=9+10)
F64AxpyIncToN1000Inc1-12       1.18µs ± 4%   0.90µs ± 2%  -23.30%    (p=0.000 n=9+9)
F64AxpyIncToN1000Inc2-12       1.24µs ± 4%   0.94µs ± 2%  -23.92%    (p=0.000 n=9+9)
F64AxpyIncToN1000Inc4-12       1.34µs ± 7%   1.27µs ± 2%   -4.96%   (p=0.000 n=10+9)
F64AxpyIncToN1000Inc10-12      2.58µs ± 3%   2.56µs ± 5%     ~      (p=0.388 n=9+10)
F64AxpyIncToN100000Inc1-12      119µs ± 4%     98µs ± 5%  -17.51%  (p=0.000 n=10+10)
F64AxpyIncToN100000Inc2-12      180µs ± 4%    158µs ± 3%  -12.34%  (p=0.000 n=10+10)
F64AxpyIncToN100000Inc4-12      319µs ± 4%    310µs ± 3%   -3.00%  (p=0.009 n=10+10)
F64AxpyIncToN100000Inc10-12    1.46ms ± 3%   1.44ms ± 2%     ~     (p=0.089 n=10+10)
F64AxpyIncToN100000IncM1-12     117µs ± 1%     98µs ± 2%  -15.76%    (p=0.000 n=8+9)
F64AxpyIncToN100000IncM2-12     184µs ± 2%    159µs ± 3%  -13.75%   (p=0.000 n=10+9)
F64AxpyIncToN100000IncM4-12     320µs ± 4%    310µs ± 5%   -3.18%  (p=0.009 n=10+10)
F64AxpyIncToN100000IncM10-12   1.48ms ± 3%   1.46ms ± 4%     ~     (p=0.218 n=10+10)

Brendan Tracey

unread,
Jun 23, 2016, 12:12:08 PM6/23/16
to gonum-dev
I'll take a look at the new organization and implementation. Thanks a lot for the hard work!

I was curious on your future projections. Last I ran, Go was something like 10x slower than cblas. Have you run the Dgemm benchmark with this code?

Let's assume we get a factor of two (as the F64UnitaryTo benchmarks suggest). Are there more gains to be had in the inner loop, or is this pretty much optimal? I'm not sure how close we can expect to get in the ideal case without just completely implementing in assembly, but I'd be happy to help tweak Dgemm if it can help close the gap significantly. I imagine we have some difficulties in increasing the gap because of assembly call overhead? 

Kunde21

unread,
Jun 23, 2016, 9:21:17 PM6/23/16
to gonum-dev
These benchmarks are comparing against a for loop, so Dgemm is already seeing the asm speedup.  I'm tweaking that code in a different branch to increase the throughput to a full cache line per loop and there's still some work I want to do around the vectors and tails that don't fit the 8-wide loop.  

Looking at the dgemm.go, I don't think there's much more to be gained without pulling another loop into the asm code.  Benchmarks show asm gains for vectors larger than 3, but the loop pays that overhead each iteration.  Like I did here, there's a lot more to be gained by doing the entire matrix multiplication in asm.  Adding the c matrix is easy from there.

Getting up to the speed of OpenBLAS is going to depend on porting this 5k-line behemoth and building in a way to identify the CPU capabilities to leverage the AVX instructions.  klauspost (author of asmfmt) has a library that might help, but it's still a non-trivial task.  It's on my dream todo list to implement 

Kunde21

unread,
Jun 25, 2016, 6:14:32 PM6/25/16
to gonum-dev
I got some interesting advice from a hardware-nerd friend today.  Apparently, the LOOP instruction is slower than DECQ/JNZ pair on newer hardware (post-Nehalem chips) due to changes in the chip design.  My main laptop is on a Nehalem core, so I've been using LOOP in these asm functions.  I'm a bit torn, do we target the newer chips with DECQ/JNZ or save that for the AVX/FMA code?  

Also, he's looking over the code in more detail today so I may have some other suggestions from him.  
Reply all
Reply to author
Forward
0 new messages