Improving gonum perfs

Sebastien Binet

unread,

Oct 5, 2017, 2:25:14 AM10/5/17

to gonu...@googlegroups.com

hi there,

I've stumbled on this piece of code from reddit:

https://www.reddit.com/r/golang/comments/73n1zt/adaline_neuron_implemented_in_go/

_ie:_ https://github.com/Plorenzo/goAdaline/blob/2b5c85f972ddf1557456e0eb5b74cab2367088e5/main.go

I've tried to see what it would look like with Gonum:

https://github.com/sbinet-staging/goAdaline/blob/9b8316625f4170edc195b4e3cd5bb6ca6bb13f8c/main.go

it's not always super pretty (especially the part that updates the weights row/row).

and, it's even slower than the original (initially b/c of the `mat.Dense.RowView` that I've modified to reuse a `mat.VecDense`+`RowViewOf`, and now it's slow b/c of interface-checks in `mat.Dot` https://github.com/sbinet-staging/goAdaline/blob/9b8316625f4170edc195b4e3cd5bb6ca6bb13f8c/main.go#L50)

see:

```

$> time ./goAdaline-ref -cycles=1000

Test error:

0.017159516933323248

Weights:

[0.629646421847802 0.5111203968481454 0.21903157034908186 -0.24872268718943477 0.113694617824979 0.03994836902175236 0.09902375643857968 0.5027786680714972 -0.015058660602257478]

real    0m0.080s

user    0m0.077s

sys 0m0.003s

$> time ./goAdaline-gonum -cycles=1000

Test error:

0.01715951693332323

Weights:

[0.6296464218478025 0.5111203968481459 0.21903157034908222 -0.2487226871894338 0.11369461782497912 0.03994836902175287 0.09902375643858036 0.5027786680714972 -0.015058660602259038]

real    0m0.278s

user    0m0.276s

sys 0m0.003s

```

here is a `pprof` output:

```

$> go tool pprof ./cpu.prof

File: goAdaline-gonum

Type: cpu

Time: Oct 3, 2017 at 9:29am (CEST)

Duration: 400.60ms, Total samples = 260ms (64.90%)

Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top50

Showing nodes accounting for 260ms, 100% of 260ms total

      flat  flat%   sum%        cum   cum%

     100ms 38.46% 38.46%      260ms   100%  main.main /home/binet/tmp/go/src/github.com/Plorenzo/goAdaline/main.go

      30ms 11.54% 50.00%       30ms 11.54%  gonum.org/v1/gonum/mat.(*VecDense).At /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/index_no_bound_checks.go

      30ms 11.54% 61.54%       40ms 15.38%  gonum.org/v1/gonum/mat.(*VecDense).RowViewOf /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/vector.go

      20ms  7.69% 69.23%       20ms  7.69%  gonum.org/v1/gonum/blas/gonum.Implementation.Ddot /home/binet/tmp/go/src/gonum.org/v1/gonum/blas/gonum/level1double_ddot.go

      10ms  3.85% 73.08%       30ms 11.54%  gonum.org/v1/gonum/blas/gonum.(*Implementation).Ddot <autogenerated>

      10ms  3.85% 76.92%       10ms  3.85%  gonum.org/v1/gonum/internal/asm/f64.DotUnitary /home/binet/tmp/go/src/gonum.org/v1/gonum/internal/asm/f64/dot_amd64.s

      10ms  3.85% 80.77%       10ms  3.85%  gonum.org/v1/gonum/mat.(*Dense).At /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/index_no_bound_checks.go

      10ms  3.85% 84.62%       10ms  3.85%  gonum.org/v1/gonum/mat.(*Dense).RawMatrix /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/dense.go

      10ms  3.85% 88.46%       10ms  3.85%  gonum.org/v1/gonum/mat.(*VecDense).RawVector /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/vector.go

      10ms  3.85% 92.31%       20ms  7.69%  gonum.org/v1/gonum/mat.Sum /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/matrix.go

      10ms  3.85% 96.15%       20ms  7.69%  runtime.assertI2I2 /usr/lib/go/src/runtime/iface.go

      10ms  3.85%   100%       10ms  3.85%  runtime.getitab /usr/lib/go/src/runtime/iface.go

         0     0%   100%       30ms 11.54%  gonum.org/v1/gonum/blas/blas64.Dot /home/binet/tmp/go/src/gonum.org/v1/gonum/blas/blas64/blas64.go

         0     0%   100%       10ms  3.85%  gonum.org/v1/gonum/blas/blas64.Gemv /home/binet/tmp/go/src/gonum.org/v1/gonum/blas/blas64/blas64.go

         0     0%   100%       10ms  3.85%  gonum.org/v1/gonum/blas/gonum.(*Implementation).Dgemv <autogenerated>

         0     0%   100%       10ms  3.85%  gonum.org/v1/gonum/blas/gonum.Implementation.Dgemv /home/binet/tmp/go/src/gonum.org/v1/gonum/blas/gonum/level2double.go

         0     0%   100%       10ms  3.85%  gonum.org/v1/gonum/mat.(*VecDense).MulVec /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/vector.go

         0     0%   100%       60ms 23.08%  gonum.org/v1/gonum/mat.Dot /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/matrix.go

         0     0%   100%       30ms 11.54%  main.computeError /home/binet/tmp/go/src/github.com/Plorenzo/goAdaline/main.go

         0     0%   100%      260ms   100%  runtime.main /usr/lib/go/src/runtime/proc.go

(pprof)

(pprof) list main

Total: 260ms

ROUTINE ======================== main.computeError in /home/binet/tmp/go/src/github.com/Plorenzo/goAdaline/main.go

         0       30ms (flat, cum) 11.54% of Total

         .          .    198:}

         .          .    199:

         .          .    200:func computeError(data *mat.Dense, expected, weights *mat.VecDense) float64 {

         .          .    201:

         .          .    202:   var errs mat.VecDense

         .       10ms    203:   errs.MulVec(data, weights)

         .          .    204:   errs.SubVec(expected, &errs)

         .          .    205:   errs.MulElemVec(&errs, &errs)

         .          .    206:

         .       20ms    207:   return mat.Sum(&errs) / float64(errs.Len())

         .          .    208:}

ROUTINE ======================== main.main in /home/binet/tmp/go/src/github.com/Plorenzo/goAdaline/main.go

     100ms      260ms (flat, cum)   100% of Total

         .          .     56:   var errorsTrain []float64

         .          .     57:   var errorsValidate []float64

         .          .     58:   var errorsTest float64

         .          .     59:

         .          .     60:   // Learning

      10ms       10ms     61:   for cycle := 0; cycle < *cycles; cycle++ {

         .          .     62:       var row mat.VecDense

         .          .     63:       for i := 0; i < nrows; i++ {

         .       40ms     64:           row.RowViewOf(data, i)

         .          .     65:           // Calculate estimate

         .       60ms     66:           estimate := mat.Dot(&row, weights)

         .          .     67:           // Update weights (range passes values as a copy)

         .          .     68:           raw := weights.RawVector().Data

      20ms       20ms     69:           for x := range raw {

      70ms      100ms     70:               raw[x] += *learningRate * (expectedY.At(i, 0) - estimate) * data.At(i, x)

         .          .     71:           }

         .          .     72:       }

         .          .     73:

         .          .     74:       // Compute cycle train error

         .       30ms     75:       errorsTrain = append(errorsTrain, computeError(data, expectedY, weights))

         .          .     76:       errorsValidate = append(errorsValidate, computeError(validateData, valExpectedY, weights))

         .          .     77:   }

         .          .     78:

         .          .     79:   errorsTest = computeError(testData, testExpectedY, weights)

         .          .     80:

```

not sure whether there is any Gonum-based more performant way to do this...

sent from my droid

Dan Kortschak

unread,

Oct 5, 2017, 2:44:05 AM10/5/17

to Sebastien Binet, gonu...@googlegroups.com

If mat is causing the problems, don't use it; we provide blas64 (and
blas for that matter) as a public package for a reason.

```untested
rate := *learningRate
row := mat.VecDense{
N: ncols,
Inc: 1,
}
raw := data.RawMatrix()
expRaw : expectedY.RawMatrix()

for cycle := 0; cycle < *cycles; cycle++ {

for i := 0; i < nrows; i++ {

row.Data = raw.Data[i*raw.Stride:]
// Calculate estimate
estimate := blas64.Dot(ncols, row, weights) # Initialise weights as a blas64.Vector.

// Update weights (range passes values as a copy)

for j := range row.Data {
raw.Data[j] += rate * (expRaw.Data[i*expRaw.Stride] - estimate) * raw.Data[i*raw.Stride+j]

}
}

// Compute cycle train error

errorsTrain = append(errorsTrain, computeError(data, expectedY, weights))

errorsValidate = append(errorsValidate, computeError(validateData, valExpectedY, weights))
}

```

It's not ideal, but if you want very high performance in tight loops,
sometimes it's unnecessary.

Sebastien Binet

unread,

Oct 5, 2017, 3:42:06 AM10/5/17

to Dan Kortschak, gonu...@googlegroups.com

Thanks Dan.

I'll check how this translates performance wise.

I guess this is good for thought for multi-dim slices for Go 2 :)

Cheers,

-s

sent from my droid

--
You received this message because you are subscribed to the Google Groups "gonum-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gonum-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sebastien Binet

unread,

Oct 7, 2017, 6:02:54 AM10/7/17

to Dan Kortschak, gonu...@googlegroups.com

On Thu, Oct 5, 2017 at 9:42 AM, Sebastien Binet <seb....@gmail.com> wrote:

Thanks Dan.
I'll check how this translates performance wise.

ok. so it's better but still no cigar :)

$> time ./goAdaline-ref -cycles=100000

Test error:

0.017159516933323238

Weights:

[0.6296464218478021 0.5111203968481455 0.2190315703490819 -0.2487226871894348 0.11369461782497882 0.03994836902175239 0.09902375643857965 0.5027786680714972 -0.015058660602257527]

real 0m3.312s

user 0m3.500s

sys 0m0.011s

$> time ./goAdaline-gonum -cycles=100000

Test error:

0.017159516933323238

Weights:

[0.6296464218478026 0.511120396848146 0.21903157034908224 -0.24872268718943405 0.11369461782497912 0.0399483690217529 0.09902375643858032 0.5027786680714972 -0.015058660602258935]

real 0m12.858s

user 0m12.902s

sys 0m0.044s

$> time ./goAdaline-blas64 -cycles=100000

Test error:

0.017159516933323238

Weights:

[0.6296464218478026 0.511120396848146 0.21903157034908224 -0.24872268718943405 0.11369461782497912 0.0399483690217529 0.09902375643858032 0.5027786680714972 -0.015058660602258935]

real 0m4.734s

user 0m4.794s

sys 0m0.024s

goAdaline-blas64 is using gonum/blas instead of gonum/mat.

everything is in the "use-gonum" branch.

There is probably some more gains to be have by migrating everything to gonum/blas (especially in the "computeErrors" func) but there isn't much to be shaved off:

(pprof) top20

Showing nodes accounting for 4.55s, 95.99% of 4.74s total

Dropped 37 nodes (cum <= 0.02s)

Showing top 20 nodes out of 39

flat flat% sum% cum cum%

1.16s 24.47% 24.47% 4.67s 98.52% main.main /home/binet/work/gonum/src/github.com/Plorenzo/goAdaline/main.go

1.13s 23.84% 48.31% 1.13s 23.84% gonum.org/v1/gonum/internal/asm/f64.DotUnitary /home/binet/work/gonum/src/gonum.org/v1/gonum/internal/asm/f64/dot_amd64.s

0.49s 10.34% 58.65% 0.78s 16.46% gonum.org/v1/gonum/blas/gonum.Implementation.Dgemv /home/binet/work/gonum/src/gonum.org/v1/gonum/blas/gonum/level2double.go

0.36s 7.59% 66.24% 1.21s 25.53% gonum.org/v1/gonum/blas/gonum.Implementation.Ddot /home/binet/work/gonum/src/gonum.org/v1/gonum/blas/gonum/level1double_ddot.go

0.34s 7.17% 73.42% 0.52s 10.97% gonum.org/v1/gonum/mat.Sum /home/binet/work/gonum/src/gonum.org/v1/gonum/mat/matrix.go

0.31s 6.54% 79.96% 1.78s 37.55% gonum.org/v1/gonum/blas/blas64.Dot /home/binet/work/gonum/src/gonum.org/v1/gonum/blas/blas64/blas64.go

0.26s 5.49% 85.44% 1.47s 31.01% gonum.org/v1/gonum/blas/gonum.(*Implementation).Ddot <autogenerated>

0.16s 3.38% 88.82% 0.16s 3.38% gonum.org/v1/gonum/mat.(*VecDense).At /home/binet/work/gonum/src/gonum.org/v1/gonum/mat/index_no_bound_checks.go

0.10s 2.11% 90.93% 0.10s 2.11% gonum.org/v1/gonum/mat.(*VecDense).MulElemVec /home/binet/work/gonum/src/gonum.org/v1/gonum/mat/vector.go

0.10s 2.11% 93.04% 0.10s 2.11% runtime.memclrNoHeapPointers /usr/lib/go/src/runtime/memclr_amd64.s

0.03s 0.63% 93.67% 0.03s 0.63% runtime.getitab /usr/lib/go/src/runtime/iface.go

0.03s 0.63% 94.30% 0.03s 0.63% runtime.greyobject /usr/lib/go/src/runtime/mgcmark.go

0.02s 0.42% 94.73% 0.04s 0.84% gonum.org/v1/gonum/mat.(*VecDense).SubVec /home/binet/work/gonum/src/gonum.org/v1/gonum/mat/vector.go

0.02s 0.42% 95.15% 1.63s 34.39% main.computeError /home/binet/work/gonum/src/github.com/Plorenzo/goAdaline/main.go

0.01s 0.21% 95.36% 0.04s 0.84% runtime.assertI2I2 /usr/lib/go/src/runtime/iface.go

0.01s 0.21% 95.57% 0.03s 0.63% runtime.gcDrain /usr/lib/go/src/runtime/mgcmark.go

0.01s 0.21% 95.78% 0.03s 0.63% runtime.scanobject /usr/lib/go/src/runtime/mgcmark.go

0.01s 0.21% 95.99% 0.03s 0.63% strconv.formatDigits /usr/lib/go/src/strconv/ftoa.go

0 0% 95.99% 0.78s 16.46% gonum.org/v1/gonum/blas/blas64.Gemv /home/binet/work/gonum/src/gonum.org/v1/gonum/blas/blas64/blas64.go

0 0% 95.99% 0.78s 16.46% gonum.org/v1/gonum/blas/gonum.(*Implementation).Dgemv <autogenerated>

-s

Kunde21

unread,

Oct 7, 2017, 8:19:41 PM10/7/17

to gonum-dev

Can you pull the `f64/gemv` branch and run it with that?

To unsubscribe from this group and stop receiving emails from it, send an email to gonum-dev+...@googlegroups.com.

Sebastien Binet

unread,

Oct 8, 2017, 3:36:29 AM10/8/17

to Kunde21, gonum-dev

On Sun, Oct 8, 2017 at 2:19 AM, Kunde21 <kun...@gmail.com> wrote:

Can you pull the `f64/gemv` branch and run it with that?

it improves a bit:

$> time ./goAdaline-no-gemv-branch -cycles=100000

Test error:

0.017159516933323238

Weights:

[0.6296464218478026 0.511120396848146 0.21903157034908224 -0.24872268718943405 0.11369461782497912 0.0399483690217529 0.09902375643858032 0.5027786680714972 -0.015058660602258935]

real 0m4.755s

user 0m4.852s

sys 0m0.017s

$> time ./goAdaline-gemv-branch -cycles=100000

Test error:

0.01715951693332324

Weights:

[0.6296464218478026 0.511120396848146 0.21903157034908224 -0.24872268718943405 0.11369461782497912 0.0399483690217529 0.09902375643858032 0.5027786680714972 -0.015058660602258935]

real 0m4.281s

user 0m4.374s

sys 0m0.020s

but we are still not there:

$> time ./goAdaline-ref -cycles=100000

Test error:

0.017159516933323238

Weights:

[0.6296464218478021 0.5111203968481455 0.2190315703490819 -0.2487226871894348 0.11369461782497882 0.03994836902175239 0.09902375643857965 0.5027786680714972 -0.015058660602257527]

real 0m3.318s

user 0m4.253s

sys 0m0.057s

To unsubscribe from this group and stop receiving emails from it, send an email to gonum-dev+unsubscribe@googlegroups.com.

Kunde21

unread,

Oct 8, 2017, 5:44:32 PM10/8/17

to gonum-dev

I think the biggest problem is the narrow dataset. Any function calls inside the training loop won't make up the cost of the call, because there's just not enough data in each row.

If you were to change the algorithm from streaming to batch learning, you could use the bigger matrix calculations to process enough data to see improvements.

Dan Kortschak

unread,

Oct 8, 2017, 8:54:35 PM10/8/17

to Sebastien Binet, Kunde21, gonum-dev

Just for kicks, what happens with `-tag noasm`?

Brendan Tracey

unread,

Oct 10, 2017, 5:50:57 PM10/10/17

to Dan Kortschak, Sebastien Binet, Kunde21, gonum-dev

You mean, `-tags noasm`

Dan Kortschak

unread,

Oct 10, 2017, 6:02:46 PM10/10/17

to Brendan Tracey, Sebastien Binet, Kunde21, gonum-dev

Yes.

On Tue, 2017-10-10 at 15:50 -0600, Brendan Tracey wrote:
> You mean, `-tags noasm`

Reply all

Reply to author

Forward