Improving gonum perfs

115 views
Skip to first unread message

Sebastien Binet

unread,
Oct 5, 2017, 2:25:14 AM10/5/17
to gonu...@googlegroups.com
hi there,
I've stumbled on this piece of code from reddit:
I've tried to see what it would look like with Gonum:
it's not always super pretty (especially the part that updates the weights row/row).
and, it's even slower than the original (initially b/c of the `mat.Dense.RowView` that I've modified to reuse a `mat.VecDense`+`RowViewOf`, and now it's slow b/c of interface-checks in `mat.Dot` https://github.com/sbinet-staging/goAdaline/blob/9b8316625f4170edc195b4e3cd5bb6ca6bb13f8c/main.go#L50)
see:
```
$> time ./goAdaline-ref -cycles=1000
Test error: 
0.017159516933323248
Weights:
[0.629646421847802 0.5111203968481454 0.21903157034908186 -0.24872268718943477 0.113694617824979 0.03994836902175236 0.09902375643857968 0.5027786680714972 -0.015058660602257478]
real    0m0.080s
user    0m0.077s
sys 0m0.003s
$> time ./goAdaline-gonum -cycles=1000
Test error: 
0.01715951693332323
Weights:
[0.6296464218478025 0.5111203968481459 0.21903157034908222 -0.2487226871894338 0.11369461782497912 0.03994836902175287 0.09902375643858036 0.5027786680714972 -0.015058660602259038]
real    0m0.278s
user    0m0.276s
sys 0m0.003s
```
here is a `pprof` output:
```
$> go tool pprof ./cpu.prof 
File: goAdaline-gonum
Type: cpu
Time: Oct 3, 2017 at 9:29am (CEST)
Duration: 400.60ms, Total samples = 260ms (64.90%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top50
Showing nodes accounting for 260ms, 100% of 260ms total
      flat  flat%   sum%        cum   cum%
     100ms 38.46% 38.46%      260ms   100%  main.main /home/binet/tmp/go/src/github.com/Plorenzo/goAdaline/main.go
      30ms 11.54% 50.00%       30ms 11.54%  gonum.org/v1/gonum/mat.(*VecDense).At /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/index_no_bound_checks.go
      30ms 11.54% 61.54%       40ms 15.38%  gonum.org/v1/gonum/mat.(*VecDense).RowViewOf /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/vector.go
      10ms  3.85% 73.08%       30ms 11.54%  gonum.org/v1/gonum/blas/gonum.(*Implementation).Ddot <autogenerated>
      10ms  3.85% 80.77%       10ms  3.85%  gonum.org/v1/gonum/mat.(*Dense).At /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/index_no_bound_checks.go
      10ms  3.85% 84.62%       10ms  3.85%  gonum.org/v1/gonum/mat.(*Dense).RawMatrix /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/dense.go
      10ms  3.85% 88.46%       10ms  3.85%  gonum.org/v1/gonum/mat.(*VecDense).RawVector /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/vector.go
      10ms  3.85% 92.31%       20ms  7.69%  gonum.org/v1/gonum/mat.Sum /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/matrix.go
      10ms  3.85% 96.15%       20ms  7.69%  runtime.assertI2I2 /usr/lib/go/src/runtime/iface.go
      10ms  3.85%   100%       10ms  3.85%  runtime.getitab /usr/lib/go/src/runtime/iface.go
         0     0%   100%       10ms  3.85%  gonum.org/v1/gonum/blas/gonum.(*Implementation).Dgemv <autogenerated>
         0     0%   100%       60ms 23.08%  gonum.org/v1/gonum/mat.Dot /home/binet/tmp/go/src/gonum.org/v1/gonum/mat/matrix.go
         0     0%   100%       30ms 11.54%  main.computeError /home/binet/tmp/go/src/github.com/Plorenzo/goAdaline/main.go
         0     0%   100%      260ms   100%  runtime.main /usr/lib/go/src/runtime/proc.go
(pprof) 
(pprof) list main
Total: 260ms
ROUTINE ======================== main.computeError in /home/binet/tmp/go/src/github.com/Plorenzo/goAdaline/main.go
         0       30ms (flat, cum) 11.54% of Total
         .          .    198:}
         .          .    199:
         .          .    200:func computeError(data *mat.Dense, expected, weights *mat.VecDense) float64 {
         .          .    201:
         .          .    202:   var errs mat.VecDense
         .       10ms    203:   errs.MulVec(data, weights)
         .          .    204:   errs.SubVec(expected, &errs)
         .          .    205:   errs.MulElemVec(&errs, &errs)
         .          .    206:
         .       20ms    207:   return mat.Sum(&errs) / float64(errs.Len())
         .          .    208:}
ROUTINE ======================== main.main in /home/binet/tmp/go/src/github.com/Plorenzo/goAdaline/main.go
     100ms      260ms (flat, cum)   100% of Total
         .          .     56:   var errorsTrain []float64
         .          .     57:   var errorsValidate []float64
         .          .     58:   var errorsTest float64
         .          .     59:
         .          .     60:   // Learning
      10ms       10ms     61:   for cycle := 0; cycle < *cycles; cycle++ {
         .          .     62:       var row mat.VecDense
         .          .     63:       for i := 0; i < nrows; i++ {
         .       40ms     64:           row.RowViewOf(data, i)
         .          .     65:           // Calculate estimate
         .       60ms     66:           estimate := mat.Dot(&row, weights)
         .          .     67:           // Update weights (range passes values as a copy)
         .          .     68:           raw := weights.RawVector().Data
      20ms       20ms     69:           for x := range raw {
      70ms      100ms     70:               raw[x] += *learningRate * (expectedY.At(i, 0) - estimate) * data.At(i, x)
         .          .     71:           }
         .          .     72:       }
         .          .     73:
         .          .     74:       // Compute cycle train error
         .       30ms     75:       errorsTrain = append(errorsTrain, computeError(data, expectedY, weights))
         .          .     76:       errorsValidate = append(errorsValidate, computeError(validateData, valExpectedY, weights))
         .          .     77:   }
         .          .     78:
         .          .     79:   errorsTest = computeError(testData, testExpectedY, weights)
         .          .     80:
```
not sure whether there is any Gonum-based more performant way to do this...

sent from my droid

Dan Kortschak

unread,
Oct 5, 2017, 2:44:05 AM10/5/17
to Sebastien Binet, gonu...@googlegroups.com
If mat is causing the problems, don't use it; we provide blas64 (and
blas for that matter) as a public package for a reason.

```untested
rate := *learningRate
row := mat.VecDense{
N: ncols,
Inc: 1,
}
raw := data.RawMatrix()
expRaw : expectedY.RawMatrix()
for cycle := 0; cycle < *cycles; cycle++ {
for i := 0; i < nrows; i++ {
row.Data = raw.Data[i*raw.Stride:]
// Calculate estimate
estimate := blas64.Dot(ncols, row, weights) # Initialise weights as a blas64.Vector.
// Update weights (range passes values as a copy)
for j := range row.Data {
raw.Data[j] += rate * (expRaw.Data[i*expRaw.Stride] - estimate) * raw.Data[i*raw.Stride+j]
}
}

// Compute cycle train error
errorsTrain = append(errorsTrain, computeError(data, expectedY, weights))
errorsValidate = append(errorsValidate, computeError(validateData, valExpectedY, weights))
}
```

It's not ideal, but if you want very high performance in tight loops,
sometimes it's unnecessary.

Sebastien Binet

unread,
Oct 5, 2017, 3:42:06 AM10/5/17
to Dan Kortschak, gonu...@googlegroups.com
Thanks Dan.
I'll check how this translates performance wise.

I guess this is good for thought for multi-dim slices for Go 2 :)

Cheers,
-s

sent from my droid



--
You received this message because you are subscribed to the Google Groups "gonum-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gonum-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sebastien Binet

unread,
Oct 7, 2017, 6:02:54 AM10/7/17
to Dan Kortschak, gonu...@googlegroups.com
On Thu, Oct 5, 2017 at 9:42 AM, Sebastien Binet <seb....@gmail.com> wrote:
Thanks Dan.
I'll check how this translates performance wise.

ok. so it's better but still no cigar :)

$> time ./goAdaline-ref  -cycles=100000
Test error: 
0.017159516933323238
Weights:
[0.6296464218478021 0.5111203968481455 0.2190315703490819 -0.2487226871894348 0.11369461782497882 0.03994836902175239 0.09902375643857965 0.5027786680714972 -0.015058660602257527]

real 0m3.312s
user 0m3.500s
sys 0m0.011s

$> time ./goAdaline-gonum  -cycles=100000
Test error: 
0.017159516933323238
Weights:
[0.6296464218478026 0.511120396848146 0.21903157034908224 -0.24872268718943405 0.11369461782497912 0.0399483690217529 0.09902375643858032 0.5027786680714972 -0.015058660602258935]

real 0m12.858s
user 0m12.902s
sys 0m0.044s

$> time ./goAdaline-blas64 -cycles=100000
Test error: 
0.017159516933323238
Weights:
[0.6296464218478026 0.511120396848146 0.21903157034908224 -0.24872268718943405 0.11369461782497912 0.0399483690217529 0.09902375643858032 0.5027786680714972 -0.015058660602258935]

real 0m4.734s
user 0m4.794s
sys 0m0.024s

goAdaline-blas64 is using gonum/blas instead of gonum/mat.
everything is in the "use-gonum" branch.

There is probably some more gains to be have by migrating everything to gonum/blas (especially in the "computeErrors" func) but there isn't much to be shaved off:

(pprof) top20
Showing nodes accounting for 4.55s, 95.99% of 4.74s total
Dropped 37 nodes (cum <= 0.02s)
Showing top 20 nodes out of 39
      flat  flat%   sum%        cum   cum%
     1.16s 24.47% 24.47%      4.67s 98.52%  main.main /home/binet/work/gonum/src/github.com/Plorenzo/goAdaline/main.go
     1.13s 23.84% 48.31%      1.13s 23.84%  gonum.org/v1/gonum/internal/asm/f64.DotUnitary /home/binet/work/gonum/src/gonum.org/v1/gonum/internal/asm/f64/dot_amd64.s
     0.49s 10.34% 58.65%      0.78s 16.46%  gonum.org/v1/gonum/blas/gonum.Implementation.Dgemv /home/binet/work/gonum/src/gonum.org/v1/gonum/blas/gonum/level2double.go
     0.36s  7.59% 66.24%      1.21s 25.53%  gonum.org/v1/gonum/blas/gonum.Implementation.Ddot /home/binet/work/gonum/src/gonum.org/v1/gonum/blas/gonum/level1double_ddot.go
     0.34s  7.17% 73.42%      0.52s 10.97%  gonum.org/v1/gonum/mat.Sum /home/binet/work/gonum/src/gonum.org/v1/gonum/mat/matrix.go
     0.31s  6.54% 79.96%      1.78s 37.55%  gonum.org/v1/gonum/blas/blas64.Dot /home/binet/work/gonum/src/gonum.org/v1/gonum/blas/blas64/blas64.go
     0.26s  5.49% 85.44%      1.47s 31.01%  gonum.org/v1/gonum/blas/gonum.(*Implementation).Ddot <autogenerated>
     0.16s  3.38% 88.82%      0.16s  3.38%  gonum.org/v1/gonum/mat.(*VecDense).At /home/binet/work/gonum/src/gonum.org/v1/gonum/mat/index_no_bound_checks.go
     0.10s  2.11% 90.93%      0.10s  2.11%  gonum.org/v1/gonum/mat.(*VecDense).MulElemVec /home/binet/work/gonum/src/gonum.org/v1/gonum/mat/vector.go
     0.10s  2.11% 93.04%      0.10s  2.11%  runtime.memclrNoHeapPointers /usr/lib/go/src/runtime/memclr_amd64.s
     0.03s  0.63% 93.67%      0.03s  0.63%  runtime.getitab /usr/lib/go/src/runtime/iface.go
     0.03s  0.63% 94.30%      0.03s  0.63%  runtime.greyobject /usr/lib/go/src/runtime/mgcmark.go
     0.02s  0.42% 94.73%      0.04s  0.84%  gonum.org/v1/gonum/mat.(*VecDense).SubVec /home/binet/work/gonum/src/gonum.org/v1/gonum/mat/vector.go
     0.02s  0.42% 95.15%      1.63s 34.39%  main.computeError /home/binet/work/gonum/src/github.com/Plorenzo/goAdaline/main.go
     0.01s  0.21% 95.36%      0.04s  0.84%  runtime.assertI2I2 /usr/lib/go/src/runtime/iface.go
     0.01s  0.21% 95.57%      0.03s  0.63%  runtime.gcDrain /usr/lib/go/src/runtime/mgcmark.go
     0.01s  0.21% 95.78%      0.03s  0.63%  runtime.scanobject /usr/lib/go/src/runtime/mgcmark.go
     0.01s  0.21% 95.99%      0.03s  0.63%  strconv.formatDigits /usr/lib/go/src/strconv/ftoa.go
         0     0% 95.99%      0.78s 16.46%  gonum.org/v1/gonum/blas/blas64.Gemv /home/binet/work/gonum/src/gonum.org/v1/gonum/blas/blas64/blas64.go
         0     0% 95.99%      0.78s 16.46%  gonum.org/v1/gonum/blas/gonum.(*Implementation).Dgemv <autogenerated>

-s

Kunde21

unread,
Oct 7, 2017, 8:19:41 PM10/7/17
to gonum-dev
Can you pull the `f64/gemv` branch and run it with that?
To unsubscribe from this group and stop receiving emails from it, send an email to gonum-dev+...@googlegroups.com.

Sebastien Binet

unread,
Oct 8, 2017, 3:36:29 AM10/8/17
to Kunde21, gonum-dev
On Sun, Oct 8, 2017 at 2:19 AM, Kunde21 <kun...@gmail.com> wrote:
Can you pull the `f64/gemv` branch and run it with that?

it improves a bit:

$> time ./goAdaline-no-gemv-branch -cycles=100000 
Test error: 
0.017159516933323238
Weights:
[0.6296464218478026 0.511120396848146 0.21903157034908224 -0.24872268718943405 0.11369461782497912 0.0399483690217529 0.09902375643858032 0.5027786680714972 -0.015058660602258935]

real 0m4.755s
user 0m4.852s
sys 0m0.017s

$> time ./goAdaline-gemv-branch -cycles=100000 
Test error: 
0.01715951693332324
Weights:
[0.6296464218478026 0.511120396848146 0.21903157034908224 -0.24872268718943405 0.11369461782497912 0.0399483690217529 0.09902375643858032 0.5027786680714972 -0.015058660602258935]

real 0m4.281s
user 0m4.374s
sys 0m0.020s

but we are still not there:

$> time ./goAdaline-ref -cycles=100000 
Test error: 
0.017159516933323238
Weights:
[0.6296464218478021 0.5111203968481455 0.2190315703490819 -0.2487226871894348 0.11369461782497882 0.03994836902175239 0.09902375643857965 0.5027786680714972 -0.015058660602257527]

real 0m3.318s
user 0m4.253s
sys 0m0.057s


To unsubscribe from this group and stop receiving emails from it, send an email to gonum-dev+unsubscribe@googlegroups.com.

Kunde21

unread,
Oct 8, 2017, 5:44:32 PM10/8/17
to gonum-dev
I think the biggest problem is the narrow dataset.  Any function calls inside the training loop won't make up the cost of the call, because there's just not enough data in each row.  

If you were to change the algorithm from streaming to batch learning, you could use the bigger matrix calculations to process enough data to see improvements. 

Dan Kortschak

unread,
Oct 8, 2017, 8:54:35 PM10/8/17
to Sebastien Binet, Kunde21, gonum-dev
Just for kicks, what happens with `-tag noasm`?

Brendan Tracey

unread,
Oct 10, 2017, 5:50:57 PM10/10/17
to Dan Kortschak, Sebastien Binet, Kunde21, gonum-dev
You mean, `-tags noasm`

Dan Kortschak

unread,
Oct 10, 2017, 6:02:46 PM10/10/17
to Brendan Tracey, Sebastien Binet, Kunde21, gonum-dev
Yes.

On Tue, 2017-10-10 at 15:50 -0600, Brendan Tracey wrote:
> You mean, `-tags noasm`
Reply all
Reply to author
Forward
0 new messages