Knet 1.1.1 is out: performance improvements, new monitoring tools, gpu memory manager

30 views

Skip to first unread message

Deniz Yuret

unread,

Oct 1, 2018, 1:15:57 PM10/1/18

to knet-users, knet...@googlegroups.com

Knet 1.1.0 introduced the new interface allowing the use of struct's in models and callable objects for model / layer definitions.

Knet 1.1.1 focuses on performance. There is a number of performance improvements the most important of which is a new GPU memory manager. The GPU memory use is reduced by up to 50% which should allow larger models and larger batch sizes.

While working on performance improvements I developed some monitoring tools as well. Julia frequently crashes when profiling with GPU code. I decided to use TimerOutputs instead. If the KNET_TIMER environment variable is set while Knet is built, the timing code will be compiled in and the `Knet.to` variable should hold timing information for all GPU calls. Similarly the AUTOGRAD_TIMER environment variable controls whether AutoGrad puts timing information for forward and backward passes over the tape into the `AutoGrad.to` variable. Here is what sample outputs look like:

julia> AutoGrad.to

───────────────────────────────────────────────────────────────────────

Time Allocations

────────────────────── ───────────────────────

Tot / % measured: 4.62s / 30.4% 546MiB / 25.0%

Section ncalls time %tot avg alloc %tot avg

+.[2] 1 328ms 23.3% 328ms 46.4MiB 34.1% 46.4MiB

sum[2] 1 288ms 20.5% 288ms 40.0MiB 29.4% 40.0MiB

* 1 38.8ms 2.76% 38.8ms 595KiB 0.43% 595KiB

* 1 269ms 19.2% 269ms 955KiB 0.68% 955KiB

+. 1 139ms 9.92% 139ms 20.4MiB 15.0% 20.4MiB

*[1] 1 117ms 8.33% 117ms 9.41MiB 6.90% 9.41MiB

record 4 88.7ms 6.31% 22.2ms 3.49MiB 2.56% 894KiB

-[1] 1 65.9ms 4.69% 65.9ms 10.0MiB 7.32% 10.0MiB

- 1 55.8ms 3.97% 55.8ms 929KiB 0.67% 929KiB

sum 1 50.0ms 3.56% 50.0ms 4.68MiB 3.44% 4.68MiB

+.[1] 1 1.78ms 0.13% 1.78ms 37.7KiB 0.03% 37.7KiB

sum_outgrads 5 1.41ms 0.10% 282μs 28.2KiB 0.02% 5.64KiB

julia> Knet.to

──────────────────────────────────────────────────────────────────────────────────────

Time Allocations

────────────────────── ───────────────────────

Tot / % measured: 76.3s / 8.89% 4.10GiB / 0.02%

Section ncalls time %tot avg alloc %tot avg

sum_32_20 206 4.96s 73.2% 24.1ms 3.22KiB 0.35% -

cudaRuntimeGetVersion 1 736ms 10.9% 736ms - 0.00% -

cudaSetDevice 1 563ms 8.29% 563ms - 0.00% -

cublasSgemm_v2 96 211ms 3.11% 2.20ms 663KiB 72.1% 6.91KiB

cublasCreate_v2 1 166ms 2.44% 166ms - 0.00% -

cublasGetVersion_v2 1 2.95μs 0.00% 2.95μs - 0.00% -

nvmlInit 1 161ms 2.37% 161ms - 0.00% -

cudaMemcpy 5.17k 72.0ms 1.06% 13.9μs 191KiB 20.8% -

curandCreateGenerator 1 20.0ms 0.29% 20.0ms - 0.00% -

sum_64_20 602 17.4ms 0.26% 28.9μs 9.41KiB 1.02% -

cudaMalloc 456 9.93ms 0.15% 21.8μs - 0.00% -

Deniz Yuret

unread,

Jan 5, 2019, 1:23:20 AM1/5/19

to knet-users, knet...@googlegroups.com

* Support for broadcasting user defined functions.

* gcheck and @gcheck for gradient checking with Params.

* Added @primitive2 and @zerograd2 for broadcast-only primitives.

* Handle functions where result is not the last thing on tape.

* Added batch matrix multiplication.

* Added tests and docs for new RNN interface.

* Improved serialization and JLD file I/O.

* Added julia programmer demo to tutorial/08.charlm

* Renamed broadcast.jl -> binary.jl and broadcast_ops -> binary_ops.

Reply all

Reply to author

Forward

0 new messages