Knet 1.1.0 introduced the new interface allowing the use of struct's in models and callable objects for model / layer definitions.
Knet 1.1.1 focuses on performance. There is a number of performance improvements the most important of which is a new GPU memory manager. The GPU memory use is reduced by up to 50% which should allow larger models and larger batch sizes.
While working on performance improvements I developed some monitoring tools as well. Julia frequently crashes when profiling with GPU code. I decided to use TimerOutputs instead. If the KNET_TIMER environment variable is set while Knet is built, the timing code will be compiled in and the `Knet.to` variable should hold timing information for all GPU calls. Similarly the AUTOGRAD_TIMER environment variable controls whether AutoGrad puts timing information for forward and backward passes over the tape into the `AutoGrad.to` variable. Here is what sample outputs look like:
julia> AutoGrad.to
───────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 4.62s / 30.4% 546MiB / 25.0%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
+.[2] 1 328ms 23.3% 328ms 46.4MiB 34.1% 46.4MiB
sum[2] 1 288ms 20.5% 288ms 40.0MiB 29.4% 40.0MiB
* 1 38.8ms 2.76% 38.8ms 595KiB 0.43% 595KiB
* 1 269ms 19.2% 269ms 955KiB 0.68% 955KiB
+. 1 139ms 9.92% 139ms 20.4MiB 15.0% 20.4MiB
*[1] 1 117ms 8.33% 117ms 9.41MiB 6.90% 9.41MiB
record 4 88.7ms 6.31% 22.2ms 3.49MiB 2.56% 894KiB
-[1] 1 65.9ms 4.69% 65.9ms 10.0MiB 7.32% 10.0MiB
- 1 55.8ms 3.97% 55.8ms 929KiB 0.67% 929KiB
sum 1 50.0ms 3.56% 50.0ms 4.68MiB 3.44% 4.68MiB
+.[1] 1 1.78ms 0.13% 1.78ms 37.7KiB 0.03% 37.7KiB
sum_outgrads 5 1.41ms 0.10% 282μs 28.2KiB 0.02% 5.64KiB
───────────────────────────────────────────────────────────────────────
julia> Knet.to
──────────────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 76.3s / 8.89% 4.10GiB / 0.02%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────────────────
sum_32_20 206 4.96s 73.2% 24.1ms 3.22KiB 0.35% -
cudaRuntimeGetVersion 1 736ms 10.9% 736ms - 0.00% -
cudaSetDevice 1 563ms 8.29% 563ms - 0.00% -
cublasSgemm_v2 96 211ms 3.11% 2.20ms 663KiB 72.1% 6.91KiB
cublasCreate_v2 1 166ms 2.44% 166ms - 0.00% -
cublasGetVersion_v2 1 2.95μs 0.00% 2.95μs - 0.00% -
nvmlInit 1 161ms 2.37% 161ms - 0.00% -
cudaMemcpy 5.17k 72.0ms 1.06% 13.9μs 191KiB 20.8% -
curandCreateGenerator 1 20.0ms 0.29% 20.0ms - 0.00% -
sum_64_20 602 17.4ms 0.26% 28.9μs 9.41KiB 1.02% -
cudaMalloc 456 9.93ms 0.15% 21.8μs - 0.00% -