Problem with MNIST classification in latest commits?

100 views
Skip to first unread message

Carin Meier

unread,
Aug 20, 2017, 7:34:51 PM8/20/17
to clojure-cortex
Hi - I've been doing some experiments generation MNIST networks with genetic algos. Everything was working pretty well and I pulled with master and saw very different results for my network.

I went back to the project and ran the examples/mnist-classification and saw pretty different behavior over the last two commits  ( Softmax image (#208)) and (Convolution tensors (#207)) I thought I would point it out since it looks like there was work done in that area.

Here is the result for the current master:

steps I took
- checkout master  (fec6fa808fe3086d7977fca4c3cae811f920eb0e Softmax image (#208))
-lein clean
-cd cortex
-lein install
-cd experiment
-lein install
-cd examples/mnist-classification
-rm *nippy
-lein run

Welcome! Please wait while we compile some Clojure code...
Reflection warning, cognitect/transit.clj:142:19 - call to static method writer on com.cognitect.transit.TransitFactory can't be resolved (argument types: unknown, java.io.OutputStream, unknown).
Aug 20, 2017 7:16:30 PM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /var/folders/cj/s4l2n2kn0ld5km979f23h83m0000gn/T/jniloader4215364497013515858netlib-native_system-osx-x86_64.jnilib
Loading mnist training dataset.
Done loading mnist training dataset in 17.517s
Loading mnist test dataset.
Done loading mnist test dataset in 3.669s
Training forever from uberjar.
Ensuring image data is built, and available on disk.
Training forever.
Building dataset from folder: mnist/training
Building dataset from folder: mnist/test
Gate opened on http://localhost:8091
Training network:

|                 type |            input |           output |  :bias | :centers | :means | :scale | :variances |   :weights |
|----------------------+------------------+------------------+--------+----------+--------+--------+------------+------------|
|       :convolutional |    1x28x28 - 784 | 20x24x24 - 11520 |   [20] |          |        |        |            |    [20 25] |
|         :max-pooling | 20x24x24 - 11520 |  20x12x12 - 2880 |        |          |        |        |            |            |
|             :dropout |  20x12x12 - 2880 |  20x12x12 - 2880 |        |          |        |        |            |            |
|                :relu |  20x12x12 - 2880 |  20x12x12 - 2880 |        |          |        |        |            |            |
|       :convolutional |  20x12x12 - 2880 |    50x8x8 - 3200 |   [50] |          |        |        |            |   [50 500] |
|         :max-pooling |    50x8x8 - 3200 |     50x4x4 - 800 |        |          |        |        |            |            |
| :batch-normalization |     50x4x4 - 800 |     50x4x4 - 800 |  [800] |          |  [800] |  [800] |      [800] |            |
|              :linear |     50x4x4 - 800 |  1x1x1000 - 1000 | [1000] |          |        |        |            | [1000 800] |
|                :relu |  1x1x1000 - 1000 |  1x1x1000 - 1000 |        |          |        |        |            |            |
|             :dropout |  1x1x1000 - 1000 |  1x1x1000 - 1000 |        |          |        |        |            |            |
|              :linear |  1x1x1000 - 1000 |      1x1x10 - 10 |   [10] |          |        |        |            |  [10 1000] |
|             :softmax |      1x1x10 - 10 |      1x1x10 - 10 |        |          |        |        |            |            |
Parameter count: 849780
Saving network to trained-network.nippy
Classification accuracy: 0.8417
Classification accuracy: 0.6165
Classification accuracy: 0.7771
Classification accuracy: 0.8351
Saving network to trained-network.nippy
Classification accuracy: 0.8491
Saving network to trained-network.nippy
Classification accuracy: 0.8557
Classification accuracy: 0.8518
Saving network to trained-network.nippy
Classification accuracy: 0.8783
Saving network to trained-network.nippy
Classification accuracy: 0.8885
Classification accuracy: 0.8666
Saving network to trained-network.nippy
Classification accuracy: 0.9065
Classification accuracy: 0.8327
Classification accuracy: 0.8538
Saving network to trained-network.nippy
Classification accuracy: 0.9076
Saving network to trained-network.nippy
Classification accuracy: 0.9327
Classification accuracy: 0.9282
Classification accuracy: 0.9169
Classification accuracy: 0.9246
Classification accuracy: 0.9229
Classification accuracy: 0.9275

Here is the result for a couple commits back:
 git checkout  bf9ea7b5aa4516cede1c66b2938fd38c1be3b298
Arbitrary transpose (#206)

Compiling mnist-classification.main
Welcome! Please wait while we compile some Clojure code...
Reflection warning, cognitect/transit.clj:142:19 - call to static method writer on com.cognitect.transit.TransitFactory can't be resolved (argument types: unknown, java.io.OutputStream, unknown).
Aug 20, 2017 7:23:35 PM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /var/folders/cj/s4l2n2kn0ld5km979f23h83m0000gn/T/jniloader3330058796203364158netlib-native_system-osx-x86_64.jnilib
Loading mnist training dataset.
Done loading mnist training dataset in 17.325s
Loading mnist test dataset.
Done loading mnist test dataset in 3.012s
Training forever from uberjar.
Ensuring image data is built, and available on disk.
Training forever.
Building dataset from folder: mnist/training
Building dataset from folder: mnist/test
Gate opened on http://localhost:8091
Training network:

|                 type |            input |           output |  :bias | :centers | :means | :scale | :variances |   :weights |
|----------------------+------------------+------------------+--------+----------+--------+--------+------------+------------|
|       :convolutional |    1x28x28 - 784 | 20x24x24 - 11520 |   [20] |          |        |        |            |    [20 25] |
|         :max-pooling | 20x24x24 - 11520 |  20x12x12 - 2880 |        |          |        |        |            |            |
|             :dropout |  20x12x12 - 2880 |  20x12x12 - 2880 |        |          |        |        |            |            |
|                :relu |  20x12x12 - 2880 |  20x12x12 - 2880 |        |          |        |        |            |            |
|       :convolutional |  20x12x12 - 2880 |    50x8x8 - 3200 |   [50] |          |        |        |            |   [50 500] |
|         :max-pooling |    50x8x8 - 3200 |     50x4x4 - 800 |        |          |        |        |            |            |
| :batch-normalization |     50x4x4 - 800 |     50x4x4 - 800 |  [800] |          |  [800] |  [800] |      [800] |            |
|              :linear |     50x4x4 - 800 |  1x1x1000 - 1000 | [1000] |          |        |        |            | [1000 800] |
|                :relu |  1x1x1000 - 1000 |  1x1x1000 - 1000 |        |          |        |        |            |            |
|             :dropout |  1x1x1000 - 1000 |  1x1x1000 - 1000 |        |          |        |        |            |            |
|              :linear |  1x1x1000 - 1000 |      1x1x10 - 10 |   [10] |          |        |        |            |  [10 1000] |
|             :softmax |      1x1x10 - 10 |      1x1x10 - 10 |        |          |        |        |            |            |
Parameter count: 849780
Saving network to trained-network.nippy
Classification accuracy: 0.89
Saving network to trained-network.nippy
Classification accuracy: 0.917
Classification accuracy: 0.9134
Saving network to trained-network.nippy
Classification accuracy: 0.9258
Saving network to trained-network.nippy
Classification accuracy: 0.9345
Saving network to trained-network.nippy
Classification accuracy: 0.9381
Saving network to trained-network.nippy
Classification accuracy: 0.9554
Saving network to trained-network.nippy
Classification accuracy: 0.9588
Classification accuracy: 0.9521
Saving network to trained-network.nippy
Classification accuracy: 0.9624
Saving network to trained-network.nippy
Classification accuracy: 0.9658
Saving network to trained-network.nippy
Classification accuracy: 0.9667
Classification accuracy: 0.9666
Saving network to trained-network.nippy
Classification accuracy: 0.9684
Saving network to trained-network.nippy
Classification accuracy: 0.9697
Classification accuracy: 0.9695
Saving network to trained-network.nippy
Classification accuracy: 0.9714
Classification accuracy: 0.9698
Classification accuracy: 0.971
Saving network to trained-network.nippy
Classification accuracy: 0.975
Saving network to trained-network.nippy
Classification accuracy: 0.9753
Classification accuracy: 0.9747
Classification accuracy: 0.9734
Saving network to trained-network.nippy
Classification accuracy: 0.976
Classification accuracy: 0.9741
Classification accuracy: 0.9742
Saving network to trained-network.nippy
Classification accuracy: 0.9766
Classification accuracy: 0.9741
Classification accuracy: 0.9703
Classification accuracy: 0.9752


I noticed the older commit seems to run much faster too.

Best,
Carin

Carin Meier

unread,
Aug 20, 2017, 7:40:00 PM8/20/17
to clojure-cortex
Of course, it could also be just me getting my computer into a weird state :) If someone could pull master and let me know what they get for the mnist-example that would be great !
- Carin

Chris Nuernberger

unread,
Aug 21, 2017, 6:31:21 PM8/21/17
to clojure-cortex
Thanks for the heads up Carin, we will be looking into it presently.

Carin Meier

unread,
Aug 21, 2017, 7:31:02 PM8/21/17
to clojure-cortex
Thanks :)

Some more data points in case they are useful. I went a bit further back:

Results at git sha  git checkout 260c00de7f291d726b5f7c9e8225db149306766c Xor example (#195)

Parameter count: 849780
Saving network to trained-network.nippy
Classification accuracy: 0.8813
Saving network to trained-network.nippy
Classification accuracy: 0.9159
Saving network to trained-network.nippy
Classification accuracy: 0.9194
Saving network to trained-network.nippy
Classification accuracy: 0.9313
Saving network to trained-network.nippy
Classification accuracy: 0.9374
Saving network to trained-network.nippy
Classification accuracy: 0.938
Saving network to trained-network.nippy
Classification accuracy: 0.9459
Saving network to trained-network.nippy
Classification accuracy: 0.9502
Saving network to trained-network.nippy
Classification accuracy: 0.9592
Classification accuracy: 0.955
Classification accuracy: 0.951
Classification accuracy: 0.9373
Classification accuracy: 0.9549
Saving network to trained-network.nippy
Classification accuracy: 0.9652
Classification accuracy: 0.9611
Classification accuracy: 0.962
Saving network to trained-network.nippy
Classification accuracy: 0.9667
Classification accuracy: 0.9666
Classification accuracy: 0.9544
Classification accuracy: 0.9663
Saving network to trained-network.nippy
Classification accuracy: 0.9686
Saving network to trained-network.nippy
Classification accuracy: 0.9696
Classification accuracy: 0.9683
Classification accuracy: 0.9679
Saving network to trained-network.nippy
Classification accuracy: 0.9724
Classification accuracy: 0.9716
Classification accuracy: 0.9715
Saving network to trained-network.nippy
Classification accuracy: 0.9729
Classification accuracy: 0.9677
Classification accuracy: 0.9715
Classification accuracy: 0.9727
Saving network to trained-network.nippy
Classification accuracy: 0.9734
Classification accuracy: 0.9697
Saving network to trained-network.nippy
Classification accuracy: 0.9736
Classification accuracy: 0.9731

vs Results of latest master sha  6b24564d2a9fe538d0a770c264ca4692b4dae006  Yolo loss (#209)


Saving network to trained-network.nippy
Classification accuracy: 0.8616
Classification accuracy: 0.7572
Classification accuracy: 0.5833
Classification accuracy: 0.5087
Classification accuracy: 0.6554
Classification accuracy: 0.5584
Classification accuracy: 0.7951
Classification accuracy: 0.6742
Classification accuracy: 0.6584
Classification accuracy: 0.6086
Classification accuracy: 0.707
Classification accuracy: 0.7715
Saving network to trained-network.nippy
Classification accuracy: 0.8929
Classification accuracy: 0.771
Classification accuracy: 0.8406
Classification accuracy: 0.8392
Classification accuracy: 0.7634
Classification accuracy: 0.8042
Classification accuracy: 0.8701
Saving network to trained-network.nippy
Classification accuracy: 0.9258
Saving network to trained-network.nippy
Classification accuracy: 0.9316
Classification accuracy: 0.8371
Classification accuracy: 0.7638
Classification accuracy: 0.8003
Classification accuracy: 0.7903
Classification accuracy: 0.7563
Classification accuracy: 0.7983
Classification accuracy: 0.8658
Classification accuracy: 0.7351
Classification accuracy: 0.8845
Classification accuracy: 0.8507
Classification accuracy: 0.8708
Classification accuracy: 0.8148
Classification accuracy: 0.8411
Classification accuracy: 0.905

alis...@thinktopic.com

unread,
Aug 21, 2017, 8:16:44 PM8/21/17
to clojure-cortex
Interesting! I just cloned cortex and ran the MNIST example, and I didn't have the same problems you did, Carin, either with time or accuracy.

Here are my results from doing this command (which lets the training run for three minutes and captures the output in a file):

timeout -sHUP 3m lein run |tee `git rev-parse --short HEAD`

On master (aka 6b24564):

And bf9ea7b:

Just to be safe, I run them in two separately and freshly cloned copies of Cortex.

I'm running Cuda V8.0.61 and cuDNN V8.0, on Ubuntu 17.10 with a GTX 960 using the 375.82 proprietary NVIDIA driver. How about you?

Wish I could be of more help, but I don't know a whole lot about Cortex :)

alis...@thinktopic.com

unread,
Aug 21, 2017, 8:35:55 PM8/21/17
to clojure-cortex
Oh, whoops! I didn't pay enough attention to which jar was currently installed. When I run bf9ea7b, I get much faster training, which I've put in this updated gist: https://gist.github.com/atroche/1bf27ae07fe1b586386a83fb8f0d68af

alis...@thinktopic.com

unread,
Aug 21, 2017, 8:53:53 PM8/21/17
to clojure-cortex
I just did git bisect between master and bf9ea7b (Arbitrary transpose), and it looks like the big drop in performance comes in 11ecab9 (Convolution tensors).

alis...@thinktopic.com

unread,
Aug 21, 2017, 9:51:41 PM8/21/17
to clojure-cortex
Latest discovery: I ran the mnist example on master while watching the output of gpustat, and the utilization was ~0%. When I ran it on bf9ea7b, it was consistently and appropriately high.

Carin Meier

unread,
Aug 22, 2017, 7:53:17 AM8/22/17
to clojure-cortex
Thanks for the troubleshooting :)

I'm on an older mac pro with NVIDIA GeForce GT 750M and CUDA  8.0.83 and GPU Driver Version: 10.17.5 355.10.05.45f01 (not sure how to verify my cudnn but ` cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2` gives 

#define CUDNN_MAJOR      5
#define CUDNN_MINOR      1
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

Carin Meier

unread,
Aug 22, 2017, 8:14:15 AM8/22/17
to clojure-cortex
In the interest of science, I ran the example without cuda (just using the cpu context) and the accuracy looks much better. It seems like it is something with the CUDA calc on my mac anyway.

Saving network to trained-network.nippy
Classification accuracy: 0.897
Saving network to trained-network.nippy
Classification accuracy: 0.9266
Saving network to trained-network.nippy
Classification accuracy: 0.9351
Saving network to trained-network.nippy
Classification accuracy: 0.9386
Saving network to trained-network.nippy
Classification accuracy: 0.9554
Saving network to trained-network.nippy
Classification accuracy: 0.9574
Saving network to trained-network.nippy
Classification accuracy: 0.9621
Saving network to trained-network.nippy
Classification accuracy: 0.9684

Chris Nuernberger

unread,
Aug 22, 2017, 11:13:04 AM8/22/17
to clojure-cortex
I am not sure that we know it is just your machine but here are some things:

As far as the convolution pathway goes, the xor test doesn't involve that so if there is a regression in accuracy there then we know it is something else.

Carin, would you mind regressing that xor example a bit and see if you can find the changelist where that happened?  That would be super useful.

We can backtrack the performance regressions for the cuda convolution system and put back in in the fast paths where I have removed them; that is no problem.  I am very much concerned, however, with the accuracy issues especially with the xor example so if we can track those down to some particular change I would *greatly* appreciate that.  I feel that probably some aspect of the GPU I am using (probably the CAS instruction) works differently on that class of hardware and I may be able to reconsider how we are doing some aspect of these things to avoid it in most cases.

Carin Meier

unread,
Aug 23, 2017, 6:45:41 AM8/23/17
to clojure-cortex
Thanks so much Chris.

I'll look into the xor example and do some regression. In the mean time, there is great news. With the latest commit 8be97ddf5013374a75786bdaf315adc5d9de1884, my mnists runs look perfect with the gpu :)

parameter count: 849780
Saving network to trained-network.nippy
Classification accuracy: 0.9041
Saving network to trained-network.nippy
Classification accuracy: 0.9295
Saving network to trained-network.nippy
Classification accuracy: 0.9413
Saving network to trained-network.nippy
Classification accuracy: 0.9526
Classification accuracy: 0.9504
Saving network to trained-network.nippy
Classification accuracy: 0.9591
Saving network to trained-network.nippy
Classification accuracy: 0.9598
Saving network to trained-network.nippy
Classification accuracy: 0.9662
Classification accuracy: 0.9645
Saving network to trained-network.nippy
Classification accuracy: 0.9692
Classification accuracy: 0.9656

Chris Nuernberger

unread,
Aug 24, 2017, 4:11:04 PM8/24/17
to clojure-cortex
OK, that is interesting and I think we are actually treading on thin ice.  Some assumption I am making about GPU CAS instructions isn't completely correct for your GPU which is extremely unsettling.

Basically I special cased some of the tensor operations with cudnn optimized fast paths; the issue went away and I imagine that your performance is back.  But that also means that in one of those two operations there exists a pathway that fails very subtly on your GPU (or rather it fails less subtly on mine).  

In any case, keep calm and keep moving I guess until we have to narrow this down further.

Really great news that your system is working fine again.

Carin Meier

unread,
Aug 25, 2017, 7:22:55 PM8/25/17
to clojure-cortex
The performance is back. I checked out the xor example and did some regression on it. It looks fine through all the commits, so it looks like it was limited to the CNN layers.
Reply all
Reply to author
Forward
0 new messages