Problem with MNIST classification in latest commits?

Carin Meier

unread,

Aug 20, 2017, 7:34:51 PM8/20/17

to clojure-cortex

Hi - I've been doing some experiments generation MNIST networks with genetic algos. Everything was working pretty well and I pulled with master and saw very different results for my network.

I went back to the project and ran the examples/mnist-classification and saw pretty different behavior over the last two commits ( Softmax image (#208)) and (Convolution tensors (#207)) I thought I would point it out since it looks like there was work done in that area.

Here is the result for the current master:

steps I took

- checkout master (fec6fa808fe3086d7977fca4c3cae811f920eb0e Softmax image (#208))

-lein clean

-cd cortex

-lein install

-cd experiment

-lein install

-cd examples/mnist-classification

-rm *nippy

-lein run

Welcome! Please wait while we compile some Clojure code...

Reflection warning, cognitect/transit.clj:142:19 - call to static method writer on com.cognitect.transit.TransitFactory can't be resolved (argument types: unknown, java.io.OutputStream, unknown).

Aug 20, 2017 7:16:30 PM com.github.fommil.jni.JniLoader liberalLoad

INFO: successfully loaded /var/folders/cj/s4l2n2kn0ld5km979f23h83m0000gn/T/jniloader4215364497013515858netlib-native_system-osx-x86_64.jnilib

Loading mnist training dataset.

Done loading mnist training dataset in 17.517s

Loading mnist test dataset.

Done loading mnist test dataset in 3.669s

Training forever from uberjar.

Ensuring image data is built, and available on disk.

Training forever.

Building dataset from folder: mnist/training

Building dataset from folder: mnist/test

Gate opened on http://localhost:8091

Training network:

|----------------------+------------------+------------------+--------+----------+--------+--------+------------+------------|

| :convolutional | 1x28x28 - 784 | 20x24x24 - 11520 | [20] | | | | | [20 25] |

| :max-pooling | 20x24x24 - 11520 | 20x12x12 - 2880 | | | | | | |

| :dropout | 20x12x12 - 2880 | 20x12x12 - 2880 | | | | | | |

| :relu | 20x12x12 - 2880 | 20x12x12 - 2880 | | | | | | |

| :convolutional | 20x12x12 - 2880 | 50x8x8 - 3200 | [50] | | | | | [50 500] |

| :max-pooling | 50x8x8 - 3200 | 50x4x4 - 800 | | | | | | |

| :batch-normalization | 50x4x4 - 800 | 50x4x4 - 800 | [800] | | [800] | [800] | [800] | |

| :linear | 50x4x4 - 800 | 1x1x1000 - 1000 | [1000] | | | | | [1000 800] |

| :relu | 1x1x1000 - 1000 | 1x1x1000 - 1000 | | | | | | |

| :dropout | 1x1x1000 - 1000 | 1x1x1000 - 1000 | | | | | | |

| :linear | 1x1x1000 - 1000 | 1x1x10 - 10 | [10] | | | | | [10 1000] |

| :softmax | 1x1x10 - 10 | 1x1x10 - 10 | | | | | | |

Parameter count: 849780

Saving network to trained-network.nippy

Classification accuracy: 0.8417

Classification accuracy: 0.6165

Classification accuracy: 0.7771

Classification accuracy: 0.8351

Saving network to trained-network.nippy

Classification accuracy: 0.8491

Saving network to trained-network.nippy

Classification accuracy: 0.8557

Classification accuracy: 0.8518

Saving network to trained-network.nippy

Classification accuracy: 0.8783

Saving network to trained-network.nippy

Classification accuracy: 0.8885

Classification accuracy: 0.8666

Saving network to trained-network.nippy

Classification accuracy: 0.9065

Classification accuracy: 0.8327

Classification accuracy: 0.8538

Saving network to trained-network.nippy

Classification accuracy: 0.9076

Saving network to trained-network.nippy

Classification accuracy: 0.9327

Classification accuracy: 0.9282

Classification accuracy: 0.9169

Classification accuracy: 0.9246

Classification accuracy: 0.9229

Classification accuracy: 0.9275

Here is the result for a couple commits back:

git checkout bf9ea7b5aa4516cede1c66b2938fd38c1be3b298

Arbitrary transpose (#206)

Compiling mnist-classification.main

Welcome! Please wait while we compile some Clojure code...

Reflection warning, cognitect/transit.clj:142:19 - call to static method writer on com.cognitect.transit.TransitFactory can't be resolved (argument types: unknown, java.io.OutputStream, unknown).

Aug 20, 2017 7:23:35 PM com.github.fommil.jni.JniLoader liberalLoad

INFO: successfully loaded /var/folders/cj/s4l2n2kn0ld5km979f23h83m0000gn/T/jniloader3330058796203364158netlib-native_system-osx-x86_64.jnilib

Loading mnist training dataset.

Done loading mnist training dataset in 17.325s

Loading mnist test dataset.

Done loading mnist test dataset in 3.012s

Training forever from uberjar.

Ensuring image data is built, and available on disk.

Training forever.

Building dataset from folder: mnist/training

Building dataset from folder: mnist/test

Gate opened on http://localhost:8091

Training network:

|----------------------+------------------+------------------+--------+----------+--------+--------+------------+------------|

| :convolutional | 1x28x28 - 784 | 20x24x24 - 11520 | [20] | | | | | [20 25] |

| :max-pooling | 20x24x24 - 11520 | 20x12x12 - 2880 | | | | | | |

| :dropout | 20x12x12 - 2880 | 20x12x12 - 2880 | | | | | | |

| :relu | 20x12x12 - 2880 | 20x12x12 - 2880 | | | | | | |

| :convolutional | 20x12x12 - 2880 | 50x8x8 - 3200 | [50] | | | | | [50 500] |

| :max-pooling | 50x8x8 - 3200 | 50x4x4 - 800 | | | | | | |

| :batch-normalization | 50x4x4 - 800 | 50x4x4 - 800 | [800] | | [800] | [800] | [800] | |

| :linear | 50x4x4 - 800 | 1x1x1000 - 1000 | [1000] | | | | | [1000 800] |

| :relu | 1x1x1000 - 1000 | 1x1x1000 - 1000 | | | | | | |

| :dropout | 1x1x1000 - 1000 | 1x1x1000 - 1000 | | | | | | |

| :linear | 1x1x1000 - 1000 | 1x1x10 - 10 | [10] | | | | | [10 1000] |

| :softmax | 1x1x10 - 10 | 1x1x10 - 10 | | | | | | |

Parameter count: 849780

Saving network to trained-network.nippy

Classification accuracy: 0.89

Saving network to trained-network.nippy

Classification accuracy: 0.917

Classification accuracy: 0.9134

Saving network to trained-network.nippy

Classification accuracy: 0.9258

Saving network to trained-network.nippy

Classification accuracy: 0.9345

Saving network to trained-network.nippy

Classification accuracy: 0.9381

Saving network to trained-network.nippy

Classification accuracy: 0.9554

Saving network to trained-network.nippy

Classification accuracy: 0.9588

Classification accuracy: 0.9521

Saving network to trained-network.nippy

Classification accuracy: 0.9624

Saving network to trained-network.nippy

Classification accuracy: 0.9658

Saving network to trained-network.nippy

Classification accuracy: 0.9667

Classification accuracy: 0.9666

Saving network to trained-network.nippy

Classification accuracy: 0.9684

Saving network to trained-network.nippy

Classification accuracy: 0.9697

Classification accuracy: 0.9695

Saving network to trained-network.nippy

Classification accuracy: 0.9714

Classification accuracy: 0.9698

Classification accuracy: 0.971

Saving network to trained-network.nippy

Classification accuracy: 0.975

Saving network to trained-network.nippy

Classification accuracy: 0.9753

Classification accuracy: 0.9747

Classification accuracy: 0.9734

Saving network to trained-network.nippy

Classification accuracy: 0.976

Classification accuracy: 0.9741

Classification accuracy: 0.9742

Saving network to trained-network.nippy

Classification accuracy: 0.9766

Classification accuracy: 0.9741

Classification accuracy: 0.9703

Classification accuracy: 0.9752

I noticed the older commit seems to run much faster too.

Best,

Carin

Carin Meier

unread,

Aug 20, 2017, 7:40:00 PM8/20/17

to clojure-cortex

Of course, it could also be just me getting my computer into a weird state :) If someone could pull master and let me know what they get for the mnist-example that would be great !

- Carin

Chris Nuernberger

unread,

Aug 21, 2017, 6:31:21 PM8/21/17

to clojure-cortex

Thanks for the heads up Carin, we will be looking into it presently.

Carin Meier

unread,

Aug 21, 2017, 7:31:02 PM8/21/17

to clojure-cortex

Thanks :)

Some more data points in case they are useful. I went a bit further back:

Results at git sha git checkout 260c00de7f291d726b5f7c9e8225db149306766c Xor example (#195)

Parameter count: 849780

Saving network to trained-network.nippy

Classification accuracy: 0.8813

Saving network to trained-network.nippy

Classification accuracy: 0.9159

Saving network to trained-network.nippy

Classification accuracy: 0.9194

Saving network to trained-network.nippy

Classification accuracy: 0.9313

Saving network to trained-network.nippy

Classification accuracy: 0.9374

Saving network to trained-network.nippy

Classification accuracy: 0.938

Saving network to trained-network.nippy

Classification accuracy: 0.9459

Saving network to trained-network.nippy

Classification accuracy: 0.9502

Saving network to trained-network.nippy

Classification accuracy: 0.9592

Classification accuracy: 0.955

Classification accuracy: 0.951

Classification accuracy: 0.9373

Classification accuracy: 0.9549

Saving network to trained-network.nippy

Classification accuracy: 0.9652

Classification accuracy: 0.9611

Classification accuracy: 0.962

Saving network to trained-network.nippy

Classification accuracy: 0.9667

Classification accuracy: 0.9666

Classification accuracy: 0.9544

Classification accuracy: 0.9663

Saving network to trained-network.nippy

Classification accuracy: 0.9686

Saving network to trained-network.nippy

Classification accuracy: 0.9696

Classification accuracy: 0.9683

Classification accuracy: 0.9679

Saving network to trained-network.nippy

Classification accuracy: 0.9724

Classification accuracy: 0.9716

Classification accuracy: 0.9715

Saving network to trained-network.nippy

Classification accuracy: 0.9729

Classification accuracy: 0.9677

Classification accuracy: 0.9715

Classification accuracy: 0.9727

Saving network to trained-network.nippy

Classification accuracy: 0.9734

Classification accuracy: 0.9697

Saving network to trained-network.nippy

Classification accuracy: 0.9736

Classification accuracy: 0.9731

vs Results of latest master sha 6b24564d2a9fe538d0a770c264ca4692b4dae006 Yolo loss (#209)

Saving network to trained-network.nippy

Classification accuracy: 0.8616

Classification accuracy: 0.7572

Classification accuracy: 0.5833

Classification accuracy: 0.5087

Classification accuracy: 0.6554

Classification accuracy: 0.5584

Classification accuracy: 0.7951

Classification accuracy: 0.6742

Classification accuracy: 0.6584

Classification accuracy: 0.6086

Classification accuracy: 0.707

Classification accuracy: 0.7715

Saving network to trained-network.nippy

Classification accuracy: 0.8929

Classification accuracy: 0.771

Classification accuracy: 0.8406

Classification accuracy: 0.8392

Classification accuracy: 0.7634

Classification accuracy: 0.8042

Classification accuracy: 0.8701

Saving network to trained-network.nippy

Classification accuracy: 0.9258

Saving network to trained-network.nippy

Classification accuracy: 0.9316

Classification accuracy: 0.8371

Classification accuracy: 0.7638

Classification accuracy: 0.8003

Classification accuracy: 0.7903

Classification accuracy: 0.7563

Classification accuracy: 0.7983

Classification accuracy: 0.8658

Classification accuracy: 0.7351

Classification accuracy: 0.8845

Classification accuracy: 0.8507

Classification accuracy: 0.8708

Classification accuracy: 0.8148

Classification accuracy: 0.8411

Classification accuracy: 0.905

alis...@thinktopic.com

unread,

Aug 21, 2017, 8:16:44 PM8/21/17

to clojure-cortex

Interesting! I just cloned cortex and ran the MNIST example, and I didn't have the same problems you did, Carin, either with time or accuracy.

Here are my results from doing this command (which lets the training run for three minutes and captures the output in a file):

timeout -sHUP 3m lein run |tee `git rev-parse --short HEAD`

On master (aka 6b24564):

https://gist.github.com/atroche/d26782720294692b543e54799174be82

And bf9ea7b:

https://gist.github.com/atroche/1bf27ae07fe1b586386a83fb8f0d68af

Just to be safe, I run them in two separately and freshly cloned copies of Cortex.

I'm running Cuda V8.0.61 and cuDNN V8.0, on Ubuntu 17.10 with a GTX 960 using the 375.82 proprietary NVIDIA driver. How about you?

Wish I could be of more help, but I don't know a whole lot about Cortex :)

alis...@thinktopic.com

unread,

Aug 21, 2017, 8:35:55 PM8/21/17

to clojure-cortex

Oh, whoops! I didn't pay enough attention to which jar was currently installed. When I run bf9ea7b, I get much faster training, which I've put in this updated gist: https://gist.github.com/atroche/1bf27ae07fe1b586386a83fb8f0d68af

alis...@thinktopic.com

unread,

Aug 21, 2017, 8:53:53 PM8/21/17

to clojure-cortex

I just did git bisect between master and bf9ea7b (Arbitrary transpose), and it looks like the big drop in performance comes in 11ecab9 (Convolution tensors).

alis...@thinktopic.com

unread,

Aug 21, 2017, 9:51:41 PM8/21/17

to clojure-cortex

Latest discovery: I ran the mnist example on master while watching the output of gpustat, and the utilization was ~0%. When I ran it on bf9ea7b, it was consistently and appropriately high.

Carin Meier

unread,

Aug 22, 2017, 7:53:17 AM8/22/17

to clojure-cortex

Thanks for the troubleshooting :)

I'm on an older mac pro with NVIDIA GeForce GT 750M and CUDA 8.0.83 and GPU Driver Version: 10.17.5 355.10.05.45f01 (not sure how to verify my cudnn but ` cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2` gives

#define CUDNN_MAJOR 5

#define CUDNN_MINOR 1

#define CUDNN_PATCHLEVEL 5

--

#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

Carin Meier

unread,

Aug 22, 2017, 8:14:15 AM8/22/17

to clojure-cortex

In the interest of science, I ran the example without cuda (just using the cpu context) and the accuracy looks much better. It seems like it is something with the CUDA calc on my mac anyway.

Saving network to trained-network.nippy

Classification accuracy: 0.897

Saving network to trained-network.nippy

Classification accuracy: 0.9266

Saving network to trained-network.nippy

Classification accuracy: 0.9351

Saving network to trained-network.nippy

Classification accuracy: 0.9386

Saving network to trained-network.nippy

Classification accuracy: 0.9554

Saving network to trained-network.nippy

Classification accuracy: 0.9574

Saving network to trained-network.nippy

Classification accuracy: 0.9621

Saving network to trained-network.nippy

Classification accuracy: 0.9684

Chris Nuernberger

unread,

Aug 22, 2017, 11:13:04 AM8/22/17

to clojure-cortex

I am not sure that we know it is just your machine but here are some things:

As far as the convolution pathway goes, the xor test doesn't involve that so if there is a regression in accuracy there then we know it is something else.

Carin, would you mind regressing that xor example a bit and see if you can find the changelist where that happened? That would be super useful.

We can backtrack the performance regressions for the cuda convolution system and put back in in the fast paths where I have removed them; that is no problem. I am very much concerned, however, with the accuracy issues especially with the xor example so if we can track those down to some particular change I would *greatly* appreciate that. I feel that probably some aspect of the GPU I am using (probably the CAS instruction) works differently on that class of hardware and I may be able to reconsider how we are doing some aspect of these things to avoid it in most cases.

Carin Meier

unread,

Aug 23, 2017, 6:45:41 AM8/23/17

to clojure-cortex

Thanks so much Chris.

I'll look into the xor example and do some regression. In the mean time, there is great news. With the latest commit 8be97ddf5013374a75786bdaf315adc5d9de1884, my mnists runs look perfect with the gpu :)

parameter count: 849780

Saving network to trained-network.nippy

Classification accuracy: 0.9041

Saving network to trained-network.nippy

Classification accuracy: 0.9295

Saving network to trained-network.nippy

Classification accuracy: 0.9413

Saving network to trained-network.nippy

Classification accuracy: 0.9526

Classification accuracy: 0.9504

Saving network to trained-network.nippy

Classification accuracy: 0.9591

Saving network to trained-network.nippy

Classification accuracy: 0.9598

Saving network to trained-network.nippy

Classification accuracy: 0.9662

Classification accuracy: 0.9645

Saving network to trained-network.nippy

Classification accuracy: 0.9692

Classification accuracy: 0.9656

Chris Nuernberger

unread,

Aug 24, 2017, 4:11:04 PM8/24/17

to clojure-cortex

OK, that is interesting and I think we are actually treading on thin ice. Some assumption I am making about GPU CAS instructions isn't completely correct for your GPU which is extremely unsettling.

Basically I special cased some of the tensor operations with cudnn optimized fast paths; the issue went away and I imagine that your performance is back. But that also means that in one of those two operations there exists a pathway that fails very subtly on your GPU (or rather it fails less subtly on mine).

In any case, keep calm and keep moving I guess until we have to narrow this down further.

Really great news that your system is working fine again.

Carin Meier

unread,

Aug 25, 2017, 7:22:55 PM8/25/17

to clojure-cortex

The performance is back. I checked out the xor example and did some regression on it. It looks fine through all the commits, so it looks like it was limited to the CNN layers.

Reply all

Reply to author

Forward