How important is ECC for Neural Networks?

themoo...@googlemail.com

unread,

Nov 19, 2015, 6:45:05 AM11/19/15

to lasagne-users

One difference between the Nvidia Titan Black and the Nvidia K20 is ECC. The Nvidia Titan Black costs 1080 Euro and the K20 costs 3000 Euro.

The rest seems to be similar or in favor for the Titan Black.

* Comparision: Titan Black vs Nvidia K20

* Source: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-black/specifications vs http://www.nvidia.de/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v07.pdf

* Cores: 2880 vs 2496

* Memory: 6144 MB vs 5 GB

* Memory clock: 7.0 Gbps vs 2.6 GHz

* Price: 1080 Euro vs 3000 Euro (via German Amazon)

So my question is: Is ECC important (for machine learning (with Lasagne / Neural Networks))? Does it make a difference? Are there publications about it?

Daniel Renshaw

unread,

Nov 19, 2015, 7:21:07 AM11/19/15

to lasagn...@googlegroups.com

Whether ECC [1] is important to you will depend on your use case. How much of a problem would it be if a random bit in GPU memory flipped its state (either from 0 to 1 or 1 to 0)?

This is a general technology that has no specific relevance to machine learning with Lasagne or neural networks.

A bit flip corruption might occur in memory related to the operation of the GPU driver, or the Theano framework, which may corrupt the logic of the computation as whole and perhaps prevent it from continuing. In this case, as long as you are saving the model state to persistent storage at suitable points through the computation (e.g. once per epoch) then you could simply restart the computation from the last saved state.

If the corruption occurs in your data or model parameters (most likely since this space is likely much larger than the system overheads) then the effect may or may not be important to you. A random flip of a bit in a floating point number could have very little, or a great deal of effect (e.g. switching sign). Your computation may or may not be robust to such changes.

I don't know how much ECC affects memory speed (I suspect very little since it's a hardware mechanism and the memory clock speed is presumably set to a level that permits ECC operation) but GPU memory size is affected. The PDF you linked to suggests the GPU memory size will be 10% smaller with ECC enabled.

The probability of suffering a random flip is very small and, I suspect usually, the impact of a random flip would be negligible. So ECC is unlikely to be useful for most people. That 10% of memory is probably much more valuable for most.

Daniel

[1] https://en.wikipedia.org/wiki/ECC_memory

--
You received this message because you are subscribed to the Google Groups "lasagne-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lasagne-user...@googlegroups.com.
To post to this group, send email to lasagn...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/f31637e0-2905-4f46-ae59-2767074f6d2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

goo...@jan-schlueter.de

unread,

Nov 19, 2015, 9:20:16 AM11/19/15

to lasagne-users

So my question is: Is ECC important (for machine learning (with Lasagne / Neural Networks))? Does it make a difference? Are there any publications about it?

I'm not aware of any publications that compare it, but it's general practice to disable ECC for deep learning to free up GPU memory:
http://caffe.berkeleyvision.org/performance_hardware.html
ECC is important for long-running simulations where even small errors would accumulate. Neural networks tend to learn fine in the presence of noise, so unless you encounter a bunch of bit flips turning a word into NaN, you will not notice (and ECC wouldn't protect you from this scenario either, it can only correct single bit flips per word).

But your question actually seems to be about which graphics card to buy, and the answer would definitely be a card based on the Maxwell architecture such as the Titan X, GTX 980 Ti or GTX 980. The cards you listed are older GPUs based on the Kepler architecture, which will not profit from recent optimizations in the convolution and gemm kernels in cuDNN and other libraries. Furthermore, even the flagship Titan X is cheaper than what you found for the Titan Black (German geizhals.at).

Best, Jan

Sander Dieleman

unread,

Nov 19, 2015, 12:43:53 PM11/19/15

to lasagne-users

Was going to say this, but of course Jan already did :)

Definitely get a Maxwell-based card. Get the Titan X if you need the memory (it has 12GB), otherwise the 980 Ti (which has 6GB). The 980 is great as well and is a bit less power hungry. Usually it has only 4GB though.

With cuDNN v3 and its Maxwell-specific optimizations, the performance difference between these architectures is huge. Don't waste your money on Tesla cards (of which no Maxwell versions exist) or previous-generation cards like the Titan.

Sander

themoo...@googlemail.com

unread,

Nov 20, 2015, 4:38:35 AM11/20/15

to lasagne-users, themoo...@googlemail.com

I've just found two interesting articles, which are slightly related:

Key points I take out of it and your answers are:

Sometimes ECC is cheap to get. Then one should take it.
ECC is crucial when one runs big calculations long enough so (1) errors become more likely (2) it becomes VERY annoying / expensive to run the calculation again
Neural networks can deal with noisy data, so bitflips should not be a problem
Having Maxwell architecture is much more important than having ECC

Reply all

Reply to author

Forward