I wanted to give a brief check-in on this google group as I have seen very few people actually post online about their success running CUDA 6.5 with Python + Theano + Lasagne on an NVIDIA Jetson TK1. For those of you who are unfamiliar:
https://developer.nvidia.com/jetson-tk1"The NVIDIA Jetson TK1 development kit is a full-featured platform for
Tegra K1 embedded applications. It allows you to unleash the power of
192 CUDA cores to develop solutions in computer vision, robotics,
medicine, security, and automotive."
The Jetson is a SoC sporting a quad-core ARM processor and 2GB of shared system and graphics memory. It runs CUDA 6.5 on 192 Kepler-generation cores. Overall, a solid little machine that can run a small convolutional neural network.
I've been working on a remote-sensing project that requires some form of scene classification and/or object detection on an embedded platform. Our prototype system uses a DCNN architecture and we've successfully taken production code running on (and models trained on) a Tesla K20m and very simply installed it (not ported, hallelujah Python) on the TK1. Obviously there are compilation issues going from x86 to ARM, but the platform is mature enough that most required things can be apt-get installed or installed as Python modules and compiled with gcc or gfortran. Luckily, NVIDIA has done some of the grunt work for giving instructions to install Ubuntu with the correct CUDA and NVIDIA drivers.
Now, the bread and butter. Taking the stereotypical 10,000 28x28-pixel patch MNIST example, we were able to get the following performance:
- The Tesla K20m can train the model in a very short amount of time and by the first epoch is achieving >98% on the 60,000 MNIST training data.
- The final model (without data augmentation) achieves a performance of 99.38% on the validation set within 5 minutes, averaging 17 seconds per epoch. This is rather slow, but training includes a lot of debugging output (like convolutional filter drawing) and the times should only serve as a relative benchmark.
- The final trained model achieves a 99.44% accuracy on the MNIST test data and classifies all 10,000 patches in 0.7406 seconds -- getting only 56 examples incorrect. The state-of-the-art performance on MNIST has the error around 20-23 cases, depending on the architecture.
- The Jetson TK1 can go through one full training epoch in roughly 190 seconds. We opted to simply transfer the pre-trained K20m model onto the Jetson.
- The Jetson, using the same model as the K20m, achieves an accuracy of 98.48% in 4.857 seconds.
Note the drop in performance from the K20 to the TK1. We do not see this drop in performance when we run the same model on other x86-supported, non-K20, desktop GPUs. One explanation we can offer as to why the performance becomes degraded is that the ARM-compiled libraries or Theano compiler is losing bits of precision somewhere and this error is accumulating as the signal propagates deeper into the network. This is pure guess and intuition and should be taken with a grain of salt.
Nevertheless, hopefully this post serves as a verifiable proof-of-concept and as at least one benchmark for the DCNN performance on a Jetson TK1.