Hi Yenda,
Great I will update asap.
I was aware MKL was only optimised for Intel processors, luckily the cluster I work on has only Intel processors.
As you mentioned, I changed the configure file actually using this advisor link line, a little bit odd was the fact that when I used the link line using GNU option, which outputs flags as -no-as-needed, I encountered a bunch of errors so I switch to the Intel (R) C/C++ option, the only errors were that some -lmkl were not found so I kept the flag " -Wl,-rpath=$mkllibdir" from the original configure file and voila. It worked. Ah sure and I also added the options || check_library $OMPLIBDIR "libmkl_tbb_thread" "a" || check_library $OMPLIBDIR "libmkl_tbb_thread" "so" the same as in the linux_configure_omplibdir. That did the trick for me
I called the script thus:
./configure --mkl-root=/opt/intel/mkl --threaded-math=yes --use-cuda=yes --cudatk-dir=/usr/local/cuda-7.0 --omp-libdir=/opt/intel/mkl/lib/intel64
The --omp-libdir option might seem redundant but otherwise it defaults to the previous composer_xe edition which still uses libiomp5.
So far I have gotten considerable time improvements in nnet training of around 30-40% and ~60% while decoding compared to ATLAS but only 10-15% and 20% against the previous version which used openmp instead of TBB.
Cheers,
Angel