Dear SIG Build team,
we are self compiling TensorFlow on various HPC clusters due to hardware
requirements (e.g. CUDA drivers) and were using `--config=mkl` to
(supposedly) enable the use of MKL and/or oneDNN to accelerate the CPU
operations for various DNN ops.
However we are notified that our self-built package performs worse than
the pip package on CPU even though we enable more aggressive
optimizations, e.g. -march=native to make use of AVX2 etc.
Further investigation revealed a serious overusage of threads leading to
many involuntary context switches severely impacting the performance.
Those can be (mostly) mitigated by setting e.g. OMP_NUM_THREADS=1, but
we can't do that by default for all users of our cluster for obvious
reasons.
Comparing our build with the official pip packages lead to the mentioned
mkl-option which is a collective setting for these flags:
--define=build_with_mkl=true --define=enable_mkl=true
--define=tensorflow_mkldnn_contraction_kernel=0
--define=build_with_openmp=true
Searching the binaries of the pip package for the effects of those flags
makes me conclude that neither of those is used, i.e. the official pip
packages are not build with `--config=mkl`. See
https://github.com/easybuilders/easybuild-easyblocks/issues/2577#issuecomment-919914929
for a detailed analysis.
However disabling (i.e. not passing) --config=mkl makes it fail at least
1 Test: //tensorflow/core/kernels/mkl:mkl_fused_batch_norm_op_test
Only disabling the omp part, i.e. passing `--define=build_with_mkl=true
--define=enable_mkl=true
--define=tensorflow_mkldnn_contraction_kernel=0` instead makes many
tests fail:
//tensorflow/c/eager:c_api_cluster_test
//tensorflow/c/eager:c_api_remote_function_test
//tensorflow/c/eager:c_api_remote_test
//tensorflow/c/eager:c_api_test
//tensorflow/core/kernels:matmul_op_test
//tensorflow/core/kernels/mkl:mkl_fused_batch_norm_op_test
//tensorflow/python:convert_to_constants_test
//tensorflow/python/keras/layers:kernelized_test
//tensorflow/python/keras/wrappers:scikit_learn_test
//tensorflow/python/kernel_tests:variables_test
//tensorflow/python/kernel_tests/distributions:dirichlet_test
Using the related `--config=mkl_threadpool` seems to be even worse with
NaNs, segfaults, FPEs....
- So what exactly is the purpose of `build_with_mkl` and `enable_mkl`?
- How are those flags exactly related to oneDNN and MKL? I don't see the
actual MKL being using, hence the confusion.
- How are the official pip packages built? Are they tested with that
setting?
Thanks!
Alex