--
To unsubscribe from this group and stop receiving emails from it, send an email to build+un...@tensorflow.org.
Yes, basically "bazel test" is what I'm after. But it seems that is not enough as you need more params and targets.
E.g. from Jason the GPU command was: "bazel --host_jvm_args=-Xms512m --host_jvm_args=-Xmx4096m test --local_test_jobs=2 --local_resources 16384,4.0,1.0 --build_tests_only --config=cuda --action_env=TF2_BEHAVIOR=1 --test_tag_filters=gpu,-no_oss,-oss_serial,-no_gpu,-benchmark-test --test_lang_filters=py --host_javabase=@bazel_tools//tools/jdk:jdk --java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8 --host_java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8 --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute -- //tensorflow/... -//tensorflow/compiler/... -//tensorflow/lite/... -//tensorflow/python:timeline_test_gpu -//tensorflow/python/profiler/internal:run_metadata_test_gpu"
That's a mouthful and I'm afraid I don't understand half of it. From my guessing it limits the amount of parallel tests to 2, runs gpu tests only but no non-gpu or benchmarks and only for python tests. Then there is a bunch of exclusions (e.g. compiler or lite) which I'm not sure of why that is excluded.
Our goal is to have confidence that everything is fine, especially in the presence of TF_SYSTEM_LIBS. E.g. using OpenSSL instead of BoringSSL lead to issues detected only later and with PyTorch I found a couple serious issues in OpenBLAS and libunwind for POWER. With TF we don't have those tests yet, hence I'd like to have a recommendation which tests make sense to run. It should be a thorough test but doesn't need to have every little thing tested.
And I'm also seeing the protobuf -> java related failure which is especially annoying on POWER HPC systems (java is not trivial to install there). But from a first test that seems to be a missing definition only in the protobuf.bzl file.
Alex
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dipl.-Inf. Alexander Grund Research Assistant Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) Chemnitzer Str.46b, Raum 250 01062 Dresden Tel.: +49 (351) 463-35982 E-Mail: alexand...@tu-dresden.de ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes, basically "bazel test" is what I'm after. But it seems that is not enough as you need more params and targets.
E.g. from Jason the GPU command was: "bazel --host_jvm_args=-Xms512m --host_jvm_args=-Xmx4096m test --local_test_jobs=2 --local_resources 16384,4.0,1.0 --build_tests_only --config=cuda --action_env=TF2_BEHAVIOR=1 --test_tag_filters=gpu,-no_oss,-oss_serial,-no_gpu,-benchmark-test --test_lang_filters=py --host_javabase=@bazel_tools//tools/jdk:jdk --java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8 --host_java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8 --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute -- //tensorflow/... -//tensorflow/compiler/... -//tensorflow/lite/... -//tensorflow/python:timeline_test_gpu -//tensorflow/python/profiler/internal:run_metadata_test_gpu"
That's a mouthful and I'm afraid I don't understand half of it. From my guessing it limits the amount of parallel tests to 2, runs gpu tests only but no non-gpu or benchmarks and only for python tests. Then there is a bunch of exclusions (e.g. compiler or lite) which I'm not sure of why that is excluded.
Our goal is to have confidence that everything is fine, especially in the presence of TF_SYSTEM_LIBS. E.g. using OpenSSL instead of BoringSSL lead to issues detected only later
and with PyTorch I found a couple serious issues in OpenBLAS and libunwind for POWER. With TF we don't have those tests yet, hence I'd like to have a recommendation which tests make sense to run. It should be a thorough test but doesn't need to have every little thing tested.
And I'm also seeing the protobuf -> java related failure which is especially annoying on POWER HPC systems (java is not trivial to install there). But from a first test that seems to be a missing definition only in the protobuf.bzl file.
bazel test --config=opt -- //tensorflow/core/... //tensorflow/cc/... //tensorflow/c/... -//tensorflow/core:example_java_proto
Thanks so far, currently trying something based on this. However I got a question regarding exclusions and such:
- I assume for the targets "//tensorflow/core/..." means "include" and "-//tensorflow/core:example_java_proto" means "exclude", is this correct?
- There is
"--test_tag_filters=gpu,-no_oss,-oss_serial,-no_gpu,-benchmark-test".
Like "run all gpu tests, and not the no_gpu ones"? Is there some
suggested preset of tag filters I should use? I mean naively I'd
like to run gpu and non-gpu tests. Excluding benchmarks makes
sense too
bazel test --config=opt -- //tensorflow/core/... //tensorflow/cc/... //tensorflow/c/... -//tensorflow/core:example_java_proto
Thanks so far, currently trying something based on this. However I got a question regarding exclusions and such:
- I assume for the targets "//tensorflow/core/..." means "include" and "-//tensorflow/core:example_java_proto" means "exclude", is this correct?
- There is "--test_tag_filters=gpu,-no_oss,-oss_serial,-no_gpu,-benchmark-test". Like "run all gpu tests, and not the no_gpu ones"? Is there some suggested preset of tag filters I should use? I mean naively I'd like to run gpu and non-gpu tests. Excluding benchmarks makes sense too
--
To unsubscribe from this group and stop receiving emails from it, send an email to build+un...@tensorflow.org.
Sorry, I haven't found any other error besides that: "ERROR: /dev/shm/tensorflow-r2.4/tensorflow/python/tools/BUILD:473:8: failed (Exit 1): generate-xml.sh failed: error executing command"
And as mentioned if I try to run that manually it works and I
don't even see how that could fail. I'll retry (log is gone now)
and double-check if there is anything else which could be useful.
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dipl.-Inf. Alexander Grund Research Assistant Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) Chemnitzer Str.46b, Raum 250 01062 Dresden Tel.: +49 (351) 463-35982 E-Mail: alexand...@tu-dresden.de ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ran into this again and the output is:
SUBCOMMAND: #
//tensorflow/python/keras/optimizer_v2:optimizer_v2_test_gpu
[action 'Testing
//tensorflow/python/keras/optimizer_v2:optimizer_v2_test_gpu
(shard 6 of 8)', configuration:
9b98de9644902f7a44cf7890192f2bf08567062808516f2db87a5cd4c6efae0a,
execution platform: @local_execution_config_platform//:platform]
(cd
/dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmpiNPW_y-bazel-tf/output_base/execroot/org_tensorflow
&& \
exec env - \
PATH=/usr/bin:/bin \
TEST_BINARY=tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu
\
TEST_NAME='//tensorflow/python/keras/optimizer_v2:optimizer_v2_test_gpu
(shard 6 of 8)' \
TEST_SHARD_INDEX=5 \
TEST_TOTAL_SHARDS=8 \
external/bazel_tools/tools/test/generate-xml.sh
bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.log
bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.xml
315 142)
ERROR:
/dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/TensorFlow/tensorflow-r2.4/tensorflow/python/keras/optimizer_v2/BUILD:238:13:
failed (Exit 1): generate-xml.sh failed: error executing command
(cd
/dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmpiNPW_y-bazel-tf/output_base/execroot/org_tensorflow
&& \
exec env - \
PATH=/usr/bin:/bin \
TEST_BINARY=tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu
\
TEST_NAME='//tensorflow/python/keras/optimizer_v2:optimizer_v2_test_gpu
(shard 6 of 8)' \
TEST_SHARD_INDEX=5 \
TEST_TOTAL_SHARDS=8 \
external/bazel_tools/tools/test/generate-xml.sh
bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.log
bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.xml
315 142)
Execution platform: @local_execution_config_platform//:platform
INFO: Elapsed time: 1013.764s, Critical Path: 987.36s
INFO: 55 processes: 55 local.
FAILED: Build did NOT complete successfully
Can't see anything else suspicious or related in the log. Maybe
some race condition or so that the file is not yet created or gets
removed inbetween?
What is the error you are seeing? The pasted snippet does not reference it.
Probably you will need to exclude a few more targets/dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmpiNPW_y-bazel-tf/output_base/execroot/org_tensorflow/bazel-out/ppc-opt/bin/tensorflow/lite/python/schema_py_srcs_no_include_all/tflite
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dipl.-Inf. Alexander Grund Research Assistant Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) Chemnitzer Str.46b, Raum 250 01062 Dresden Tel.: +49 (351) 463-35982 E-Mail: alexand...@tu-dresden.de ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes, when run manually it exits with zero exit code and no stdout/stderr.
I'm currently also blocked on TF/Bazel trying to generate XLA stuff even though XLA is not enabled. See https://github.com/tensorflow/tensorflow/issues/45045
Looks like the the native.genrule is executed even when nothing
depends on its outputs when I do `bazel test
//tensorflow/python/...`
Any ideas here?
I'm currently also blocked on TF/Bazel trying to generate XLA stuff even though XLA is not enabled. See https://github.com/tensorflow/tensorflow/issues/45045
Looks like the the native.genrule is executed even when nothing depends on its outputs when I do `bazel test //tensorflow/python/...`
Any ideas here?
--
Yes, building on /dev/shm to reduce build times. Running out of
memory is unlikely, I also tried to build on /tmp and the same
happens. It also consistently fails on that. So grepping the log I
see not a single generate-xml to be executed but the failing one.
Running out of ideas here...
Thanks for the reply. I assume that is similar to running the parenthesized command show by bazel.
I retried the build, it failed again (it consistently does) and I executed the commands manually as shown by you (with changed paths according to the failing one). The return code is zero and the xml file is created. So to me it really looks like something in Bazel failing, but I have no idea what.
Also the error message "error executing command" comes from Bazel itself and is to generic to see what went wrong. I.e. if it is caused by a missing file, wrong permissions or a failure in the script. I see no stdout/stderr from the run itself only the SUBCOMMAND, then ERROR and that's it.
I can rule out memory issues, using a 700GB /tmp here and the file itself is created, executable and runs fine after the error.
Any more ideas on how to debug this?
Thanks a lot!
FTR: The Bazel binary itself contains a zipped version of the
generate-xml.sh, so in order to modify it, I modified it before
building Bazel so the modified version gets picked up. I was
hoping to see any output so I could tell whether the script was
created and started to run. But I see no output at all.
--
Just tried that and still the same issue. Why isn't Bazel giving
me any better information WHY it failed to run that command?
While running this interactively again I noticed something string: I do have `--jobs 64` still set (after the --jobs 1 failed), but also `--local_test_jobs=1` but I see:
[506 / 570] 42 / 783 tests; 64 actions, 3 running; last test:
//tensorflow/c/eager:c_api_unified_experimental_test_gpu
Testing //tensorflow/c/eager:c_api_remote_test; 40s local
Testing //tensorflow/c/eager:c_api_remote_test_gpu; 15s local
Testing //tensorflow/c:while_loop_test; 7s local
[Sched] Testing //tensorflow/c/eager:c_api_cluster_test_gpu;
110s
[Sched] Testing
//tensorflow/core/framework:op_def_builder_test; 110s
[Sched] Testing
//tensorflow/core/framework:op_compatibility_test; 110s
[Sched] Testing
//tensorflow/cc/experimental/base/tests:tensorhandle_test; 109s
[Sched] Testing //tensorflow/cc:gradients_manip_grad_test;
109s ...
Does this indicate it is running 3 tests in parallel even though
I instructed it to only run 1 at a time?
It is a long shot and painful test but did you try with -j 1 flag?