Testing TF built with TF_SYSTEM

Alexander Grund

unread,

Oct 19, 2020, 3:28:32 AM10/19/20

to SIG Build

For the HPC builds of TensorFlow we do we currently do not run any tests due to

the difficulties introduced by Bazel.

I'd like to change that to improve confidence in the resulting builds, especially on our POWER system.

For those building and testing TF outside the "official" CI: What is the best/easiest way to run the tests after TF was build? For other projects this is simply `make test` or (e.g. PyTorch) `./run_tests.py`

I also expect some tests to fail (different environment not handled by tests, numeric instabilities, ...) so a test exclusion option is a must

I'll be glad for any advice. Thanks!

Jason M Furmanek

unread,

Oct 19, 2020, 6:31:36 PM10/19/20

to alexand...@tu-dresden.de, bu...@tensorflow.org

I don't know of a way. Bazel is just a different paradigm I'm afraid.

For the ppc64le community builds, we do have a job set up that we enable once a release that runs the unit tests:

https://powerci.osuosl.org/job/TensorFlow_PPC64LE_GPU_Unit_Test

https://powerci.osuosl.org/job/TensorFlow_PPC64LE_CPU_Unit_Test/

The configuration is standard though (it does not use TF_SYSTEM_LIBS options). I can share the job details if that would help. There are bazel options for test size, tags, whether a remote bazel cache is being used, etc.

Is there a way to fit bazel's way of doing things into your build environment? The idea behind bazel is hermetic builds and then ensuring correctness. So for each change bazel will build enough to test the change. In other words, you shouldn't have to run unit tests after the fact. The TF_SYSTEM_LIBS option does kind of break the hermetic nature of bazel, but presumably you are using a supported OS, so there is some insurance there.

Jason M. Furmanek

IBM Data and AI
Mobile: 1-512-638-9692
E-mail: furm...@us.ibm.com

--
To unsubscribe from this group and stop receiving emails from it, send an email to build+un...@tensorflow.org.

Jason Zaman

unread,

Oct 20, 2020, 11:46:49 AM10/20/20

to Jason M Furmanek, alexand...@tu-dresden.de, SIG Build

Bazel tests are just regular binaries that bazel builds and runs and looks at the status code, roughly speaking this:

$ bazel test --config=opt -- //tensorflow/c:c_test

is sort of the same as:

$ bazel build --config=opt -- //tensorflow/c:c_test

$ bazel-bin/tensorflow/c/c_test && echo "Passed!" || echo "Failed!"

You might have to cd into the runfiles dir if tests need to access any of their test data. If you run "bazel test --test_output=streamed" it'll give you the commandline it runs underneath so you can make sure you're calling it in the same way. You can use bazel cquery to get the list of test targets so you know which to run.

So you _could_ do that, but I wouldn't really recommend it, if you want to run the tests you should ideally try running them through bazel test directly. I just let my machine run them over night and most tests worked fine with TF_SYSTEM_LIBS. I noticed there are some java targets that fail because my unbundled protobuf is missing some targets that those need, I'll see how much work those would be to add. Generally speaking the TF_SYSTEM_LIBS things are external interface things (eg libjpeg or protobuf etc) so should not affect the results of the computations. Almost all the other failures I'm seeing are because my GPU is OOMing if I don't run the tests serially.

Another thing which I want to get in our SIG-Build repo but not sure how yet is a bunch of end-to-end integration tests on different big models with known inputs and outputs. That way when we've got a new pip wheel and if all those models produce the correct output we can be reasonably confident that there are no regressions or build failures. There would need to be some amount of epsilon so maybe if the model returns >99.98% confident the image is a parrot then we say it passes. You might be most interested in something like this rather than the bazel unit tests?

-- Jason

Alexander Grund

unread,

Oct 20, 2020, 11:59:17 AM10/20/20

to Jason Zaman, Jason M Furmanek, SIG Build

Yes, basically "bazel test" is what I'm after. But it seems that is not enough as you need more params and targets.

E.g. from Jason the GPU command was: "bazel --host_jvm_args=-Xms512m --host_jvm_args=-Xmx4096m test --local_test_jobs=2 --local_resources 16384,4.0,1.0 --build_tests_only --config=cuda --action_env=TF2_BEHAVIOR=1 --test_tag_filters=gpu,-no_oss,-oss_serial,-no_gpu,-benchmark-test --test_lang_filters=py --host_javabase=@bazel_tools//tools/jdk:jdk --java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8 --host_java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8 --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute -- //tensorflow/... -//tensorflow/compiler/... -//tensorflow/lite/... -//tensorflow/python:timeline_test_gpu -//tensorflow/python/profiler/internal:run_metadata_test_gpu"

That's a mouthful and I'm afraid I don't understand half of it. From my guessing it limits the amount of parallel tests to 2, runs gpu tests only but no non-gpu or benchmarks and only for python tests. Then there is a bunch of exclusions (e.g. compiler or lite) which I'm not sure of why that is excluded.

Our goal is to have confidence that everything is fine, especially in the presence of TF_SYSTEM_LIBS. E.g. using OpenSSL instead of BoringSSL lead to issues detected only later and with PyTorch I found a couple serious issues in OpenBLAS and libunwind for POWER. With TF we don't have those tests yet, hence I'd like to have a recommendation which tests make sense to run. It should be a thorough test but doesn't need to have every little thing tested.

And I'm also seeing the protobuf -> java related failure which is especially annoying on POWER HPC systems (java is not trivial to install there). But from a first test that seems to be a missing definition only in the protobuf.bzl file.

Alex

Am 20.10.20 um 17:46 schrieb Jason Zaman:

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dipl.-Inf. Alexander Grund
Research Assistant

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
Chemnitzer Str.46b, Raum 250 01062 Dresden
Tel.: +49 (351) 463-35982
E-Mail: alexand...@tu-dresden.de
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Jason Zaman

unread,

Oct 20, 2020, 4:43:24 PM10/20/20

to Alexander Grund, Jason M Furmanek, SIG Build

replies inline

On Tue, 20 Oct 2020 at 08:59, Alexander Grund <alexand...@tu-dresden.de> wrote:

Yes, basically "bazel test" is what I'm after. But it seems that is not enough as you need more params and targets.

E.g. from Jason the GPU command was: "bazel --host_jvm_args=-Xms512m --host_jvm_args=-Xmx4096m test --local_test_jobs=2 --local_resources 16384,4.0,1.0 --build_tests_only --config=cuda --action_env=TF2_BEHAVIOR=1 --test_tag_filters=gpu,-no_oss,-oss_serial,-no_gpu,-benchmark-test --test_lang_filters=py --host_javabase=@bazel_tools//tools/jdk:jdk --java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8 --host_java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8 --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute -- //tensorflow/... -//tensorflow/compiler/... -//tensorflow/lite/... -//tensorflow/python:timeline_test_gpu -//tensorflow/python/profiler/internal:run_metadata_test_gpu"

That's a mouthful and I'm afraid I don't understand half of it. From my guessing it limits the amount of parallel tests to 2, runs gpu tests only but no non-gpu or benchmarks and only for python tests. Then there is a bunch of exclusions (e.g. compiler or lite) which I'm not sure of why that is excluded.

Most of that you can ignore. the jvm_args are to give bazel itself more memory, the javabase and java_toolchain args I'm pretty sure are the defaults in bazel, if things are working for you then you don't have to set any of them.

the local_test_jobs and local_resources are to tell bazel how many actions to run in parallel and how much resources to use, leave these as default too unless you find that builds are hitting the machine too hard and you want.

the run_under=parallel_gpu_execute is a neat little wrapper script to only run a specific number of parallel tests per GPU at a time (otherwise you probably get OOM), you probably should look into it after you have things running.

The --config, action_env and test filters all come from the ./configure script too

Our goal is to have confidence that everything is fine, especially in the presence of TF_SYSTEM_LIBS. E.g. using OpenSSL instead of BoringSSL lead to issues detected only later

Huh, what kind of SSL issues did you find? BoringSSL is basically just OpenSSL with a bunch of things removed, the two are pretty interchangeable, eg GRPC uses BoringSSL if you choose bundled and OpenSSL if you link against the system ones, the arg to TF_SYSTEM_LIBS is called boringssl because that's the name of the external, I've not heard of any issues with using OpenSSL instead, definitely would like to find out more.

and with PyTorch I found a couple serious issues in OpenBLAS and libunwind for POWER. With TF we don't have those tests yet, hence I'd like to have a recommendation which tests make sense to run. It should be a thorough test but doesn't need to have every little thing tested.

Yeah, this comes back to having a few examples of common models and known inputs/outputs so we can compare across platforms. This would also help when tracking down issues if they're related to architecture or dependency versions or GPU driver/CUDA versions probably

And I'm also seeing the protobuf -> java related failure which is especially annoying on POWER HPC systems (java is not trivial to install there). But from a first test that seems to be a missing definition only in the protobuf.bzl file.

If you're not shipping the java bindings, I wouldn't worry about those tests just exclude them

For now, this works for me until I add in the stuff to make it work with systemlibs:

bazel test --config=opt -- //tensorflow/core/... //tensorflow/cc/... //tensorflow/c/... -//tensorflow/core:example_java_proto

The //tensorflow/python/... tests are mostly working but I need my machine for other things now so stopped it part way so I don't know if they all pass yet.

Regarding the exclusions from Jay's runs, I don't know the specifics. For a long time compiler and lite were excluded from presubmits even in kokoro but most of them are added now. The targets that are officially tested are here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/build_scripts/DEFAULT_TEST_TARGETS.sh

If you build with XLA disabled then the tests under //tensorflow/compiler/... are less important and skipping them doesnt matter so much. TFLite is originally for mobile but works great on servers too but I imagine not used as much in HPC environments, so you skipping those too probably doesn't hurt either.

Since you're going from no tests to some tests, excluding a bunch of things isn't so bad, then when you've found which parts of the tree are failing for you we can all compare notes across the different builds :)

-- Jason

Alexander Grund

unread,

Nov 19, 2020, 12:57:41 PM11/19/20

to Jason Zaman, Jason M Furmanek, SIG Build

bazel test --config=opt -- //tensorflow/core/... //tensorflow/cc/... //tensorflow/c/... -//tensorflow/core:example_java_proto

Thanks so far, currently trying something based on this. However I got a question regarding exclusions and such:

- I assume for the targets "//tensorflow/core/..." means "include" and "-//tensorflow/core:example_java_proto" means "exclude", is this correct?

- There is "--test_tag_filters=gpu,-no_oss,-oss_serial,-no_gpu,-benchmark-test". Like "run all gpu tests, and not the no_gpu ones"? Is there some suggested preset of tag filters I should use? I mean naively I'd like to run gpu and non-gpu tests. Excluding benchmarks makes sense too

Mihai Maruseac

unread,

Nov 19, 2020, 6:17:46 PM11/19/20

to Alexander Grund, Jason Zaman, Jason M Furmanek, SIG Build

On Thu, Nov 19, 2020 at 9:57 AM Alexander Grund <alexand...@tu-dresden.de> wrote:

bazel test --config=opt -- //tensorflow/core/... //tensorflow/cc/... //tensorflow/c/... -//tensorflow/core:example_java_proto

Thanks so far, currently trying something based on this. However I got a question regarding exclusions and such:

- I assume for the targets "//tensorflow/core/..." means "include" and "-//tensorflow/core:example_java_proto" means "exclude", is this correct?

That is correct.

- There is "--test_tag_filters=gpu,-no_oss,-oss_serial,-no_gpu,-benchmark-test". Like "run all gpu tests, and not the no_gpu ones"? Is there some suggested preset of tag filters I should use? I mean naively I'd like to run gpu and non-gpu tests. Excluding benchmarks makes sense too

Here again, - in front of a tag means exclude, no sign means include only targets that have that tag (two tags such as foo,bar mean that only targets that have _both_ tags are considered). The tags are specified in the corresponding BUILD files, under the corresponding rule (there might be macros that add additional tags, but in general tags are in the same place as the rule).

--
To unsubscribe from this group and stop receiving emails from it, send an email to build+un...@tensorflow.org.

--

Tensors must flow securely

Alexander Grund

unread,

Nov 20, 2020, 7:15:32 AM11/20/20

to Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Ok next step, I tried

bazel --output_base=... test --compilation_mode=opt --config=opt
--subcommands --verbose_failures --config=noaws --jobs=64 --copt="-fPIC"
--action_env=CPATH --action_env=PYTHONPATH
--action_env=PYTHONNOUSERSITE=1 --distinct_host_configuration=false
--local_test_jobs=1 -- //tensorflow/python/...
-//tensorflow/python/integration_testing/...

So I basically copied my build command, changed build to test and added
the local_test_jobs and test targets

But it keeps failing:

ERROR: /dev/shm/tensorflow-r2.4/tensorflow/python/tools/BUILD:473:8:
failed (Exit 1): generate-xml.sh failed: error executing command
(cd /dev/shm/output_base/execroot/org_tensorflow && \
exec env - \
    PATH=/usr/bin:/bin \
TEST_BINARY=tensorflow/python/tools/large_matmul_no_multithread_test \
TEST_NAME=//tensorflow/python/tools:large_matmul_no_multithread_test \
    TEST_SHARD_INDEX=0 \
    TEST_TOTAL_SHARDS=0 \
external/bazel_tools/tools/test/generate-xml.sh
bazel-out/ppc-opt/testlogs/tensorflow/python/tools/large_matmul_no_multithread_test/test.log
bazel-out/ppc-opt/testlogs/tensorflow/python/tools/large_matmul_no_multithread_test/test.xml
45 0)

When I C&P the shown command it does work though, hence no idea what
could be the issue.

What could be the cause here?

Mihai Maruseac

unread,

Nov 20, 2020, 11:37:11 AM11/20/20

to Alexander Grund, Jason Zaman, Jason M Furmanek, SIG Build

What is the error you are seeing? The pasted snippet does not reference it.

Probably you will need to exclude a few more targets

Alexander Grund

unread,

Nov 20, 2020, 11:47:26 AM11/20/20

to Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Sorry, I haven't found any other error besides that: "ERROR: /dev/shm/tensorflow-r2.4/tensorflow/python/tools/BUILD:473:8: failed (Exit 1): generate-xml.sh failed: error executing command"

And as mentioned if I try to run that manually it works and I don't even see how that could fail. I'll retry (log is gone now) and double-check if there is anything else which could be useful.

Am 20.11.20 um 17:36 schrieb Mihai Maruseac:

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dipl.-Inf. Alexander Grund
Research Assistant

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
Chemnitzer Str.46b, Raum 250 01062 Dresden
Tel.: +49 (351) 463-35982
E-Mail: alexand...@tu-dresden.de
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Alexander Grund

unread,

Nov 23, 2020, 5:33:25 AM11/23/20

to Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Ran into this again and the output is:

SUBCOMMAND: # //tensorflow/python/keras/optimizer_v2:optimizer_v2_test_gpu [action 'Testing //tensorflow/python/keras/optimizer_v2:optimizer_v2_test_gpu (shard 6 of 8)', configuration: 9b98de9644902f7a44cf7890192f2bf08567062808516f2db87a5cd4c6efae0a, execution platform: @local_execution_config_platform//:platform]
(cd /dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmpiNPW_y-bazel-tf/output_base/execroot/org_tensorflow && \

exec env - \
PATH=/usr/bin:/bin \

    TEST_BINARY=tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu \
    TEST_NAME='//tensorflow/python/keras/optimizer_v2:optimizer_v2_test_gpu (shard 6 of 8)' \
    TEST_SHARD_INDEX=5 \
    TEST_TOTAL_SHARDS=8 \
external/bazel_tools/tools/test/generate-xml.sh bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.log bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.xml 315 142)
ERROR: /dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/TensorFlow/tensorflow-r2.4/tensorflow/python/keras/optimizer_v2/BUILD:238:13: failed (Exit 1): generate-xml.sh failed: error executing command
(cd /dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmpiNPW_y-bazel-tf/output_base/execroot/org_tensorflow && \

exec env - \
PATH=/usr/bin:/bin \

    TEST_BINARY=tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu \
    TEST_NAME='//tensorflow/python/keras/optimizer_v2:optimizer_v2_test_gpu (shard 6 of 8)' \
    TEST_SHARD_INDEX=5 \
    TEST_TOTAL_SHARDS=8 \
external/bazel_tools/tools/test/generate-xml.sh bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.log bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.xml 315 142)
Execution platform: @local_execution_config_platform//:platform
INFO: Elapsed time: 1013.764s, Critical Path: 987.36s
INFO: 55 processes: 55 local.
FAILED: Build did NOT complete successfully

Can't see anything else suspicious or related in the log. Maybe some race condition or so that the file is not yet created or gets removed inbetween?

Am 20.11.20 um 17:36 schrieb Mihai Maruseac:

What is the error you are seeing? The pasted snippet does not reference it.

Probably you will need to exclude a few more targets/dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmpiNPW_y-bazel-tf/output_base/execroot/org_tensorflow/bazel-out/ppc-opt/bin/tensorflow/lite/python/schema_py_srcs_no_include_all/tflite

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dipl.-Inf. Alexander Grund
Research Assistant

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
Chemnitzer Str.46b, Raum 250 01062 Dresden
Tel.: +49 (351) 463-35982
E-Mail: alexand...@tu-dresden.de
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Manuel Klimek

unread,

Nov 23, 2020, 5:42:07 AM11/23/20

to Alexander Grund, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Does it return a zero exit code?

Alexander Grund

unread,

Nov 27, 2020, 7:29:06 AM11/27/20

to Manuel Klimek, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Am 23.11.20 um 11:41 schrieb Manuel Klimek:

Yes, when run manually it exits with zero exit code and no stdout/stderr.

I'm currently also blocked on TF/Bazel trying to generate XLA stuff even though XLA is not enabled. See https://github.com/tensorflow/tensorflow/issues/45045

Looks like the the native.genrule is executed even when nothing depends on its outputs when I do `bazel test //tensorflow/python/...`
Any ideas here?

Manuel Klimek

unread,

Nov 27, 2020, 8:23:50 AM11/27/20

to Alexander Grund, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

That is indeed weird.

I'm currently also blocked on TF/Bazel trying to generate XLA stuff even though XLA is not enabled. See https://github.com/tensorflow/tensorflow/issues/45045

Looks like the the native.genrule is executed even when nothing depends on its outputs when I do `bazel test //tensorflow/python/...`
Any ideas here?

I don't think we've invested in making /... parts compile cleanly without XLA; our CI does not test all paths, but removes parts of the tree we don't want to test, so we never use /...

Sami Kama

unread,

Nov 27, 2020, 12:41:13 PM11/27/20

to Alexander Grund, Manuel Klimek, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Are you building on /dev/shm? Any chance that you are running out of memory?

--

Alexander Grund

unread,

Nov 30, 2020, 8:57:17 AM11/30/20

to Sami Kama, Manuel Klimek, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Yes, building on /dev/shm to reduce build times. Running out of memory is unlikely, I also tried to build on /tmp and the same happens. It also consistently fails on that. So grepping the log I see not a single generate-xml to be executed but the failing one. Running out of ideas here...

Am 27.11.20 um 18:41 schrieb Sami Kama:

Manuel Klimek

unread,

Nov 30, 2020, 10:17:05 AM11/30/20

to Alexander Grund, Yun Peng, Sami Kama, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

@Yun Peng for ideas

Yun Peng

unread,

Nov 30, 2020, 10:25:07 AM11/30/20

to Manuel Klimek, Alexander Grund, Sami Kama, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

When the build fails, can you just do the following:

1. Open a new terminal

2. Run the following manually:

cd /dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmpiNPW_y-bazel-tf/output_base/execroot/org_tensorflow

export PATH=/usr/bin:/bin
export TEST_BINARY=tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu
export TEST_NAME='//tensorflow/python/keras/optimizer_v2:optimizer_v2_test_gpu (shard 6 of 8)'
export TEST_SHARD_INDEX=5
export TEST_TOTAL_SHARDS=8

./external/bazel_tools/tools/test/generate-xml.sh bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.log bazel-out/ppc-opt/testlogs/tensorflow/python/keras/optimizer_v2/optimizer_v2_test_gpu/shard_6_of_8/test.xml 315 142

3. Check what's the output or return code.

If you want to debug the generate-xml.sh, don't modify it directly (it's in Bazel's install directory, and Bazel doesn't like it to be touched). You can copy it to eg. /tmp/generate-xml.sh, then add `set -x` at the beginning of the script, then you should be able to see where it failed.

Alexander Grund

unread,

Dec 1, 2020, 5:13:43 AM12/1/20

to Yun Peng, Manuel Klimek, Sami Kama, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Thanks for the reply. I assume that is similar to running the parenthesized command show by bazel.

I retried the build, it failed again (it consistently does) and I executed the commands manually as shown by you (with changed paths according to the failing one). The return code is zero and the xml file is created. So to me it really looks like something in Bazel failing, but I have no idea what.

Also the error message "error executing command" comes from Bazel itself and is to generic to see what went wrong. I.e. if it is caused by a missing file, wrong permissions or a failure in the script. I see no stdout/stderr from the run itself only the SUBCOMMAND, then ERROR and that's it.

I can rule out memory issues, using a 700GB /tmp here and the file itself is created, executable and runs fine after the error.

Any more ideas on how to debug this?

Thanks a lot!

FTR: The Bazel binary itself contains a zipped version of the generate-xml.sh, so in order to modify it, I modified it before building Bazel so the modified version gets picked up. I was hoping to see any output so I could tell whether the script was created and started to run. But I see no output at all.

Am 30.11.20 um 16:24 schrieb Yun Peng:

Sami Kama

unread,

Dec 1, 2020, 10:53:41 AM12/1/20

to Alexander Grund, Yun Peng, Manuel Klimek, Sami Kama, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

It is a long shot and painful test but did you try with -j 1 flag?

--

Alexander Grund

unread,

Dec 3, 2020, 11:07:23 AM12/3/20

to Sami Kama, Yun Peng, Manuel Klimek, Sami Kama, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Just tried that and still the same issue. Why isn't Bazel giving me any better information WHY it failed to run that command?

Am 01.12.20 um 16:53 schrieb Sami Kama:

Alexander Grund

unread,

Dec 7, 2020, 5:32:22 AM12/7/20

to Sami Kama, Yun Peng, Manuel Klimek, Sami Kama, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

While running this interactively again I noticed something string: I do have `--jobs 64` still set (after the --jobs 1 failed), but also `--local_test_jobs=1` but I see:

[506 / 570] 42 / 783 tests; 64 actions, 3 running; last test: //tensorflow/c/eager:c_api_unified_experimental_test_gpu
    Testing //tensorflow/c/eager:c_api_remote_test; 40s local
    Testing //tensorflow/c/eager:c_api_remote_test_gpu; 15s local
    Testing //tensorflow/c:while_loop_test; 7s local
    [Sched] Testing //tensorflow/c/eager:c_api_cluster_test_gpu; 110s
    [Sched] Testing //tensorflow/core/framework:op_def_builder_test; 110s
    [Sched] Testing //tensorflow/core/framework:op_compatibility_test; 110s
    [Sched] Testing //tensorflow/cc/experimental/base/tests:tensorhandle_test; 109s
    [Sched] Testing //tensorflow/cc:gradients_manip_grad_test; 109s ...

Does this indicate it is running 3 tests in parallel even though I instructed it to only run 1 at a time?

Am 01.12.20 um 16:53 schrieb Sami Kama:

It is a long shot and painful test but did you try with -j 1 flag?

Yun Peng

unread,

Dec 7, 2020, 10:54:53 AM12/7/20

to Alexander Grund, Sami Kama, Manuel Klimek, Sami Kama, Mihai Maruseac, Jason Zaman, Jason M Furmanek, SIG Build

Hi Alexander,

I'm glad you just figured out the root cause in the github issue reported.

https://github.com/bazelbuild/bazel/issues/12579

I replied to the issue and sent a patch for Bazel, but not sure it will get merged. Please see details in the comment.

Testing TF built with TF_SYSTEM_LIBS

Alexander Grund

Jason M Furmanek

Jason Zaman

Alexander Grund

Jason Zaman

Alexander Grund

Mihai Maruseac

Alexander Grund

Mihai Maruseac

Alexander Grund

Alexander Grund

Manuel Klimek

Alexander Grund

Manuel Klimek

Sami Kama

Alexander Grund

Manuel Klimek

Yun Peng

Alexander Grund

Sami Kama

Alexander Grund

Alexander Grund

Yun Peng