confusing remote_run failure related to logdog bootstrap and buildbot timeout

28 views
Skip to first unread message

Paweł Hajdan, Jr.

unread,
Aug 4, 2016, 12:43:19 PM8/4/16
to infr...@chromium.org, Nico Weber, d...@chromium.org
I'd like to ask for help with https://bugs.chromium.org/p/chromium/issues/detail?id=633294

[7861/36198] ACTION //cc/proto:proto_internal_gen(//build/toolchain/win:clang_x64)
FAILED: pyproto/cc/proto/begin_main_frame_and_commit_state_pb2.py gen/cc/proto/begin_main_frame_and_commit_state.pb.cc gen/cc/proto/begin_main_frame_and_commit_state.pb.h 
INFO:remote_run:Command ['C:\\b\\.recipe_cipd\\logdog_butler.exe', '-log-level', 'warning', '-project', 'chromium', '-prefix', 'bb/chromium.fyi/CrWinAsan/3925', '-output', 'logdog,host="services-dot-luci-logdog.appspot.com"', '-service-account-json', 'c:\\creds\\service_accounts\\service-account-luci-logdog-publisher.json', '-output-max-buffer-age', '30s', 'run', '-stdout', 'tee=stdout', '-stderr', 'tee=stderr', '-streamserver-uri', 'net.pipe:LUCILogDogButler', '--', 'C:\\b\\.recipe_cipd\\logdog_annotee.exe', '-log-level', 'warning', '-project', 'chromium', '-butler-stream-server', 'net.pipe:LUCILogDogButler', '-logdog-host', 'luci-logdog.appspot.com', '-annotate', 'tee', '-name-base', 'recipes', '-print-summary', '-tee', '-json-args-path', 'C:\\b\\rr\\tmpwlna8a\\logdog_annotee_cmd.json', '-result-path', 'C:\\b\\rr\\tmpwlna8a\\bootstrap_result.json'] finished with exit code 1.
WARNING:remote_run:Could not bootstrap LogDog: Failed to open bootstrap result file [C:\b\rr\tmpwlna8a\bootstrap_result.json]: [Errno 2] No such file or directory: 'C:\\b\\rr\\tmpwlna8a\\bootstrap_result.json'
INFO:remote_run:Not using LogDog. Invoking `recipes.py` directly.
INFO:remote_run:Executing command: ['C:\\b\\depot_tools\\python276_bin\\python.exe', 'C:\\b\\.remote_run_cipd\\recipes.py', '--verbose', 'remote', '--repository', 'https://chromium.googlesource.com/chromium/tools/build.git', '--revision', 'origin/master', '--workdir', 'C:\\b\\rr\\tmpwlna8a\\rw', '--', '--verbose', 'run', '--properties-file', 'C:\\b\\rr\\tmpwlna8a\\remote_run_properties.json', '--workdir', 'C:\\b\\rr\\tmpwlna8a\\w', '--output-result-json', 'C:\\b\\rr\\tmpwlna8a\\recipe_result.json', 'chromium']
For some reason compile failure led to a condition that looks like logdog bootstrap failure. Please note before these lines another buildbot message appears about reaching the timeout:
command timed out: 2400 seconds without output, attempting to kill
It's possible it somehow killed logdog pipeline first, which didn't write its output then, and somehow remote_run.py was still alive.

FWIW I consider this orthogonal to remote_run, since both that script and existing annotated_run.py use common logdog_bootstrap python module.

Any advice would be welcome.

Paweł

Erik Staab

unread,
Aug 4, 2016, 8:19:12 PM8/4/16
to Paweł Hajdan, Jr., infr...@chromium.org, Nico Weber, d...@chromium.org
dnj@ will be back next Wednesday and I'm not sure anyone else will get around to looking at this before then. When he's back we'll do more knowledge sharing so others can more easily dig into this.

--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+unsubscribe@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPbZO%2BhhuZOEqrn8T69EJG1Sq-fSSLsL6jtGycoDt4JX7Q%40mail.gmail.com.

Ryan Tseng

unread,
Aug 4, 2016, 8:20:24 PM8/4/16
to Erik Staab, Paweł Hajdan, Jr., infr...@chromium.org, Nico Weber, d...@chromium.org
Last time this happened it was a cipd version mismatch of some sort, but I don't recall the details

Daniel Jacques

unread,
Aug 6, 2016, 1:41:51 PM8/6/16
to Ryan Tseng, Erik Staab, Paweł Hajdan, Jr., infr...@chromium.org, Nico Weber, d...@chromium.org
The bootstrap file is supposed to work like this:
  • remote_run runs LogDog Butler.
  • Butler runs Annotee, passing "-result-path C:\\b\\rr\\tmpwlna8a\\bootstrap_result.json". If Annotee successfully executes its payload, the return value of the payload (regardless of its value) is written to this file. If Annotee does not successfully execute it (missing executable, system error, whatever), the file will not be created.
  • Annotee runs the actual payload, the recipe engine.
  • Recipe engine terminates with return code.
  • Annotee reaps the process, notes the successful (albeit with error code) result, and writes the bootstrap result JSON file to the specified path.
    • It is given "-print-summary", so it should be dumping the annotation stream summary before exiting, which begins with "=== Annotee" and is not present in these logs.
  • Butler reaps Annotee.
  • remote_run reaps Butler, looks at the "-result-path" path.
    • If the file exists, remote_run knows that the bootstrapping succeeded.
    • If the file does not exist, remote_run assumes that the bootstrapping failed, and falls back to a "safe" behavior of running without LogDog.
The "safe" behavior probably should be eliminated at some point once we have confidence in the bootstrapping, so that each run has exactly one chance to succeed, even if the bootstrapping failed. The safe fallback was more an opportunity to not ruin literally every build while we worked out any kinks with LogDog deployment.

Anyway, what we're seeing here is that "remote_run" is observing that the bootstrap result file does not exist and assuming the bootstrap failed. What's weird here is that the bootstrap did not appear to fail, as the underlying process ran successfully. I don't see any Butler or Annotee errors, so it looks like both processes terminated gracefully, but I also don't see the Annotee summary printed, suggesting that it was either killed or willingly exited before it could do that. I don't think a willing exit path exists that avoids the summary. remote_run received the forwarded exit code "1", as opposed to "250" or "251" that Butler/Annotee should be returning if either bootstrap failed.

So we have a condition that is expected to be binary, but appears to be otherwise:
  • Either Butler/Annotee should have failed, not created the file, and returned 250/251 to "remote_run".
  • Or they should have succeeded, printed the annotation summary, forwarded the "1" exit code, and created the file.
  • But apparently they failed to create the file and forwarded the "1" exit code.
However, maybe the "1" isn't being forwarded through Butler/Annotee, but just happens to be the return code of some unnatural process termination. If Annotee or Butler were killed externally via some other means and return a "1", that would explain why we're not seeing the summary or the file. This suggests that the killing process passes the return code, and that "1" is not an unusual thing to pass.

We do see this line:
command timed out: 2400 seconds without output, attempting to kill
A loose theory of what's happening, then, would be:
  1. "remote_run" runs LogDog bootstrap, blocks on Butler completion.
  2. BuildBot is killing "remote_run" due to lack of output.
  3. This is being forwarded to the build, which kills children (is this how Windows works?), ultimately killing Annotee and Butler prematurely.
  4. "remote_run" somehow survives this (maybe Python magic via try/catch/finally?), doesn't realize it's been killed, thinks it should keep going, and falls back to "safe" non-LogDog mode upon not observing that file.
  5. Build runs again without LogDog.
I think we really need more work put into handling Windows deaths. On Linux/Mac, Annotee/Butler would catch the signal, kill their subprocesses, flush, and exit through the expected route. This is obviously not happening on Windows.

I think this is a bug in the termination / signal handling parts of Annotee and Butler. Probably present in Kitchen too. I've filed crbug.com/635258 to track. I am still on vacation though, but if someone wants to help with this that'd be very welcome. Otherwise, I'll check it out in earnest when I get back.


On Thu, Aug 4, 2016 at 5:20 PM Ryan Tseng <hin...@google.com> wrote:
Last time this happened it was a cipd version mismatch of some sort, but I don't recall the details

On Thu, Aug 4, 2016 at 5:18 PM, Erik Staab <est...@chromium.org> wrote:
dnj@ will be back next Wednesday and I'm not sure anyone else will get around to looking at this before then. When he's back we'll do more knowledge sharing so others can more easily dig into this.
On Thu, Aug 4, 2016 at 9:43 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
I'd like to ask for help with https://bugs.chromium.org/p/chromium/issues/detail?id=633294

[7861/36198] ACTION //cc/proto:proto_internal_gen(//build/toolchain/win:clang_x64)
FAILED: pyproto/cc/proto/begin_main_frame_and_commit_state_pb2.py gen/cc/proto/begin_main_frame_and_commit_state.pb.cc gen/cc/proto/begin_main_frame_and_commit_state.pb.h 
INFO:remote_run:Command ['C:\\b\\.recipe_cipd\\logdog_butler.exe', '-log-level', 'warning', '-project', 'chromium', '-prefix', 'bb/chromium.fyi/CrWinAsan/3925', '-output', 'logdog,host="services-dot-luci-logdog.appspot.com"', '-service-account-json', 'c:\\creds\\service_accounts\\service-account-luci-logdog-publisher.json', '-output-max-buffer-age', '30s', 'run', '-stdout', 'tee=stdout', '-stderr', 'tee=stderr', '-streamserver-uri', 'net.pipe:LUCILogDogButler', '--', 'C:\\b\\.recipe_cipd\\logdog_annotee.exe', '-log-level', 'warning', '-project', 'chromium', '-butler-stream-server', 'net.pipe:LUCILogDogButler', '-logdog-host', 'luci-logdog.appspot.com', '-annotate', 'tee', '-name-base', 'recipes', '-print-summary', '-tee', '-json-args-path', 'C:\\b\\rr\\tmpwlna8a\\logdog_annotee_cmd.json', '-result-path', 'C:\\b\\rr\\tmpwlna8a\\bootstrap_result.json'] finished with exit code 1.
WARNING:remote_run:Could not bootstrap LogDog: Failed to open bootstrap result file [C:\b\rr\tmpwlna8a\bootstrap_result.json]: [Errno 2] No such file or directory: 'C:\\b\\rr\\tmpwlna8a\\bootstrap_result.json'
INFO:remote_run:Not using LogDog. Invoking `recipes.py` directly.
INFO:remote_run:Executing command: ['C:\\b\\depot_tools\\python276_bin\\python.exe', 'C:\\b\\.remote_run_cipd\\recipes.py', '--verbose', 'remote', '--repository', 'https://chromium.googlesource.com/chromium/tools/build.git', '--revision', 'origin/master', '--workdir', 'C:\\b\\rr\\tmpwlna8a\\rw', '--', '--verbose', 'run', '--properties-file', 'C:\\b\\rr\\tmpwlna8a\\remote_run_properties.json', '--workdir', 'C:\\b\\rr\\tmpwlna8a\\w', '--output-result-json', 'C:\\b\\rr\\tmpwlna8a\\recipe_result.json', 'chromium']
For some reason compile failure led to a condition that looks like logdog bootstrap failure. Please note before these lines another buildbot message appears about reaching the timeout:
command timed out: 2400 seconds without output, attempting to kill
It's possible it somehow killed logdog pipeline first, which didn't write its output then, and somehow remote_run.py was still alive.

FWIW I consider this orthogonal to remote_run, since both that script and existing annotated_run.py use common logdog_bootstrap python module.

Any advice would be welcome.

Paweł

--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.

To post to this group, send email to infr...@chromium.org.
Reply all
Reply to author
Forward
0 new messages