Debugging a non-locally-reproducible build failure

29 views
Skip to first unread message

Primiano Tucci

unread,
Jan 14, 2021, 5:47:57 AM1/14/21
to infra-dev, Lalit Maganti, Eric Seckler, Francois Pierre Doray
Heya folks,

TL;DR I have some obscure build error that doesn't repro locally and happens only on one bot. Is it possible to download GN desc or ninja files form a failed build like this? What's special about that one bot? Any other suggested debugging steps?

Longer version
  • A //third_party/pefetto roll was reverted yesterday because it caused a build failure on the android-archive-dbg bot. The roll was 99,9999% the culprit, the revert made the bot green again.
  • Something subtle is going on: the roll passed the CQ, passed the main waterfall but broke only on that one android-archive-dbg bot. 
  • The failure is not a one-off flake, it failed three times back-to-back on the same bot, and went green after the revert. 
  • The build error is lib.unstripped/libperfetto_android_internal.cr.so
    ld.lld: error: lld uses blx instruction, no object with architecture supporting feature detected  
  • Don't look too much into the lld error (although that's also weird). The real problem is that libperfetto_android_internal shouldn't be known/built in the chromium tree at all (not even in target_os = android configs). I checked older (Green) builds for the same bot and in fact it doesn't show in the compile out.
  • I really can't tell how that target would be pulled (and why only that bot). I am aware of the subtlety that, in GN, a target X might be pulled in the "all" target even just if the BUILD.gn that contains X file is referenced elsewhere (e.g. by depending on another target Y defined in the same BUILD.gn that contains X). I followed all the paths and queried GN (below), I don't think it's the case here (but then, it's happening). 
  • If I had to guess, I suspect it's related with this one CL but I can't tell what that is doing wrong. All the references are guarded by perfetto_build_with_android which is never true in the chromium tree. That's only ever set to true by our GN -> Android.bp offline build translator.
Here's the more interesting part: it doesn't repro locally.
I tried to repro this locally, by copying the same GN config of the failed bot: 
  • As I'd expect, there is no trace of libperfetto_android_internal in the generated ninja files, . GN doesn't see that target (as expected). 
  • $ gn desc out/acomp //third_party/perfetto/src/android_internal:libperfetto_android_internal
    The input //third_party/perfetto/src/android_internal:libperfetto_android_internal matches no targets, configs or files. (same by adding the (//build/toolchain/android:android_clang_arm) toolchain) 
  •  gn desc . --format=json --all-toolchains '//*' > desc.json
     grep perfetto_android_internal desc.json  # No results

At this point I don't know what to do other than throwing darts like this in the dark, but I have very low confidence in that and will likely cause some other land - break - revert cycles.

Questions:
1. Am I missing something obvious? Can somebody see something interesting? Any idea what would be special about that bot?
2. Can I pull somehow the  ninja files from the failed build or the gn desc --all-toolchains? I would really curious to see which ninja file and where is trying to pull that library.
3. Any other suggested debugging steps?

Thanks for the help,
Primiano

Takuto Ikuta

unread,
Jan 14, 2021, 6:24:20 AM1/14/21
to Primiano Tucci, infra-dev, Lalit Maganti, Eric Seckler, Francois Pierre Doray
I could repro the same error with

$ git checkout e772ac076e469b22da84731dd98d66d8ff539694
$ gclient sync
$ cat out/android-archive-dbg/args.gn
# Set build arguments here. See `gn help buildargs`.
is_component_build = true
is_debug = true
symbol_level = 1
target_os = "android"
use_goma = true
$ autoninja -C out/android-archive-dbg/ libperfetto_android_internal.cr.so

What happens if you run above command?

--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CA%2ByH71eH28m3Fz%2BH--xQ7WcwnCZANOSVLu2UUgyKxn%2Byc3WgeQ%40mail.gmail.com.

Primiano Tucci

unread,
Jan 14, 2021, 8:44:05 AM1/14/21
to Takuto Ikuta, infra-dev, Lalit Maganti, Eric Seckler, Francois Pierre Doray
Aaah I see what I was doing wrong.
turns out that a CL that landed right after the roll range, happened to fix the root cause of the issue.
I was just checking out third_party/perfetto @ origin/master assuming the root cause was still there in our ToT.
Your comment, where you did the right thing by checking out precisely the version that failed, shows that. I can repro as well if checkout the roll revision as you did.

I think this is what happened:
1. The roll passed the CQ because the CQ doesn't build "all" but builds only the affected target from gn analyze. The problem here is the addition of a spurious library to the build graph which is not referred by any chrome target, but fails only when built in isolation, which really shows up with :all.
2. I think (didn't check) the main waterfall didn't go red because there is no bot there that covers the combination of is_debug && is_component && is_android (this seems to fail only in component builds)

Also I think that the mysterious lld error really meant "you are trying to build a shared library, but there are no source files in it"

I think we can call it mystery solved.
Thanks!

Primiano

Reply all
Reply to author
Forward
0 new messages