Cros SDK producing unbootable kernel images

已查看 66 次
跳至第一个未读帖子

Curtis Malainey

未读,
2017年8月8日 17:18:582017/8/8
收件人 Chromium OS dev、Guenter Roeck、Mike Frysinger
As of yesterday (August 7, 2017) by at least 12:30pm PDT a repo sync has caused tools to produce unbootable kernel images on development devices. The issues appear to be related only code pushed by the update_kernel.sh. This does not appear to be isolated to any specific board, (we have experienced issues with cyan, gnawty and others.) Myself and groeck@ have both been affected by this issue. Clean commits that previously worked are now broken. Kernels 4.9 and 4.12 have both been tested.

Symptoms observed:
  • Device startups, gets past firmware and them either hangs on a black screen or reboots.
  • Device requires hard reset to get out of a hang
  • No ramoops left in pstore
  • System can be recovered by using dd to copy partitions 2 and 3 to the host device from a flash drive when booted from an external device.

We are currently working through rolling back our SDK to figure out the cause but have found nothing so far. Does anyone have any hints as to what might have gone in the tree recently that could have caused this?

Guenter Roeck

未读,
2017年8月8日 17:21:212017/8/8
收件人 Curtis Malainey、Chromium OS dev、Mike Frysinger
Here is a key difference, seen when trying to boot cyan.

old boot, working:

early console in extract_kernel
input_data: 0x00000000023ad276
input_len: 0x00000000007e12d5
output: 0x0000000001000000
output_len: 0x00000000017d2014
kernel_total_size: 0x0000000001bbe000
booted via startup_32()
Physical KASLR using RDRAND RDTSC...
Virtual KASLR using RDRAND RDTSC...

Decompressing Linux... Parsing ELF... Performing relocations... done.
Booting the kernel.

new, no longer booting:

early console in extract_kernel
input_data: 0x00000000023ad276
input_len: 0x00000000007e0b3d
output: 0x0000000001000000
output_len: 0x00000000017d1f44
kernel_total_size: 0x0000000001bbe000
booted via startup_64()
Physical KASLR using RDRAND RDTSC...

[ hangs here ]

Guenter

Manoj Gupta

未读,
2017年8月8日 18:38:012017/8/8
收件人 Guenter Roeck、Curtis Malainey、Chromium OS dev、Mike Frysinger、Matthias Kaehlcke
There have been two major toolchain updates recently:

1. Bintuils update to 2.27 10 days back or so.
2. Clang/llvm version update on Aug 06. Not sure if this can affect kernels since they are still built with GCC.


But all the waterfall is fine and we didn't see any issue in our testing. 


Let us know if you think one of these CLs is to blame. And in that case, I am curious why we didn't catch this in our testing.

Thanks,
Manoj

--
--
Chromium OS Developers mailing list: chromiu...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-os-dev?hl=en



Richard Barnette

未读,
2017年8月8日 20:17:372017/8/8
收件人 Manoj Gupta、Guenter Roeck、Curtis Malainey、Chromium OS dev、Mike Frysinger、Matthias Kaehlcke

> On Aug 8, 2017, at 3:37 PM, Manoj Gupta <manoj...@chromium.org> wrote:
>
> There have been two major toolchain updates recently:
>
> 1. Bintuils update to 2.27 10 days back or so.
> 2. Clang/llvm version update on Aug 06. Not sure if this can affect kernels since they are still built with GCC.
>
> Bintuils CL: https://chromium-review.googlesource.com/c/566066
> LLVM CL: https://chromium-review.googlesource.com/c/602683
>
> But all the waterfall is fine and we didn't see any issue in our testing.
>
> Bintuils testing: https://docs.google.com/a/google.com/spreadsheets/d/1Dyha7xOBVjSJM0wVPk-JiDSTimCJRORSG2MJKak0wGk/edit?usp=sharing
> llvm testing: https://docs.google.com/a/google.com/spreadsheets/d/1xVLkraenV8I2EV-UzHArghnBoDUYMX5LiwfxMZ8qmZk/edit?usp=sharing
>
> Let us know if you think one of these CLs is to blame. And in that case, I am curious why we didn't catch this in our testing.
>
It's probably worth re-emphasizing: The tree is healthy.
Most especially, the canaries are healthy. Whatever
problem is causing the observed failures, it's not affecting
customer builds.

Generally speaking, our builders start fresh. So, it's
possible that something in the process of syncing the code
and then running an incremental build is producing corrupted
images. You might try building from a freshly created chroot
to see if that makes the problem go away.

Guenter Roeck

未读,
2017年8月8日 20:51:542017/8/8
收件人 Richard Barnette、Manoj Gupta、Curtis Malainey、Chromium OS dev、Mike Frysinger、Matthias Kaehlcke
FWIW, I _did_ try a fresh chroot. That was part of the problem,
because it _introduced_ the problem (while older chroots with no
recent repo sync work fine). I also tested a clean canary image. That
works fine if its kernel is replaced with a 4.9 or 4.12 kernel built
with an older chroot, but not if it is replaced with a kernel from a
new chroot.

We'll do some more testing. Not that it helps much, but we managed to
boot by disabling address space randomization (with nokaslr boot
option). Any 4.9 and 4.12 image with KASLR enabled fails to boot if
built in a new chroot.

Guenter

Richard Barnette

未读,
2017年8月8日 21:07:222017/8/8
收件人 Guenter Roeck、Benjamin Gordon、Manoj Gupta、Curtis Malainey、Chromium OS dev、Mike Frysinger、Matthias Kaehlcke
+bmgordon@

> On Aug 8, 2017, at 5:51 PM, Guenter Roeck <gro...@google.com> wrote:
>
> FWIW, I _did_ try a fresh chroot. That was part of the problem,
> because it _introduced_ the problem (while older chroots with no
> recent repo sync work fine). I also tested a clean canary image. That
> works fine if its kernel is replaced with a 4.9 or 4.12 kernel built
> with an older chroot, but not if it is replaced with a kernel from a
> new chroot.
>

Hmmm... The canaries do build everything from scratch: See the
InitSDK and SetupBoard stages. If I understand what you mean about
replacing the kernel, I'd say that's kind of expected, given that the
symptom is "when you build locally with a new chroot, it fails, if
you build with an old chroot, it passes."

IIUC, there's been some changes in the way cros_sdk builds our chroots
within the past few days. I'm thinking specifically about this thread:
https://groups.google.com/a/chromium.org/forum/?hl=en#!topic/chromium-os-dev/yVEWDR7wDfA

I don't know much about the changes, but I'm wondering if maybe that
new code could behave differently on our builders from the way it
behaves on desktops?


> We'll do some more testing. Not that it helps much, but we managed to
> boot by disabling address space randomization (with nokaslr boot
> option). Any 4.9 and 4.12 image with KASLR enabled fails to boot if
> built in a new chroot.
>
You mean "built in a new chroot but not on the release builders",
I presume? I'm just trying to confirm that we know that the
canary builds do enable KASLR.

Guenter Roeck

未读,
2017年8月8日 21:21:472017/8/8
收件人 Richard Barnette、Benjamin Gordon、Manoj Gupta、Curtis Malainey、Chromium OS dev、Mike Frysinger、Matthias Kaehlcke
On Tue, Aug 8, 2017 at 6:07 PM, Richard Barnette
<jrbar...@chromium.org> wrote:
> +bmgordon@
>
>> On Aug 8, 2017, at 5:51 PM, Guenter Roeck <gro...@google.com> wrote:
>>
>> FWIW, I _did_ try a fresh chroot. That was part of the problem,
>> because it _introduced_ the problem (while older chroots with no
>> recent repo sync work fine). I also tested a clean canary image. That
>> works fine if its kernel is replaced with a 4.9 or 4.12 kernel built
>> with an older chroot, but not if it is replaced with a kernel from a
>> new chroot.
>>
>
> Hmmm... The canaries do build everything from scratch: See the
> InitSDK and SetupBoard stages. If I understand what you mean about
> replacing the kernel, I'd say that's kind of expected, given that the
> symptom is "when you build locally with a new chroot, it fails, if
> you build with an old chroot, it passes."
>
> IIUC, there's been some changes in the way cros_sdk builds our chroots
> within the past few days. I'm thinking specifically about this thread:
> https://groups.google.com/a/chromium.org/forum/?hl=en#!topic/chromium-os-dev/yVEWDR7wDfA
>
> I don't know much about the changes, but I'm wondering if maybe that
> new code could behave differently on our builders from the way it
> behaves on desktops?
>

I don't think this has anything to do with builders or the way a
chroot is created or maintained.

>
>> We'll do some more testing. Not that it helps much, but we managed to
>> boot by disabling address space randomization (with nokaslr boot
>> option). Any 4.9 and 4.12 image with KASLR enabled fails to boot if
>> built in a new chroot.
>>
> You mean "built in a new chroot but not on the release builders",
> I presume? I'm just trying to confirm that we know that the
> canary builds do enable KASLR.
>

We know that chromeos-4.4 and older kernels build fine. Our problem is
with 4.9 and 4.12 kernels, which are not built by any canary builds.
The problem is not with the canary builds, it is with 4.9 and 4.12
kernel images built within a recently updated chroot.

The following does suggest that KASLR is enabled for x86_64 builds.

$ git grep CONFIG_RANDOMIZE_BASE chromeos/
chromeos/config/i386/common.config:CONFIG_RANDOMIZE_BASE=y
chromeos/config/i386/common.config:CONFIG_RANDOMIZE_BASE_MAX_OFFSET=0x20000000
chromeos/config/x86_64/common.config:CONFIG_RANDOMIZE_BASE=y
chromeos/config/x86_64/common.config:CONFIG_RANDOMIZE_BASE_MAX_OFFSET=0x40000000

Guenter

Curtis Malainey

未读,
2017年8月8日 21:44:452017/8/8
收件人 Chromium OS dev、manoj...@chromium.org、gro...@google.com、cujoma...@google.com、vap...@google.com、m...@chromium.org


On Tuesday, August 8, 2017 at 5:17:37 PM UTC-7, Richard Barnette wrote:

> On Aug 8, 2017, at 3:37 PM, Manoj Gupta <manoj...@chromium.org> wrote:
>
> There have been two major toolchain updates recently:
>
> 1. Bintuils update to 2.27 10 days back or so.
> 2. Clang/llvm version update on Aug 06. Not sure if this can affect kernels since they are still built with GCC.
>
> Bintuils CL: https://chromium-review.googlesource.com/c/566066
> LLVM CL: https://chromium-review.googlesource.com/c/602683
>
Reverting Binutils did change the crash behavior slightly but overall similar symptoms, I will try to revert LLVM again tomorrow but there is a build error in sys-devel/autofdo-0.15-r5 I am running into currently.

> But all the waterfall is fine and we didn't see any issue in our testing.
>
> Bintuils testing: https://docs.google.com/a/google.com/spreadsheets/d/1Dyha7xOBVjSJM0wVPk-JiDSTimCJRORSG2MJKak0wGk/edit?usp=sharing
> llvm testing: https://docs.google.com/a/google.com/spreadsheets/d/1xVLkraenV8I2EV-UzHArghnBoDUYMX5LiwfxMZ8qmZk/edit?usp=sharing
>
> Let us know if you think one of these CLs is to blame. And in that case, I am curious why we didn't catch this in our testing.
>
It's probably worth re-emphasizing:  The tree is healthy.
Most especially, the canaries are healthy.  Whatever
problem is causing the observed failures, it's not affecting
customer builds.

Generally speaking, our builders start fresh.  So, it's
possible that something in the process of syncing the code
and then running an incremental build is producing corrupted
images.  You might try building from a freshly created chroot
to see if that makes the problem go away.


I did delete my chroot (source and all to be safe) and I am still running into the same issue.

Manoj Gupta

未读,
2017年8月8日 22:05:442017/8/8
收件人 Curtis Malainey、Chromium OS dev、Guenter Roeck、Mike Frysinger、Matthias Kaehlcke
I have created the CL https://chromium-review.googlesource.com/c/607360 for easier testing with LLVM change.

Can you try following steps:
1. Create new chroot
2. Patch the CL above.
3. sudo emerge llvm clang autofdo 
4. ./setup_board --board=$BOARD --nousepkg (The nousepkg argument is important to avoid pulling in any prebuilt toolchain packages). This step will take a while.
5. ./build_packages --nousepkg --board=$BOARD ..rest args.. (same as above for nousepkg)

Thanks,
Manoj

Guenter Roeck

未读,
2017年8月8日 23:05:582017/8/8
收件人 David Riley、Manoj Gupta、Curtis Malainey、Chromium OS dev、Mike Frysinger、Matthias Kaehlcke
On Tue, Aug 8, 2017 at 7:33 PM, David Riley <david...@google.com> wrote:
> What's the startup_32 vs startup_64 log difference caused by? ie, what
> emits that line and what makes it choose between the two?
>

All I can say at this time is that this is one of the mysteries.

Guenter

Guenter Roeck

未读,
2017年8月9日 13:03:362017/8/9
收件人 Matthias Kaehlcke、David Riley、Manoj Gupta、Curtis Malainey、Chromium OS dev、Mike Frysinger
Do we ever build 32 bit images ? I thought those are obsolete ? Note
this was seen on cyan.

Guenter


On Wed, Aug 9, 2017 at 9:46 AM, Matthias Kaehlcke <m...@chromium.org> wrote:
> On Tue, Aug 8, 2017 at 8:05 PM, Guenter Roeck <gro...@google.com> wrote:
>> On Tue, Aug 8, 2017 at 7:33 PM, David Riley <david...@google.com> wrote:
>>> What's the startup_32 vs startup_64 log difference caused by? ie, what
>>> emits that line and what makes it choose between the two?
>>>
>>
>> All I can say at this time is that this is one of the mysteries.
>
> Apparently in the working case a 32-bit kernel is built, and a 64-bit
> one in the other:
>
> startup_32 and startup_64 are defined in arch/x86/kernel/head_(32|64).S
>
>
> arch/x86/kernel/Makefile:
>
> extra-y := head_$(BITS).o
>
>
> arch/x86/Makefile:
>
> ifeq ($(CONFIG_X86_32),y)
> BITS := 32
> ....
> else
> BITS := 64
> ...
> endif

Curtis Malainey

未读,
2017年8月9日 16:28:272017/8/9
收件人 Chromium OS dev、cujoma...@google.com、gro...@google.com、vap...@google.com、m...@chromium.org
Unfortunately this did not resolve the issue, the device still fails the exact same way (from what I can tell.)

Manoj Gupta

未读,
2017年8月9日 18:13:152017/8/9
收件人 Curtis Malainey、Chromium OS dev、Guenter Roeck、Mike Frysinger、m...@chromium.org
Ok, in that case, I believe this is not related to toolchain updates. 

Thanks,
Manoj

You received this message because you are subscribed to the Google Groups "Chromium OS dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-os-dev+unsubscribe@chromium.org.

Curtis Malainey

未读,
2017年8月9日 18:35:252017/8/9
收件人 Mike Frysinger、Guenter Roeck、Richard Barnette、Manoj Gupta、Chromium OS dev、Matthias Kaehlcke
I don't know about Guenter but I am building packages and image then using cros_workon_make then using update kernel. The nokaslr workaround also works for me as well just an fyi. Is there a doc on how to directly make the kernel and push it to a device?

On Wed, Aug 9, 2017 at 3:19 PM, Mike Frysinger <vap...@google.com> wrote:
how are you building+testing the kernel builds ?  are you doing the build_packages+build_image flow, or do you have the ability to build the kernel directly to verify (e.g. go in there and `make`) ?
-mike 

Mike Frysinger

未读,
2017年8月9日 18:55:032017/8/9
收件人 Curtis Malainey、Guenter Roeck、Richard Barnette、Manoj Gupta、Chromium OS dev、Matthias Kaehlcke
you can try getting an older sync/chroot and see if you can narrow down the time frame and rule out local changes.

you can find periodic snapped manifests here:

or more succinctly:
  mkdir ~/some-temp-path
  cd ~/some-temp-path
  repo init -u https://chrome-internal.googlesource.com/chromeos/manifest-versions -m buildspecs/62/9825.0.0.xml
  repo sync
  cros_sdk
  ./build_packages --board=eve
  ./build_image --board=eve
  <do USE/emerge/whatever for the kernel>

each manifest will also have snapped the sdk & toolchain at the time.  it does this through the files:
  src/third_party/chromiumos-overlay/chromeos/binhost/host/

obviously this will involve quite a good amount of network traffic.  you should be able to speed this up by not having to blow away the source checkout when selecting a different manifest.  so if 9825.0.0 worked and you want to try the next one, the incremental step would be:
  repo init -m buildspecs/62/9830.0.0.xml
  repo sync
  cros_sdk --delete
  cros_sdk
  <repeat build/test steps>

g'luck! ;)
-mike

Guenter Roeck

未读,
2017年8月9日 19:07:582017/8/9
收件人 Mike Frysinger、Curtis Malainey、Richard Barnette、Manoj Gupta、Chromium OS dev、Matthias Kaehlcke
Hi Mike,

that is quite useful, thanks a lot! Guess I'll spend tomorrow
bisecting through the list. Just hope the oldest one still works ;-)

Guenter

Guenter Roeck

未读,
2017年8月10日 17:06:202017/8/10
收件人 Mike Frysinger、Curtis Malainey、Richard Barnette、Manoj Gupta、Chromium OS dev、Matthias Kaehlcke
Bisect is complete. The first failing manifest is 9784.0.0.

Now to the second part, finding the actual culprit....

Guenter

On Wed, Aug 9, 2017 at 3:54 PM, Mike Frysinger <vap...@chromium.org> wrote:

Mike Frysinger

未读,
2017年8月10日 17:15:192017/8/10
收件人 Guenter Roeck、Curtis Malainey、Richard Barnette、Manoj Gupta、Chromium OS dev、Matthias Kaehlcke

Guenter Roeck

未读,
2017年8月10日 17:37:102017/8/10
收件人 Mike Frysinger、Curtis Malainey、Richard Barnette、Manoj Gupta、Chromium OS dev、Matthias Kaehlcke
How do I get a list of SDK builds ? The CLs in the list you provided
all say they landed in 9782, which we know is working.

Thanks,
Guenter

Luis Lozano

未读,
2017年8月10日 17:45:162017/8/10
收件人 Mike Frysinger、Guenter Roeck、Curtis Malainey、Richard Barnette、Manoj Gupta、Chromium OS dev、Matthias Kaehlcke
(re-sending from chromium account. Sorry if you get this twice)

ok, it seems this is caused by the binutils upgrade:


The CL was submitted Jul 26, but it wasn't active until Jul 27, 8:26AM when chromiumos-sdk bot published new prebuilts.
We did binutils testing on various kernel versions, but none of them were newer than 4.4.70.

Since the waterfall is not showing this problem and has been clean for 2 weeks and, it seems, there is a temporary workaround (not using ASLR), we are not considering a binutils revert for now. Reverting binutils would be very painful.

We will need help from Kernel team to get exact instructions on how to reproduce and triage and to continue triaging (better to have someone with "domain knowledge" to help triage).
Can you please file a bug? 

Who from the kernel team can continue helping us with this? 

Thanks
回复全部
回复作者
转发
此会话已锁定
您无法回复已锁定的会话,亦不可对其执行其他操作。
0 个新帖子