Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system

126 views
Skip to first unread message

Aurelien Jarno

unread,
Sep 15, 2022, 4:00:03 PM9/15/22
to
Hi,

On 2022-09-15 20:59, debian-b...@p0358.net wrote:
> > The first thing would be to provide the output of /proc/cpuinfo
>
> Pasting below (please **NOTE** that "avx2" would normally be there, but is
> currently missing due to this kernel option `clearcpuid=293` with which I
> booted the PC now -- I can **100%** confirm "avx2" was there before, but
> don't want to reboot for now to remove this kernel flag):
>
> # cat /proc/cpuinfo
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 60
> model name : Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz
> stepping : 3
> microcode : 0x12
> cpu MHz : 2394.664
> cache size : 3072 KB
> physical id : 0
> siblings : 4
> core id : 0
> cpu cores : 2
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 13
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2
> ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt xsave avx f16c
> rdrand lahf_lm abm cpuid_fault epb invpcid_single pti tpr_shadow vnmi
> flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms invpcid xsaveopt
> dtherm arat pln pts
> vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb
> flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
> bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
> mds swapgs itlb_multihit srbds
> bogomips : 4789.10
> clflush size : 64
> cache_alignment : 64
> address sizes : 39 bits physical, 48 bits virtual
> power management:

Thanks.

> > If you believe the issue is due to AVX2, clearcpuid won't help, as it
> > just clear the corresponding flags from the kernel point of view, but
> > the cpuid instruction will just continue to behave the same. The way to
> > do disable that features at the glibc level is to set the GLIBC_TUNABLES
> > environment variable to "glibc.cpu.hwcaps=-AVX2_Usable".
>
> This works! Indeed the clearcpuid flags itself on its own did nothing as you
> mentioned too. This workaround is great to know then for the time being.

Great, that's narrowing down the problem.

> > Same from there due to ASLR. It seems to fail in at least two different
> > locations. Do you have some extra lines around, sometimes the kernel
> > dump the addresses around the instruction pointer?
>
> Generally these lines all followed similar pattern, and there was nothing
> printed below or after, just this single line per crash. I will paste a few
> more below. Isn't the "+15a000" the relative offset in libc .so though? It

The +15a000 is the size of the libc.so.6 mapping in the virtual memory.

> does seem like an oddly round number, but I loaded the library in IDA
> disassembler and the instructions at this offset do seem to be related to
> AVX2 (linking screenshot which I also pasted on the linked GitHub issue)
> (the highlighted instruction in gray seems to be the one at this
> aforementioned offset):
> https://user-images.githubusercontent.com/5182588/190256853-29ae80aa-0089-4da2-a430-990e2693d15c.png
>
> If my above hypithesis is correct, then I looked at the mother function in
> x-refs and it does seem to be defined in rtld_global_ro table, and its name
> is "__strncmp_avx2". Was something changed in this function between the
> updates?
>
> Pasting more kernel lines:
> kernel: [852124.361775] traps: dhclient[1583381] trap invalid opcode
> ip:7fe19118051d sp:7ffee6e36238 error:0 in libc-2.31.so[7fe191044000+15a000]
> kernel: [852124.468314] traps: nft[1583398] trap invalid opcode
> ip:7fe3418fe51d sp:7fff11342df8 error:0 in libc-2.31.so[7fe3417c2000+15a000]
> kernel: [852124.572700] traps: systemd-shutdow[1377424] trap invalid opcode
> ip:7fde88b724ad sp:7ffc13767028 error:0 in libc-2.31.so[7fde88a3a000+15a000]
> kernel: [ 270.477024] traps: bun[2055] trap invalid opcode ip:2e363f4
> sp:7ffe2320d640 error:0 in bun[2a6f000+2ce2000]
> kernel: [ 279.884807] traps: systemd[2115] trap invalid opcode
> ip:7faf645ec4ad sp:7ffe12e06c48 error:0 in libc-2.31.so[7faf644b4000+15a000]
> kernel: [ 299.637575] traps: bun[2296] trap invalid opcode ip:2e363f4
> sp:7ffd0c0bc9c0 error:0 in bun[2a6f000+2ce2000]
> kernel: [ 331.036417] traps: bash[2462] trap invalid opcode ip:7ff42840051d
> sp:7ffd34ad7278 error:0 in libc-2.31.so[7ff4282c4000+15a000]
> kernel: [ 357.184428] traps: bash[2652] trap invalid opcode ip:7f717873751d
> sp:7fffd34c8848 error:0 in libc-2.31.so[7f71785fb000+15a000]
> kernel: [ 645.517556] traps: bash[3508] trap invalid opcode ip:7f4b6ee8851d
> sp:7ffd74beb6e8 error:0 in libc-2.31.so[7f4b6ed4c000+15a000]
> kernel: [ 876.760209] traps: bash[4225] trap invalid opcode ip:7fd30515a0c4
> sp:7ffc604bb118 error:0 in libc.so.6[7fd30502a000+154000]
> kernel: [ 891.000593] traps: bash[4399] trap invalid opcode ip:7f3c25acd0c4
> sp:7fff33adcab8 error:0 in libc.so.6[7f3c2599d000+154000]
> kernel: [ 1245.008705] traps: systemd[5382] trap invalid opcode
> ip:7fe82964f4ad sp:7ffd9967ace8 error:0 in libc-2.31.so[7fe829517000+15a000]
> kernel: [ 1472.084646] traps: systemd[6104] trap invalid opcode
> ip:7fd0316cb4ad sp:7fff24a010b8 error:0 in libc-2.31.so[7fd031593000+15a000]
> kernel: [ 1487.513379] traps: systemd[6269] trap invalid opcode
> ip:7fa89d8354ad sp:7fffdc2b9328 error:0 in libc-2.31.so[7fa89d6fd000+15a000]
> kernel: [ 1541.866530] traps: systemd[7005] trap invalid opcode
> ip:7fbb764d74ad sp:7ffd302b3718 error:0 in libc-2.31.so[7fbb7639f000+15a000]
> kernel: [ 1774.377940] traps: systemd[7750] trap invalid opcode
> ip:7f5db1ae54ad sp:7ffc9ba5ef58 error:0 in libc-2.31.so[7f5db19ad000+15a000]
> kernel: [66259.261517] traps: bash[428087] trap invalid opcode
> ip:7fc5f7364422 sp:7fff81b7f918 error:0 in libc.so.6[7fc5f7234000+16e000]
> kernel: [67322.502174] traps: bash[435709] trap invalid opcode
> ip:7f24abbad422 sp:7ffe428004f8 error:0 in libc.so.6[7f24aba7d000+16e000]
> kernel: [67339.606742] traps: passwd[436152] trap invalid opcode
> ip:7f4f047ce422 sp:7fff59b0f618 error:0 in libc.so.6[7f4f0469e000+16e000]
> kernel: [67433.720656] traps: adduser[437285] trap invalid opcode
> ip:7f0e09f602b7 sp:7fff697e8f98 error:0 in libc-2.31.so[7f0e09e28000+15a000]
> kernel: [67714.117441] traps: bash[439504] trap invalid opcode
> ip:7f96d3a5c0c4 sp:7ffd554b71a8 error:0 in libc.so.6[7f96d392c000+154000]
>
> Note that this time around they come from different libc compilations:
> - +15a000 one is from Debian Stable (debian:bullseye-20220912-slim docker
> image)
> - +16e000 one is from Debian Sid (debian:sid-slim docker image)
> - +2ce2000 is bun.js, unrelated program that seems to have libc6 statically
> compiled
> - +154000 is from Fedora for a good measure (fedora:rawhide docker image)

As said above, this is basically linked to the size of the libc.so.6
file, or more precisely the part that is mapped into memory. That said
it seems the crash is happening in multiple place by looking at the last
digits of the ip address (knowing that there is randomization of the
exact address due to ASLR).

> So this "+" number being the same in case of same distinct libc builds does
> suggest heavily that it is simply relative instruction offset in the .so
> itself.
>
> I might be wrong though, I also had no idea where to get debug symbols from,
> and gdb didn't seem to be willing to print any useful information either...
> Do you think I should setup another LXC container and upgrade the libc6
> using this env var workaround and then try running some program under gdb
> itself with this variable cleared? I've never actually used gdb debugger,
> but surely a simple debugging up to a crash couldn't be that hard...

The debug symbols are available in the libc6-dbg package. Basically you
can try to get a shell with the glibc.cpu.hwcaps workaround. Then run
ulimit -c unlimited to get core files, and execute a binary that fails
this time without the glibc.cpu.hwcaps workaround. You can then examine
the core using gdb binary corefile.

> > The changes that are in this stable release have been (or at least were
> > supposed to, given the bug you reported) in testing/sid for a few
> > months. Are you able to do a test with debian sid, for instance in
> > docker?
>
> Yes, same story apparently. Btw, I tested similar way in latest Fedora, with
> exact same outcome, so in the end the issue seems not isolated to Debian,
> but libc6 and this particular set of patches that eventually has found its
> way to Debian Stable.
>
> # docker run -it --rm debian:sid-slim bash
> # echo $?
> 132
>
> ^ Interestingly, apt is one program that does work on sid, while not working
> on stable.

Ok, thanks. It's interesting it also fails in sid and on Fedora. The
change has been introduced back in February, so it's strange it has not
been noticed yet.

> Looking at this changelog...:
> https://tracker.debian.org/news/1358014/accepted-glibc-231-13deb11u4-source-into-proposed-updates-stable-new-proposed-updates/
>
> ... is there perhaps some way these changes could be applied one by one to
> pinpoint which one is causing issues that way?

Unfortunately, not that easily unless you want to compile the upstream
sources. As you pointed, the changes are very likely related to the AVX2
changes, so having the address of the illegal instruction would help a
lot to understand the issue.

> This machine, in case it matters, is a Lenovo G510 laptop. There is some
> update available for the BIOS, but it requires booting up Windows to perform
> it. Should I attempt that? I found some ancient thread on some forum that
> mentioned BIOS update fixes some issues with "freezes" on

As said above, I find strange that the problem has not been noticed yet
given it affects at least two distributions, and that it dates from a
few months in sid. You might want to install the intel-microcode package
and reboot to see if it helps, it should have the same effects than
updating the BIOS for the point of view of the current bug.

Regards
Aurelien

--
Aurelien Jarno GPG: 4096R/1DDD8C9B
aure...@aurel32.net http://www.aurel32.net

Stephen Kitt

unread,
Sep 24, 2022, 3:00:03 AM9/24/22
to
Hi Aurelien,

On Tue, Sep 20, 2022 at 11:20:26PM +0200, Aurelien Jarno wrote:
> Have you been able to progress on that? Do you need some help for a
> specific step?

For what it’s worth, I’ve upgraded libc6 on my Haswell system (Xeon
E3-1245v3) and everything seems to be working fine.

Regards,

Stephen
signature.asc

debian-b...@p0358.net

unread,
Sep 24, 2022, 6:40:03 PM9/24/22
to
Hello, sorry for delayed response, I've managed to collect and analyze a
few coredump files with valid symbols (I installed libc6-dbg and
dpkg-dev, and pointed gdb at Debian's debuginfod server, also used
apt-get source to get the sources for libc6).

It seems there are at least 3-4 distinct places it crashes at, two
places at memchr-avx2.S, one at strlen-avx2.S, and potentially one at
syscall-template.S, although that last one may be just some kind of kill
signal redirect.

Pasting all below:

Core was generated by `apt'.
Program terminated with signal SIGILL, Illegal instruction.
#0 __memchr_avx2 () at ../sysdeps/x86_64/multiarch/memchr-avx2.S:400
Download failed: Invalid argument. Continuing without source file
./string/../sysdeps/x86_64/multiarch/memchr-avx2.S.
400 ../sysdeps/x86_64/multiarch/memchr-avx2.S: No such file or
directory.
(gdb)

#######

Core was generated by `dpkg'.
Program terminated with signal SIGILL, Illegal instruction.
#0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:514
Download failed: Invalid argument. Continuing without source file
./string/../sysdeps/x86_64/multiarch/strlen-avx2.S.
514 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or
directory.
(gdb)

#######

Core was generated by `/usr/bin/perl /usr/sbin/adduser'.
Program terminated with signal SIGILL, Illegal instruction.
#0 __memchr_avx2 () at ../sysdeps/x86_64/multiarch/memchr-avx2.S:135
Download failed: Invalid argument. Continuing without source file
./string/../sysdeps/x86_64/multiarch/memchr-avx2.S.
135 ../sysdeps/x86_64/multiarch/memchr-avx2.S: No such file or
directory.
(gdb)

#######

Core was generated by `useradd'.
Program terminated with signal SIGILL, Illegal instruction.
#0 __memchr_avx2 () at ../sysdeps/x86_64/multiarch/memchr-avx2.S:135
Download failed: Invalid argument. Continuing without source file
./string/../sysdeps/x86_64/multiarch/memchr-avx2.S.
135 ../sysdeps/x86_64/multiarch/memchr-avx2.S: No such file or
directory.
(gdb)

#######

Core was generated by `passwd'.
Program terminated with signal SIGILL, Illegal instruction.
#0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:514
Download failed: Invalid argument. Continuing without source file
./string/../sysdeps/x86_64/multiarch/strlen-avx2.S.
514 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or
directory.
(gdb)

#######

Core was generated by `bash'.
Program terminated with signal SIGILL, Illegal instruction.
#0 0x00007f2006faf087 in kill () at ../sysdeps/unix/syscall-template.S:120
Download failed: Invalid argument. Continuing without source file
./signal/../sysdeps/unix/syscall-template.S.
120 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb)

#######

Core was generated by `su'.
Program terminated with signal SIGILL, Illegal instruction.
#0 __memchr_avx2 () at ../sysdeps/x86_64/multiarch/memchr-avx2.S:135
Download failed: Invalid argument. Continuing without source file
./string/../sysdeps/x86_64/multiarch/memchr-avx2.S.
135 ../sysdeps/x86_64/multiarch/memchr-avx2.S: No such file or
directory.
(gdb)

#######

It does seem in case of this SIGILL there's no additional stack trace,
also the path containing ".." seems to cause the source code resolution
to fail, but still the debug symbols seem to show the file source and
line, so it should hopefully help see what exactly fails.

I'm yet to try rebooting with microcode package installed though (I'll
soon check it and update on whether it helps, but even if it does, one
without bootable system first won't get a chance to install it; I'm a
bit curious how these changes did trigger this, given all these years it
didn't happen to occur before)

debian-b...@p0358.net

unread,
Sep 24, 2022, 9:40:03 PM9/24/22
to
I can confirm updating the microcode by installing the intel-microcode
package and rebooting does indeed mitigate this issue. An LXC container
that was previously bricked due to update now starts and seems to behave
fully normally.

[ 0.000000] microcode: microcode updated early to revision 0x28, date
= 2019-11-12

But as microcode update needs to be loaded every time on boot (unless I
presumably updated the UEFI), while it technically solves my problem on
this installation, the concern of people with the same family of
processors and outdated microcode running into this issue and having no
idea why any Linux does not want to boot anymore still probably
remains... (is there even any easy way to load updated microcode while
installing Debian? I can most certainly bet its ISO does not include
those due to non-free constraints)

Aurelien Jarno

unread,
Sep 25, 2022, 4:10:03 AM9/25/22
to
On 2022-09-25 00:35, debian-b...@p0358.net wrote:
> Hello, sorry for delayed response, I've managed to collect and analyze a few
> coredump files with valid symbols (I installed libc6-dbg and dpkg-dev, and
> pointed gdb at Debian's debuginfod server, also used apt-get source to get
> the sources for libc6).

Thanks a lot for your work. With more data, it's way easier to
understand the issue.

> It seems there are at least 3-4 distinct places it crashes at, two places at
> memchr-avx2.S, one at strlen-avx2.S, and potentially one at
> syscall-template.S, although that last one may be just some kind of kill
> signal redirect.

The failing places in memchr-avx2.S and strlen-avx2.S points to BMI2
(bit manipulation instructions) which have been introduced in the AVX2
code, which should not have happened. The syscall-template.S is likely
code that catches the signal to display a message and then re-emit it.

> It does seem in case of this SIGILL there's no additional stack trace, also
> the path containing ".." seems to cause the source code resolution to fail,
> but still the debug symbols seem to show the file source and line, so it
> should hopefully help see what exactly fails.
>
> I'm yet to try rebooting with microcode package installed though (I'll soon
> check it and update on whether it helps, but even if it does, one without
> bootable system first won't get a chance to install it; I'm a bit curious
> how these changes did trigger this, given all these years it didn't happen
> to occur before)

I agree with you that this should be fixed without a microcode update, I
am going to report that issue upstream and we'll get the fix in the
Debian package.

Now that we understood the bug, I actually find strange that the
microcode update is fixing this, it looks like that the BMI2
instructions support has been added in a microcode update. Would it be
possible to give the output of /proc/cpuinfo with and without the
microcode update applied?

debian-b...@p0358.net

unread,
Oct 4, 2022, 2:40:04 PM10/4/22
to
> Is there an easy way to unbrick a system affected by the issue? such as
> a kernel-line option or a configuration file in /etc? I don't see how I
> can set a GLIBC_TUNABLES environment variable for the whole system.

I was trying during my testing to set such option globally somehow, but
failed, though maybe some method for this exists. As it stands I only
see two possibilities of unbricking a system, both assuming you can
access the partition externally from some bootable system:

1. Downgrade the affected libc6 package to a version before the one
causing issues (either chroot and dpkg, or just extract and physically
replace the files), after booting apt-mark hold libc6 to prevent faulty
update from being installed until the issue is fixed

2. Or install intel-microcode package, assuming the microcode update
adds the missing instructions in particular case, basically
coincidentally fixing this issue (the updated CPU microcode is loaded on
every bootup)

Samuel Thibault

unread,
Oct 4, 2022, 2:40:04 PM10/4/22
to
Hello,

Is there an easy way to unbrick a system affected by the issue? such as
a kernel-line option or a configuration file in /etc? I don't see how I
can set a GLIBC_TUNABLES environment variable for the whole system.

Samuel
0 new messages