Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#934752: libc6: SEGFAULTs caused by tcache after upgrade to Buster

186 views
Skip to first unread message

Pavel Matěja

unread,
Aug 14, 2019, 9:00:02 AM8/14/19
to
Package: glibc
Version: 2.28-10:amd64

Dear Maintainer,

We are running manually compiled Apache and OpenSSL on Debian servers in
Debian-based chroots.
After chroot upgrade from Stretch to Buster we started to see strange
SEGFAULTs.
The strange means they appear only on 2 servers out of 6.
Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon
E3-1220 v6 produced crashes.
It did not matter if the host Debian was Stretch or Buster.

I was able to collect coredumps and get backtraces. They look like:
(gdb) bt
#0  tcache_get (tc_idx=0) at malloc.c:2934
#1  __GI___libc_malloc (bytes=3) at malloc.c:3042
#2  0x00007fd8cc0961be in CRYPTO_malloc (num=3, file=0x7fd8cc2a548c
"ssl/statem/extensions_clnt.c", line=1376) at crypto/mem.c:222
#3  0x00007fd8cc26c7b9 in tls_parse_stoc_ec_pt_formats
(s=0x7fd8640592d0, pkt=0x7fd864061810, context=256, x=0x0, chainidx=0)
    at ssl/statem/extensions_clnt.c:1376
#4  0x00007fd8cc266af5 in tls_parse_extension (s=0x7fd8640592d0,
idx=TLSEXT_IDX_ec_point_formats, context=256, exts=0x7fd864061770,
x=0x0, chainidx=0)
    at ssl/statem/extensions.c:715
#5  0x00007fd8cc266bbb in tls_parse_all_extensions (s=0x7fd8640592d0,
context=256, exts=0x7fd864061770, x=0x0, chainidx=0, fin=1)
    at ssl/statem/extensions.c:748
#6  0x00007fd8cc2798b6 in tls_process_server_hello (s=0x7fd8640592d0,
pkt=0x7fd83cff8440) at ssl/statem/statem_clnt.c:1698
#7  0x00007fd8cc277fc7 in ossl_statem_client_process_message
(s=0x7fd8640592d0, pkt=0x7fd83cff8440) at ssl/statem/statem_clnt.c:1039
#8  0x00007fd8cc275499 in read_state_machine (s=0x7fd8640592d0) at
ssl/statem/statem.c:636
#9  0x00007fd8cc274f15 in state_machine (s=0x7fd8640592d0, server=0) at
ssl/statem/statem.c:434
#10 0x00007fd8cc274a1b in ossl_statem_connect (s=0x7fd8640592d0) at
ssl/statem/statem.c:250
#11 0x00007fd8cc25b098 in SSL_do_handshake (s=0x7fd8640592d0) at
ssl/ssl_lib.c:3599
#12 0x00007fd8cc257199 in SSL_connect (s=0x7fd8640592d0) at
ssl/ssl_lib.c:1653
#13 0x00007fd8c957c934 in ssl_io_filter_handshake
(filter_ctx=0x7fd85809a090) at ssl_engine_io.c:1243
#14 0x00007fd8c957deca in ssl_io_filter_output (f=0x7fd85809a0e8,
bb=0x7fd85406b8b0) at ssl_engine_io.c:1760
..

(gdb) bt
#0  tcache_get (tc_idx=0) at malloc.c:2934
#1  __GI___libc_malloc (bytes=16) at malloc.c:3042
#2  0x00007fd8cc0961be in CRYPTO_malloc (num=16, file=0x7fd8cc159913
"crypto/bio/bss_mem.c", line=115) at crypto/mem.c:222
#3  0x00007fd8cc0961f1 in CRYPTO_zalloc (num=16, file=0x7fd8cc159913
"crypto/bio/bss_mem.c", line=115) at crypto/mem.c:230
#4  0x00007fd8cbf9ca0a in mem_init (bi=0x7fd860044130, flags=0) at
crypto/bio/bss_mem.c:115
#5  0x00007fd8cbf9cb3d in mem_new (bi=0x7fd860044130) at
crypto/bio/bss_mem.c:138
#6  0x00007fd8cbf9541a in BIO_new (method=0x7fd8cc204980 <mem_method>)
at crypto/bio/bio_lib.c:94
#7  0x00007fd8cc2454a3 in ssl3_init_finished_mac (s=0x7fd8600a7be0) at
ssl/s3_enc.c:322
#8  0x00007fd8cc281eae in tls_setup_handshake (s=0x7fd8600a7be0) at
ssl/statem/statem_lib.c:91
#9  0x00007fd8cc274ea2 in state_machine (s=0x7fd8600a7be0, server=0) at
ssl/statem/statem.c:419
#10 0x00007fd8cc274a1b in ossl_statem_connect (s=0x7fd8600a7be0) at
ssl/statem/statem.c:250
#11 0x00007fd8cc25b098 in SSL_do_handshake (s=0x7fd8600a7be0) at
ssl/ssl_lib.c:3599
#12 0x00007fd8cc257199 in SSL_connect (s=0x7fd8600a7be0) at
ssl/ssl_lib.c:1653
#13 0x00007fd8c957c934 in ssl_io_filter_handshake
(filter_ctx=0x7fd8580e8b78) at ssl_engine_io.c:1243
#14 0x00007fd8c957deca in ssl_io_filter_output (f=0x7fd8580e8bd0,
bb=0x55b212b0d518) at ssl_engine_io.c:1760
..

SSLv3 and TLS code path looked quite distinct to cause the same problem.
Based on info that SEGFAULTs are related to memory allocation in new
libc and CPU performance I found
http://51.15.138.76/patch/17499/
where Wilco Dijkstra discuss some problems with tcache which "leads to
various crashes in benchtests"

As workaround I tried to
export GLIBC_TUNABLES=glibc.malloc.tcache_count=0
in Apache startup script and I saw no SEGFAULT since.

I have coredumps but they contain production private keys for Apache
which I can't share and to make things even worse they are 1,6GB each.

I understand this is heisenbug which you won't be able to reproduce. The
CPU model dependency is beyond my comprehension.
I'm curious if you are familiar with the new tcache and if you think if
the patch in discussion can help.
I'll try to build libc6 package with it to confirm final solution but
I'm confused by the patch tree so far.

-- System Information:
Debian Release: Buster
Architecture: amd64 (x86_64)
Kernel: Linux 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2
(2019-08-08) x86_64 GNU/Linux

Fix-tcache-count-maximum.diff

Aurelien Jarno

unread,
Aug 17, 2019, 10:00:02 AM8/17/19
to
Hi,

On 2019-08-14 14:50, Pavel Matěja wrote:
> Package: glibc
> Version: 2.28-10:amd64
>
> Dear Maintainer,
>
> We are running manually compiled Apache and OpenSSL on Debian servers in
> Debian-based chroots.
> After chroot upgrade from Stretch to Buster we started to see strange
> SEGFAULTs.
> The strange means they appear only on 2 servers out of 6.
> Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon
> E3-1220 v6 produced crashes.
> It did not matter if the host Debian was Stretch or Buster.

[snip]

> SSLv3 and TLS code path looked quite distinct to cause the same problem.
> Based on info that SEGFAULTs are related to memory allocation in new libc
> and CPU performance I found
> http://51.15.138.76/patch/17499/
> where Wilco Dijkstra discuss some problems with tcache which "leads to
> various crashes in benchtests"

This patch looks an early version of the one that has been merged in
glibc 2.29 to fix tunables tcache issues:

https://sourceware.org/bugzilla/show_bug.cgi?id=24531

The patch has been backported to the upstream glibc 2.28 branch:
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=58d2672f64176fcb323859d3bd5240fb1cf8f25c

Once we have the fix reaching unstable and then testing, I'll schedule
an upload to buster with the changes from the upstream glibc 2.28 branch.

> As workaround I tried to
> export GLIBC_TUNABLES=glibc.malloc.tcache_count=0
> in Apache startup script and I saw no SEGFAULT since.
>
> I have coredumps but they contain production private keys for Apache which I
> can't share and to make things even worse they are 1,6GB each.
>
> I understand this is heisenbug which you won't be able to reproduce. The CPU
> model dependency is beyond my comprehension.
> I'm curious if you are familiar with the new tcache and if you think if the
> patch in discussion can help.
> I'll try to build libc6 package with it to confirm final solution but I'm
> confused by the patch tree so far.

You can easily build a fixed glibc package that way (providing you have
the glibc build-dependencies, devscripts and git installed):
apt-get source glibc
cd glibc-2.28/
quilt pop -a
debian/rules update-from-upstream
dch -i + set the version you want + add a new changelog entry
debuild

Regards,
Aurelien

--
Aurelien Jarno GPG: 4096R/1DDD8C9B
aure...@aurel32.net http://www.aurel32.net

Florian Weimer

unread,
Aug 17, 2019, 4:30:01 PM8/17/19
to
* Pavel Matěja:

> The strange means they appear only on 2 servers out of 6.
> Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon
> E3-1220 v6 produced crashes.
> It did not matter if the host Debian was Stretch or Buster.

Do you see crashes on stretch as well? What does the backtrace look
like there?

> SSLv3 and TLS code path looked quite distinct to cause the same problem.
> Based on info that SEGFAULTs are related to memory allocation in new
> libc and CPU performance I found
> http://51.15.138.76/patch/17499/
> where Wilco Dijkstra discuss some problems with tcache which "leads to
> various crashes in benchtests"

I was under the impression that this problem only occurs if one of the
tunables has an out-of-bounds value. Do you set any tunables?

Pavel Matěja

unread,
Aug 27, 2019, 7:10:02 AM8/27/19
to
Sorry for late answer.

On 17. 08. 19 22:18, Florian Weimer wrote:
> * Pavel Matěja:
>
>> The strange means they appear only on 2 servers out of 6.
>> Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon
>> E3-1220 v6 produced crashes.
>> It did not matter if the host Debian was Stretch or Buster.
> Do you see crashes on stretch as well? What does the backtrace look
> like there?
I newer saw the SEGFAULT when we had Stretch based chroot.

I had just one SEGFAULT on Stretch host but I didn't collect coredumps
back then.
Unfortunately the server is already running Buster.
Since the bug is caused by new libc in chroot I should be able to
install just kernel from Stretch and wait for the SEGFAULT, right?
I think the backtrace will be the same anyway.

>> SSLv3 and TLS code path looked quite distinct to cause the same problem.
>> Based on info that SEGFAULTs are related to memory allocation in new
>> libc and CPU performance I found
>> http://51.15.138.76/patch/17499/
>> where Wilco Dijkstra discuss some problems with tcache which "leads to
>> various crashes in benchtests"
> I was under the impression that this problem only occurs if one of the
> tunables has an out-of-bounds value. Do you set any tunables?
No, I didn't even know they existed.
I did not read the libc sources yet so I don't know what does the patch
actually fixes neither if it helps with my problem.

Pavel Matěja

Florian Weimer

unread,
Aug 27, 2019, 8:20:02 AM8/27/19
to
* Pavel Matěja:

> Sorry for late answer.
>
> On 17. 08. 19 22:18, Florian Weimer wrote:
>> * Pavel Matěja:
>>
>>> The strange means they appear only on 2 servers out of 6.
>>> Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon
>>> E3-1220 v6 produced crashes.
>>> It did not matter if the host Debian was Stretch or Buster.
>> Do you see crashes on stretch as well? What does the backtrace look
>> like there?

> I newer saw the SEGFAULT when we had Stretch based chroot.
>
> I had just one SEGFAULT on Stretch host but I didn't collect coredumps
> back then.
> Unfortunately the server is already running Buster.
> Since the bug is caused by new libc in chroot I should be able to
> install just kernel from Stretch and wait for the SEGFAULT, right?
> I think the backtrace will be the same anyway.

If I recall correctly, stretch doesn't have the tcache code. If the
crash happened there as well, it's something else.

>>> SSLv3 and TLS code path looked quite distinct to cause the same problem.
>>> Based on info that SEGFAULTs are related to memory allocation in new
>>> libc and CPU performance I found
>>> http://51.15.138.76/patch/17499/
>>> where Wilco Dijkstra discuss some problems with tcache which "leads to
>>> various crashes in benchtests"
>> I was under the impression that this problem only occurs if one of the
>> tunables has an out-of-bounds value. Do you set any tunables?

> No, I didn't even know they existed.
> I did not read the libc sources yet so I don't know what does the
> patch actually fixes neither if it helps with my problem.

Then the patch will not help to fix the crash.

(By the way, even if the crash goes away if you use a tunable to disable
the thread cache, it could still be timing-related. It's definitely
possible that the faster malloc/free implementation exposes pre-existing
data races.)

Thanks,
Florian
0 new messages