xenstored in dom0 has one CPU core pegged near 80% usage

Kelly Dean

unread,

Aug 22, 2018, 8:25:28 PM8/22/18

to qubes...@googlegroups.com

System is otherwise idle. Almost no disk I/O. Pausing my app qubes makes no difference. xentop shows only dom0 is really busy (CPU usage fluctuates from 105% to 120%). top in dom0 shows only xenstored is busy (80%). Reserved memory for xenstored is stable, and only 3496 kB. There are two CPU cores, one of which is idle, and the system is usable. The system has been this way for more than 4 hours. Only thing I did before that was update and reboot, then start my usual working set of qubes. There were some fedora-28 updates, but no dom0 updates.

Running 4.0 final release.

What's the easiest way to find out what xenstored is doing?

Jean-Philippe Ouellet

unread,

Aug 22, 2018, 9:50:28 PM8/22/18

to Kelly Dean, qubes-devel

A few methods, in rough order of ease:

1. See if anything interesting in `systemctl status xenstored`

2. If you are likely to be able to reproduce the behavior, enable
tracing by defining XENSTORED_TRACE to non-empty string at top of
/etc/xen/scripts/launch-xenstore and check
/var/log/xen/xenstored-trace.log when it happens again.

3. Check what your xenstored is doing:

$ sudo qubes-dom0-update xen-debuginfo
$ sudo gdb xenstored $(pgrep xenstored)
(gdb) bt
...

Warning: attaching gdb to a suspect process exposes a large attack
surface. You may not want to do this on a live machine if it is
sensitive. Dump a core file and do offline analysis in an unpriveleged
environment instead.

Kelly Dean

unread,

Aug 23, 2018, 6:26:05 AM8/23/18

to Jean-Philippe Ouellet, qubes-devel

Jean-Philippe Ouellet writes:

> 1. See if anything interesting in `systemctl status xenstored`

One interesting thing in the log:
TDB: tdb_open_ex: could not open file /var/lib/xenstored/tdb: No such file or directory

But that file does exist, and it's being written every 2 to 3 seconds. Size is 614400 bytes, which never changes.

> 2. If you are likely to be able to reproduce the behavior, enable
> tracing by defining XENSTORED_TRACE to non-empty string at top of
> /etc/xen/scripts/launch-xenstore and check
> /var/log/xen/xenstored-trace.log when it happens again.

Enabled. Will check log after next reboot.

> 3. Check what your xenstored is doing:
>
> $ sudo qubes-dom0-update xen-debuginfo
> $ sudo gdb xenstored $(pgrep xenstored)
> (gdb) bt

I pasted several randomly sampled backtraces below. (Yes, well...)

> Warning: attaching gdb to a suspect process exposes a large attack
> surface.

I don't understand your warning. Since xenstored is running as root in dom0, the game's already over if xenstored is compromised. Attaching gdb can't give xenstored any access it doesn't already have.

Random backtraces:
(gdb) continue
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007d4d36428af0 in __read_nocancel () from /lib64/libc.so.6
(gdb) bt
#0 0x00007d4d36428af0 in __read_nocancel () from /lib64/libc.so.6
#1 0x000000000040a3ce in tdb_read ()
#2 0x000000000040aab9 in tdb_find ()
#3 0x000000000040ac58 in tdb_find_lock_hash ()
#4 0x000000000040bd38 in tdb_fetch ()
#5 0x0000000000403648 in read_node ()
#6 0x0000000000404c31 in get_node ()
#7 0x00000000004052de in handle_input ()
#8 0x0000000000402970 in main ()
(gdb) continue
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007d4d3642cf20 in __poll_nocancel () from /lib64/libc.so.6
(gdb) bt
#0 0x00007d4d3642cf20 in __poll_nocancel () from /lib64/libc.so.6
#1 0x0000000000402746 in main ()
(gdb) continue
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007d4d36438ea7 in lseek64 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007d4d36438ea7 in lseek64 () from /lib64/libc.so.6
#1 0x000000000040a3ae in tdb_read ()
#2 0x000000000040aab9 in tdb_find ()
#3 0x000000000040ac58 in tdb_find_lock_hash ()
#4 0x000000000040bd38 in tdb_fetch ()
#5 0x0000000000403648 in read_node ()
#6 0x0000000000404c31 in get_node ()
#7 0x00000000004052de in handle_input ()
#8 0x0000000000402970 in main ()
(gdb) continue
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007d4d363b6e90 in malloc () from /lib64/libc.so.6
(gdb) bt
#0 0x00007d4d363b6e90 in malloc () from /lib64/libc.so.6
#1 0x000000000040824b in _talloc ()
#2 0x000000000040836e in talloc_named_const ()
#3 0x0000000000404e72 in handle_input ()
#4 0x0000000000402970 in main ()
(gdb) continue
Continuing.
(gdb) bt
^C
Program received signal SIGINT, Interrupt.
0x00007d4d3642e6a7 in ioctl () from /lib64/libc.so.6
(gdb) bt
#0 0x00007d4d3642e6a7 in ioctl () from /lib64/libc.so.6
#1 0x00007d4d36d2dc2c in xenevtchn_notify () from /lib64/libxenevtchn.so.1
#2 0x0000000000406115 in writechn ()
#3 0x00000000004043e9 in write_messages ()
#4 0x0000000000402946 in main ()
(gdb)

Ergo, this conclusion is the place where I got tired of thinking.

Marek Marczykowski-Górecki

unread,

Aug 30, 2018, 10:26:43 AM8/30/18

to Kelly Dean, qubes...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Strace can be helpful, see https://github.com/QubesOS/qubes-issues/issues/4191
Do you know what exact action triggers this problem? Given the trace
in the above mentioned issue, I'd guess some network setting operation.

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEhrpukzGPukRmQqkK24/THMrX1ywFAluH/p0ACgkQ24/THMrX
1yycagf/Y4ZOmvSps6tP/q+XfcBcSFRG7Bdw7x5drgMTEtkExUPEsGsZcyImxGyM
i0H9L0kZdejJF8WiFwCDZO6dkwl/L4mI5aEn1C0VTvT779tc0pViIuYekdbVmNSW
IcpqRB6ia0grx1+VK8sPz1y6fJuUv0IXaq7Hc6i6oKphyqczHhK71/G7L8PDv7sW
6e/rn1NmHIcUVEPR8NSqZnHHxYMweKOSa68IgsKK4EEhT6qWoMsyFQ6pOLMv7n4i
/YOSWVu4jGsrd3MlQFBtxv3KBlz8PTuvwzsAJ3hlKo2K9uyCglv2ahY1Wg1hKxvO
Dbw5YFSRfHTdUaYdOWmJg03gOCIOLg==
=apcC
-----END PGP SIGNATURE-----

Kelly Dean

unread,

Aug 31, 2018, 12:25:03 AM8/31/18

to Marek Marczykowski-Górecki, qubes...@googlegroups.com

Marek Marczykowski-Górecki writes:

> Strace can be helpful, see https://github.com/QubesOS/qubes-issues/issues/4191
> Do you know what exact action triggers this problem? Given the trace
> in the above mentioned issue, I'd guess some network setting operation.

I updated, rebooted, started a few of my usual app qubes, then noticed the CPU wasn't going idle as expected. Haven't rebooted again since then, so I haven't had a chance yet to check exactly when the problem starts. But in the meantime, here's a trace:

$ sudo strace -fp $(pgrep xenstored) 2>&1 | grep -v "lseek\|read\|fcntl\|write\|ioctl\|poll"
strace: Process 2248 attached
accept(3, NULL, NULL) = 8
open("/var/lib/xenstored/tdb.0x2072860", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 19
open("/var/lib/xenstored/tdb.0x2072860", O_RDWR) = 20
fstat(20, {st_mode=S_IFREG|0640, st_size=655360, ...}) = 0
close(19) = 0
rename("/var/lib/xenstored/tdb.0x2072860", "/var/lib/xenstored/tdb") = 0
close(21) = 0
unlink("/var/lib/xenstored/tdb.0x2072860") = -1 ENOENT (No such file or directory)
close(8) = 0
accept(3, NULL, NULL) = 8
open("/var/lib/xenstored/tdb.0x20721e0", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 19
open("/var/lib/xenstored/tdb.0x20721e0", O_RDWR) = 21
fstat(21, {st_mode=S_IFREG|0640, st_size=655360, ...}) = 0
close(19) = 0
rename("/var/lib/xenstored/tdb.0x20721e0", "/var/lib/xenstored/tdb") = 0
close(20) = 0
unlink("/var/lib/xenstored/tdb.0x20721e0") = -1 ENOENT (No such file or directory)
close(8) = 0
accept(3, NULL, NULL) = 8
open("/var/lib/xenstored/tdb.0x20721e0", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 19
open("/var/lib/xenstored/tdb.0x20721e0", O_RDWR) = 20
fstat(20, {st_mode=S_IFREG|0640, st_size=655360, ...}) = 0
close(19) = 0
accept(3, NULL, NULL) = 19
open("/var/lib/xenstored/tdb.0x2074a50", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 26
open("/var/lib/xenstored/tdb.0x2074a50", O_RDWR) = 27
fstat(27, {st_mode=S_IFREG|0640, st_size=655360, ...}) = 0
close(26) = 0
rename("/var/lib/xenstored/tdb.0x20721e0", "/var/lib/xenstored/tdb") = 0
close(21) = 0
unlink("/var/lib/xenstored/tdb.0x20721e0") = -1 ENOENT (No such file or directory)
close(8) = 0
close(27) = 0
unlink("/var/lib/xenstored/tdb.0x2074a50") = 0
open("/var/lib/xenstored/tdb.0x2072860", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 8
open("/var/lib/xenstored/tdb.0x2072860", O_RDWR) = 21
fstat(21, {st_mode=S_IFREG|0640, st_size=655360, ...}) = 0
close(8) = 0
accept(3, NULL, NULL) = 8
open("/var/lib/xenstored/tdb.0x20742f0", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 26
open("/var/lib/xenstored/tdb.0x20742f0", O_RDWR) = 27
fstat(27, {st_mode=S_IFREG|0640, st_size=655360, ...}) = 0
close(26) = 0
rename("/var/lib/xenstored/tdb.0x20742f0", "/var/lib/xenstored/tdb") = 0
close(20) = 0
unlink("/var/lib/xenstored/tdb.0x20742f0") = -1 ENOENT (No such file or directory)
close(21) = 0
unlink("/var/lib/xenstored/tdb.0x2072860") = 0
close(8) = 0
open("/var/lib/xenstored/tdb.0x2072860", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 8
open("/var/lib/xenstored/tdb.0x2072860", O_RDWR) = 20
fstat(20, {st_mode=S_IFREG|0640, st_size=655360, ...}) = 0
close(8) = 0
^C

That trace was over the course of about 5 seconds. Without filtering out read/write/etc, I get the typical trace that seberm mentioned in issue #4191.

To mitigate the bug, I attached and paused xenstored using gdb, and the CPU went idle. I leave it paused, but temporarily resume it when I need to start or stop a qube, use qvm-run, etc. That's how I've been getting by for the past week.

Reply all

Reply to author

Forward