Hi,
I have some further info. I partially know cause and have a workaround.
There is my investigation. Some minor inaccuracies might be caused by retrospective writing:
1. I have tried to debug using strace. (Prerequisite: sudo qubes-dom0-install strace) After finding pid of qubesd, I ran:
sudo strace -s 256 -p PID_OF_QUBESD -o /tmp/qubesd.log
It looks like few seconds is enough to get a reasonable sample, see below.
2. I ran sort /tmp/qubesd.log | uniq -c | sort -n (one can also add “ -r | head -n 50”).
I have noticed an interesting line that repeats frequently:
sendto(270, "QubesNoSuchPropertyError\0", 25, 0, NULL, 0) = 25
3. Look closer:
$ grep --before=5 --after=5 QubesNoSuchPropertyError /tmp/qubesd.log
The output contains many repeated occurrences of this, just with a different VM name. It seems to iterate over all the VMs (even those that are not running):
--
epoll_wait(3, [], <some_number>, 0) = 0
getpid() = <some_number>
epoll_wait(3, [], <some_number>, 0) = 0
epoll_wait(3, [], <some_number>, 0) = 0
sendto(<some_number>, "2\0", 2, 0, NULL, 0) = 2
sendto(<some_number>, "QubesNoSuchPropertyError\0", 25, 0, NULL, 0) = 25
sendto(<some_number>, "\0", 1, 0, NULL, 0) = 1
sendto(<some_number>, "Invalid property 'internal' of <some-vm-name>\0", 38, 0, NULL, 0) = 38
shutdown(<some_number>, SHUT_WR) = 0
epoll_wait(3, [{EPOLLIN, {u32=<some_number>, u64=<some_number>}}], 18, 0) = 1
close(<some_number>) = 0
4. WTF, what would iterate over all the VMs? Maybe some script repeatedly runs qvm-ls? Let's ps aux | grep qvm-ls that! During increased CPU workload, I have identified:
qvm-ls --no-spinner --raw-data --fields NAME,FLAGS
5. During the current random CPU workload, I cannot reliably verify if it is the cause of the increased CPU usage, but at least I can verify if it is the cause of the error messages. So, I have tried the command while running this:
(sudo strace -s 256 -p PID_OF_QUBESD 2>&1) | grep 'Invalid property'
And yes, it seems to be the cause of the error messages and maybe also the source of increased CPU load.
6. Let's identify the script that runs the command: I ran htop, switched to tree mode (key t), waited for the qvm-ls (using watch + ps aux + grep) and typed “/qvm-ls”.
And the script to blame is – qubes-i3status
7. And yes, killing qubes-i3status has helped to decrease the CPU load. After doing that, I was able to confirm that qvm-ls --no-spinner --raw-data --fields NAME,FLAGS also causes the CPU load.
So, there are multiple causes combined:
* I have many VMs in my computer.
* I use i3 with qubes-i3status
* The script qubes-i3status calls command qvm-ls --no-spinner --raw-data --fields NAME,FLAGS quite frequently.
* The command qvm-ls --no-spinner --raw-data --fields NAME,FLAGS seems to cause high CPU load. Unfortunately, the process that shows the high CPU usage is qubesd, not qvm-ls.
What can be improved:
a. Don't use qubes-i3status. Problem solved.
b. Optimize qvm-ls. Not sure how hard it is.
c. Optimize qubes-i3status. I am not sure about the ideal way of doing that, but clearly running qvm-ls --no-spinner --raw-data --fields NAME,FLAGS just to compute the number of running qubes is far from optimal. One could add --running. And maybe it could have been written without flags. The script just ignores VMs with the first flag being “0” (maybe in order to ignore dom0) and the second flag being “r” (probably not needed with --running).
Regards,
Vít Šesták 'v6ak'