Hi folks!
I maintain Fedora's instance of openQA (
https://open.qa ), an automated
testing framework. Since I upgraded our deployments recently, I'm
running into a problem where, after running for a while, a "worker
host" (those are bare metal machines dedicated to running tests in
throwaway virtual machines) will log an error like this:
Jul 06 07:45:26
openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: UID 985 exceeded its 'bytes' quota on UID 985.
and after that, "jobs" (individual test runs) start failing.
This is specific to one particular type of openQA job - groups of
networked jobs. These are for testing e.g. client/server scenarios -
you need a server VM and a client VM and they need to be able to
communicate.
We use openvswitch to enable this - there is an openvswitch bridge on
each worker host, and a bunch of tap devices, one per worker "instance"
(the workers are long-lived processes which take a job, spin up a VM,
run the job, then tear the VM down). The tap devices are connected to
the openvswitch bridge. When a group of networked jobs starts, the
workers run VMs configured to use the one of the tap devices each, and
this allows them to communicate.
openQA has a helper service related to this called os-autoinst-
openvswitch.service , which does stuff around this process: as far as I
understand it, it assigns the tap devices for each group of jobs to a
unique VLAN, to avoid network collisions between groups of jobs (if
e.g. two instances of the same group which use static IPs start at the
same time, we don't want them to collide with each other).
Right before the error message from dbus-broker, an os-vsctl command is
logged which I believe is this helper service's doing. Right after the
error, dbus-broker logs a bunch of peer disconnections, I think one per
tap device (which is the same as one per worker), so we have this:
Jul 06 07:45:26
openqa-x86-worker01.rdu3.fedoraproject.org ovs-vsctl[2185491]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap3 tag=27 vlan_mode=dot1q-tunnel
Jul 06 07:45:26
openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: UID 985 exceeded its 'bytes' quota on UID 985.
Jul 06 07:45:26
openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: Peer :1.60 is being disconnected as it does not have the resources to receive a signal it subscribed to.
Jul 06 07:45:26
openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: Peer :1.61 is being disconnected as it does not have the resources to receive a signal it subscribed to.
Jul 06 07:45:26
openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: Peer :1.62 is being disconnected as it does not have the resources to receive a signal it subscribed to.
Jul 06 07:45:26
openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: Peer :1.64 is being disconnected as it does not have the resources to receive a signal it subscribed to.
...etc etc etc. After that, none of these "grouped networking jobs"
will run on the affected host; on startup they immediately fail with an
error like:
Open vSwitch command 'set_vlan' with arguments 'tap59 6' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name is not activatable
so far the only way I've found to recover from this state is to reboot
the worker host, so it's quite disruptive.
This started happening when I upgraded the workers hosts from Fedora 41
to Fedora 42, *and* updated the openQA code by about twelve months.
Unfortunately it's hard to figure which of the two changes triggered
the problem :/
I'm not sure how to proceed in debugging/resolving this situation. Is
this 'bytes' quota configurable? I couldn't find out from a poke
through the source code or documentation. Are there any typical steps
to try to figure out what's going on?
Thanks a lot!
--
Adam Williamson (he/him/his)
Fedora QA
Fedora Chat: @adamwill:
fedora.im | Mastodon: @
ad...@fosstodon.org
https://www.happyassassin.net