Help dealing with "UID 985 exceeded its 'bytes' quota on UID 985" errors

18 views
Skip to first unread message

Adam Williamson

unread,
Jul 6, 2025, 12:29:49 PMJul 6
to bus1-...@googlegroups.com
Hi folks!

I maintain Fedora's instance of openQA (https://open.qa ), an automated
testing framework. Since I upgraded our deployments recently, I'm
running into a problem where, after running for a while, a "worker
host" (those are bare metal machines dedicated to running tests in
throwaway virtual machines) will log an error like this:

Jul 06 07:45:26 openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: UID 985 exceeded its 'bytes' quota on UID 985.

and after that, "jobs" (individual test runs) start failing.

This is specific to one particular type of openQA job - groups of
networked jobs. These are for testing e.g. client/server scenarios -
you need a server VM and a client VM and they need to be able to
communicate.

We use openvswitch to enable this - there is an openvswitch bridge on
each worker host, and a bunch of tap devices, one per worker "instance"
(the workers are long-lived processes which take a job, spin up a VM,
run the job, then tear the VM down). The tap devices are connected to
the openvswitch bridge. When a group of networked jobs starts, the
workers run VMs configured to use the one of the tap devices each, and
this allows them to communicate.

openQA has a helper service related to this called os-autoinst-
openvswitch.service , which does stuff around this process: as far as I
understand it, it assigns the tap devices for each group of jobs to a
unique VLAN, to avoid network collisions between groups of jobs (if
e.g. two instances of the same group which use static IPs start at the
same time, we don't want them to collide with each other).

Right before the error message from dbus-broker, an os-vsctl command is
logged which I believe is this helper service's doing. Right after the
error, dbus-broker logs a bunch of peer disconnections, I think one per
tap device (which is the same as one per worker), so we have this:

Jul 06 07:45:26 openqa-x86-worker01.rdu3.fedoraproject.org ovs-vsctl[2185491]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap3 tag=27 vlan_mode=dot1q-tunnel
Jul 06 07:45:26 openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: UID 985 exceeded its 'bytes' quota on UID 985.
Jul 06 07:45:26 openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: Peer :1.60 is being disconnected as it does not have the resources to receive a signal it subscribed to.
Jul 06 07:45:26 openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: Peer :1.61 is being disconnected as it does not have the resources to receive a signal it subscribed to.
Jul 06 07:45:26 openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: Peer :1.62 is being disconnected as it does not have the resources to receive a signal it subscribed to.
Jul 06 07:45:26 openqa-x86-worker01.rdu3.fedoraproject.org dbus-broker[4890]: Peer :1.64 is being disconnected as it does not have the resources to receive a signal it subscribed to.

...etc etc etc. After that, none of these "grouped networking jobs"
will run on the affected host; on startup they immediately fail with an
error like:

Open vSwitch command 'set_vlan' with arguments 'tap59 6' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name is not activatable

so far the only way I've found to recover from this state is to reboot
the worker host, so it's quite disruptive.

This started happening when I upgraded the workers hosts from Fedora 41
to Fedora 42, *and* updated the openQA code by about twelve months.
Unfortunately it's hard to figure which of the two changes triggered
the problem :/

I'm not sure how to proceed in debugging/resolving this situation. Is
this 'bytes' quota configurable? I couldn't find out from a poke
through the source code or documentation. Are there any typical steps
to try to figure out what's going on?

Thanks a lot!
--
Adam Williamson (he/him/his)
Fedora QA
Fedora Chat: @adamwill:fedora.im | Mastodon: @ad...@fosstodon.org
https://www.happyassassin.net



David Rheinsberg

unread,
Jul 7, 2025, 5:03:07 AMJul 7
to Adam Williamson, bus1-...@googlegroups.com
Hi Adam!

On Sun, 6 Jul 2025 at 18:29, 'Adam Williamson' via bus1-devel
<bus1-...@googlegroups.com> wrote:
> I'm not sure how to proceed in debugging/resolving this situation. Is
> this 'bytes' quota configurable? I couldn't find out from a poke
> through the source code or documentation. Are there any typical steps
> to try to figure out what's going on?

I debugged a similar issue with you a while back (also in openQA). I
used this as an example how to debug such resource leaks in a blog
post [1]. I hope this can help to understand the underlying problem.
If you have access to long-running machines (regardless whether they
showed this error or not), you can query the accounting system of
dbus-broker, as shown in the blog post. This should usually help
finding the application that accumulates resources.

There is always the possibility of a bug in dbus-broker. However, so
far all resource exhaustions similar to yours uncovered bugs in
application behavior, where dbus resources were accumulated in the
background and never released. Hence, I recommend looking into updates
to the dbus code in openQA or related utilities. I am also unaware of
any significant changes to dbus-broker in the recent updates.

We have an upcoming change to the accounting system that will track
resource use on a much more fine-grained level, and hopefully better
show which process was involved in the resource exhaustion. However,
this is not in any public release, yet.

Thanks
David

[1] https://dvdhrm.github.io/2021/04/14/locating-dbus-resource-leaks/

Adam Williamson

unread,
Jul 7, 2025, 12:39:04 PMJul 7
to David Rheinsberg, bus1-...@googlegroups.com
Hey David! I had completely forgotten about that, hah.

So, using the dbus-send command you recommend there and some grepping,
I can see that there are 60 clients with more-than-zero outgoing bytes:

dbus-send --system --dest=org.freedesktop.DBus --type=method_call --print-reply /org/freedesktop/DBus org.freedesktop.DBus.Debug.Stats.GetStats | grep -2 OutgoingB | grep uint | grep -v "uint32 0" | wc -l

that's exactly the number of workers on this particular host. The
actual numbers are in a gradually-decreasing range between 3875769 and
3782619 currently, but every time I run the command they have ticked up
slightly - it looks kinda like a 'slow leak'. So I guess they just keep
ticking up until the first one hits the quota and then things blow up.

This is definitely only affecting the "worker hosts" that run advanced
networking jobs, I see nothing like this on the hosts that don't. So it
must be in the advanced-networking bits somewhere.

Looking at the full dict of the first one in the list, pid is 7488,
which is one of the workers:

_openqa+ 7488 0.0 0.0 106540 79060 ? Ss Jul06 0:18 /usr/bin/perl /usr/share/openqa/script/worker --instance 6

and indeed it has one filter - Matches is 1. That script is
https://github.com/os-autoinst/openQA/blob/master/script/worker . It
does indeed have one message filter - Matches is 1.

At a guess, could this be caused by
https://github.com/os-autoinst/openQA/commit/eeb2e670d650a2bf2a4f8960d3dbf87130402df8
? That is in the worker script and it's in the range of commits that
appeared when I updated our openQA package - I went from a July 2024
snapshot to an April 2025 one.

Thanks a lot for the help, both this time and the previous time!

Adam Williamson

unread,
Jul 7, 2025, 12:45:48 PMJul 7
to David Rheinsberg, bus1-...@googlegroups.com
On Mon, 2025-07-07 at 09:38 -0700, Adam Williamson wrote:
>
> At a guess, could this be caused by
> https://github.com/os-autoinst/openQA/commit/eeb2e670d650a2bf2a4f8960d3dbf87130402df8
> ? That is in the worker script and it's in the range of commits that
> appeared when I updated our openQA package - I went from a July 2024
> snapshot to an April 2025 one.

Hmm, yeah, that probably *is* the problem, isn't it? Once again we're
keeping a dbus connection around permanently:

my $bus = ($self->{_system_dbus} //= Net::DBus->system(nomainloop => 1));

I'll try adjusting that to just re-init the connection every time, as
we did in the other case.

David Rheinsberg

unread,
Aug 25, 2025, 2:23:12 AMAug 25
to Adam Williamson, bus1-...@googlegroups.com
Hi Adam!

On Mon, 7 Jul 2025 at 18:45, Adam Williamson <awil...@redhat.com> wrote:
>
> On Mon, 2025-07-07 at 09:38 -0700, Adam Williamson wrote:
> >
> > At a guess, could this be caused by
> > https://github.com/os-autoinst/openQA/commit/eeb2e670d650a2bf2a4f8960d3dbf87130402df8
> > ? That is in the worker script and it's in the range of commits that
> > appeared when I updated our openQA package - I went from a July 2024
> > snapshot to an April 2025 one.
>
> Hmm, yeah, that probably *is* the problem, isn't it? Once again we're
> keeping a dbus connection around permanently:
>
> my $bus = ($self->{_system_dbus} //= Net::DBus->system(nomainloop => 1));
>
> I'll try adjusting that to just re-init the connection every time, as
> we did in the other case.

Sorry for not replying. Great to hear you figured it out! And always
glad to offer help!

David
Reply all
Reply to author
Forward
0 new messages