Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1028212: prometheus-node-exporter-collectors: APT update deadlock - prevents unattended security upgrades

234 views
Skip to first unread message

Eric Estievenart

unread,
Jan 8, 2023, 10:50:05 AM1/8/23
to
Package: prometheus-node-exporter-collectors
Version: 0.0~git20221011.8f6be63-1
Severity: important
Tags: security
X-Debbugs-Cc: Salvatore Bonaccorso <car...@debian.org>, stev...@tecwec.eu, Debian Security Team <te...@security.debian.org>

As requested, by Salvatore lowering prio and avoiding embargo.
-----

Hello, happy new year, and thanks.

This looks like an apt deadlock, which prevents updates, unattended upgrades, and so critical security updates
for systems where they are enabled.
(Yes, we can just manually kill the offending apt_info.py process to temporarily solve the issue - but this is not the good solution).
As it prevents security updates, and despite it unlikely to happen massively, and be practically exploited, I feel this requires real attention.


Symptoms:
Persistent apt update locking error:
# apt update
Reading package lists... Done
E: Could not get lock /var/lib/apt/lists/lock. It is held by process 65553 (python3)
N: Be aware that removing the lock file is not a solution and may break your system.
E: Unable to lock directory /var/lib/apt/lists/

# 1 hour later, same issue, same holding PID 65553

# Concerned processes:
# ps aux |grep pyth
root 1259 0.0 0.1 121076 27528 ? Ssl Jan06 0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgra>
root 65553 0.0 0.4 89640 76908 ? S 12:09 0:03 python3 /usr/share/prometheus-node-exporter-collectors/apt_info.py
ee 70395 0.0 0.2 124164 42844 ? Sl 12:35 0:00 /bin/python3.11 /home/ee/.vscode-oss/extensions/ms-python.python> (not suspected)

# ps aux |grep apt
root 65551 0.0 0.0 9552 4252 ? Ss 12:09 0:00 /bin/bash -c /usr/share/prometheus-node-exporter-collectors/apt_>
root 65553 0.0 0.4 89640 76908 ? S 12:09 0:03 python3 /usr/share/prometheus-node-exporter-collectors/apt_info.>
root 65554 0.0 0.0 2464 884 ? S 12:09 0:00 sponge /var/lib/prometheus/node-exporter/apt.prom
_apt 65814 0.0 0.0 27192 13204 ? S 12:09 0:00 /usr/lib/apt/methods/https
_apt 65815 0.0 0.0 24420 10236 ? S 12:09 0:00 /usr/lib/apt/methods/http
_apt 65816 0.0 0.0 27192 13204 ? S 12:09 0:00 /usr/lib/apt/methods/https
_apt 65817 0.0 0.0 24420 10272 ? S 12:09 0:00 /usr/lib/apt/methods/http
_apt 65819 0.0 0.0 17572 7624 ? S 12:09 0:00 /usr/lib/apt/methods/gpgv
_apt 65826 0.0 0.0 27192 13464 ? S 12:09 0:00 /usr/lib/apt/methods/https
_apt 65829 0.0 0.0 24420 10292 ? S 12:09 0:00 /usr/lib/apt/methods/http
_apt 66110 0.0 0.0 17528 7500 ? S 12:10 0:00 /usr/lib/apt/methods/store
_apt 66112 0.0 0.0 18436 8636 ? S 12:10 0:00 /usr/lib/apt/methods/rred
_apt 66113 0.0 0.0 18576 8860 ? S 12:10 0:00 /usr/lib/apt/methods/rred

The deadlock is obviously between the unattended-upgrade proc (1259), and the prometheus tryptic: 65551/53/54.


# 65553 seems to be the culprit - as apt update tells us
# strace -p 65553
strace: Process 65553 attached
pselect6(29, [12 13 14 16 18 20 22 24 26 28], [], NULL, {tv_sec=0, tv_nsec=499419645}, NULL) = 0 (Timeout)
pselect6(29, [12 13 14 16 18 20 22 24 26 28], [], NULL, {tv_sec=0, tv_nsec=500000000}, NULL) = 0 (Timeout)
... repeats 'forever' ....
All fds are pipes, I could not get more info until the processed crashed due to my diagnostic atttempts.
An apt/python/prom collector specialist should instantly identify these pipes and make more deductions, from the following state:

# gdb -p 65553 and bt:
#0 0x00007fa4bf65f794 in __GI___select (nfds=29, readfds=0x7ffc24f8e7c0, writefds=0x7ffc24f8e840, exceptfds=0x0,
timeout=0x7ffc24f8e750) at ../sysdeps/unix/sysv/linux/select.c:69
#1 0x00007fa4bebad338 in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2 0x00007fa4becb1485 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#3 0x00007fa4becb1976 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) ()
from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#4 0x00007fa4bed32fe1 in ?? () from /usr/lib/python3/dist-packages/apt_pkg.cpython-311-x86_64-linux-gnu.so
#5 0x0000000000521cf0 in ?? ()
#6 0x000000000053983c in PyObject_Vectorcall ()
#7 0x000000000052a570 in _PyEval_EvalFrameDefault ()
#8 0x000000000052222b in PyEval_EvalCode ()
#9 0x0000000000647f07 in ?? ()
#10 0x00000000006457cf in ?? ()
#11 0x0000000000651920 in ?? ()
#12 0x000000000065166b in _PyRun_SimpleFileObject ()
#13 0x0000000000651494 in _PyRun_AnyFileObject ()
#14 0x000000000065022f in Py_RunMain ()
#15 0x00000000006248b7 in Py_BytesMain ()
#16 0x00007fa4bf58818a in __libc_start_call_main (main=main@entry=0x624820, argc=argc@entry=2,
argv=argv@entry=0x7ffc24f8f298) at ../sysdeps/nptl/libc_start_call_main.h:58
#17 0x00007fa4bf588245 in __libc_start_main_impl (main=0x624820, argc=2, argv=0x7ffc24f8f298, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc24f8f288) at ../csu/libc-start.c:381
#18 0x0000000000624751 in _start ()

This seems to suggest that the location of the deadlock, for 65553, is:
(apt_info.py)
def _main():
cache = apt.cache.Cache()

# First of all, attempt to update the index. If we don't have permission
# to do so (or it fails for some reason), it's not the end of the world,
# we'll operate on the old index.
with contextlib.suppress(apt.cache.LockFailedException, apt.cache.FetchFailedException):
cache.update() <<<<<<<<<<<< VERY LIKELY

I could not confirm the precise location, as trying to get a python backtrace from the process generated a SEGV:
(gdb) call PyRun_SimpleString("print('toto\n')") # to test
'PyRun_SimpleString' has unknown return type; cast the call to its declared return type
(gdb) call (void*)PyRun_SimpleString("print('toto\n')")
Program received signal SIGSEGV, Segmentation fault.
# Oops... will not get a python trace now.

Hopefully, I collected the core (~27MB) - if interested, tell me - keeping it for a few weeks:
#0 0x000000000063187a in ?? ()
#1 0x00000000006349b2 in PyImport_AddModuleObject ()
#2 0x0000000000634688 in PyImport_AddModule ()
#3 0x000000000063e323 in PyRun_SimpleStringFlags ()
(but I feel it unrelated, and not so usefull - but I may be wrong)

I feel I can't help more now, so throwing the potato 😉

Best,
Eric 'Steve' Estievenart


-- System Information:
Debian Release: bookworm/sid
APT prefers unstable
APT policy: (990, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 6.0.0-6-amd64 (SMP w/4 CPU threads; PREEMPT)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages prometheus-node-exporter-collectors depends on:
ii moreutils 0.67-1
ii prometheus-node-exporter 1.5.0-1+b1
ii python3-apt 2.5.0
ii systemd-sysv 252.4-1

Versions of packages prometheus-node-exporter-collectors recommends:
ii ipmitool 1.8.19-4
ii jq 1.6-2.1
ii nvme-cli 2.2.1-3
ii python3 3.11.1-1
ii smartmontools 7.3-1+b1

prometheus-node-exporter-collectors suggests no packages.

-- no debconf information

Daniel Swarbrick

unread,
Jan 8, 2023, 8:21:44 PM1/8/23
to
Hi Eric,

Thanks for the detailed bug report. As this is something which can
theoretically affect _any_ apt-based distributed (i.e., derivatives of
Debian), I feel that it should ideally be reported upstream.

I personally run this textfile collector on a Debian bookworm system, as
well as apticron - so this is (I think) a similar scenario where two
independent processes are periodically updating the apt cache, and I
wondered whether that was wise or not. I have seen the textfile
collector block only once so far.

The apt.sh script which apt_info.py replaces only executed "apt-get
--just-print" - so even if executed as root, it never tried to update
the apt cache. In fact, unless you had something else like apticron to
periodically update the apt cache, apt.sh would return stale information.

I guess that a simple workaround would be to tweak the systemd service
so that apt_info.py is executed as an unprivileged user, which would be
unable to update the cache, and theoretically avoid any potential for a
deadlock. Perhaps a recommendation to the upstream developer could be
made, e.g. to add a command-line argument to the script so that it
wouldn't try to update the cache even when executed as root.

Best,
Daniel

OpenPGP_signature

Antoine Beaupré

unread,
Oct 11, 2023, 2:50:05 PM10/11/23
to
On 2023-01-09 14:08:06, Daniel Swarbrick wrote:
> Hi Eric,
>
> Thanks for the detailed bug report. As this is something which can
> theoretically affect _any_ apt-based distributed (i.e., derivatives of
> Debian), I feel that it should ideally be reported upstream.

I'm curious here, actually: which upstream are you thinking of? Because
I have the suspicion this is actually a python3-apt bug rather than
specific to this exporter...

> I personally run this textfile collector on a Debian bookworm system, as
> well as apticron - so this is (I think) a similar scenario where two
> independent processes are periodically updating the apt cache, and I
> wondered whether that was wise or not. I have seen the textfile
> collector block only once so far.

We're seeing repeated problems with this here. We manage a fleet of
about 90 Debian installations, out of which 42 have been upgraded to
bookworm and are showing symptoms. Those machines have a hourly legacy
cron job that updates the apt cache for another monitoring system, with
`apt update -qq`. Since we upgraded to bookworm, we have had 95 warnings
from cron, some of them repeating for hours on end.

One box in particular hung on that lock for over *two days*.

> The apt.sh script which apt_info.py replaces only executed "apt-get
> --just-print" - so even if executed as root, it never tried to update
> the apt cache. In fact, unless you had something else like apticron to
> periodically update the apt cache, apt.sh would return stale information.

That does seem suboptimal, that said. :)

> I guess that a simple workaround would be to tweak the systemd service
> so that apt_info.py is executed as an unprivileged user, which would be
> unable to update the cache, and theoretically avoid any potential for a
> deadlock. Perhaps a recommendation to the upstream developer could be
> made, e.g. to add a command-line argument to the script so that it
> wouldn't try to update the cache even when executed as root.

Surely there should be a timeout to this script or something? Why does
it hang in the first place? I'll investigate a little bit further.

Antoine Beaupré

unread,
Oct 11, 2023, 3:20:05 PM10/11/23
to
(adding the collector author and python3-apt author in cc)

(context for you fine folks: the Prometheus monitoring system (package
prometheus-node-exporter-collectors) has a systemd timer that runs every
5 minutes and seems to sometimes hang for hours on end. The script was
written by Kyle Fazzari (in cc) using the python-apt library written by
jak (also in cc). I'm trying to figure out if it's using the library
correctly, or if there's a bug in said library.)

On 2023-10-11 14:31:51, Antoine Beaupré wrote:
> Surely there should be a timeout to this script or something? Why does
> it hang in the first place? I'll investigate a little bit further.

So that is what the Python script does:

def _main():
cache = apt.cache.Cache()

# First of all, attempt to update the index. If we don't have permission
# to do so (or it fails for some reason), it's not the end of the world,
# we'll operate on the old index.
with contextlib.suppress(apt.cache.LockFailedException, apt.cache.FetchFailedException):
cache.update()

cache.open()
cache.upgrade(True)
_write_pending_upgrades(cache)
_write_held_upgrades(cache)
_write_autoremove_pending(cache)
_write_reboot_required()

That's the version in bookworm, the latest upstream is similar:

https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/blob/34dd42ee2cf5bf1ffcfdea5a3599130f146b88fc/apt_info.py#L93

The _write_*() calls are all things that parse the cache and spew out
metrics. I believe they are irrelevant to this bug.

If I look at a stuck server, the process list looks like this:

root 101969 0.0 0.0 4388 3140 ? Ss 10:00 0:00 /bin/bash -c /usr/share/prometheus-node-exporter-collectors/apt_info.py | sponge /var/lib/prometheus
root 101971 0.0 0.9 88124 78232 ? S 10:00 0:04 \_ python3 /usr/share/prometheus-node-exporter-collectors/apt_info.py
_apt 101988 0.0 0.1 21216 11700 ? S 10:00 0:00 | \_ /usr/lib/apt/methods/https
_apt 101989 0.0 0.1 21164 10696 ? S 10:00 0:00 | \_ /usr/lib/apt/methods/http
_apt 101990 0.0 0.1 24336 14196 ? S 10:00 0:00 | \_ /usr/lib/apt/methods/https
_apt 101992 0.0 0.0 12424 6436 ? S 10:00 0:00 | \_ /usr/lib/apt/methods/gpgv
root 101972 0.0 0.0 2464 908 ? S 10:00 0:00 \_ sponge /var/lib/prometheus/node-exporter/apt.prom

That seems to be the APT `gpgv`, `http` and `https` methods stuck
there... This particular one has been stuck for 4 hours at this
point. The network processes have a socket open with the remote HTTP
server in CLOSE_WAIT state.

Obviously, we don't have this problem with other `apt`
incantations. So something must be wrong with the way we're calling apt
here... It looks like it's doing this:

bool res = ListUpdate(progress, *source, pulseInterval);

here:

https://salsa.debian.org/apt-team/python-apt/-/blob/bdcb2550cb6f623f4556bf6581a040642f29dd28/python/cache.cc#L158

while apt itself just calls it as:

if (!ListUpdate(*acquire, *m_cache->GetSourceList())) {

in:

https://sources.debian.org/src/libqapt/3.0.5-2/src/worker/aptworker.cpp/?hl=311#L311

... but that looks like a different interface, perhaps, than the
previous one.

Here's a backtrace with debugging symbols attached for libapt:

(gdb) bt
#0 0x00007f4a6601d744 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f4a65a93508 in pkgAcquire::Run (this=this@entry=0x7ffdf60fa4c0, PulseInterval=PulseInterval@entry=500000) at ./apt-pkg/acquire.cc:761
#2 0x00007f4a65b97935 in AcquireUpdate (Fetcher=..., PulseInterval=PulseInterval@entry=0, RunUpdateScripts=RunUpdateScripts@entry=true,
ListCleanup=ListCleanup@entry=true) at ./apt-pkg/update.cc:71
#3 0x00007f4a65b97e26 in ListUpdate (Stat=..., List=..., PulseInterval=0) at ./apt-pkg/update.cc:37
#4 0x00007f4a65c18fe1 in ?? () from /usr/lib/python3/dist-packages/apt_pkg.cpython-311-x86_64-linux-gnu.so
#5 0x00000000005230d0 in ?? ()
#6 0x000000000053ac2c in PyObject_Vectorcall ()
#7 0x000000000052b940 in _PyEval_EvalFrameDefault ()
#8 0x000000000052360b in PyEval_EvalCode ()
#9 0x0000000000647497 in ?? ()
#10 0x0000000000644d4f in ?? ()
#11 0x0000000000651010 in ?? ()
#12 0x0000000000650d5b in _PyRun_SimpleFileObject ()
#13 0x0000000000650b84 in _PyRun_AnyFileObject ()
#14 0x000000000064f90f in Py_RunMain ()
#15 0x00000000006275c7 in Py_BytesMain ()
#16 0x00007f4a65f461ca in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x00007f4a65f46285 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#18 0x0000000000627461 in _start ()

(gdb) up
#1 0x00007f4a65a93508 in pkgAcquire::Run (this=this@entry=0x7ffdf60fa4c0, PulseInterval=PulseInterval@entry=500000) at ./apt-pkg/acquire.cc:761
761 in ./apt-pkg/acquire.cc
(gdb) p (pkgAcquire const) *0x7ffdf60fa4c0
$5 = {_vptr.pkgAcquire = 0x7f4a65bf0380 <vtable for pkgAcquire+16>, LockFD = 26, d = 0x0, Items = std::vector of length 49, capacity 64 = {0x2465f50, 0x248dd00,
0x243ea90, 0x243f920, 0x2467520, 0x2479040, 0x24817b0, 0x222b3e0, 0x2482180, 0x21bbe00, 0x21c4840, 0x1f97c60, 0x1f33e50, 0x24869d0, 0x2487ac0, 0x249bea0,
0x249caa0, 0x249dd40, 0x24a0ce0, 0x24a0f40, 0x24a55e0, 0x24a62f0, 0x24a6ed0, 0x24a80c0, 0x24a8ca0, 0x24a9ee0, 0x24aabf0, 0x24a3350, 0x24a3910, 0x24a4080,
0x24ab7f0, 0x24ac7a0, 0x24ad480, 0x24e8e30, 0x24e9c00, 0x24ea880, 0x24eb3d0, 0x24ec3a0, 0x24ecfb0, 0x24f0d10, 0x24f4ec0, 0x24f5520, 0x24f6130, 0x24f7260,
0x24f8430, 0x24f95b0, 0x24fa710, 0x24fb800, 0x216e9b0}, Queues = 0x248d900, Workers = 0x223c580, Configs = 0x2246950, Log = 0x7ffdf60fa5a0, ToFetch = 2,
QueueMode = pkgAcquire::QueueHost, Debug = false, Running = true}
(gdb) up
#2 0x00007f4a65b97935 in AcquireUpdate (Fetcher=..., PulseInterval=PulseInterval@entry=0, RunUpdateScripts=RunUpdateScripts@entry=true,
ListCleanup=ListCleanup@entry=true) at ./apt-pkg/update.cc:71
71 ./apt-pkg/update.cc: No such file or directory.
(gdb) p Fetcher
$2 = (pkgAcquire &) @0x7ffdf60fa4c0: {_vptr.pkgAcquire = 0x7f4a65bf0380 <vtable for pkgAcquire+16>, LockFD = 26, d = 0x0,
Items = std::vector of length 49, capacity 64 = {0x2465f50, 0x248dd00, 0x243ea90, 0x243f920, 0x2467520, 0x2479040, 0x24817b0, 0x222b3e0, 0x2482180, 0x21bbe00,
0x21c4840, 0x1f97c60, 0x1f33e50, 0x24869d0, 0x2487ac0, 0x249bea0, 0x249caa0, 0x249dd40, 0x24a0ce0, 0x24a0f40, 0x24a55e0, 0x24a62f0, 0x24a6ed0, 0x24a80c0,
0x24a8ca0, 0x24a9ee0, 0x24aabf0, 0x24a3350, 0x24a3910, 0x24a4080, 0x24ab7f0, 0x24ac7a0, 0x24ad480, 0x24e8e30, 0x24e9c00, 0x24ea880, 0x24eb3d0, 0x24ec3a0,
0x24ecfb0, 0x24f0d10, 0x24f4ec0, 0x24f5520, 0x24f6130, 0x24f7260, 0x24f8430, 0x24f95b0, 0x24fa710, 0x24fb800, 0x216e9b0}, Queues = 0x248d900,
Workers = 0x223c580, Configs = 0x2246950, Log = 0x7ffdf60fa5a0, ToFetch = 2, QueueMode = pkgAcquire::QueueHost, Debug = false, Running = true}

Note that killing the process and restarting it fixes the issue
completely for us, as documented elsewhere.

So I'm a bit stuck as well. I can't help but think that the previous
shell script, as horrendous as it was, Just Worked for our purpose...

In any case, this is an important regression for us right now.

Antoine Beaupré

unread,
Oct 11, 2023, 3:30:05 PM10/11/23
to
We deployed the following workaround for this, will report on whether or
not it fixes the issue:

# cat /etc/systemd/system/prometheus-node-exporter-apt.service.d/override.conf
[Service]
# bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028212
# https://gitlab.torproject.org/tpo/tpa/team/-/issues/41355
TimeoutStartSec=30s

Kyle Fazzari

unread,
Oct 11, 2023, 4:30:05 PM10/11/23
to
On 10/11/23 12:08, Antoine Beaupré wrote:
> (adding the collector author and python3-apt author in cc)

Thanks for looping me in, Antoine. I'm sorry to hear this is happening!
I haven't had a chance to experiment with this yet, but I have a few
preliminary thoughts to share after reading through the conversation.

By the way, Eric, sometimes I forget what good bug reports look like.
Thanks for providing an example.

First of all, my primary use-case for this collector is to alert me of
updates that need to be installed (e.g. security updates). In that
context, operating on a stale cache would give me stale information,
which one could argue is worse than no information. Once is probably
fine, which is why that script silently proceeds if there is a problem
updating the cache, but actually adding the ability to disable the cache
update (as Daniel suggests as a potential solution) seems like it would
be a footgun.

Second, I must admit I'm also rather flummoxed by this behavior. Earlier
in the thread, Daniel said:

> I personally run this textfile collector on a Debian bookworm system,
> as well as apticron - so this is (I think) a similar scenario where
> two independent processes are periodically updating the apt cache,
> and I wondered whether that was wise or not

I don't claim to be an apt expert, but I believe apt uses locks to
prevent the scenario where this would be problematic, no? You can't run
two `apt update`s at the same time: one fails because it can't get the
lock. Similarly, you can't run one `apt update` at the same time as this
Python script:

import apt
cache = apt.cache.Cache()
cache.update()

The Python script uses the same lock, and fails the same way. You can't
run two instances of that script ^ at the same time, either. This is why
we catch the `apt.cache.LockFailedException` error in the script. If the
deadlock is really happening on the `cache.update()` line, that feels
like an apt bug. Obviously jak will know better.

One off-the-cuff idea is that we could probably use a custom apt config
to run the checks on our own copy of the cache without interacting with
the system's cache, but that will greatly complicate this script.

Anyway, I'll do some experimentation and see if I can develop some
properly-formed thoughts.

Kyle

Antoine Beaupré

unread,
Oct 11, 2023, 10:30:04 PM10/11/23
to
On 2023-10-11 13:17:39, Kyle Fazzari wrote:
> On 10/11/23 12:08, Antoine Beaupré wrote:

[...]

> First of all, my primary use-case for this collector is to alert me of
> updates that need to be installed (e.g. security updates). In that
> context, operating on a stale cache would give me stale information,
> which one could argue is worse than no information. Once is probably
> fine, which is why that script silently proceeds if there is a problem
> updating the cache, but actually adding the ability to disable the cache
> update (as Daniel suggests as a potential solution) seems like it would
> be a footgun.

Yep, that makes sense.

> Second, I must admit I'm also rather flummoxed by this behavior. Earlier
> in the thread, Daniel said:
>
> > I personally run this textfile collector on a Debian bookworm system,
> > as well as apticron - so this is (I think) a similar scenario where
> > two independent processes are periodically updating the apt cache,
> > and I wondered whether that was wise or not
>
> I don't claim to be an apt expert, but I believe apt uses locks to
> prevent the scenario where this would be problematic, no? You can't run
> two `apt update`s at the same time: one fails because it can't get the
> lock. Similarly, you can't run one `apt update` at the same time as this
> Python script:
>
> import apt
> cache = apt.cache.Cache()
> cache.update()
>
> The Python script uses the same lock, and fails the same way. You can't
> run two instances of that script ^ at the same time, either. This is why
> we catch the `apt.cache.LockFailedException` error in the script. If the
> deadlock is really happening on the `cache.update()` line, that feels
> like an apt bug. Obviously jak will know better.

I don't think this is a deadlock, but I admit I haven't looked at the
code. At least the gdb backtraces don't show the hung process as
spinning on a lock, it's waiting on a download:

#0 0x00007f4a6601d744 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f4a65a93508 in pkgAcquire::Run (this=this@entry=0x7ffdf60fa4c0, PulseInterval=PulseInterval@entry=500000) at ./apt-pkg/acquire.cc:761

That's clearly some thread waiting on the network. What I feel is
happening is that there's some timeout in `apt update` that exists and
doesn't trigger in `apt.cache.update()`.

> One off-the-cuff idea is that we could probably use a custom apt config
> to run the checks on our own copy of the cache without interacting with
> the system's cache, but that will greatly complicate this script.

Yeah, really, the script you wrote should Just Work. I find the
`cache.upgrade()` call to be a little strange, personnally: I would try
ripping that out completely to see if it fixes the issue, but maybe you
have a better idea of why it's there in the first place?

For now, we're using a timeout mitigation in the systemd timer
instead. I've received two more warnings from the other cron job, but
hopefully this will go away.

But because the `apt_info.py` can silently fail to update the cache, we
may want to add an extra metric to track the update timestamp on the
mirror info, I filed this bug about that:

https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/180

> Anyway, I'll do some experimentation and see if I can develop some
> properly-formed thoughts.

Thank you so much for your response!

I think adding instrumentation around how long the script takes to run
itself would be valuable, that could be a simple time counter added to
the script's output... This would allow tracking this problem in fleets
where there *isn't* such lock contention.

After all, the only reason we found out about this is because we got
repeated emails from cron about apticron or other software failing to
run apt-update. If that's removed from the equation, the script here
just fails silently, and I think that's also possibly a Bad Thing.

a.

--
If ease of use was the ultimate aim for a tool,
the bicycle would never have evolved beyond the tricycle.
— Doug Engelbart, 1925-2013

Kyle Fazzari

unread,
Oct 12, 2023, 1:40:05 AM10/12/23
to


On 10/11/23 19:24, Antoine Beaupré wrote:
> Yeah, really, the script you wrote should Just Work. I find the
> `cache.upgrade()` call to be a little strange, personnally: I would try
> ripping that out completely to see if it fixes the issue, but maybe you
> have a better idea of why it's there in the first place?

It's admittedly been a little while so there's some rust in my noggin,
but the backbone of this script is `cache.get_changes()`. I believe it's
`cache.upgrade()` that actually marks the changes. Note of course that
none of the changes are applied without a `cache.commit()`, so no
upgrade operation is actually performed; this is just a way of saying
"if I were to upgrade, what would changes would be made?" Ultimately I
believe the removal of `cache.upgrade()` would result in the collector
never seeing available upgrades.
> But because the `apt_info.py` can silently fail to update the cache, we
> may want to add an extra metric to track the update timestamp on the
> mirror info, I filed this bug about that:
>
> https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/180

Agreed, that seems sound.

>> Anyway, I'll do some experimentation and see if I can develop some
>> properly-formed thoughts.
>
> Thank you so much for your response!
>
> I think adding instrumentation around how long the script takes to run
> itself would be valuable, that could be a simple time counter added to
> the script's output... This would allow tracking this problem in fleets
> where there *isn't* such lock contention.
>
> After all, the only reason we found out about this is because we got
> repeated emails from cron about apticron or other software failing to
> run apt-update. If that's removed from the equation, the script here
> just fails silently, and I think that's also possibly a Bad Thing.

Yeah I won't lie, my immediate thought was "huh, I've never seen that
happen." Then my follow-up was "actually, how would I even know?" But at
the same time, I could make that argument for a lot of collectors! Is
there an established pattern for gathering this kind of data?

Kyle

Antoine Beaupré

unread,
Oct 12, 2023, 1:50:05 AM10/12/23
to
On 2023-10-11 22:30:16, Kyle Fazzari wrote:
> On 10/11/23 19:24, Antoine Beaupré wrote:

[...]

> Yeah I won't lie, my immediate thought was "huh, I've never seen that
> happen." Then my follow-up was "actually, how would I even know?" But at
> the same time, I could make that argument for a lot of collectors! Is
> there an established pattern for gathering this kind of data?

Filing a timestamp and duration of the last run is typical.

--
Vivre tous simplement pour que tous puissent simplement vivre.
- Gandhi

Kyle Fazzari

unread,
Oct 13, 2023, 12:30:05 AM10/13/23
to


On 10/11/23 22:45, Antoine Beaupré wrote:
> On 2023-10-11 22:30:16, Kyle Fazzari wrote:
>> On 10/11/23 19:24, Antoine Beaupré wrote:
>
> [...]
>
>> Yeah I won't lie, my immediate thought was "huh, I've never seen that
>> happen." Then my follow-up was "actually, how would I even know?" But at
>> the same time, I could make that argument for a lot of collectors! Is
>> there an established pattern for gathering this kind of data?
>
> Filing a timestamp and duration of the last run is typical.

Very good.

By the way, I came across this Ubuntu bug today, which sounds eerily
familiar, no?

https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851

Antoine Beaupré

unread,
Oct 13, 2023, 10:40:05 AM10/13/23
to
Oh wow, so that would mean a bug in APT itself. And a regression too,
because we certainly never witnessed this in buster or bullseye...

I wonder if we should clone this bug into apt...

a.
--
The illusion of freedom will continue as long as it's profitable to
continue the illusion. At the point where the illusion becomes too
expensive to maintain, they will just take down the scenery, they will
pull back the curtains, they will move the tables and chairs out of
the way and you will see the brick wall at the back of the theater.
- Frank Zappa

Antoine Beaupré

unread,
Oct 13, 2023, 11:50:05 AM10/13/23
to
On 2023-10-13 17:26:38, Julian Andres Klode wrote:
> On Fri, Oct 13, 2023 at 10:29:10AM -0400, Antoine Beaupré wrote:
>> On 2023-10-12 21:25:20, Kyle Fazzari wrote:
>> > On 10/11/23 22:45, Antoine Beaupré wrote:
>> >> On 2023-10-11 22:30:16, Kyle Fazzari wrote:
>> >>> On 10/11/23 19:24, Antoine Beaupré wrote:
>> >>
>> >> [...]
>> >>
>> >>> Yeah I won't lie, my immediate thought was "huh, I've never seen that
>> >>> happen." Then my follow-up was "actually, how would I even know?" But at
>> >>> the same time, I could make that argument for a lot of collectors! Is
>> >>> there an established pattern for gathering this kind of data?
>> >>
>> >> Filing a timestamp and duration of the last run is typical.
>> >
>> > Very good.
>> >
>> > By the way, I came across this Ubuntu bug today, which sounds eerily
>> > familiar, no?
>> >
>> > https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851
>>
>> Oh wow, so that would mean a bug in APT itself. And a regression too,
>> because we certainly never witnessed this in buster or bullseye...
>>
>> I wonder if we should clone this bug into apt...
>
> If you use apt-cacher-ng, stop using it. apt is *not* compatible
> with apt-cacher-ng. We already have bugs for that, and the only
> way things will change is if we gets https HTTP/2 on the mirrors
> and then can switch to libcurl.

We are not using apt-cacher-ng, and are now seeing this problem with
unattended-upgrades as well.

> Also please do not run apt update in the background or try to
> calculate dist upgrades, that is evil and you're breaking stuff.
> If you want to check for updates, make sure the periodic apt service
> is configured to run. You are entitled to one run per day. If you
> do not operate the mirror infrastructure please do not run your own
> updates out of band.

I agree this is a little odd and that the prometheus apt plugin should
probably not be the one handling the cache update.

What's the magic setting to make apt check those updates on its own? I
often get confused between unattended-upgrades and apt there...

> Then you may calculate the upgrade, but don't open the cache
> with locking so you don't go block other clients.

Could you expand on this? What's the proper pythonic way of doing this?

The script currently does:

cache = apt.cache.Cache()
with contextlib.suppress(apt.cache.LockFailedException, apt.cache.FetchFailedException):
cache.update()

cache.open()
cache.upgrade(True)

What's the locking part here? Is this sufficient?

cache = apt.cache.Cache()

> Generally speaking it is likely that we will come up with a daemon
> on the road to Ubuntu 26.04 that you can ask to check for updates
> if it hasn't (again please don't if you don't own your infra) and
> then query for available upgrades; but it's nothing for the next 12 months.

Okay, that makes sense. So the Debian package here clearly has a bug
because it hits those mirrors every 15 minutes, that's definitely
something that needs to change.

> Also Kyle's emails don't arrive at all, well one did in the
> entire thread.

I'm not sure what to say about this... I don't have access to either
Kyle's, Ubuntu's or Debian.org infrastructure to diagnose this... Would
you prefer we move this discussion in the GitHub issue tracker or the
Launchpad issue?

Thanks for your feedback!

--
The reasonable man adapts himself to the world.
The unreasonable man persists in trying to adapt the world to himself.
Therefore, all progress depends on the unreasonable man.
- George Bernard Shaw

Antoine Beaupré

unread,
Oct 13, 2023, 12:00:04 PM10/13/23
to
On 2023-10-13 11:40:17, Antoine Beaupré wrote:

[...]

> I agree this is a little odd and that the prometheus apt plugin should
> probably not be the one handling the cache update.

I have made a patch upstream for this:

https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/pull/181

I believe this would fix this issue as far as apt_info.py is concerned.

I do feel there's an outstanding issue here in apt, however, but that's
probably better served by another bug report.

[...]

> Could you expand on this? What's the proper pythonic way of doing this?
>
> The script currently does:
>
> cache = apt.cache.Cache()
> with contextlib.suppress(apt.cache.LockFailedException, apt.cache.FetchFailedException):
> cache.update()
>
> cache.open()
> cache.upgrade(True)
>
> What's the locking part here? Is this sufficient?
>
> cache = apt.cache.Cache()

I have just done that, @jak it would be great if you could just +1 the
above...

Thanks again!

Antoine Beaupré

unread,
Oct 13, 2023, 12:10:05 PM10/13/23
to
severity 1028212 serious
tags 1028212 +patch
thanks

Also, considering what jak said about scripts that shouldn't hit mirrors
more than once a day, I'm going to mark this as release critical.

The rationale is that we're doing what I consider "is a severe violation
of Debian policy". While i am not sure there's a written policy of how
often you should hit mirrors, this is *definitely* too often. In our
modest fleet of 100 servers, hitting the mirrors every 15 minutes is
equivalent to generating one "hit" (which is possibly multiple HTTP
requests) *every nine seconds*!

So we need to fix this, and we need to fix it in bookworm, and IMHO,
ASAP. I'll certainly deploy the attached patch in production on our end
to see what comes up.

--
Do not use your energy to worry.
Use your energy to believe, to create, to learn, to think, and to grow.
- Richard Feynman
0001-do-not-run-apt-update-or-simulate-apt-dist-upgrade.patch

Kyle Fazzari

unread,
Oct 13, 2023, 12:10:05 PM10/13/23
to


On 10/13/23 08:26, Julian Andres Klode wrote:
> Also please do not run apt update in the background or try to
> calculate dist upgrades, that is evil and you're breaking stuff.
> If you want to check for updates, make sure the periodic apt service
> is configured to run. You are entitled to one run per day. If you
> do not operate the mirror infrastructure please do not run your own
> updates out of band.

A fair critique, although as I mentioned in an earlier email, this
collector cannot do its job if it's running on a significantly
out-of-date cache. At the same time, if feels out of scope for this
Debian package to ensure the periodic apt service is configured to run.
Feels like a rock and a hard place. Thoughts?

Kyle

Kyle Fazzari

unread,
Oct 13, 2023, 12:30:05 PM10/13/23
to


On 10/13/23 09:14, Antoine Beaupré wrote:
> I think this is a deployment issue. People who provision this package
> should *not* expect it to run `apt update`: I certainly didn't, and the
> previous (shell) implementation didn't either.
>
> We have other tools that continuously pull mirrors, and as jak stated,
> APT can be configured to do so as well (although I'm not sure what the
> canonical way of doing so).
>
> So let's not do this here.

I don't entirely agree, but disagreement is okay. I do at least
recommend accompanying this with a cache age statistic, as we discussed
earlier.

Kyle

Antoine Beaupré

unread,
Oct 13, 2023, 12:40:04 PM10/13/23
to
On 2023-10-13 11:59:23, Antoine Beaupré wrote:
> severity 1028212 serious
> tags 1028212 +patch

[...]

> From 3b17a4dcb8caa56191c5be523c874a7f470bd04a Mon Sep 17 00:00:00 2001

[...]

> diff --git a/apt_info.py b/apt_info.py
> index eb1a642..9b1b675 100755
> --- a/apt_info.py
> +++ b/apt_info.py

[...]

> registry = CollectorRegistry()
> _write_pending_upgrades(registry, cache)
> _write_held_upgrades(registry, cache)

That patch doesn't actually apply cleanly on bookworm because upstream
did some refactoring, the attached patch should work better.

A.
--
C'est avec les pierres de la loi qu'on a bâti les prisons,
et avec les briques de la religion, les bordels.
- William Blake
0001-do-not-run-apt-update-or-simulate-apt-dist-upgrade.patch
0 new messages