Bug#1053509: shasta: autopkgtest regression on arm64: is memory suddenly not enough?

Paul Gevers

unread,

Oct 5, 2023, 7:50:05 AM10/5/23

to

Source: shasta
Version: 0.11.1-1
Severity: important
User: debi...@lists.debian.org
Usertags: regression

Dear maintainer(s),

Your package has an autopkgtest, great. However, it recently started
failing on arm64 in unstable, testing *and* stable. Until mid September
2023, most runs passed (there are some historical rare failures where we
don't have logs anymore), but somewhere between 2023-09-17 and
2023-10-01 the tests started failing (although, in testing we had some
passes on 2023-09-28 and 2023-10-04, interestingly these runs too *much*
longer than normal (factor ~20)). I first suspected that this regression
aligned with my fix for bug 1050256 where I switched the kernel of our
hosts to backports, but there are multiple good runs with that kernel.
Today I learned that "Killed" in the log is often coming from the kernel
when a program is allocating too much memory and the kernel kills the
process. We have some striking logs of failures [1] where the shasta
test actually bails out itself for lack of memory.

So, there's a couple of things (weirdness and options):
1) why does it now suddenly start to (nearly always) fail across the
board on arm64 (in Debian, Ubuntu still seems fine), without changes to
the infrastructure that I know of?
2) do you also believe this is related to memory consumption?
3) If 2 == yes, what are the memory requirements for the test? The test
*could* test for that before it starts and bail out (restriction:
skippable with exit code 77 [2]) if the amount of memory available is
too small.
4) just stop testing on arm64 (but again, in Ubuntu the test is still
running fine)
5) the recent glibc security update also came to my mind, but that's not
*yet* installed on our hosts, nor is it installed in the stable testbed.

The release team has announced [3] that failing autopkgtest on amd64 and
arm64 are considered RC in testing. However, due to nature of the
symptoms, I've filed this as important for now.

Paul

[1]
https://ci.debian.net/data/autopkgtest/testing/arm64/s/shasta/38544284/log.gz
48s 2023-Oct-04 12:01:05.603191 Memory allocation failure during
mremap call for MemoryMapped::Vector.
48s This assembly requires more memory than available.
48s Rerun on a larger machine.
[2]
https://salsa.debian.org/ci-team/autopkgtest/-/blob/master/doc/README.package-tests.rst
[3] https://lists.debian.org/debian-devel-announce/2019/07/msg00002.html

OpenPGP_signature.asc

Étienne Mollier

unread,

Dec 3, 2023, 12:40:04 PM12/3/23

to

Hi Paul,

> 1) why does it now suddenly start to (nearly always) fail across the
> board on arm64 (in Debian, Ubuntu still seems fine), without changes to
> the infrastructure that I know of?

I'm afraid I'm not sure what is up with shasta eating up more
memory on arm64 hosts of CI infrastructure. What I can see from
my end is that the test roughly requires 8GiB of anonymous
memory to map for doing its job. Except that, this is already
the case for shasta in bookworm running on bookworm kernel, so
that doesn't look to be a regression per se.

> 2) do you also believe this is related to memory consumption?

The problem you mentioned, where shasta explicitly gives up when
running into memory limits, is reproducible when I disable the
swap on an 8GiB machine that I have at hand. I attempted to
play with /proc/sys/vm/overcommmit_* settings, but my swap at t
time was too big (10GiB) to give me the granularity necessary to
check whether I could get somewhere with improper overcommit
memory tuning. In any cases, the "Killed" status suggest
overcommit is active (or heuristic) on your end for at least
some of the hosts.

Per chance, could you double check the memory settings on the CI
hosts, just in case, to make sure that the swap didn't drop off
the machine? Or maybe check for memory overcommit settings
inconsistencies?

Currently readable test logs suggest that:

* ci-worker-arm64-10 met memory requirements in November,
* ci-worker-arm64-07 did not meet requirements in October,
* ci-worker-arm64-08 did not meet requirements in October,
* ci-worker-arm64-03 did not meet requirements in October.

So this may already be resolved, in case you changed something
in between.

> 3) If 2 == yes, what are the memory requirements for the test? The test
> *could* test for that before it starts and bail out (restriction:
> skippable with exit code 77 [2]) if the amount of memory available is
> too small.

It shouldn't hurt I guess. I think I can bolt something reading
the memory commit capacity and usage in /proc/meminfo at the
beginning of the test, and skip the run if the testbed couldn't
meet the memory requirement for whatever reason. Note this may
involve some trial and error.

Have a nice Sunday, :)
--
.''`. Étienne Mollier <emol...@debian.org>
: :' : gpg: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
`. `' sent from /dev/pts/1, please excuse my verbosity
`- on air: Ghost - Avalanche

signature.asc

Paul Gevers

unread,

Dec 3, 2023, 3:40:05 PM12/3/23

to

Hi Étienne,

On 03-12-2023 18:34, Étienne Mollier wrote:
>> 1) why does it now suddenly start to (nearly always) fail across the
>> board on arm64 (in Debian, Ubuntu still seems fine), without changes to
>> the infrastructure that I know of?
>
> I'm afraid I'm not sure what is up with shasta eating up more
> memory on arm64 hosts of CI infrastructure. What I can see from
> my end is that the test roughly requires 8GiB of anonymous
> memory to map for doing its job.

8GiB... that's not little, considering that that's what these hosts have
as RAM (https://wiki.debian.org/ContinuousIntegration/WorkerSpecs).

> Except that, this is already
> the case for shasta in bookworm running on bookworm kernel, so
> that doesn't look to be a regression per se.

Weird.

> Per chance, could you double check the memory settings on the CI
> hosts, just in case, to make sure that the swap didn't drop off
> the machine?

ci-worker-arm64-04: -rw------- 1 root root 3.9G May 27 2022 /swap
ci-worker-arm64-02: -rw------- 1 root root 3.9G May 27 2022 /swap
ci-worker-arm64-06: -rw------- 1 root root 3.9G May 26 2022 /swap
ci-worker-arm64-03: -rw------- 1 root root 3.9G May 27 2022 /swap
ci-worker-arm64-05: -rw------- 1 root root 3.9G May 27 2022 /swap
ci-worker-arm64-11: -rw------- 1 root root 3.9G May 27 2022 /swap
ci-worker-arm64-07: -rw------- 1 root root 3.9G May 27 2022 /swap
ci-worker-arm64-08: -rw------- 1 root root 3.9G May 27 2022 /swap
ci-worker-arm64-09: -rw------- 1 root root 3.9G May 27 2022 /swap
ci-worker-arm64-10: -rw------- 1 root root 3.9G May 27 2022 /swap

> Or maybe check for memory overcommit settings
> inconsistencies?

It's kbytes, memory, ratio == 0, 0, 50 across all our hosts.

> Currently readable test logs suggest that:
>
> * ci-worker-arm64-10 met memory requirements in November,
> * ci-worker-arm64-07 did not meet requirements in October,
> * ci-worker-arm64-08 did not meet requirements in October,
> * ci-worker-arm64-03 did not meet requirements in October.

Those hosts should be equivalent. Be aware though that tests don't run
in isolation. At the same time, on our arm64 hosts, one more test might
be running. So what's *available* might not be constant in time.

Paul

OpenPGP_signature.asc

Étienne Mollier

unread,

Dec 4, 2023, 4:40:05 AM12/4/23

to

Hi Paul,

Paul Gevers, on 2023-12-03:

> 8GiB... that's not little, considering that that's what these hosts have as
> RAM (https://wiki.debian.org/ContinuousIntegration/WorkerSpecs).

[…]
> ci-worker-arm64-NN: -rw------- 1 root root 3.9G May 27 2022 /swap
[…]

> It's kbytes, memory, ratio == 0, 0, 50 across all our hosts.

Thank you for the figures, that makes:

(50% × 8 GiB RAM) + 4 GiB swap = 8 GiB CommitLimit

With overcommit disabled, that is a hard limit for commiting
anonymous memory for the whole system (which makes me wonder how
come we hit a failures modes which looks like it should only
occur when some overcommit is allowed). 8 GiB is the lower
limit documented by upstream, so I'm even wondering how is it
possible that the test has been passing in the past. Some more
precise calculation of memory consumption show me an upper bound
of 7,900,000 kB, with some runs consuming in the 6,000,000 kB.
This may be explained by the algorithm involving random steps.
Otherwise said, tests may pass on sheer luck…

> Be aware though that tests don't run in
> isolation. At the same time, on our arm64 hosts, one more test might be
> running. So what's *available* might not be constant in time.

Okay, that means there will be concurrent access to an already
tight space for shasta. It looks like that's on me being
greedy. I see what I can do to reduce anonymous maps usage.
Upstream's documentation looks to have a chapter on reducing
memory consumption by giving options to use disk backed memory
maps, although on first try it didn't look to reduce the commit
memory consumption. Let's see if I can get somewhere…

Have a nice day, :)

--
.''`. Étienne Mollier <emol...@debian.org>
: :' : gpg: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
`. `' sent from /dev/pts/1, please excuse my verbosity

`- on air: Tangerine Dream - Stratosfear

signature.asc

Étienne Mollier

unread,

Dec 4, 2023, 10:50:05 AM12/4/23

to

Alright, I think I managed to get somewhere with the program's
configuration options: using an older reference config from
2019, shasta doesn't look to reserve itself unnecessary amounts
of memory, and the test should now go through. If this works,
we can forget about checking available memory on the host, and
the Ubuntu specific change (apparently, Steve Langasek disabled
the first command of the test suite, which explains why there
were no issues on this operating system).

I will push an updated autopkgtest soon to close the issue.

Have a nice day, :)
--
.''`. Étienne Mollier <emol...@debian.org>
: :' : gpg: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da

`. `' sent from /dev/pts/2, please excuse my verbosity
`-

signature.asc