Documentation and state of Sage CI

221 views
Skip to first unread message

Vincent Macri

unread,
Aug 25, 2025, 2:30:11 PMAug 25
to sage-devel
Hello all,

This is somewhat open-ended, but as I'm sure we're all aware, our GitHub
Actions CI occasionally fails on PRs for reasons unrelated to the PR
itself. According to
https://github.com/sagemath/sage/actions/metrics/performance this month
over 100,000 minutes have been spent running jobs that ended up failing.
Obviously some jobs will fail (the whole point of CI is to fail when
something is wrong) but frequently CI fails for reasons unrelated to the
PR. I don't know the details of how many CI minutes Sage has a month but
I'm assuming it's not unlimited.

I don't think we have any significant documentation on our CI setup,
which makes it difficult for contributors who have not touched the CI
before to propose improvements or understand why a job failed. So I
wanted to ask, broadly speaking, how was our CI setup designed? Is the
number of CI minutes we use a month a problem for us? Do people who have
worked on the CI have ideas for how to improve it that they haven't had
the time to implement yet?

I don't think CI details belong in our official documentation, but would
it be possible to document it somewhere (for example, a README in
https://github.com/sagemath/sage/tree/develop/.github/workflows) to
explain enough to give a starting point to Sage contributors who want to
improve the CI?

--
Vincent Macri (he/him)

Kwankyu Lee

unread,
Aug 25, 2025, 8:56:43 PMAug 25
to sage-devel
A brief history:

1. Aa far as I know, Matthias (currently off duty) did the most work in setting up the original CI infrastructure. This is based on traditional tools: make and docker.
2. Recently Tobias made (and is making) changes to the CI infrastructure,  based on meson, with which he intends to replace make.
3. I helped Matthias maintain CI (especially regarding doc previews), and Dima is helping Tobias. Unfortunately they did not cooperate with each other. 

Perhaps this helps you understand the current pitiful state of the CI infrastructure.  Since the CI infrastructure  itself is broken and unstable, I think that documenting it now is premature.

By the way, 


need review.

 

Tobia...@gmx.de

unread,
Aug 26, 2025, 9:55:28 AMAug 26
to sage-devel
The biggest issue with the reliability of the CI is a deep design decision in the way the tests are setup. Many doctests have an inherent random element, and this is mostly on purpose to increase the surface of code paths that are tested and thereby discover new bugs. The disadvantage is that unfortunately some test runs will produce failures that are not connected to the changes of the PR. I don't see really anything that can be done on the level of the CI infrastructure to improve the situation, but would be happy to get new ideas.

What would help is to a) open a new issue whenever you see an unrelated test failure (so that we can keep track on when/how it happens) and b) work on such issues (searching for 'random' or 'flaky' or 'CI' in the github issues should bring up most of them, eg https://github.com/sagemath/sage/issues?q=is%3Aissue%20state%3Aopen%20%22random%22). There were some recent pushes to resolve some of those random failures, notably by user202729.

I also have a half-working notebook that extracts the failing tests from the CI runs at https://github.com/sagemath/sage/pull/39100, which would help with statistics and point a) above.

>  Is the number of CI minutes we use a month a problem for us?

No not really. I don't quite remember what plan the Sagemath org is on, but it's not limited on how many minutes per month we can use but instead we have a certain quota of 'runners' that can work in parallel. And we do hit this limit sometimes, especially after a new release when certain longer-taking runs are triggered and a lot of people update their branches. Then it takes a bit longer until the CI results for a PR roll in. We had way more serious issues in this regard, but by now it should work relatively smoothly.

There are two other sources of 'systematic' failures:
- Sometimes PRs introduce reproducible build errors on a small subset of systems. This then leads to failures of the CI runs that check those systems after a new release. Matthias used to invest a lot of time and energy into fixing those; I don't have the time to do this but will open an issue if I see such a failure and then after some time disable the failing system (recent example: https://github.com/sagemath/sage/pull/40675). 
- The buildbots tested by Volker on a new release differ in many aspects from the github CI runs. But Volker only looks at the buildbots (to my knowledge) when deciding if a PR is okay to be merged. In particular, almost all recent failures of the linter workflow are a result of this discrepancy. My goal and hope is that we can retire the buildbots sooner than later. 


On Tuesday, August 26, 2025 at 8:56:43 AM UTC+8 Kwankyu Lee wrote:
1. Aa far as I know, Matthias (currently off duty) did the most work in setting up the original CI infrastructure. This is based on traditional tools: make and docker.

Small clarification: Matthias introduced the "portability" workflows that check sage-the-distro on various systems and are run after a new release. All the remaining workflows (essentially everything that runs now for PRs) were initially contributed by me 4 or 5 years ago (with the idea to fully migrate to github at some point).

  

Dylan Thurston

unread,
Aug 26, 2025, 11:05:42 AMAug 26
to 'Tobia...@gmx.de' via sage-devel
On Tue, Aug 26, 2025 at 06:55:28AM -0700, 'Tobia...@gmx.de' via sage-devel wrote:
> The biggest issue with the reliability of the CI is a deep design
> decision in the way the tests are setup. Many doctests have an
> inherent random element, and this is mostly on purpose to increase
> the surface of code paths that are tested and thereby discover new
> bugs. The disadvantage is that unfortunately some test runs will
> produce failures that are not connected to the changes of the PR.
> ...

I know nothing about the internal implementation here, but just this
description suggests a change in practice: when a test on a PR fails,
rerun that test on the existing branch (without the pull request) with
the same random seed to see if it also fails there. If it does, then
automatically file a separate issue.

(I see potential issues with different numbers of random numbers being
used pre- and post-PR, but there are also ways to mitigate that, eg
maintaining a separate random seed used only for generating test
cases.)

--Dylan

Vincent Macri

unread,
Aug 26, 2025, 6:58:13 PMAug 26
to sage-...@googlegroups.com
On 2025-08-25 6:56 p.m., Kwankyu Lee wrote:
Perhaps this helps you understand the current pitiful state of the CI infrastructure.  Since the CI infrastructure  itself is broken and unstable, I think that documenting it now is premature.
Even if formal documentation isn't possible, I still think it would be good to have something informal explaining what people are working on for the CI, how it works, what needs to be improved, etc. Maybe this could be a use case for https://github.com/sagemath/sage/discussions ? It would be nice to have something a bit more centralized than various GitHub issues and PRs, even if it's just a list of the relevant issues and PRs (GitHub's projects feature might also be useful for this: https://docs.github.com/en/issues/planning-and-tracking-with-projects). My worry is that it seems there are very few people who are knowledgeable about our CI. To put it crudely, the bus factor for our CI infrastructure seems to be low.

On 2025-08-26 7:55 a.m., 'Tobia...@gmx.de' via sage-devel wrote:
The biggest issue with the reliability of the CI is a deep design decision in the way the tests are setup. Many doctests have an inherent random element, and this is mostly on purpose to increase the surface of code paths that are tested and thereby discover new bugs. The disadvantage is that unfortunately some test runs will produce failures that are not connected to the changes of the PR. I don't see really anything that can be done on the level of the CI infrastructure to improve the situation, but would be happy to get new ideas.
We also occasionally have PRs fail for issues other than random inputs. Sometimes there is a networking issue (not sure if there's anything to be done about that). I've sometimes seen tests fail that did not involve any randomness and the failure was in code unrelated to the PR and could not be reproduced (for example, https://github.com/sagemath/sage/actions/runs/17134650987/job/48607483558#step:15:8858). I wonder if the stranger/unreproducible failures might be caused by some faulty caching on the CI server, but I don't know enough about how the CI server is configured and what is cached between builds to say if that might be the case.

On 2025-08-26 7:55 a.m., 'Tobia...@gmx.de' via sage-devel wrote:
work on such issues (searching for 'random' or 'flaky' or 'CI' in the github issues should bring up most of them, eg https://github.com/sagemath/sage/issues?q=is%3Aissue%20state%3Aopen%20%22random%22).

It would be nice to have a GitHub label for these kinds of issues so they can be found more easily. I'm not sure who has permissions to add new labels.

On 2025-08-26 7:55 a.m., 'Tobia...@gmx.de' via sage-devel wrote:
I also have a half-working notebook that extracts the failing tests from the CI runs at https://github.com/sagemath/sage/pull/39100

Getting this from half-working to working and reporting the results somewhere easy to find sounds like it would be a worthwhile endeavour, and would be a good start to collecting information about the CI somewhere where it can be found easily. I'll open a GitHub issue for this (I'm not saying I am going to try to work on it myself anytime soon, so if anyone else reading this wants to work on it feel free).

On 2025-08-26 8:54 a.m., Dylan Thurston wrote:
I know nothing about the internal implementation here, but just this
description suggests a change in practice: when a test on a PR fails,
rerun that test on the existing branch (without the pull request) with
the same random seed to see if it also fails there. If it does, then
automatically file a separate issue.

That sounds like a good idea that is probably technically possible but I have no idea how it would be implemented. I do think we have something to ignore failures for doctests that failed on the last commit to the develop branch, which is similar enough that I think this should be possible.

One issue with this suggestions is it will mean CI takes longer which would aggravate the problem of potentially waiting a long time for a CI server to be available for other tests. For setups with limited CI minutes I think it's common to run a limited test suite on PRs and the full test suite on develop. We sort of do this, more distros are tested on develop than on PRs for instance. Another possibility is to run the most important/stable/relevant tests first and only run the full test suite after those pass. Obviously this has the drawback of waiting longer for things to finish if we don't start jobs until the first jobs succeed, but it's something to consider if we reach a point where we are running more CI tests than our CI servers can keep up with. One way to do this would be to run tests on our "most stable" systems first (Linux) and only test the less stable systems (Mac, Windows) if those pass. Or only test the PDF docbuild after the HTML docbuild passes. We'd probably want to have some label to override this for PRs that are trying to fix something like a Windows-specific issue. If our CI usage right now isn't a big problem then I don't think we need to do this, just pointing out that there are options if it becomes an issue in the future.

dmo...@deductivepress.ca

unread,
Aug 26, 2025, 7:21:37 PMAug 26
to sage-devel
> The biggest issue with the reliability of the CI is a deep design decision in the way the tests are setup. 
> Many doctests have an inherent random element, and this is mostly on purpose to increase the surface 
> of code paths that are tested and thereby discover new bugs. The disadvantage is that unfortunately 
> some test runs will produce failures that are not connected to the changes of the PR. ...

Is this still true?  My understanding is that all CI runs now use the same random seed, so I didn't think this was the "biggest issue" any more.   

Kwankyu Lee

unread,
Aug 26, 2025, 10:30:28 PMAug 26
to sage-devel
- Sometimes PRs introduce reproducible build errors on a small subset of systems ... then after some time disable the failing system

I think we should not drop support of a system (platform) failing because of bugs introduced by PRs. Similarly

"Sometimes a PR introduces a bug that breaks existing features. Then after some time remove the features." 

would be a horrible idea.

Vincent Macri

unread,
Aug 26, 2025, 11:28:15 PMAug 26
to sage-...@googlegroups.com
  • My understanding is that all CI runs now use the same random seed
We do use random seeds, and we should.

https://github.com/sagemath/sage/issues/40632 was found because we have a test that generates a random input to a function and 0 is not valid input for that function. For some seeds, like the one used by the CI for one run which found this bug, the test generated 0 as the input value. This is a bug and so I think this demonstrates that we should use random seeds. It was also unrelated to the PR. Perhaps one could run tests for both a fixed and random seed to avoid unrelated failures while still testing random inputs, but doubling the amount of CI tests we run seems somewhat wasteful.

  1. Sometimes PRs introduce reproducible build errors on a small subset of systems ... then after some time disable the failing system
  2. I think we should not drop support of a system (platform) failing because of bugs introduced by PRs.

I think you two are saying two different things (I added emphasis for the difference) and I generally agree with both.

If Sage fails to build on a system, especially an old system that has reached end of life (which I think is the case for what Tobias was talking about) I don't think we need to spend time trying to support it. For example, we regularly drop support for old Python versions, but there are surely some systems where newer Python versions aren't available. I think that's fine. Ideally, we would have removed the unsupported systems from CI when we dropped support for the old Python. Better yet would be if the CI knew which versions of those distros we test are supported without us having to update the CI configuration whenever there is a new or EOL Ubuntu/Fedora/whatever.

As for bugs, if the system builds Sage and then a test fails after it successfully builds, that's a problem. If Sage builds but has failing tests on say, the most recent Fedora that's something we obviously should not ignore. If it builds and tests fail on an older system, maybe some Ubuntu LTS that's past EOL, I think it's still worth investigating why the failure occurred. Maybe it's relevant to Sage and tells us something, maybe it doesn't. Depending on what test failed and why we evaluate if it's a regression or just a matter of trying to run Sage on a system that is past EOL where a recent enough version of some compiler/library is unavailable. In a perfect world we would track down what specific version of what library is causing the bug and update the build dependencies to not allow that version. In practice I think that might be an unrealistic amount of work for us that provides little practical benefit.

dmo...@deductivepress.ca

unread,
Aug 26, 2025, 11:52:24 PMAug 26
to sage-devel
I took a look at the "sage -t" in "test-long (src/sage[a-f]*)" of a few recent PRs (going back to June), and it looks to me like they all used the same random seed: 286735480429121101562228604801325644303.  Where are you seeing a different seed?

Tobia...@gmx.de

unread,
Aug 27, 2025, 12:14:21 AMAug 27
to sage-devel
To slightly expand on the above: We do have tests with fixed (build.yml) and with variable random seed (ci-meson.yml). But even with a fixed seed you do get sometimes, often non-reproducible, failures. Some of these issues are known for a long time (eg https://github.com/sagemath/sage/issues/29528 is open for 5 years now).
A related question would be: shall we temporarily disable tests that are known to randomly fail?
Advantage: less noise due to random failures
Disadvantage: less coverage

>  I wonder if the stranger/unreproducible failures might be caused by some faulty caching on the CI server, but I don't know enough about how the CI server is configured and what is cached between builds to say if that might be the case.

From my experience, these issues are almost never specific to CI (i.e. the same error could be reproduced in principle by running the same commands locally on a developer's machine). The only exceptions are issues related to "docker pull/push" that you sometimes see. Those come from the design decision to run the CI in a new docker container. Fixing those issues by redesiging the corresponding workflows would be desirable (see below). 

>  I think we should not drop support of a system (platform) failing because of bugs introduced by PRs.

In theory, I agree. In practice, however, not a single of those portability issues that I've opened were fixed. And there is no point in burning CPU cycles if we already know the build will fail on that system - it's also confusing to people looking at the CI results since it's not clear that this is a known issue.

>  It would be nice to have a GitHub label for these kinds of issues so they can be found more easily. I'm not sure who has permissions to add new labels.

Good idea! Needs to be done by one of the github org admins.

I will see if I find some time to document the CI infrastructure a bit. In my opinion, it's design is pretty stable by now and I at least don't have any major plans for further restructuring in the near future. A few items I would like to work on:
- Migrate the "long" tests to meson (https://github.com/sagemath/sage/pull/40158)
- Redesign the ci-distro workflow to work directly on the system's container and not going through docker + tox (similar to how ci.yml works) For macos this was done in https://github.com/sagemath/sage/pull/40516.
- Rework `dist.yml` to be based on meson, build wheels for sagelib and a few general stability improvements (eg use gh cli/action to create the release)

Dima Pasechnik

unread,
Aug 27, 2025, 1:26:48 AMAug 27
to sage-...@googlegroups.com
Not always - e.g. we regularly have to update OS versions, as they get to EOL, stop being supported by GitHub Actions, etc.
Testing on systems/configurations nobody uses any more is mostly a waste of time and effort (if one not only runs tests, but tries to fix issues only appearing on such an obsolete system).


>

Vincent Macri

unread,
Aug 27, 2025, 6:38:55 PMAug 27
to sage-...@googlegroups.com
> To slightly expand on the above: We do have tests with fixed
> (build.yml) and with variable random seed (ci-meson.yml).

I was not aware that only meson used a random seed. That's good to know
and explains why meson seems to fail more often in CI. As a
reviewer/developer it's helpful to know which workflows are the most
stable, and which could fail for unrelated reasons. This fact should be
written down somewhere, probably in the CI documentation if you end up
having time to write it.

> A related question would be: shall we temporarily disable tests that
> are known to randomly fail?
> Advantage: less noise due to random failures
> Disadvantage: less coverage

I don't think tests need to be disabled, but rather the CI should not
report a PR as failing if the same failure occurs on develop. So still
run the failing tests, but don't report the workflow as failing if the
same test fails on develop. I think we already have something like this
for the fixed seed tests, but not for the random seeds.

On the other hand, it would be nice if the CI highlighted a test which
passes in a PR but fails on develop.

> >  I wonder if the stranger/unreproducible failures might be caused by
> some faulty caching on the CI server, but I don't know enough about
> how the CI server is configured and what is cached between builds to
> say if that might be the case.
>
> From my experience, these issues are almost never specific to CI (i.e.
> the same error could be reproduced in principle by running the same
> commands locally on a developer's machine). The only exceptions are
> issues related to "docker pull/push" that you sometimes see. Those
> come from the design decision to run the CI in a new docker container.
> Fixing those issues by redesiging the corresponding workflows would be
> desirable (see below).

This failure in code unrelated to the PR is neither due to a random test
or a docker issue, and I cannot reproduce it:
https://github.com/sagemath/sage/actions/runs/17134650987/job/48607483558#step:15:8858

I agree that failures like this are very rare though. We do use a lot of
CI every month, so I would not be surprised to learn that the expected
number of monthly random hardware glitches (or solar flares, or whatever
your favourite explanation is for strange computer phenomena) for our CI
setup is non-negligible.

> >  It would be nice to have a GitHub label for these kinds of issues
> so they can be found more easily. I'm not sure who has permissions to
> add new labels.
>
> Good idea! Needs to be done by one of the github org admins.
Would whoever has the permissions for this consider adding two new
labels to GitHub? One called "CI" (or something similar) to be used for
issues/PRs relating to the CI (we have "CI fix" but that's for CI fixes
that should be merged before other PRs). And one called "random seed
failure" (or something similar) for issues that report or PRs that fix
tests that consistently fail for specific random seeds.

Tobia...@gmx.de

unread,
Aug 27, 2025, 8:42:50 PMAug 27
to sage-devel
I don't think tests need to be disabled, but rather the CI should not
report a PR as failing if the same failure occurs on develop. So still
run the failing tests, but don't report the workflow as failing if the
same test fails on develop. I think we already have something like this
for the fixed seed tests, but not for the random seeds.

That would work for tests where only the output differs sometimes. But the most problems we have at the moment are from random segfaults, and for those it's probably easier to not run them at all instead of trying to filter them out afterwards.

Vincent Macri

unread,
Aug 28, 2025, 3:50:30 PM (13 days ago) Aug 28
to sage-...@googlegroups.com

That is a trickier situation. I do think random segfaults should be considered a bug, but also shouldn't cause CI to mark a PR as failing (unless that PR introduced a new random segfault, although that's probably easier to say than detect). As long as we have some CI workflow that still runs and reports those tests (even if only on develop) I think it would be okay to skip them on PRs. I'm not sure what the exact behaviour of #known bug is (does it always skip the test or just in some workflows?), but either that or something like it should work.

Tobia...@gmx.de

unread,
Aug 28, 2025, 11:46:48 PM (13 days ago) Aug 28
to sage-devel
Lines annotated with "known bug" are always skipped. What do you think about the following:
1. Introduce a new "random failure" tag and use that to annotate doctests that are known to fail sometimes (either with segfaults, timeouts, or differing output)
2. Normally on CI, skip the execution of such tests.
3. In one special run after a release (or perhaps also for PRs?), only execute the "random failure" tests. Since there will not be that many such annotated tests, we could even run it a couple of times and only report an error if none of the runs was successful.

Vincent Macri

unread,
Aug 29, 2025, 1:58:43 PM (12 days ago) Aug 29
to sage-...@googlegroups.com
> What do you think about the following:
> 1. Introduce a new "random failure" tag and use that to annotate
> doctests that are known to fail sometimes (either with segfaults,
> timeouts, or differing output)
> 2. Normally on CI, skip the execution of such tests.
> 3. In one special run after a release (or perhaps also for PRs?), only
> execute the "random failure" tests. Since there will not be that many
> such annotated tests, we could even run it a couple of times and only
> report an error if none of the runs was successful.

As long as "after a release" includes beta releases I think something
like that would be ideal.

I do think the failing tests should fail very clearly on develop at
least so we don't forget, whereas for pull requests they're mostly
noise. Maybe run it a few times and report an error for develop if there
are any failures, and report an error for PRs if there is a random
failure test that failed every time? We'll still probably get some
unrelated failures on PRs, but if they are in a separate job that's
called "random failures" or something like that I think it will be clear
to developers that it's not a big deal if the test failed. Probably
still worth double-checking the test passes on your own machine
(especially if you touched possibly-related code), but not necessary to
spend time trying to debug. Ideally #random failure tags are accompanied
with what the random failure is, so it's clear if the type of failure
changed (for example, if it previously gave the wrong answer and now
segfaults, or vice-versa).

If GitHub Actions allow this, I think it should be added as a new job to
the existing workflows, so we don't spend time building Sage for these
tests. As long as it's still reported in GitHub as a separate test.

Michael Orlitzky

unread,
Aug 29, 2025, 2:13:42 PM (12 days ago) Aug 29
to 'Tobia...@gmx.de' via sage-devel
On 2025-08-28 20:46:47, 'Tobia...@gmx.de' via sage-devel wrote:
> Lines annotated with "known bug" are always skipped. What do you think
> about the following:
> 1. Introduce a new "random failure" tag and use that to annotate doctests
> that are known to fail sometimes (either with segfaults, timeouts, or
> differing output)
> 2. Normally on CI, skip the execution of such tests.
> 3. In one special run after a release (or perhaps also for PRs?), only
> execute the "random failure" tests. Since there will not be that many such
> annotated tests, we could even run it a couple of times and only report an
> error if none of the runs was successful.

I think this would be counter-productive because it would completely
hide these failures from developers, but leave them there for users to
hit. Not tested / known bug would be better in that regard, since it
hides them from everyone.

But I really don't think we need a special solution for this. These
tests usually aren't that hard to fix. In the time it will take to
come up with an elaborate way to ignore them, most of them could be
fixed.
Reply all
Reply to author
Forward
0 new messages