1. Aa far as I know, Matthias (currently off duty) did the most work in setting up the original CI infrastructure. This is based on traditional tools: make and docker.
Perhaps this helps you understand the current pitiful state of the CI infrastructure. Since the CI infrastructure itself is broken and unstable, I think that documenting it now is premature.
The biggest issue with the reliability of the CI is a deep design decision in the way the tests are setup. Many doctests have an inherent random element, and this is mostly on purpose to increase the surface of code paths that are tested and thereby discover new bugs. The disadvantage is that unfortunately some test runs will produce failures that are not connected to the changes of the PR. I don't see really anything that can be done on the level of the CI infrastructure to improve the situation, but would be happy to get new ideas.
work on such issues (searching for 'random' or 'flaky' or 'CI' in the github issues should bring up most of them, eg https://github.com/sagemath/sage/issues?q=is%3Aissue%20state%3Aopen%20%22random%22).
It would be nice to have a GitHub label for these kinds of issues so they can be found more easily. I'm not sure who has permissions to add new labels.
I also have a half-working notebook that extracts the failing tests from the CI runs at https://github.com/sagemath/sage/pull/39100
I know nothing about the internal implementation here, but just this description suggests a change in practice: when a test on a PR fails, rerun that test on the existing branch (without the pull request) with the same random seed to see if it also fails there. If it does, then automatically file a separate issue.
That sounds like a good idea that is probably technically
possible but I have no idea how it would be implemented. I do
think we have something to ignore failures for doctests that
failed on the last commit to the develop branch, which is similar
enough that I think this should be possible.
One issue with this suggestions is it will mean CI takes longer
which would aggravate the problem of potentially waiting a long
time for a CI server to be available for other tests. For setups
with limited CI minutes I think it's common to run a limited test
suite on PRs and the full test suite on develop. We sort of do
this, more distros are tested on develop than on PRs for instance.
Another possibility is to run the most important/stable/relevant
tests first and only run the full test suite after those pass.
Obviously this has the drawback of waiting longer for things to
finish if we don't start jobs until the first jobs succeed, but
it's something to consider if we reach a point where we are
running more CI tests than our CI servers can keep up with. One
way to do this would be to run tests on our "most stable" systems
first (Linux) and only test the less stable systems (Mac, Windows)
if those pass. Or only test the PDF docbuild after the HTML
docbuild passes. We'd probably want to have some label to override
this for PRs that are trying to fix something like a
Windows-specific issue. If our CI usage right now isn't a big
problem then I don't think we need to do this, just pointing out
that there are options if it becomes an issue in the future.
- Sometimes PRs introduce reproducible build errors on a small subset of systems ... then after some time disable the failing system
I don't think tests need to be disabled, but rather the CI should not
report a PR as failing if the same failure occurs on develop. So still
run the failing tests, but don't report the workflow as failing if the
same test fails on develop. I think we already have something like this
for the fixed seed tests, but not for the random seeds.
That is a trickier situation. I do think random segfaults should
be considered a bug, but also shouldn't cause CI to mark a PR as
failing (unless that PR introduced a new random segfault, although
that's probably easier to say than detect). As long as we have
some CI workflow that still runs and reports those tests (even if
only on develop) I think it would be okay to skip them on PRs. I'm
not sure what the exact behaviour of #known
bug is (does it always skip the test or just in some
workflows?), but either that or something like it should work.