The past few weeks our CI jobs have been failing for a variety of reasons, making it more and more unreliable to decide PR merges. However, some fixes have gone in and I think we are at a stage where except for one issue, the rest failures are now genuine issues (either in tests or the code).
The remaining issue appears to be jobs that for no apparent reason being "skipped"/"cancelled"[1] after they have been running for more than an hour or so. For example, the JVM 8 and JVM 11 jobs have now started to frequently get "skipped" right in the middle of running tests after around 1 hr 20 min (sometimes 1hr 30 min) of the job being in progress. The worse part is, no logs get shown or available for download once the job gets cancelled, making it really hard to understand what happened. Initially I thought that our job timeouts were low and that was causing timeout and jobs being cancelled. But now that I went and checked the timeout, it isn't the one that's causing this because we have set the timeout to 4 hours[2].
After looking around a bit, I think I might have found the cause of this issue. I suspect there might be some issue with our "cancel-previous-runs" action[3]. I looked at its recent runs for Quarkus and noticed that one of the runs got scheduled and run at around the time when one of the JVM tests job got cancelled. Looking into the "cancel-previous-runs" CI job logs, I see this[4]:
2020-08-23T03:29:15.5846083Z *** 2020-08-23T03:29:15.5846964Z Workflow ID is: ci-actions.yml 2020-08-23T03:29:16.4194484Z 220269448 : https://api.github.com/repos/quarkusio/quarkus/actions/workflows/597537 : in_progress : 8511 2020-08-23T03:29:16.4202540Z First: jaikiran/quarkus/qk-11511 2020-08-23T03:29:16.4229539Z Cleaning up orphan processes
I'm not 100% sure but does it look like for some reason that
job identified the JVM tests job (which was currently in
progress) as eligible for cancelling and cancelled it? Looking
at the code of the "cancel-previous-runs" action, I would have
expected it to log a message which says it's cancelling this
job, but I don't see that log, so I'm not really 100% sure this
is what is causing it. But given the timestamps involved in
these runs and lack of any other possible theories that I can
think of, perhaps this is indeed what's causing this issue?
[1]
https://github.com/quarkusio/quarkus/runs/1017271541?check_suite_focus=true
[2] https://github.com/quarkusio/quarkus/blob/master/.github/workflows/ci-actions.yml#L104
[3] https://github.com/n1hility/cancel-previous-runs
[4] https://github.com/quarkusio/quarkus/runs/1017388856?check_suite_focus=true
-Jaikiran
That is a interesting theory and could be a very simple explanation... we should in any case add that logging to help isolate it.
Jason - care to investigate ?
/max
https://xam.dk/about
--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/7eb2a3e1-6dbf-eb35-3c83-048787f9f446%40gmail.com.
On Aug 23, 2020, at 6:55 PM, Max Rydahl Andersen <mand...@redhat.com> wrote:
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/55159DF8-122D-40F6-8DB7-462B9C04096A%40redhat.com.
--
fyi, I raised the concern with GitHub support but still no response.
Would it make sense to run without the cleaner for a day or two and see
if things starts running fine or will
we run out resources too fast anyway ?
What I have observed is that, they are marked as cancelled/skipped until
the last job within the workflow completes. Once the last job in the
workflow gets completed, these cancelled/skipped ones are being marked
as failed. At least, that's what I observed on this one (which I was
closely monitoring) https://github.com/quarkusio/quarkus/runs/1017271541
Interestingly, this specific PR https://github.com/quarkusio/quarkus/pull/11551 which shows one of the jobs as cancelled (JDK Java 11 JVM Tests cancelled 4 hours ago in 1h 40m 51s) https://github.com/quarkusio/quarkus/pull/11551/checks?check_run_id=1020762644 has this in the logs:
2020-08-24T10:35:46.2127997Z ##[error]The operation was canceled.
2020-08-24T10:35:47.0063961Z ##[error]The runner has received a
shutdown signal. This can happen when the runner service is
stopped, or a manually started runner is canceled.
For whatever reason, this specific job managed to retain these
logs unlike the other ones.
-Jaikiran
--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CALt0%2Bo_BzP7_foaLf1ONaQFCswUrP9W_%2BtYwVq8pJp-G7StOYA%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CALt0%2Bo_4mCvvy8vu_jhcBMvnP%3DSJP9fKd%3D%2B_j-q%2BdT9sxJtMoA%40mail.gmail.com.
I am starting to wonder if for some reason we are ending up killing the runner (oom killer or some such resource limits for example) by running far too many processes (in tests) and then these services[1] in the job's VM?
Their documentation[2] states:
Each virtual machine has the same hardware resources
available.
2-core CPU
7 GB of RAM memory
14 GB of SSD disk space
I don't know if we end up hitting that 7 GB RAM limit which
then perhaps triggers a process kill or some such thing. I wish
these runners were more accessible at least for gathering data
like this.
[1]
https://github.com/quarkusio/quarkus/blob/master/.github/workflows/ci-actions.yml#L126
[2]
https://docs.github.com/en/actions/reference/virtual-environments-for-github-hosted-runners
-Jaikiran
--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CALt0%2Bo_dpJ6UZ6GWRscrvrtuKRPD1LptzdYpxvFxmdp9UV5Z3A%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/7ae6549e-8346-16fa-be74-15807e62df0c%40gmail.com.
2020-08-25T08:07:05.4482227Z [INFO] 2020-08-25T08:07:05.4482391Z [INFO] Results: 2020-08-25T08:07:05.4482524Z [INFO] 2020-08-25T08:07:05.5492255Z [INFO] Tests run: 39, Failures: 0, Errors: 0, Skipped: 0 2020-08-25T08:07:05.5493807Z [INFO] 2020-08-25T08:07:10.9324179Z ##[error]The operation was canceled. 2020-08-25T08:07:10.9660808Z Cleaning up orphan processes
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/37FCA8B0-C6E2-44EE-8808-FB830E665BE6%40redhat.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CALt0%2Bo8Au_Huci5gJFST6ZZgHmp2AKvg%3D0z6B8CEOE9xwQMrWQ%40mail.gmail.com.
Reading through some issues reported in github actions repo,
I found a very similar one reported against macOS runners[1]. It
looks like their team is aware of some kind of runner issues
(which I haven't seen explained in that issue). So I decided to
go ahead and ask if they think our issue is a result of some
changes in their runners
https://github.com/actions/virtual-environments/issues/1491
[1] https://github.com/actions/virtual-environments/issues/736
-Jaikiran
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/5d698133-116f-c486-dbac-c30be2482d23%40gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CALt0%2Bo828t%2BL%2BAjOOVzCcNN9ryD2VUHA3pVsBQ7awAt8_%3DFehA%40mail.gmail.com.
That seems to have had a very good impact!I think with that and with Stuart's PR which is now running in CI (rebased onto the latest master), we should be back in good shape - 🤞