CI job failures - cause behind jobs getting oddly cancelled/skipped?

56 views
Skip to first unread message

Jaikiran Pai

unread,
Aug 23, 2020, 12:04:23 AM8/23/20
to Quarkus Development mailing list

The past few weeks our CI jobs have been failing for a variety of reasons, making it more and more unreliable to decide PR merges. However, some fixes have gone in and I think we are at a stage where except for one issue, the rest failures are now genuine issues (either in tests or the code).

The remaining issue appears to be jobs that for no apparent reason being "skipped"/"cancelled"[1] after they have been running for more than an hour or so. For example, the JVM 8 and JVM 11 jobs have now started to frequently get "skipped" right in the middle of running tests after around 1 hr 20 min (sometimes 1hr 30 min) of the job being in progress. The worse part is, no logs get shown or available for download once the job gets cancelled, making it really hard to understand what happened. Initially I thought that our job timeouts were low and that was causing timeout and jobs being cancelled. But now that I went and checked the timeout, it isn't the one that's causing this because we have set the timeout to 4 hours[2].

After looking around a bit, I think I might have found the cause of this issue. I suspect there might be some issue with our "cancel-previous-runs" action[3]. I looked at its recent runs for Quarkus and noticed that one of the runs got scheduled and run at around the time when one of the JVM tests job got cancelled. Looking into the "cancel-previous-runs" CI job logs, I see this[4]:

2020-08-23T03:29:15.5846083Z ***
2020-08-23T03:29:15.5846964Z Workflow ID is: ci-actions.yml
2020-08-23T03:29:16.4194484Z 220269448 : https://api.github.com/repos/quarkusio/quarkus/actions/workflows/597537 : in_progress : 8511
2020-08-23T03:29:16.4202540Z First: jaikiran/quarkus/qk-11511
2020-08-23T03:29:16.4229539Z Cleaning up orphan processes

I'm not 100% sure but does it look like for some reason that job identified the JVM tests job (which was currently in progress) as eligible for cancelling and cancelled it? Looking at the code of the "cancel-previous-runs" action, I would have expected it to log a message which says it's cancelling this job, but I don't see that log, so I'm not really 100% sure this is what is causing it. But given the timestamps involved in these runs and lack of any other possible theories that I can think of, perhaps this is indeed what's causing this issue?

[1] https://github.com/quarkusio/quarkus/runs/1017271541?check_suite_focus=true

[2] https://github.com/quarkusio/quarkus/blob/master/.github/workflows/ci-actions.yml#L104

[3] https://github.com/n1hility/cancel-previous-runs

[4] https://github.com/quarkusio/quarkus/runs/1017388856?check_suite_focus=true

-Jaikiran

Max Rydahl Andersen

unread,
Aug 23, 2020, 7:55:26 PM8/23/20
to Jaikiran Pai, Quarkus Development mailing list

That is a interesting theory and could be a very simple explanation... we should in any case add that logging to help isolate it.

Jason - care to investigate ?
/max
https://xam.dk/about

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/7eb2a3e1-6dbf-eb35-3c83-048787f9f446%40gmail.com.

Jason Greene

unread,
Aug 23, 2020, 10:22:47 PM8/23/20
to mand...@redhat.com, Jaikiran Pai, Quarkus Development mailing list
I haven’t looked into these particular runs, but cancelling is the purpose of this job. It will only allow one run for any given fork branch (effectively every PR). So for example if you push twice, the first will get cancelled, and the second will continue.

I’m not sure why there would not be a log though. It could be that output isn’t flushed and gets truncated. It could also be a bug. 

On Aug 23, 2020, at 6:55 PM, Max Rydahl Andersen <mand...@redhat.com> wrote:



Guillaume Smet

unread,
Aug 24, 2020, 7:45:23 AM8/24/20
to Jaikiran Pai, Quarkus Development mailing list
That's an interesting theory.

My issue with it is that we have jobs failing [1] (with absolutely no logs whereas they were running before being declared as failing) and also skipped. Whereas I would have expected only cancelled jobs if it was this action going on.

Most of the time, the jobs are failing, not cancelled.


--

Max Rydahl Andersen

unread,
Aug 24, 2020, 8:39:25 AM8/24/20
to Guillaume Smet, Jaikiran Pai, Quarkus Development mailing list


> That's an interesting theory.
>
> My issue with it is that we have jobs failing [1] (with absolutely no
> logs
> whereas they were running before being declared as failing) and also
> skipped. Whereas I would have expected only cancelled jobs if it was
> this
> action going on.
>
> Most of the time, the jobs are failing, not cancelled.
>
> [1]
> https://github.com/quarkusio/quarkus/pull/11554/checks?check_run_id=1020915801

fyi, I raised the concern with GitHub support but still no response.

Would it make sense to run without the cleaner for a day or two and see
if things starts running fine or will
we run out resources too fast anyway ?


/max
>> <https://groups.google.com/d/msgid/quarkus-dev/7eb2a3e1-6dbf-eb35-3c83-048787f9f446%40gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Quarkus Development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to quarkus-dev...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/quarkus-dev/CALt0%2Bo-gD-bGUgshjusFazGb3_Suu4ym-FW1dX-guggTuXK0ew%40mail.gmail.com.


Guillaume Smet

unread,
Aug 24, 2020, 8:49:39 AM8/24/20
to Max Rydahl Andersen, Jaikiran Pai, Quarkus Development mailing list
On Mon, Aug 24, 2020 at 2:39 PM Max Rydahl Andersen <mand...@redhat.com> wrote:
fyi, I raised the concern with GitHub support but still no response.

Would it make sense to run without the cleaner for a day or two and see
if things starts running fine or will
we run out resources too fast anyway ?

We will run out of resources fast but I would say it's worth a try anyway.

--
Guillaume

Jaikiran Pai

unread,
Aug 24, 2020, 10:01:39 AM8/24/20
to Guillaume Smet, Quarkus Development mailing list

On 24/08/20 5:14 pm, Guillaume Smet wrote:
> That's an interesting theory.
>
> My issue with it is that we have jobs failing [1] (with absolutely no
> logs whereas they were running before being declared as failing) and
> also skipped. Whereas I would have expected only cancelled jobs if it
> was this action going on.
>
> Most of the time, the jobs are failing, not cancelled.

What I have observed is that, they are marked as cancelled/skipped until
the last job within the workflow completes. Once the last job in the
workflow gets completed, these cancelled/skipped ones are being marked
as failed. At least, that's what I observed on this one (which I was
closely monitoring) https://github.com/quarkusio/quarkus/runs/1017271541

-Jaikiran

Guillaume Smet

unread,
Aug 24, 2020, 10:34:57 AM8/24/20
to Jaikiran Pai, Quarkus Development mailing list
On Mon, Aug 24, 2020 at 4:01 PM Jaikiran Pai <jai.for...@gmail.com> wrote:
What I have observed is that, they are marked as cancelled/skipped until
the last job within the workflow completes. Once the last job in the
workflow gets completed, these cancelled/skipped ones are being marked
as failed. At least, that's what I observed on this one (which I was
closely monitoring) https://github.com/quarkusio/quarkus/runs/1017271541

I don't know. Some of the JDK jobs are also randomly skipped entirely.

And when a build is cancelled by Jason's action, I usually get an email saying things have been cancelled.

For all these failing jobs, I get nothing.

I'm all for experimenting dropping this action for a while. Just unsure it has anything to do with it as the behavior is not really consistent.

Jaikiran Pai

unread,
Aug 24, 2020, 10:37:54 AM8/24/20
to Guillaume Smet, Quarkus Development mailing list

Interestingly, this specific PR https://github.com/quarkusio/quarkus/pull/11551 which shows one of the jobs as cancelled (JDK Java 11 JVM Tests cancelled 4 hours ago in 1h 40m 51s) https://github.com/quarkusio/quarkus/pull/11551/checks?check_run_id=1020762644 has this in the logs:

2020-08-24T10:35:46.2127997Z ##[error]The operation was canceled.
2020-08-24T10:35:47.0063961Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

For whatever reason, this specific job managed to retain these logs unlike the other ones.

-Jaikiran

Guillaume Smet

unread,
Aug 24, 2020, 11:18:44 AM8/24/20
to Jaikiran Pai, Quarkus Development mailing list

One job is currently skipped and mostly in limbo from GH UI (you still see some orange progress bullet).

I have no idea in which status it will be at the end.

Andy Damevin

unread,
Aug 24, 2020, 11:23:19 AM8/24/20
to Guillaume Smet, Jaikiran Pai, Quarkus Development mailing list
And this one got tests JVM 8 & 11 skipped without any reason:

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.


--
Andy Damevin

Guillaume Smet

unread,
Aug 24, 2020, 11:24:42 AM8/24/20
to Jaikiran Pai, Quarkus Development mailing list
It was indeed interesting.

The Java 11 job had started and did a few things. Then somehow got marked as "Skipped" and in the end, when the last job of this run finished, it was marked as "Failed".


On Mon, Aug 24, 2020 at 5:18 PM Guillaume Smet <guillau...@gmail.com> wrote:

Andy Damevin

unread,
Aug 24, 2020, 11:28:40 AM8/24/20
to Guillaume Smet, Jaikiran Pai, Quarkus Development mailing list
Did someone manage to have any complete PR CI passing today?

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.


--
Andy Damevin

Jaikiran Pai

unread,
Aug 24, 2020, 11:57:39 AM8/24/20
to Guillaume Smet, Quarkus Development mailing list

I am starting to wonder if for some reason we are ending up killing the runner (oom killer or some such resource limits for example) by running far too many processes (in tests) and then these services[1] in the job's VM?

Their documentation[2] states:

Each virtual machine has the same hardware resources available.

    2-core CPU
    7 GB of RAM memory
    14 GB of SSD disk space

I don't know if we end up hitting that 7 GB RAM limit which then perhaps triggers a process kill or some such thing. I wish these runners were more accessible at least for gathering data like this.

[1] https://github.com/quarkusio/quarkus/blob/master/.github/workflows/ci-actions.yml#L126

[2] https://docs.github.com/en/actions/reference/virtual-environments-for-github-hosted-runners

-Jaikiran

Max Andersen

unread,
Aug 24, 2020, 12:59:43 PM8/24/20
to guillau...@gmail.com, Jaikiran Pai, Quarkus Development mailing list
Can you make it happen? Let us know when to think it’s better :)

/max
--

You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.

Guillaume Smet

unread,
Aug 24, 2020, 1:05:26 PM8/24/20
to Jaikiran Pai, Quarkus Development mailing list
Ah ah, I just saw your email. I posted the same thing in your PR.

Considering it looks like something that is totally not taken care of in the UI (things marked as skipped while they started, things still in progress then status changed from skipped to failed), I'm also in favor of the theory of something killing the process and the OOM killer is the usual suspect.

Jason Greene

unread,
Aug 24, 2020, 1:20:25 PM8/24/20
to Jaikiran Pai, Guillaume Smet, Quarkus Development mailing list
Yeah I think this is a likely reason. I will add some debug to catch oom_killer actions. We could also inoculate the runner from oom_killer via oom_adj.



--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.

Jason Greene

unread,
Aug 25, 2020, 12:15:23 AM8/25/20
to Jaikiran Pai, Guillaume Smet, Quarkus Development mailing list
Just an update that I am taking a look into this. I also have a mechanism to get a shell on the system so I can try to catch it happening.

-Jason 

Andy Damevin

unread,
Aug 25, 2020, 4:32:56 AM8/25/20
to Jason Greene, Jaikiran Pai, Guillaume Smet, Quarkus Development mailing list
This is the first time I can get the logs, when it gets skipped (always around 1h25) :

2020-08-25T08:07:05.4482227Z [INFO] 
2020-08-25T08:07:05.4482391Z [INFO] Results:
2020-08-25T08:07:05.4482524Z [INFO] 
2020-08-25T08:07:05.5492255Z [INFO] Tests run: 39, Failures: 0, Errors: 0, Skipped: 0
2020-08-25T08:07:05.5493807Z [INFO] 
2020-08-25T08:07:10.9324179Z ##[error]The operation was canceled.
2020-08-25T08:07:10.9660808Z Cleaning up orphan processes




--
Andy Damevin

Guillaume Smet

unread,
Aug 25, 2020, 9:41:55 AM8/25/20
to Andy Damevin, Jason Greene, Jaikiran Pai, Quarkus Development mailing list
One odd thing to add to the mystery: I usually have less problems with the backport pull requests i.e. PRs that are not targeting master.

This one fully passed: https://github.com/quarkusio/quarkus/pull/11560 and I can't remember having a backport PR failing with the same symptoms (and I create 2-3 of them per release).


Georgios Andrianakis

unread,
Aug 25, 2020, 10:03:44 AM8/25/20
to Guillaume Smet, Andy Damevin, Jason Greene, Jaikiran Pai, Quarkus Development mailing list
Be very careful with PR now, it seems like GH Actions is totally stuck

Georgios Andrianakis

unread,
Aug 25, 2020, 10:20:38 AM8/25/20
to Guillaume Smet, Andy Damevin, Jason Greene, Jaikiran Pai, Quarkus Development mailing list
Should be OK now... I messed up a commit that changed the YAML a bit... Sorry!

Jaikiran Pai

unread,
Aug 26, 2020, 8:37:51 AM8/26/20
to Guillaume Smet, Andy Damevin, Jason Greene, Quarkus Development mailing list

Reading through some issues reported in github actions repo, I found a very similar one reported against macOS runners[1]. It looks like their team is aware of some kind of runner issues (which I haven't seen explained in that issue). So I decided to go ahead and ask if they think our issue is a result of some changes in their runners https://github.com/actions/virtual-environments/issues/1491


[1] https://github.com/actions/virtual-environments/issues/736


-Jaikiran

Georgios Andrianakis

unread,
Aug 26, 2020, 8:46:30 AM8/26/20
to Jaikiran Pai, Guillaume Smet, Andy Damevin, Jason Greene, Quarkus Development mailing list
Thanks Jaikiran!

Let's see what they say

Guillaume Smet

unread,
Aug 26, 2020, 9:16:58 AM8/26/20
to Georgios Andrianakis, Jaikiran Pai, Andy Damevin, Jason Greene, Quarkus Development mailing list
Jason posted a message on Zulip but not here: apparently we get out of both memory and swap. And that's our main issue.

Georgios Andrianakis

unread,
Aug 26, 2020, 9:19:11 AM8/26/20
to Guillaume Smet, Jaikiran Pai, Andy Damevin, Jason Greene, Quarkus Development mailing list
Interestingly enough on that topic, https://github.com/quarkusio/quarkus/pull/11608 passed

Jaikiran Pai

unread,
Aug 26, 2020, 9:21:50 AM8/26/20
to Guillaume Smet, Georgios Andrianakis, Andy Damevin, Jason Greene, Quarkus Development mailing list

On 26/08/20 6:46 pm, Guillaume Smet wrote:
> Jason posted a message on Zulip but not here: apparently we get out of
> both memory and swap. And that's our main issue.

Ah! I haven't been on Zulip much.

Thanks, so that at least solves the mystery. Now it's more of seeing if
GitHub actions can retain these logs and make it more apparent why it
failed. I'll add this detail to that github actions issue I created.

-Jaikiran

Stuart Douglas

unread,
Aug 27, 2020, 3:33:17 AM8/27/20
to Guillaume Smet, Georgios Andrianakis, Jaikiran Pai, Andy Damevin, Jason Greene, Quarkus Development mailing list
I have been looking into this today. It turns out we are leaking our ClassLoader (and all the class definitions) as part of the BuildMojo run in the integration tests. This means that every one of the 100+ integration tests modules will leak a lot of metaspace, which would definitely not be helping.

I have been looking into fixing it and I have found some issues but there are still more to go.

Stuart

Stuart Douglas

unread,
Aug 27, 2020, 4:02:03 AM8/27/20
to Guillaume Smet, Georgios Andrianakis, Jaikiran Pai, Andy Damevin, Jason Greene, Quarkus Development mailing list

Guillaume Smet

unread,
Aug 28, 2020, 2:17:48 AM8/28/20
to Stuart Douglas, Georgios Andrianakis, Jaikiran Pai, Andy Damevin, Jason Greene, Quarkus Development mailing list
We should continue the overall effort but I got the JVM tests to run successfully with this PR: https://github.com/quarkusio/quarkus/pull/11679 .

It has been merged.

Georgios Andrianakis

unread,
Aug 28, 2020, 2:19:44 AM8/28/20
to Guillaume Smet, Stuart Douglas, Jaikiran Pai, Andy Damevin, Jason Greene, Quarkus Development mailing list
That seems to have had a very good impact!

I think with that and with Stuart's PR which is now running in CI (rebased onto the latest master), we should be back in good shape - 🤞

Stuart Douglas

unread,
Aug 28, 2020, 2:40:20 AM8/28/20
to Georgios Andrianakis, Guillaume Smet, Jaikiran Pai, Andy Damevin, Jason Greene, Quarkus Development mailing list
On Fri, 28 Aug 2020 at 16:19, Georgios Andrianakis <gand...@redhat.com> wrote:
That seems to have had a very good impact!

I think with that and with Stuart's PR which is now running in CI (rebased onto the latest master), we should be back in good shape - 🤞

So with the previous run of my PR the JDK14 job passed, while the others all got most of the way through then hit the intermittent Muntiny-JAX-RS failure.

What I found today was that in addition to the leak of the Quarkus ClassLoader we are also leaking the ClassLoader from the docs annotation processor. As this uses Jackson it has around 3k classes in it, so it represents a very significant leak that is likely the underlying cause of the issues we are having.

Hopefully the combination of these two changes should now get us well under the memory limits.

Stuart
Reply all
Reply to author
Forward
0 new messages