No notification after stage fail

20 views
Skip to first unread message

Hans Dampf

unread,
Nov 21, 2024, 4:26:15 AMNov 21
to go-cd
Hi,

fail.png

We run into this problem tonight. The stage passed, but then the pipeline crashed. The crash itself is not the main problem.

The main problem is we have an extra fail-stage configured by mail and enriched information.  My guess is this fail stage did not get triggered because the previous staged succeeded.

I found in the documentation this part, but I'm not sure if this had worked in this case.

If not, then there should be a way to catch these events outside the stage and job level but still in the pipeline and generate an alert.

Regards

Chad Wilson

unread,
Nov 21, 2024, 4:47:54 AMNov 21
to go...@googlegroups.com
What does the GoCD server think the status of that job and stage is? What does "the pipeline crashed" mean? If the stage is shown as passed by the GoCD server, what downstream problem did this cause? Did subsequent stages or pipelines not trigger correctly?

The error looks like your agent had some kind of problem talking to the server or reporting its status. If that's the case then there is potentially a chicken-and-egg problem here that might prevent reporting at the level of scope you suggest - depending on the root cause of the issue
  • if the agent couldn't talk to the server to report its status and the error was not recoverable by the agent then you'd probably need to monitor agents for such connectivity errors. Agents do have a health API exposed which reports their ability to connect to the server - this could be monitored externally, but would not have "pipeline/stage/job" scope.
  • if the GoCD server itself had a problem preventing it from correctly updating the status from the agent, it would depend what the cause of that error is/was and whether it happened within the scope of a stage/job. If the stage/job was left in an indeterminate state there'd potentially be a similar problem with knowing how to report the status at the pipeline/stage's scope. The server has its own internal error reporting/tracking (the one that drives the red errors/warnings in the UI, and also has its own API for external consumption) but we'd need to know what the root cause was and whether it triggered such an error/warning.

-Chad

--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/go-cd/e4dd0034-8c49-4ac4-9e19-6d193d73c20fn%40googlegroups.com.
Message has been deleted

Wolfgang Achinger

unread,
Dec 20, 2024, 2:42:46 AM (yesterday) Dec 20
to go...@googlegroups.com
Ok i think we found the problem. We had a peak disk IO of > 1s - 6s on the gocd server. This resulted in the timeout of the pipelines. So the reason of the problem was the storage of beneath the vm. We still investigate the reason for this peak, but we noticed the same peak on several vms at the same time in the same storage.

You received this message because you are subscribed to a topic in the Google Groups "go-cd" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/go-cd/hEtUHngS7oo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to go-cd+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/go-cd/CAA1RwH-t9Do19zSeR9v7ckFsr3vmR0fLR2AbACQU3BwLg8hWRA%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages