Long pauses between steps and stages

49 views
Skip to first unread message

Fábio Almeida

unread,
Sep 1, 2025, 12:33:18 PM (5 days ago) Sep 1
to jenkins...@googlegroups.com

Hello Everyone!

Me and my team have been struggling with a problem where Jenkins is stalling a lot between steps and stages, particularly in changes' validation (which we do with the Gerrit Trigger plugin). The periodic executions (daily, during the night) do not exhibit the same behaviour, although they do exactly the same as the changes' validation.

This happens throughout all stages and all steps, but the easiest one to witness the behaviour is in a single step stage, where the step (a bat command) takes less than 30 seconds to execute, however, Jenkins reports over 5 minutes of execution time for the stage.

A couple of months ago we upgraded to Jenkins 2.492.3 (from 2.319.1), and the problem started to manifest itself.
Recently, we bumped the controller's resources from 4CPUs and 5GB of RAM to 8CPUs and 10GB of RAM, with no discernible improvement in performance (for this particular case).

We have upwards of 100 declarative pipelines with the ocasional script step (some of them multibranch), and we are using the "Performance-optimized" option in Speed/Durability configuration.
We have 16 agents with a total of 45 executors, and at any given time we have ~20-30 concurrent executions throughout the day (and ~5 concurrent executions throughout the night).
Unfortunately our pipelines are long, with most of them taking ~1h.

We've gathered GC logs and have seen some worrisome patterns.

After the recent increase in resources, this pattern seems to have disappeared, but we are still experiencing the same long pauses, as I've said before.

Our current configurations to the JVM are: "-server -Xms8G -Xmx8G \
      -XX:MaxDirectMemorySize=1G -XX:MaxMetaspaceSize=512M \
      -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ParallelRefProcEnabled -XX:+ExplicitGCInvokesConcurrent \
      -XX:+UseStringDeduplication -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions"


With all that being said, I'm not sure how to continue with the analysis:

  • Do we need even more resources? CPU or RAM?
    • We are running the Jenkins controller as a Docker container, is that unadvised?
  • Am I constraining the JVM too much? Should I just define a minimum and maximum heap and leave the rest alone?
  • Are there any known issues that could explain this behaviour? Either with plugins or Jenkins itself?
    • I find it odd that during the day we have problems while during the night we do not?
      • Is it the Gerrit Trigger plugin processing of events?
      • Or is it just too much concurrent executions for our current resources?
  • Are we just doing something plain wrong in our pipelines?
    • Are there any know patterns that might be causing this problem?

Pardon me for the long post, but any general advice either on how to fix or how to further debug this problem, would be much appreciated.

Best Regards,

PS - I'm not sure if this mailing list is ok with screen shots, pardon me if not.

Fábio Almeida
Platform Engineering Team Lead

SISCOG - Sistemas Cognitivos, SA
A Campo Grande, 378 - 3º, 1700-097 Lisboa, Portugal
T +351 217 529 100
W www.siscog.pt

Optimising the resources of the world


DISCLAIMER This message may contain confidential information. You should not copy or address
this message to third parties. If you are not the appropriate recipient we kindly ask you to delete
the message and notify the sender.
The contents of this message and its attachments are the sole responsibility of the sender and under
no circumstances can SISCOG - Sistemas Cognitivos, SA be liable for any resulting consequences.



Maciej Jaros

unread,
Sep 2, 2025, 4:53:12 AM (4 days ago) Sep 2
to jenkins...@googlegroups.com
I guess that depends on what you are doing but I have pretty similar effect after building an application with Maven. The build is on a separate Jenkins node and after build it takes a minute or more for the process to complete. Jenkins needs that time to transfer large artifacts (war, jar) from build-node to main-node. In our case the jar is large (~200MB) as it is a standalone file, so naturally it takes some time to transfer. Not sure if that is your problem.

So I would look at where your steps are executed and  do you really need a separate job for every small thing or would you be able to make that an optional step in one job. Or you could build things on main-node.

Cheers,
Maciej

'Fábio Almeida' via Jenkins Users (2025-09-01 18:30):
--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/jenkinsci-users/ad38efaa5eb2e73c935a28281b4537e6%40siscog.pt.

Fábio Almeida

unread,
Sep 3, 2025, 5:28:54 AM (3 days ago) Sep 3
to Jenkins Users

Thank you for your reply Maciej.

However, I don't think I have the same problem you are describing since no artifacts are being archived into Jenkins itself, we archive things in Nexus at a later stage.
Additionally, as I said, the nightly jobs, which are equal to the changes validating, aren't being affected by this slowdown.

At the moment, I'm convinced that this a problem with either the change discovery done by Gerrit Trigger Plugin, or a load problem (less so of this option, because doubling our resources had virtually no effect on the problem).

If anyone would have any pointers on how to further pursue this issue, I'd be deeply grateful.

Fábio

Björn Pedersen

unread,
Sep 4, 2025, 8:40:58 AM (2 days ago) Sep 4
to Jenkins Users
fabio....@siscog.pt schrieb am Mittwoch, 3. September 2025 um 11:28:54 UTC+2:

Thank you for your reply Maciej.

However, I don't think I have the same problem you are describing since no artifacts are being archived into Jenkins itself, we archive things in Nexus at a later stage.
Additionally, as I said, the nightly jobs, which are equal to the changes validating, aren't being affected by this slowdown.

At the moment, I'm convinced that this a problem with either the change discovery done by Gerrit Trigger Plugin, or a load problem (less so of this option, because doubling our resources had virtually no effect on the problem).



Gerrit trigger does not 'discover changes', it listens to stream events (at least in the normal setup). You can configure it o wait for replication  to also finish, which can introduce a certain lag (depending on how your gerrit  replication instance is configured, e.g. with a long replaction delay).
If you look at the  classic view (not blue ocean / pipeline stages...) you should see timing information:

This run spent:

8.8 sec waiting;   <== we have a 5 sec. replication delay
8 min 58 sec build duration;
9 min 7 sec total from scheduled to completion
Reply all
Reply to author
Forward
0 new messages