Activity page unresponsive for jobs with large number of nodes

176 views
Skip to first unread message

Alexander Szele

unread,
Mar 2, 2023, 7:06:07 AM3/2/23
to rundeck-discuss
So for process reasons, we had to long delay our upgrade past v3.4 so we just recently upgraded from v3.3 to v4.10.1 (we followed all the interim upgrade steps in the path as prescribed). Everything is working except the activity log display pages for jobs with a large (>200+) number of nodes.

The problem seems to happen on the call to .../execution/ajaxExecState/... It simply takes forever and eventually times out. With 200-500 nodes the page eventually loads after 5+ minutes or so, but once we hit around 700+ nodes, the page doesn't load at all.

I've confirmed that it doesn't matter if the job activity is a failure or success report, and it doesn't seem to be an issue of log size. A job that targets a single node produces an 8M log that we can display/download without issue, but the job that targets 700+ nodes that usually produces a smaller log won't even display the "whale" alert. The page simply goes unresponsive and eventually dies. 

We increased the system RAM from 4G to 8G and increased the Java heap from 2G to 4G. When loading the page, the system report shows no appreciable increase in Java heap usage - it never goes above about 30%.

There is no appreciable impact on the system CPU or RAM.

I purged the execution history from about 75k entries to less than 25k (~6 months of retention) and it had no impact on the load times of the 200 or 500 node jobs and the 700 node job still won't load.

I'm not clear what version introduced the change that caused this as I didn't notice it during any of our interim testing on the version by version upgrade path we followed, but it seems likely to be related to the new UI from 3.4 (just a guess). Notably we never experienced any delay in activity log displays on v3.3 based on node number (just the "whale" alerts on large logs...)

I've searched and found nothing in this discussion group and nothing anywhere online related to this.

Any help or suggestions would be appreciated!

Thanks!
Alex Szele
Message has been deleted

rac...@rundeck.com

unread,
Mar 2, 2023, 7:48:48 AM3/2/23
to rundeck-discuss
Hi Alex,

Do you see any clue in the service.log file? Does this occur only on jobs dispatched on >200+? do you see any error on the browser dev console?

Regards!
Message has been deleted

Alexander Szele

unread,
Mar 2, 2023, 9:39:03 AM3/2/23
to rundeck-discuss
I've tailed the service log while loading a page and literally nothing happens. The access log just shows the page call and a single subsequent ajaxExec call. I've noted no errors anywhere in any of the logs. In fairness, it doesn't appear to be an "error" so much as it just takes an extremely long time to load. The browser console gives the same feedback as if I look at an activity log for one node that loads immediately :
  • GET https://<url>/assets/static/css/pages/execution-show.css net::ERR_ABORTED 404 (Not Found)
  • DevTools failed to load source map: Could not load content for https:// <url>  /assets/vendor/perfect-scrollbar.min.js.map: HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE
  • DevTools failed to load source map: Could not load content for https:// <url>  /assets/execution/knockout.mapping.min.js.map: HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE
I see the source map warnings for myriad objects throughout the site. That missing css is the only thing semi-unique to the activity log, but again, logs for small number of nodes load immediately.

As for the number of nodes, it seems to be an additive issue. Each node in a job log takes a couple seconds to load. So a job with one node loads more or less instantly. But 10 is a bit slower. And a 100 is a lot slower. And 200-500 is still worse. Currently, from my timing of it our nightly full environment jobs that hit about 750 nodes take about 6.5 minutes to load the activity log.

And it's definitely the nodes. We have a gitlab backup job that hits a single node and produces an 8.5M log. The activity log loads immediately (and displays the whale alert appropriately). The nightly job that runs log maintenance/review hits 750 nodes and produces only about a 4.5M log, but takes 6.5 minutes to display the activity log.

I've gone through the "tuning" page on Rundeck and everything is basically in order. ulimit is 65535. We doubled the heap size from 2G to 4G. I'm noting zero impact on CPU or RAM on either the database or the application servers. The Java Heap report on the server status page never climbs above 30-35%. It's like the page is working super hard at something but only using a fraction of the system resources to do it.

Honestly, I'm just plumb out of ideas on this one.

Thanks!
Alex
Message has been deleted

rac...@rundeck.com

unread,
Mar 2, 2023, 2:47:36 PM3/2/23
to rundeck-discuss
Hi Alex,

Is your Rundeck instance behind a proxy server? If so, could you try using the instance directly? (without the proxy).

Regards.
Message has been deleted

Alexander Szele

unread,
Mar 2, 2023, 3:10:12 PM3/2/23
to rundeck-discuss
Yep. We've got a proxy handling the SSL certs and https forwarding (per standard config recommendations unless that has changed...) Using local hosts file, I was able to log in directly against the server using http port 4440. Same thing. ~6 minutes to load activity report 81424 (1.3MB rdlog file with ~750 nodes). Instant load for activity report 81346 (8.3MB rdlog file with a single node)

Not sure what's happening here. Everytime I mention the data-base, something deletes my reply... Suffice to say, the data-base appears fine. (hopefully this will make it past the defenses...)

I also tried implementing the new setting from https://github.com/rundeck/rundeck/pull/7508 .   I set rundeck.logviewer.trimOutput as low as 100kb, restarted rundeck, and no difference in behaviour.

I'm suspecting that there is something in the rundeck config properties or framework config files that has changed between v3.3.16 and v4.10.1 that I'm not accounting for. The only thing I know about that changed for sure is that rundeck.projectsStorageType = "filesystem" is deprecated.

I've gone through the Update Release notes for every version between 3.3.16 and 4.10.1 but I've found nothing else that seems even vaguely related.

It's notable that the missing CSS file ("/assets/static/css/pages/execution-show.css") doesn't match the formatting of everything else where there's a long string of letters/numbers appended to the end of the filename, but I can't fathom how a) I could fix that, or b) why that would cause the node display problem...

Thanks again!
Alex
Message has been deleted

Alexander Szele

unread,
May 18, 2023, 12:20:00 PM5/18/23
to rundeck-discuss
Sorry for necro-ing this thread, but we recently upgraded to 4.13 hoping it might resolve the issue highlighted above but the problem persists. We've had no luck tracking down a source for this issue and it remains a thorn in our operations. Does anyone have any further suggestions for troubleshooting?

Thanks

rac...@rundeck.com

unread,
May 18, 2023, 4:16:53 PM5/18/23
to rundeck-discuss
Hi Alex,

I tried to reproduce the issue unsuccessfully. Is still happening on another browser like Firefox? I saw similar issues related to Google Chrome.

Apologies for the "Message has been deleted" error. It was a odd and anoying bug from Google Groups, it's fixed now.

Regards.

Alexander Szele

unread,
May 18, 2023, 4:52:13 PM5/18/23
to rundeck-discuss
I just confirmed the same behavior in Chrome, Firefox, and Edge. All three do the same thing where the page bogs down. The browsers give their various "this page is slowing down the browser" or similar warnings prompting you to stop the page. But if you leave it run for 5-10 minutes, the page eventually finishes processing.

I also just ran it through Firefox profiler, pulling the profile after about 15 seconds, and there's a nested mess of Javascript that ends with get HTMLCollection.length (under the MutationCallback tree). That HTMLCollection.length is running something on the order of 6,000 self-samples and is responsible for almost the entire running time of the page load (when I checked it...). It's in a script listed as https://<our server url>/assets/static/components/uisockets-f616e545f28116cec9fa19bdff6c2cc3.js

Based on what I'm seeing and the fact that the delay is additive (well, probably multiplicative if not exponential) as the number of nodes increases, I have to assume this is some kind of nested loop it's going through node-by-node on the page display for some reason. But I have no idea whether this is a configuration issue that I can actually resolve or if it's a codebase problem that's only manifesting due to some combination of settings/build on our end.

rac...@rundeck.com

unread,
May 19, 2023, 1:34:31 PM5/19/23
to rundeck-discuss
Hi Alex,

I created a new 702 nodes environment on 4.13... Issue reproduced. Do you see this message right? 

Also tested on 3.4, and this doesn't happen.

Please open a new issue here. I'll mark it as a "reproduced".

Thank you!

Alexander Szele

unread,
May 19, 2023, 2:53:45 PM5/19/23
to rundeck-discuss
Yes! Obviously the page name is different in the error message, but yup. That's exactly what we're running into! Thanks. I'll open the issue straight away.

Alexander Szele

unread,
May 19, 2023, 3:14:57 PM5/19/23
to rundeck-discuss
Thanks again. Issue submitted as https://github.com/rundeck/rundeck/issues/8335
Reply all
Reply to author
Forward
0 new messages