So for process reasons, we had to long delay our upgrade past v3.4 so we just recently upgraded from v3.3 to v4.10.1 (we followed all the interim upgrade steps in the path as prescribed). Everything is working except the activity log display pages for jobs with a large (>200+) number of nodes.
The problem seems to happen on the call to .../execution/ajaxExecState/... It simply takes forever and eventually times out. With 200-500 nodes the page eventually loads after 5+ minutes or so, but once we hit around 700+ nodes, the page doesn't load at all.
I've confirmed that it doesn't matter if the job activity is a failure or success report, and it doesn't seem to be an issue of log size. A job that targets a single node produces an 8M log that we can display/download without issue, but the job that targets 700+ nodes that usually produces a smaller log won't even display the "whale" alert. The page simply goes unresponsive and eventually dies.
We increased the system RAM from 4G to 8G and increased the Java heap from 2G to 4G. When loading the page, the system report shows no appreciable increase in Java heap usage - it never goes above about 30%.
There is no appreciable impact on the system CPU or RAM.
I purged the execution history from about 75k entries to less than 25k (~6 months of retention) and it had no impact on the load times of the 200 or 500 node jobs and the 700 node job still won't load.
I'm not clear what version introduced the change that caused this as I didn't notice it during any of our interim testing on the version by version upgrade path we followed, but it seems likely to be related to the new UI from 3.4 (just a guess). Notably we never experienced any delay in activity log displays on v3.3 based on node number (just the "whale" alerts on large logs...)
I've searched and found nothing in this discussion group and nothing anywhere online related to this.
Any help or suggestions would be appreciated!
Thanks!
Alex Szele