To whom it may concern,
Over the past few weeks our system has been going down almost daily for periods of time (~30 minutes-1 hr) a few times a day. We are running our production application server on a VM with 4 cores and 32 GB of ram. OpenJDK Runtime Environment: 1.8.0_372-b07 on RHel7X86_64. We are currently on XNAT 1.8.8, build: 92 with the following plugins:
xnatMlPlugin - 1.0.2, dicom-query-retrieve - 1.0.4-xpi, datasetsPlugin - 1.0.3, xsyncPlugin - 1.5.0 (runs every hour on a single project that isn't updated often), ohifViewerPlugin - 3.5.1, nat-ldap-auth-plugin - 1.1.0, containers - 3.3.1-fat, batchLaunchPlugin - 0.6.0.
All plugins were installed May 25, 2023 so I don't see anything recent that would explain this behavior.
Overall we have 1 VM which is our web VM with a reverse proxy pointing to our App VM (specs mentioned above) and that connects to.a DB VM. During outages we cannot connect through a web browser directly to the app VM but everything is accessible via SSH and we have tested connections between VMs at that time and have not noticed anything connection-wise that would explain the behavior so we started looking at other things from the JavaMelody tool.
We currently are not running any containers on this production system (containers being tested just on our mirror test system), so the system is mainly being used as a repository at the moment. There are a few large projects that have 5000+ subjects where in the past we extended the keep alive on the html as the scantype cleanup wasn't loading, however those changes were made a year ago so I don't see why behavior would change over the last 30 days or so. Additionally, there was a firewall change a few weeks back but our development system (same set up) is behind the same firewall and accessible during outages so we do not believe this is connection related. Some users also use rest-api with Xnatpy functions that occurred around the time of freezing last week, but it was not running today from the access logs that I can tell.
When looking at threads in JavaMelody monitoring tool we see the following:
Number = 241, Maximum = 300, Total Started = 382. However, prior to the system freezing we I saw Total Started = 28,750.
Non heap memory = 364 mb, Loaded classes = 29282, Committed virtual memory = 16,464 mb, free physical memory = 240 mb, total physical memory = 31,993 mb.
The number of loaded classes seems very high to me so I suspect there is something here that is causing the issue. I have screen shots of the threads details from JavaMelody at the time when started was ~29k, as well as HTOP command output for the java process using 16GB ram from the terminal of the VM in case there is a spot you want me to upload this info to help pinpoint where to look....any suggestions would be greatly appreciated. A tomcat restart cleared this issue and reset thread total started = 598.
Thanks,
Ajay