System freezing daily - Potential Java threads issue?

135 views
Skip to first unread message

Ajay Kurani

unread,
Mar 29, 2024, 4:58:35 PMMar 29
to xnat_discussion
To whom it may concern,
   Over the past few weeks our system has been going down almost daily for periods of time (~30 minutes-1 hr) a few times a day.  We are running our production application server on a VM with 4 cores and 32 GB of ram. OpenJDK Runtime Environment: 1.8.0_372-b07 on RHel7X86_64.  We are currently on XNAT 1.8.8, build: 92 with the following plugins:
xnatMlPlugin - 1.0.2, dicom-query-retrieve - 1.0.4-xpi, datasetsPlugin - 1.0.3, xsyncPlugin - 1.5.0 (runs every hour on a single project that isn't updated often), ohifViewerPlugin - 3.5.1, nat-ldap-auth-plugin - 1.1.0, containers - 3.3.1-fat, batchLaunchPlugin - 0.6.0.   

All plugins were installed May 25, 2023 so I don't see anything recent that would explain this behavior.

Overall we have 1 VM which is our web VM with a reverse proxy pointing to our App VM (specs mentioned above) and that connects to.a DB VM.  During outages we cannot connect through a web browser directly to the app VM but everything is accessible via SSH and we have tested connections between VMs at that time and have not noticed anything connection-wise that would explain the behavior so we started looking at other things from the JavaMelody tool.


We currently are not running any containers on this production system (containers being tested just on our mirror test system), so the system is mainly being used as a repository at the moment.  There are a few large projects that have 5000+ subjects where in the past we extended the keep alive on the html as the scantype cleanup wasn't loading, however those changes were made a year ago so I don't see why behavior would change over the last 30 days or so.  Additionally, there was a firewall change a few weeks back but our development system (same set up) is behind the same firewall and accessible during outages so we do not believe this is connection related.  Some users also use rest-api with Xnatpy functions that occurred around the time of freezing last week, but it was not running today from the access logs that I can tell.

 When looking at threads in JavaMelody monitoring tool we see the following:

Number = 241, Maximum = 300, Total Started = 382.   However, prior to the system freezing we I saw Total Started =  28,750.

Non heap memory  = 364 mb, Loaded classes = 29282, Committed virtual memory = 16,464 mb, free physical memory = 240 mb, total physical memory = 31,993 mb. 


The number of loaded classes seems very high to me so I suspect there is something here that is causing the issue.   I have  screen shots of the threads details from JavaMelody at the time when started was ~29k, as well as HTOP command output for the java process using 16GB ram from the terminal of the VM in case there is a spot you want me to upload this info to help pinpoint where to look....any suggestions would be greatly appreciated.  A tomcat restart cleared this issue and reset thread total started = 598. 

Thanks,
Ajay

Tim Olsen

unread,
Apr 2, 2024, 10:46:27 AMApr 2
to xnat_discussion
Are the outages correlated to any particularly large uploads?  We used to see some issues like that related to large uploads.

1.8.9 included some significant performance improvements as did recent versions of the OHIF plugin (which used to impact the server on large file uploads).   It's probably worth updating to see if it improves performance here.

Tim

From: xnat_di...@googlegroups.com <xnat_di...@googlegroups.com> on behalf of Ajay Kurani <dr.ajay...@gmail.com>
Sent: Friday, March 29, 2024 3:58 PM
To: xnat_discussion <xnat_di...@googlegroups.com>
Subject: [XNAT Discussion] System freezing daily - Potential Java threads issue?
 
You don't often get email from dr.ajay...@gmail.com. Learn why this is important
--
You received this message because you are subscribed to the Google Groups "xnat_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xnat_discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xnat_discussion/8c69eb1f-de77-48eb-819a-586fdca7acf5n%40googlegroups.com.

Ajay Kurani

unread,
Apr 5, 2024, 3:53:23 PMApr 5
to xnat_discussion
Hi Tim,
   Thanks for the suggestion!  We do see this happening when the MRI center is doing uploads, along with a group using XNATpy with massive downloads that are downloading each individual dicom instead of a zip of the scans.  The system just crashed as the MRI center was uploading a 15GB session.  In terms of XNATpy have you had issues where lots of downloads or many requests per session cause an issue?  We think we isolated the issue to a tomcat process as rebooting tomcat brings the system back up immediately.  This is running on a 32 GB /4core VM where cpu usage is not an issue.

In terms of the java process when using top command we see 16.1g VIRT, 1.5GB Res and 28000kb on SHR (small in comparison).  Using HTOP we see the java command as follows:
/usr/bin/java -Djava.util.loggin.config.file=/opt/tomcat/latest/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.security.egd=file:/dev/./urandom -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Xms1g -Xmx12g -XX:+UseConcMarkSweepGC.     Given I see max java is 12g and the top is using 16.1g VIRT and 1.5GB Res do you think it's an issue of this value being undersized?   

Or based on your experience 1.8.9 fixes these issues ?  We will be upgrading next week regardless but hoping to find the root cause so we know in case this comes up in the future.

Thanks,
Ajay

Tim Olsen

unread,
Apr 5, 2024, 4:35:28 PMApr 5
to xnat_discussion
Is Postgresql running on a separate server?  If so, then you can probably increase the tomcat memory to 2/3 or ¾ of system memory.

1.8.9 will definitely improve the system performance.  But, I know the older OHIF would lock up a server after really big sessions were uploaded.  And 15gb definitely counts as really big.

I think if you increase your memory and upgrade XNAT (and OHIF) you should avoid these.  I don't think I've ever seen downloads bring down a server, but it is theoretically possible.  

Tim

Sent: Friday, April 5, 2024 2:53 PM
To: xnat_discussion <xnat_di...@googlegroups.com>
Subject: Re: [XNAT Discussion] System freezing daily - Potential Java threads issue?
 

Ajay Kurani

unread,
Apr 5, 2024, 4:44:30 PMApr 5
to xnat_di...@googlegroups.com
Hi Tim,
    Yes the Postgres DB is on a separate VM.  Does the ohif viewer lock it up even when not being used to view?

We will update the system and ohif as recommended.  As for Tomcat we will upgrade from 12gb to at least 20-24GB.


Thanks for the suggestions.

Best,
Ajay

You received this message because you are subscribed to a topic in the Google Groups "xnat_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/xnat_discussion/e4YNydrkxzM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to xnat_discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xnat_discussion/CH0PR13MB4650B9D2DB2AD7E1DB447838D9032%40CH0PR13MB4650.namprd13.prod.outlook.com.

Ajay Kurani

unread,
Apr 19, 2024, 10:36:55 PMApr 19
to xnat_discussion
Hi Tim,
   I wanted to circle back with you with the results.  We upgraded to xnat 1.8.10 build 117 and OHIF viewer 3.6.1.    We updated tomcat threads to 300 (default was ~250) after seeing max threads were around 260 based on Java Melody report.  We have a 4 core VM with 32 GB RAM and we updated the java memory from 12GB to 22GB.  

When testing we uploaded the 15GB file from the scanner directly again and this time RAM did not shoot up and the data which usually took about an hour on 1.8.8 finished uploading in 37 minutes.  Also we noticed the RAM usage never went about 10GB, whereas in the past it was much higher.  During this upload we had several users login and also several browsers viewing images on the OHIF viewer and the system did not crash.  We performed this twice last week and all went well. Only at the end of the scan did CPU usage jump to almost 100% on all 4 cores but that may be due to archiving or decompression etc but this was short.

When opening up the large imaging sequence (Siemens XA30 where each slice is a separate dicom, thus resting state was 76000 images) we notices this is when there was a slowdown on the system.  We noticed some differences in behavior on safari and firefox which I will detail on a separate thread for OHIF.  It took ~50minutes on firefox for the page to load, however it did load and the system did not crash, although it became slow.  RAM usage spiked to 20GB but more importantly the CPU spiked on all 4 CPUs for most of that time and the system slowed down but did not crash.

Overall, the system performed much better and did not need rebooting which is a big plus.  We also had IT investigate and it looks like 15% of the time the VM is waiting for the cores to be allocated, as it seems resources are low for the VMs.  This may also be contributing to lags during peak usage of VMs.

Thanks for all of the help with this matter and your suggestions.

Best,
Ajay

DrRadiohead

unread,
Jun 2, 2024, 11:55:51 AMJun 2
to xnat_discussion
Dear Ajay and other XNAT team members

Some questions,
1. Is your XNAT running locally or on the cloud?
2. Do you have suggestions for connecting the MRI scanner to XNAT deployed on the cloud?

Regards

Ajay Kurani

unread,
Jun 4, 2024, 12:58:22 PMJun 4
to xnat_discussion
Hi,
  
1) We are running our instance of XNAT locally.  We have deployed the components across 3 VMs maintained by our university: Web VM that has a reverse proxy connecting to the App VM (contains XNAT) and a third VM which just runs has the database.

2) For connecting the MRI scanner to an XNAT instance in the cloud I think that would depend on your sites rules/IT security but if you CANNOT let any data with PHI leave the university (which is the likely case and they consider your cloud instance outside), one automated way to de-identify is setting up the CTP software from RSNA on a system which performs de-identification of dicoms and then sending those clean dicoms to your XNAT cloud instance.  This whole workflow will probably need to be validated by your IT security team to ensure it meets their legal compliance etc but this can be a potential workflow that is fairly automated.  I think someone at XNAT will be much more knowledgable than I on connecting to the cloud.

Best,
Ajay

Reply all
Reply to author
Forward
0 new messages