Dataverse coming to a grinding halt with a CPU load of 400%

103 views
Skip to first unread message

Paul Boon

unread,
Jul 11, 2019, 2:13:02 PM7/11/19
to Dataverse Users Community


Help needed with dataverse.nl coming to a grinding halt with a CPU load of 400%
 
The high CPU load seems to be a result of a high memory usage. 
Probably the GC kicks in when the memory is getting full. 
We have looked in the logs, but found no clues yet to what triggered this. 
When glassfish or the whole server is restarted memory usage is back to low but then immediately starts increasing in a very linear way, not what you expect if it is normal user actions causing it. 
I have not seen this on any of our other servers, which have similar configurations. 

We run a fork of dataverse 4.10.1 on Redhat7, with 16GB of memory and enough free diskspace. 

Memory related options
-Xmx12500M
-Xms12500M
-XX:MaxPermSize=512m
-XX:PermSize=256m

Also I looked into the number of open file descriptors, which was causing trouble in some previous versions, but that looks alright. 
The weird thing is that it used to run without problems from the time we deployed it on production 2019-05-09 up to 2019-07-10, when we got the near 400% CPU load. 
Looking back we can see a first stepwise memory usage increase in 2019-05-30, then it keeps stable on that higher level until 2019-07-10 when it goes up again causing problems. 
Restarting does only help temporarily because it goes back (linear) to the last (full) memory level in eight hours. 
Maybe someone has an idea what process can cause this linear increase in memory usage. 

Any thoughts on how to proceed and find the cause of the problem is much appreciated. 


Below the memory usage with first a restart of glassfish, then a full server reboot. 

Paul Boon

unread,
Jul 11, 2019, 2:20:02 PM7/11/19
to Dataverse Users Community
Somehow the image did't come through
PastedGraphic-5.png

Condon, Kevin M

unread,
Jul 11, 2019, 2:33:57 PM7/11/19
to Dataverse Users Community

Hi Paul,

Are you running version 4.15? There was an issue with too many threads being opened when accessing the dataset page. We've just released a fix in 4.15.1, available since last evening.

Regards,

Kevin

From: dataverse...@googlegroups.com <dataverse...@googlegroups.com> on behalf of Paul Boon <boo...@gmail.com>
Sent: Thursday, July 11, 2019 2:13 PM
To: Dataverse Users Community
Subject: [Dataverse-Users] Dataverse coming to a grinding halt with a CPU load of 400%
 
--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/8f471e66-e448-446d-85dd-893424c09220%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Don Sizemore

unread,
Jul 11, 2019, 2:46:10 PM7/11/19
to dataverse...@googlegroups.com
Hello,

We're still running 4.9.4 here but our production Dataverse typically clocks in around 18G-20G of active memory (Xmx32768m). Any chance you could bump your RHEL box to 24G of RAM?

On memory. Are you functioning as a harvesting server? That's typically when we see our memory usage climb.

What changes are involved in your fork? Is the code up somewhere? (I found your fork of DVN3)

Donald


--

Philip Durbin

unread,
Jul 11, 2019, 3:22:25 PM7/11/19
to dataverse...@googlegroups.com
Speaking of harvesting, a little bird just pointed out a potential memory problem here:


The regularity of the climb in memory is strange to me but harvesting would explain that. I don't know how long harvesting lasts, though.

As we mentioned on the last community call and in (updated) release notes for Dataverse 4.15, there was a memory leak in that version. This has been fixed in https://github.com/IQSS/dataverse/pull/6004 in 4.15.1, released yesterday.

The same bug still exists in OAISetServiceBean in 4.15.1. Paul, especially since you're running a fork, you could try patching your version of the file to see if it helps. It looks like the code is the same in the version you're running: https://github.com/IQSS/dataverse/blob/v4.10.1/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAISetServiceBean.java#L139

If this helps, please do make a pull request, of course. :)

There are more scientific ways to identify the problem than this but it's something to try. :)


- Download multiple files as zip can be memory hungry.
- Ingest can be memory hungry.

Please keep us posted! Hang in there!

Phil



For more options, visit https://groups.google.com/d/optout.


--

Heppler, Michael

unread,
Jul 11, 2019, 3:53:59 PM7/11/19
to dataverse...@googlegroups.com
There is also a newly opened GitHub to further investigate the performance issue the patch resolved.

Only use one HttpSolrClient object per Solr server (i.e. One HttpSolrClient) #6005

Feel free to drop any additional solr server vs server stability information in there.





Paul Boon

unread,
Jul 12, 2019, 8:28:02 AM7/12/19
to Dataverse Users Community
I disabled the 'OIA Server' on the dashboard, but the steady climbing in memory usage has not changed. 
The logs show nothing special coming from outside, and it should be something like a very regular machine request. 
It is so linear that it looks like a ticking clock, what else could produce this continues adding to the memory consumption?

I have made a threaddump with asadmin (while CPU was still low), because kill -3 didn't do anything and jstack wasn't working either. Not sure if it helps finding the leak though. 

Meanwhile we have doubled the memory to 32GiB and activated the restart scripts, now I keep my fingers crossed that this leak stays stable until it is fixed. 


On Thursday, July 11, 2019 at 9:22:25 PM UTC+2, Philip Durbin wrote:
Speaking of harvesting, a little bird just pointed out a potential memory problem here:


The regularity of the climb in memory is strange to me but harvesting would explain that. I don't know how long harvesting lasts, though.

As we mentioned on the last community call and in (updated) release notes for Dataverse 4.15, there was a memory leak in that version. This has been fixed in https://github.com/IQSS/dataverse/pull/6004 in 4.15.1, released yesterday.

The same bug still exists in OAISetServiceBean in 4.15.1. Paul, especially since you're running a fork, you could try patching your version of the file to see if it helps. It looks like the code is the same in the version you're running: https://github.com/IQSS/dataverse/blob/v4.10.1/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAISetServiceBean.java#L139

If this helps, please do make a pull request, of course. :)

There are more scientific ways to identify the problem than this but it's something to try. :)


- Download multiple files as zip can be memory hungry.
- Ingest can be memory hungry.

Please keep us posted! Hang in there!

Phil


To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Paul Boon

unread,
Jul 12, 2019, 9:20:18 AM7/12/19
to Dataverse Users Community
Hi Donald, 
We have just upgraded to 32GiB and I disabled the OAI Server, but leaking is still going on. 

- we changed the homepage a bit
- we made changes related to the shibboleth login
- we added some endpoints to the api
- we added some extra UI stuff on the dataset page
- we have some extra tables and columns in the database

Paul

Paul Boon

unread,
Jul 16, 2019, 6:03:52 AM7/16/19
to Dataverse Users Community
I think we have found the cause of our problem!
We discovered search requests coming from one IP, every second.
When we do the same on our test server, we get comparable leakage. 

Blocking this IP is one thing, but I want the leak to be fixed as well. 
Looks like it is the search, so probably related to this HttpSolrClient mentioned here. 
Does anyone know where to find the 'patch' or some tips on how to fix it in our 4.10.1 based fork?
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.


--

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Don Sizemore

unread,
Jul 16, 2019, 6:51:35 AM7/16/19
to dataverse...@googlegroups.com
Looks like it is the search, so probably related to this HttpSolrClient mentioned here. 
Does anyone know where to find the 'patch' or some tips on how to fix it in our 4.10.1 based fork?

Hello,

sekmiller fixed the memory leak here:

I don't see a branch corresponding to #6005 just yet...

D


Philip Durbin

unread,
Jul 16, 2019, 8:41:32 AM7/16/19
to dataverse...@googlegroups.com
Yes, https://github.com/IQSS/dataverse/compare/4.15-prod-patch7-1 was the patch Harvard Dataverse ran in production for a week or two but this was eventually cleaned up into https://github.com/IQSS/dataverse/pull/6004 which shipped with Dataverse 4.15.1 (now deployed to Harvard Dataverse and the demo site).

Indeed, please stay tuned for https://github.com/IQSS/dataverse/issues/6005

Paul, are you doing anything fancy to test? Just a curl command in a "while true" loop with a 1 second sleep? I'm asking because "Better Detection of Memory Leaks/Usage" at https://github.com/IQSS/dataverse/issues/5977 is in our current sprint and we're happy pick the community's brain about tools and techniques. I'm glad you found the root cause and that blocking that IP address in production helped!

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Paul Boon

unread,
Jul 18, 2019, 5:13:35 AM7/18/19
to Dataverse Users Community
The scenario to produce leakage has been reduced to a simple GET request on a verse (non- homepage also) every second. If this is done for 10 minutes we get a 200M leak. 
Kevin is also able to reproduce this. 

After narrowing down it is good to do some exploration. 
I changed the GC to -XX:+UseG1GC, but no difference.
And I just discovered that it also leaks if I request for the login page, can't imagine it will have anything to do with solr. 
Maybe it is glassfish or prime faces or something very basic. 
 
For testing I use my vagrant VM with JMX enabled in the jvm options and Jconsole attached to see the leakage in a 'live' graph. 
To force the leakage I now use JMeter. For diving deeper into the cause of leakage I think VisualVM would be better than Jconsole. 
It seems so weird that we didn't experience this problem earlier. 


On Tuesday, July 16, 2019 at 2:41:32 PM UTC+2, Philip Durbin wrote:
Yes, https://github.com/IQSS/dataverse/compare/4.15-prod-patch7-1 was the patch Harvard Dataverse ran in production for a week or two but this was eventually cleaned up into https://github.com/IQSS/dataverse/pull/6004 which shipped with Dataverse 4.15.1 (now deployed to Harvard Dataverse and the demo site).

Indeed, please stay tuned for https://github.com/IQSS/dataverse/issues/6005

Paul, are you doing anything fancy to test? Just a curl command in a "while true" loop with a 1 second sleep? I'm asking because "Better Detection of Memory Leaks/Usage" at https://github.com/IQSS/dataverse/issues/5977 is in our current sprint and we're happy pick the community's brain about tools and techniques. I'm glad you found the root cause and that blocking that IP address in production helped!

Phil

On Tue, Jul 16, 2019 at 6:51 AM Don Sizemore <don.s...@gmail.com> wrote:

Looks like it is the search, so probably related to this HttpSolrClient mentioned here. 
Does anyone know where to find the 'patch' or some tips on how to fix it in our 4.10.1 based fork?

Hello,

sekmiller fixed the memory leak here:

I don't see a branch corresponding to #6005 just yet...

D


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Paul Boon

unread,
Jul 18, 2019, 8:00:51 AM7/18/19
to Dataverse Users Community
Reply all
Reply to author
Forward
0 new messages