High CPU usage in DSpace 7.6 leading to server issues, lots of errors/examples (crossposted from Slack)

Carolyn Sullivan

unread,

Mar 5, 2024, 5:30:15 AM3/5/24

to DSpace Technical Support

Hello all,

As you might have seen if you frequent the Technical Support channel in the DSpace Slack, we've been encountering high CPU usage in DSpace 7.6 leading to decreased performance (ie. site unavailability) and we've been having a lot of errors. We're not entirely sure which errors may be the significant ones relating to our performance challenges, and would welcome any input from the community to help us improve our site issues. Also, if you're encountering similar issues, please do let us know--maybe we're all having the same problems and can solve these collaboratively. Thanks already to Tim Donoghue and Mark Wood for their suggestions in the DSpace Slack! I've aggregated the responses we've already received on these issues here to enable us to keep track of suggestions.

So, a summary of our issues: We've set up our server following best practices in the Performance Tuning documentation. We run everything on a single server with 4 CPU and 12 GB of RAM, which was the configuration that worked for our previous version of DSpace (6.3). Initially, we had pm2 configured in cluster mode with max instances and max_memory_restart: 500M. With this configuration, the node instances kept restarting ~every minute and seemed to be monopolizing the CPUs, and starving the other components. Since then, we have since tuned it down to 3 instances, ie:

{
"apps": [
{
"name": "dspace-ui",
"cwd": "/var/dspace-frontend/",
"script": "dist/server/main.js",
"instances": "3",
"exec_mode": "cluster",
"timestamp": "YYYY-MM-DD HH:mm Z",
"out_file": "log/dspace-ui.log",
"error_file": "log/dspace-ui_error.log",
"merge_logs": true,
"env": {
"NODE_ENV": "production",
"NODE_EXTRA_CA_CERTS": "/etc/ssl/certs/rootCA2.crt"
},
"max_memory_restart": "1500M",
"node_args": "--max_old_space_size=4096"
}
]
}

A review of process with top shows very active node instances despite low traffic (~ 1 request/second):

Tasks: 289 total, 5 running, 284 sleeping, 0 stopped, 0 zombie
%Cpu(s): 93.9 us, 3.0 sy, 0.0 ni, 3.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 11965.2 total, 440.4 free, 10901.1 used, 623.8 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 613.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1504803 dspace 20 0 2328180 1.5g 13936 R 100.0 12.5 10:53.08 node /v+
1506783 dspace 20 0 2620092 1.7g 14024 R 93.8 14.8 9:44.49 node /v+
1506913 dspace 20 0 1383380 586472 14180 R 93.8 4.8 4:57.11 node /v+
1508040 dspace 20 0 733380 141452 36952 R 75.0 1.2 0:00.77 node /v+
781 root 20 0 237020 2536 944 S 6.2 0.0 9:41.79 vmtoolsd
1 root 20 0 171488 7176 2492 S 0.0 0.1 0:44.04 systemd

Our cache settings are set as follows:

# Caching settings
cache:
...
serverSide:
debug: false
botCache:
max: 1000
timeToLive: 86400000 # 1 day
allowStale: true
anonymousCache:
max: 1000
timeToLive: 10000 # 10 seconds
allowStale: true

The main question of our systems analyst (Francois Malric): Is this level of constantly high CPU usage is normal due to node.js? Or is it likely that our DSpace is displaying poor performance due to underlying issues?

Here are some examples of the errors we've seen:

(1) From our DSpace Logs: According to this, we don't have that much traffic (HTTP 0.96 requests/minute), which would likely be higher if bot traffic was the issue. Nota bene, our pm2 monitor likely has a bug as it's showing the units as req/min; should be req/sec.

lq Process List qqqqqqqqqqqqqqqqqqqqklqq dspace-ui Logs qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqk
x[ 1] dspace-ui Mem: 824 MB xx dspace-ui > The response for 'https://ruor.uottawa.ca/server/api/core/items/d2d3c x
x[ 2] dspace-ui Mem: 316 MB xx dspace-ui > 1 rules skipped due to selector errors: x
x[ 3] dspace-ui Mem: 777 MB xx dspace-ui > .custom-file-input:lang(en)~.custom-file-label -> unmatched x
x[ 0] pm2-logrotate Mem: 45 xx dspace-ui > GET /handle/10393/19705/simple-search?query=&sort_by=score&order=desc x
x xx dspace-ui > 1 rules skipped due to selector errors: x
x xx dspace-ui > .custom-file-input:lang(en)~.custom-file-label -> unmatched x
x xx dspace-ui > Redirecting from /bitstreams/e524c49e-5fc2-4e74-b69d-0c890238ab3b/dow x
x xx dspace-ui > GET /bitstreams/e524c49e-5fc2-4e74-b69d-0c890238ab3b/download 302 x
x xx dspace-ui > ERROR Error: Cannot set headers after they are sent to the client x
x xx dspace-ui > at new NodeError (node:internal/errors:405:5) x
x xx dspace-ui > at ServerResponse.setHeader (node:_http_outgoing:648:11) x
x xx dspace-ui > at ServerResponseService.setHeader (/opt/dspace-frontend/dist/ser x
x xx dspace-ui > at Object.next (/opt/dspace-frontend/dist/server/9366.js:1:4722) x
x xx dspace-ui > at ConsumerObserver2.next (/opt/dspace-frontend/dist/server/main. x
x xx dspace-ui > at SafeSubscriber2.Subscriber2._next (/opt/dspace-frontend/dist/s x
x xx dspace-ui > at SafeSubscriber2.Subscriber2.next (/opt/dspace-frontend/dist/se x
x xx dspace-ui > at /opt/dspace-frontend/dist/server/main.js:1:4471483 x
x xx dspace-ui > at OperatorSubscriber2._this._next (/opt/dspace-frontend/dist/ser x
x xx dspace-ui > at OperatorSubscriber2.Subscriber2.next (/opt/dspace-frontend/dis x
x xx dspace-ui > code: 'ERR_HTTP_HEADERS_SENT' x
x xx dspace-ui > } x
x xx dspace-ui > 1 rules skipped due to selector errors: x
x xx dspace-ui > .custom-file-input:lang(en)~.custom-file-label -> unmatched x
x xx dspace-ui > Warning [ERR_HTTP_HEADERS_SENT]: Tried to set headers after they x
x xx dspace-ui > GET /items/d2d3cc05-419a-488e-912e-1ff20ab7a654 200 3281.694 ms - - x
mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqjmqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj
lq Custom Metrics qqqqqqqqqqqqqqqqqqklq Metadata qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqk
x Heap Size 709.14 MiB xx App Name dspace-ui x
x Event Loop Latency p95 xx Namespace default x
x Event Loop Latency 34.32 ms xx Version N/A x
x Active handles 10 xx Restarts 38 x
x Active requests 1 xx Uptime 5m x
x HTTP 0.96 req/min xx Script path /opt/dspace-frontend/dist/server/main.js x
x HTTP P95 Latency 4009 ms xx Script args N/A x
x HTTP Mean Latency 868 ms xx Interpreter node x

We'd particularly appreciate feedback on the error messages (bolded above):

1 rules skipped due to selector errors:
- Suggestion from Mark Wood: this is probably because we're using Bootstrap 4
.custom-file-input:lang(en)~.custom-file-label -> unmatched
ERROR Error: Cannot set headers after they are sent to the client
Warning [ERR_HTTP_HEADERS_SENT]: Tried to set headers after they
- Suggestion from Mark Wood: The Headers Sent errors seem to be mostly an annoyance, but the constant dumping of stack traces is bloating the log.

These errors occur constantly :'(

Suggestion from Mark Wood on error messages: The most serious is probably the proxy errors. It appears that PM2 is closing proxy connections, probably because there are too many. The machine is simply being asked to do more work than it can handle in the available time. We see this too, even after doubling our CPU and memory from levels that were quite adequate for v6. We are about to throw a big increase in resources at v7 to see if that helps, as it has at other sites.

(2) Example of errors from our Apache Error Log (we run Apache to proxy, as recommended in documentation):

[Fri Mar 01 14:51:00.740446 2024] [proxy:error] [pid 1494894:tid 140510257739520] [client 66.XXX.75.XXX:0] AH00898: Error reading from remote server returned by /handle/10393/19705/simple-search
[Fri Mar 01 14:53:26.192799 2024] [proxy_http:error] [pid 1494894:tid 140510257739520] (104)Connection reset by peer: [client 66.XXX.75.XXX:0] AH01102: error reading status line from remote server localhost:4000

Some suggestions we've seen in the DSpace Slack already:

Tim Donoghue:

Initial increased bot traffic is common, but tends to decrease over time
Review major errors in DSpace/Tomcat/Postgres/etc. logs
Enable more caching in server-side rendering as that uses the most CPU in Node.js
- Seconded by Mark Wood: Increasing the caching will likely reduce the CPU demand but memory demand will increase drastically.
Mark Wood: In general, DSpace 7.x is much more computationally expensive than the previous versions

If you've read this far, thank you so much for your time and consideration. The wider DSpace community seems to be struggling with these issues, and we would all welcome your observations on these issues and suggestions for resolving it.

Best,

Carolyn Sullivan

Maruan Sahyoun

unread,

Mar 5, 2024, 6:09:08 AM3/5/24

to DSpace Technical Support

Dear Carolyn,

not directly answering your question but we are running (a possibly smaller instance of DSpace for a non insitutonal site) without PM2 but with node.js directly. With the amount of traffic you are having you might want to give it a try to rule out PM2 and it's handling as a souce of error. With our install we are not getting e.g. ERR_HTTP_HEADERS_SENT

BR

Maruan Sahyoun

FileAffairs GmbH

Majo

unread,

Mar 5, 2024, 6:22:23 AM3/5/24

to DSpace Technical Support

Hello Carolyn Sullivan.

I would like to offer a few points I noticed. I was responsible for deploying

one instance of DSpace and I am quite familiar with problems you described.

However I am no expert, so take all of the following with a grain of salt

(and perhaps a bit of hope that someone more experienced will also reply).

First of all, the resources you have available are by far not enough.

The instance we deployed was small and we are using 15 CPUs and 30GB of RAM.

Initially we had about your specs and I couldn't make it work reliably, no matter what I did.

(Limiting bots helped a great deal, but that is certainly not ideal and the performance was still terrible).

Secondly, the caching. By trial and error, I arrived to 20 sites for

bots and 100 sites for anonymous users. When it was significantly higher, each core used a lot

of memory and therefore kept restarting, losing all benefit of cache. I suggest you try to observe

how much memory your individual cores consume and if it is too much, decrease the cache.

My conclusion is, that you do not need exceeding amounts of caching, because yes,

it will consume too much memory and if it causes swapping, it will only slow everything

down. I was slightly surprised with the settings that work for us, but they are effective,

when used together with more CPU and RAM resources.

I consider the points above the most important, but what is also peculiar is your setting

max_memory_restart = 1500M. Why have 4096MB of memory for each core, if you restart

it at 1500M? We only use the max_old_spaces_size argument. I am not sure, but I would

either remove the restart argument, or increase it to match max_old_space_size.

We also have the "1 rules skipped due to selector errors" constantly, but

it doesn't appear to be a limiting factor. I would be happy if it were resolved, however.

Perhaps last note, the angular frontend with SSR on consumes a lot of resources.

Apparently much more than it previously did, so no matter the settings, you will

have to up the CPU and RAM. It could also resolve the proxy timeouts.

Our cores (before I tuned the settings) were so overwhelmed, we frequently

got 504 ERROR (We have the angular behind reverse proxy.. it did not manage to

load within 45 or 50 second limit).

I wish you a lot of luck as I am afraid you will need it.

Best regards,

Majo

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/0e817201-5f4a-4fc1-9781-b462cc97134fn%40googlegroups.com.

Edmund Balnaves

unread,

Mar 5, 2024, 5:13:09 PM3/5/24

to DSpace Technical Support

We are running DSpace 7 instances in a multi-tennanted environment in a reasonably stable way.

Our experience is that lots of memory is needed and we do see lockups in cluster instances periodically. Even low levels of bot activity can stress the system and performance of DSpace7 is pretty under-whelming but we have managed to maintain stable instances. Trimming your caching would be wise to keep memory within reasonable bounds. We have written shell scripts to monitor and restart instances that look to be locked up, and have put some memory limits for auto-restart per your approach.

Edmund Balnaves

Prosentient Systems

uOttawa Library

unread,

Mar 6, 2024, 11:38:43 AM3/6/24

to DSpace Technical Support

I would like to understand how memory is used by the node instances.

There are comments in the example frontend configuration file that mentions the following:

# Maximum number of pages to cache for known bots. Set to zero (0) to disable server side caching for bots.
# Default is 1000, which means the 1000 most recently accessed public pages will be cached.
# As all pages are cached in server memory, increasing this value will increase memory needs.
# Individual cached pages are usually small (<100KB), so max=1000 should only require ~100MB of memory.

We have both bot cache and anonymous cache set to max: 1000. This would mean a total of ~200MB cache (per instance?). We allocate 1.5GB to instances (max_memory_restart), so cache wouldn't be the main cause of the high memory usage. We have set max_old_space_size=1024 since the original post above, and this seems to make the instances stay alive longer (instance restart every ~90 min due to exceeding the 1.5GB memory).

It isn't clear is if cache is shared amongst instances (to avoid having to render the same frequently accessed content in every instance), but in any case, it wouldn't be the main source of memory use according to the comments.

François

DSpace Technical Support

unread,

Mar 6, 2024, 2:47:31 PM3/6/24

to DSpace Technical Support

Hi all,

I wanted to chime in briefly to say that I appreciate everyone sharing your experiences with high CPU issues, as it does help the developers & I to hear what everyone is encountering under heavier load and/or bot activity. The more that institutions can share your experiences, the more likely we can begin to narrow down the problem(s) and build better documentation/guidelines for everyone.

A few things that are clear is that Server Side Rendering (SSR) from Angular **does seem to be more CPU heavy than we anticipated**. This is why the basic "SSR caching" was added in the first place. However, what's also starting to become clear is that the basic SSR caching may not be enough. (In all honesty, we knew it would help in some scenarios but possibly not *all* scenarios.)

I can verify though that the existing basic SSR caching is *per instance*. So, when using "cluster mode" (and running several instances at once), there is no way to currently share that cache across instances (as the cache is literally just stored in the memory of each instance). This means it has a more limited impact than we initially hoped.

This may mean we need to begin looking at some more advanced caching options for Angular SSR. To be clear though, this SSR performance/caching issue shouldn't be specific to DSpace 7, as we are just using the SSR tools from Angular.io. So, it's possible that tools may already exist out there from other sites/applications that use Angular SSR.

In the meantime, I would ask that sites which have this working well consider also sharing your experiences of how you "stabilized" your high CPU. I know there are sites out there who've done this (as there are a growing number of sites running DSpace 7 in production). It'd just be helpful, for me (and others), if we can learn from each other in order to create better documentation & best practices for DSpace 7. (All DSpace documentation & best practices have always been a collaborative/community effort because we don't have a central development team.)

Tim

Alan Orth

unread,

Mar 7, 2024, 12:46:33 AM3/7/24

to DSpace Technical Support

Dear all,

Our experience with moving to DSpace 7.6 in production was that bots exhausted the SSR cache immediately. We effectively solved it by adding rate limiting of bots in nginx.

I also noticed that many applications have had performance issues with Angular SSR due to its use of `inlineCriticalCss`. See some discussions:

- https://github.com/angular/angular/issues/42098

- https://github.com/angular/universal/issues/2106

- https://github.com/GoogleChromeLabs/critters/issues/78

On that note there is a draft pull request for dspace-angular to allow disabling inlineCriticalCss: https://github.com/DSpace/dspace-angular/pull/2067

Regards,

Majo

unread,

Mar 7, 2024, 1:23:08 AM3/7/24

to DSpace Technical Support

Hello everybody.

Firstly I would like to respond specifically to message from François about caching.

The comments in config file say that cached page usually has about 100KBs, so

having the setting on 1000 is very much ok. When I was experimenting with cache

and watching consumed memory, I found that to be false. When I decreased page

and waited several hours up to perhaps a day, I could take a guess based on how

much memory individual cores consumed. It was heavily dependent on the cache

setting. As I wrote in one (in fact several, I believe) of my previous messages, better

value was much smaller, about 20 pages for bots and 100 for anonymous users.

Not only does this conserve memory of individual cores and prevent them from restarting,

It also enables smooth run of frontend, without too long delays and the performance

was acceptable for us.

It also appeared to me, that cache is indeed NOT shared between instances.

I used pm2 monit to see how much each core used and all seemed to increase their

usage with increased cache. The memory consumed overall corresponded with

increased usage of RAM as reported by operating system. I had to therefore balance

number of instances with available RAM and available CPUs.

Too many CPUs with not a lot of RAM either results in cores crashing and

restarting or not enough cache therefore slow performance.

However, I look forward to any solutions that might be found, as per Tim's message

about searching for more advanced Angular SSR caching.

There is one more thing to add, again a bit of speculation. I think having quite low

number of cached pages could make a sense specifically in DSpace. There are

only a few pages that are frequented by all the users, where it makes sense to

cache them and plenty pages that are very "personal" or only "on demand",

which are of course items. To me it seems there is no point in caching those,

as each user might want to view different page, but almost everyone will land

or homepage, login page and perhaps a few others. Anyway, I believe limit of

20 and 100 (as mentioned above) leave plenty room for caching those items that

might be very frequented.

Best regards,

Majo

To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/b71cc9d4-059f-4ef0-8ca1-33e60ff53aa4n%40googlegroups.com.

uOttawa Library

unread,

Mar 7, 2024, 3:06:42 PM3/7/24

to DSpace Technical Support

Thank you everyone for your comments. We have made changes to the caching, now using this in our config.prod.yml

...

serverSide:
botCache:
max: 20

...

anonymousCache:
max: 100

....

This seems to make a huge difference. The cluster instances' memory are more stable now and not causing restarts. They peek at around 1000MB each and go back down to ~500MB even with "max_memory_restart": "2500M", and "--max_old_space_size=2048" . We've had no restart so far (only 2h after the change, but pm2 monit gives me more confidence now. The memory use doesn't keep increasing like before).

François

uOttawa Library

unread,

Mar 12, 2024, 3:45:11 PM3/12/24

to DSpace Technical Support

I have set the botCache back to max: 1000 as I think it makes sense to have more cache for this to lower CPU usage over time.

So far, memory use per cluster instance is peaking at around 1500MB and remain around that value - no instance restart yet (after 4 days).

%CPU(s) is at ~50% overall with 4 cores

François

Vahe Ghorghorian

unread,

Jun 11, 2024, 3:10:44 AM6/11/24

to DSpace Technical Support

Hello Anybody has found a solution for the High cpu issue caused by apache tomcat with dspace 7. any solution?

DSpace Technical Support

unread,

Jun 11, 2024, 10:53:30 AM6/11/24

to DSpace Technical Support

Hi all,

I wanted to link up the new discussion ticket we've created related to these "High CPU" discussions related to Angular Server Side Rendering:

https://github.com/DSpace/dspace-angular/issues/3110

In this ticket, I've tried to summarize the hints/tips that I've heard from various locations (mailing lists, Slack, developer meetings). But, this is a ongoing discussion/analysis (which will likely ramp up further once 8.0 is finished up). So, if others have found solutions/tips/workarounds related to this problem, you are also welcome to add them as comments to that ticket.

(I'll also add there are some performance improvements coming in both 8.0 and 7.6.2 which are mentioned in that ticket. We're working on getting those out quickly, but links are provided to the new code if sites wish to try them out immediately.)

Tim

Reply all

Reply to author

Forward