Youare correct, that there is nothing causing high CPU Utilization, but for some reason, there is high CPU load (those are 2 different things).
And i dont remember having high CPU Load on 13.x.x.
Is there a way to debug somehow this?
so your max hits 1.11, 0.48, 0.37. Mine is 5 times higher 6.45, 1.93, 0.998, and yet I do not get a critical alerts. Perhaps the problem is your monitoring software is not configured correctly and the values are too low. My peaks are also at around 04:00, so this is likely due to a sidekiq job.
We are not using any of the new features that come with GitLab 14. And whatever we are not using and is possible to be disabled - is disabled.
I have not observed anywhere in the changelogs mentions for increase in CPU load or CPU requirements.
Throwing more CPU resources is not really a solution, if there is no actual utilization of those resources (as can be seen - going from 2 to 4 cores did not reduce the CPU load duration/period/intensity).
I am reluctant in increasing the CPU load limit for the monitoring, since when we got hit by the RCE, the limit was pretty high - at 2.0. At such, it failed to record the VM being used to mine crypto - since only 1 core was set to mine, and the load was between 1.0 and 2.0.
If there is nothing to be done, i would consider switching to CPU Utilization monitoring, which would seem a better idea with this behavior.
Theoretically mine is probably lower than required. If we take 3.6GHz and multiply by 4cpu, then we have approx 14.4GHz total. For 2.4GHz, I would need 6cpu to get to 14.4GHz to make it equivalent. Maybe not a brilliant calculation, probably a bit simplified, but can give an idea for figuring out the requirements perhaps.
I had a similar issue here when upgraded to 14.6.0 from 14.5.2.
While I was examining logs, I have detected a sidekiq boot/shutdown loop that causing high cpu load in my case. I run gitlab in a vps with 2cpu cores with no issues.
Technically what you could have done was taken a backup of the machine after the upgrade, made a clean install of Gitlab, restored gitlab.rb and gitlab-secrets.json to /etc/gitlab on the new server, ran gitlab-ctl reconfigure, restore the backup from the old server, and see if the hardware overload started again. It probably would after restoring the backup, since an empty clean server install is not the same as one that is running with a load of repos, etc.
This is my configuration:
Puma port: 9999
Nginx port: 9091
auth_backend set to http: // localhost: 9999
And disabled prometheus, alertmanager, grafana, exporters and letsencrypt.
The other smtp and time zone settings do not appear to be relevant to this problem.
The configurations that I have not named are commented in the gitlab.rb file, the default value is used.
So if you have these sidekiq issues and you tried this on a VPS with lower specs than what is outlined in the Gitlab documentation, I seriously suggest using 4cpu and 8gb ram - despite you disabling a load of stuff like grafana, prometheus. There obviously seems to be changes in requirements for 14.6.x than earlier versions that means what was possible for before, is becoming less and less possible. But this is normal, so it cannot be expected that Gitlab will always run on hardware less than the recommended specs.
I just noticed this thread.
My gitlab-ce instance has been running for a long while.
I was observing that since sometime in the upgrades path from gitlab-ce 13 to 15 where I am now, I noticed a large increase in cpu load.
My system only serves about 30 users and only 1 or 2 are active on a given day, so previously (last year) I ran with only 2 cpus and 4 GB ram. But I started getting timeouts with an unresponsive browser at times, so I upgraded to the recommended hardware.
In the dynamic web ecosystem, systems are scaling and becoming increasingly distributed, elevating the importance of observability, automation, and robust testing frameworks. At the helm of building resilient systems is chaos engineering, a practice where failures are intentionally injected into the system to identify weaknesses before they cause real issues. New Relic runs weekly chaos experiments in our pre-production environment to unearth and address potential system failures, particularly in complex environments like relational databases.
Amazon Aurora, among other databases, poses unique challenges given its distributed nature, architecture, and failover method for providing high availability. This post will cover how to leverage observability and chaos engineering to ensure your services can handle database degradations.
Validate failover and application robustness: This highlights weaknesses in application error handling and driver configurations, offering an opportunity to reinforce system defenses.
New Relic Amazon Cloudwatch Metrics Stream using Kinesis Firehose: Utilize the AWS CloudWatch metric streams to send a continuous flow of AWS CloudWatch metrics into New Relic, including a real-time monitoring data for Aurora databases.
Failover is a critical procedure for maintaining data availability during unexpected disruptions. The Amazon Aurora failover process is designed to minimize downtime by automatically redirecting database operations to a standby instance. Understanding this operation can guide you in preparing your systems for resilience.
Instance selection: Aurora chooses a suitable reader instance to promote to primary status. The selection factors in the instance's specifications, its availability zone, and predefined failover priorities. For an in-depth look at this process, AWS provides comprehensive details in their section on high availability for Amazon Aurora.
DNS endpoint updates: To reflect the new roles, Aurora updates its DNS records. The writer endpoint points to the new primary instance, while the reader endpoint aggregates the old primary node into its list of readers. The DNS change uses a time-to-live (TTL) configuration of 5 seconds, but client-side caching may affect how quickly this change is recognized.
During the failover transition, brief connection interruptions are expected. Clients should be prepared for a few seconds of disruption as Aurora shifts traffic to the new WRITER and adjusts the READER setup. Typically, these interruptions last around 5 seconds, as per AWS's observations documented for failover.
Facilitating a smooth failover demands that our applications and drivers are attuned to handle these transitions gracefully. In the subsequent sections, we discuss driver configurations and optimal settings for your client applications to manage these transitions with minimum impact.
Monitoring for driver and system errors is an integral part of understanding how your applications weather a database failover. Watch your APM client side errors and those recorded database server side in logs for signs of error related to writing to a reader / replica role.
Here are some examples of the errors you might encounter when a MySQL or PostgreSQL database switches to a read-only state during failover, indicating the application attempted a write operation that isn't allowed:
Identifying these error messages quickly is crucial for a swift incident response. Whether you're employing New Relic APM to analyze application performance or you're diving into more granular log data, staying vigilant for these error patterns during and after a failover can greatly assist in your troubleshooting process.
To weather the disruptions of failover events, database connection drivers require precise tuning. This not only prepares your system for planned maintenance but also lays the groundwork for coping with unplanned scenarios.
Specify driver options: For AWS databases, consider the specialized JDBC driver for MySQL and JDBC driver for PostgreSQL. If smart drivers aren't available, manage connection lifetime settings to ensure connections are refreshed in line with the DNS TTL.
Configure connection max-lifetime: Without the option for a smart database driver, an alternative option is to configure your connection pooler to have a max lifetime for each connection. After passing threshold X period of time, it must be closed and then reopened. Leveraging a setting like this will help ensure DNS.
New Relic APM: Helps you understand the client side viewpoint. What does the application see for its own activity against the database server? What was the response time, throughput, and error rate?
New Relic Amazon Cloudwatch Metrics Stream using Kinesis Firehose: Utilize metric streams to send a continuous flow of CloudWatch metrics into New Relic, including a real-time monitoring data for Aurora databases.
Alerting: Set up the appropriate alerts for your situation. Strong candidates are often error rate, throughput and response time. For failover situations, error rate in particular is important, but write throughput and response time monitoring can provide additional value for many situations.
As you continue to refine your resilience strategies, the scope and complexity of your chaos experiments can expand. Anticipate explorations into new services, the impact of different scale and load patterns, and more sophisticated failure simulations.
Bryant Vinisky is a Lead Software Engineer on the Database Engineering team at New Relic. With a focus on observability and reliability, since 2016 he's helped New Relic craft resilient database architectures and drive forward platform-level solutions focused on datastores. When away from his terminal, he's an avid hiker, gamer, musician, and a green-thumbed plant enthusiast.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (
discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.
3a8082e126