On 27 Nov 2020, at 09:27, elzoc...@gmail.com <elzoc...@gmail.com> wrote:Hi Gerrit Community,First a 3.2 upgrade status, I have a snapshot of our production currently running, and upgraded successfully to 3.2.5.1, all basic tests look good, that's promising, I am hoping to move prod before the end of December!Am currently taking a look at the Gatling suit, very satisfying to see the clones ramping up, fast :) I still need to integrate it to a jenkins server, and setup the gatling servers in the cloud as my home connection ain't that good. and also need to figure out how to add other stuff than clones and fetches.One strange thing I have noted though during one of hops to 3.2, I was asked if I wanted to install the Plugin manager plugin, I said yes to see what it was, once gerrit was booted, clicking to the new Plugins tab, it displayed the plugin manager page with a "back to gerrit" (or something similar) link to the left, maybe an icon to the right, but nothing else, no plugins displayed. I checked the setting, I have the allowRemoteAdmin set, I tried to set the jenkinsUrl to the default gerrit-forge but I couldn't get anything displayed.Since then I destroyed the server and restarted from scratch, second time I didn't say yes to install the plugin, so can't check further on that point.OK that's it for 3.2 status.Now to prod, on the 8th of November, we upgraded Prod to 2.16.23.
Since then, I've received a few alerts from our monitoring tool about some very high system-load (ubuntu 18.04 LTS) logs tell me system_load has been over 40 in the 1min, 5min stat, and over 30 in the 15min one, CPU activity during those times is normal.
Yesterday I've seen my first one live, when one on my collegue told his clones were stalling.Attached [show-threads-queue.txt] a example of the show-caches show-queue at that time. only a few clones in progress, very light repos (<100MiB). all the clones at that time where comming from the same VPC as Gerrit, so proximity/speed should be as good as it can be, no flaky home network :)repo1 - 97Mrepo2 - 3.7Mrepo3 - 1.1Mrepo4 - 5.6Mrepo5 - 484Krepo6 - 1.6Mrepo7 - 1.8Mrepo8 - 5.3Mrepo9 - 14M
I got to do top in time attached [top.txt], as you can see lots of HTTP threads are stuck in un-interruptible sleep. at 17:25:46 all 9 clones were still there, at 17:26:16, they were all gone, was looking at the top the `D` threads all disappeared at the same time.On the Stats side (thanks for the gerrit-monitoring project I've added added those dashboards last month, they look good :)), over the past 2 months, there is no real variations,On the grafana prometheus side, only the systemload shows a noticable different behaviour, java Heap is creeping up, but that has always been the case.
On the AWS monitoring side, here again systemload shows same as grafana, other stat which shows a different behaviour is the EFS data_read.iobytes graphs, see attached [efs_data_read_iobytes.png]Graphs bellow span 30days, as you can see, easy to pinpoint upgrade day without looking at the dates.
<efs_data_read_iobytes.png><grafana_system.png>
Any idea? I've been monitoring the group, didn't see anything similar pass through.
I'll upgrade to 2.16.25 this Sunday, will see if next week is better, and if not I might have to downgrade back to 2.16.22 if our community start noticing, never did a downgrade, might be a first for me.
Many thanks,Cedric.--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/9b18b180-ac6d-4df7-bcd1-05ef90304614n%40googlegroups.com.
<grafana_system.png><top.txt><show-threads-queue.txt><efs_data_read_iobytes.png>
--
On 27 Nov 2020, at 09:33, elzoc...@gmail.com <elzoc...@gmail.com> wrote:Sorry, show-threads got eaten, new file.
Hi Gerrit Community,First a 3.2 upgrade status, I have a snapshot of our production currently running, and upgraded successfully to 3.2.5.1, all basic tests look good, that's promising, I am hoping to move prod before the end of December!Am currently taking a look at the Gatling suit, very satisfying to see the clones ramping up, fast :) I still need to integrate it to a jenkins server, and setup the gatling servers in the cloud as my home connection ain't that good. and also need to figure out how to add other stuff than clones and fetches.One strange thing I have noted though during one of hops to 3.2, I was asked if I wanted to install the Plugin manager plugin, I said yes to see what it was, once gerrit was booted, clicking to the new Plugins tab, it displayed the plugin manager page with a "back to gerrit" (or something similar) link to the left, maybe an icon to the right, but nothing else, no plugins displayed. I checked the setting, I have the allowRemoteAdmin set, I tried to set the jenkinsUrl to the default gerrit-forge but I couldn't get anything displayed.Since then I destroyed the server and restarted from scratch, second time I didn't say yes to install the plugin, so can't check further on that point.OK that's it for 3.2 status.Now to prod, on the 8th of November, we upgraded Prod to 2.16.23. Since then, I've received a few alerts from our monitoring tool about some very high system-load (ubuntu 18.04 LTS) logs tell me system_load has been over 40 in the 1min, 5min stat, and over 30 in the 15min one, CPU activity during those times is normal.Yesterday I've seen my first one live, when one on my collegue told his clones were stalling.Attached [show-threads-queue.txt] a example of the show-caches show-queue at that time. only a few clones in progress, very light repos (<100MiB). all the clones at that time where comming from the same VPC as Gerrit, so proximity/speed should be as good as it can be, no flaky home network :)repo1 - 97Mrepo2 - 3.7Mrepo3 - 1.1Mrepo4 - 5.6Mrepo5 - 484Krepo6 - 1.6Mrepo7 - 1.8Mrepo8 - 5.3Mrepo9 - 14MI got to do top in time attached [top.txt], as you can see lots of HTTP threads are stuck in un-interruptible sleep. at 17:25:46 all 9 clones were still there, at 17:26:16, they were all gone, was looking at the top the `D` threads all disappeared at the same time.On the Stats side (thanks for the gerrit-monitoring project I've added added those dashboards last month, they look good :)), over the past 2 months, there is no real variations,On the grafana prometheus side, only the systemload shows a noticable different behaviour, java Heap is creeping up, but that has always been the case.On the AWS monitoring side, here again systemload shows same as grafana, other stat which shows a different behaviour is the EFS data_read.iobytes graphs, see attached [efs_data_read_iobytes.png]Graphs bellow span 30days, as you can see, easy to pinpoint upgrade day without looking at the dates.
<efs_data_read_iobytes.png>
<grafana_system.png>
Any idea? I've been monitoring the group, didn't see anything similar pass through.I'll upgrade to 2.16.25 this Sunday, will see if next week is better, and if not I might have to downgrade back to 2.16.22 if our community start noticing, never did a downgrade, might be a first for me.Many thanks,Cedric.
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/50d2d34c-16c0-47e7-8c25-013b7a257d0bn%40googlegroups.com.
<show-threads-queue.txt><grafana_system.png><efs_data_read_iobytes.png>
Hi Gerrit Community,First a 3.2 upgrade status, I have a snapshot of our production currently running, and upgraded successfully to 3.2.5.1, all basic tests look good, that's promising, I am hoping to move prod before the end of December!Am currently taking a look at the Gatling suit, very satisfying to see the clones ramping up, fast :) I still need to integrate it to a jenkins server, and setup the gatling servers in the cloud as my home connection ain't that good. and also need to figure out how to add other stuff than clones and fetches.One strange thing I have noted though during one of hops to 3.2, I was asked if I wanted to install the Plugin manager plugin, I said yes to see what it was, once gerrit was booted, clicking to the new Plugins tab, it displayed the plugin manager page with a "back to gerrit" (or something similar) link to the left, maybe an icon to the right, but nothing else, no plugins displayed. I checked the setting, I have the allowRemoteAdmin set, I tried to set the jenkinsUrl to the default gerrit-forge but I couldn't get anything displayed.Since then I destroyed the server and restarted from scratch, second time I didn't say yes to install the plugin, so can't check further on that point.OK that's it for 3.2 status.Now to prod, on the 8th of November, we upgraded Prod to 2.16.23. Since then, I've received a few alerts from our monitoring tool about some very high system-load (ubuntu 18.04 LTS) logs tell me system_load has been over 40 in the 1min, 5min stat, and over 30 in the 15min one, CPU activity during those times is normal.
Yesterday I've seen my first one live, when one on my collegue told his clones were stalling.
Attached [show-threads-queue.txt] a example of the show-caches show-queue at that time. only a few clones in progress, very light repos (<100MiB). all the clones at that time where comming from the same VPC as Gerrit, so proximity/speed should be as good as it can be, no flaky home network :)repo1 - 97Mrepo2 - 3.7Mrepo3 - 1.1Mrepo4 - 5.6Mrepo5 - 484Krepo6 - 1.6Mrepo7 - 1.8Mrepo8 - 5.3Mrepo9 - 14MI got to do top in time attached [top.txt], as you can see lots of HTTP threads are stuck in un-interruptible sleep. at 17:25:46 all 9 clones were still there, at 17:26:16, they were all gone, was looking at the top the `D` threads all disappeared at the same time.On the Stats side (thanks for the gerrit-monitoring project I've added added those dashboards last month, they look good :)), over the past 2 months, there is no real variations,On the grafana prometheus side, only the systemload shows a noticable different behaviour, java Heap is creeping up, but that has always been the case.
On the AWS monitoring side, here again systemload shows same as grafana, other stat which shows a different behaviour is the EFS data_read.iobytes graphs, see attached [efs_data_read_iobytes.png]Graphs bellow span 30days, as you can see, easy to pinpoint upgrade day without looking at the dates.
Any idea? I've been monitoring the group, didn't see anything similar pass through.
I'll upgrade to 2.16.25 this Sunday, will see if next week is better, and if not I might have to downgrade back to 2.16.22 if our community start noticing, never did a downgrade, might be a first for me.Many thanks,Cedric.
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/9b18b180-ac6d-4df7-bcd1-05ef90304614n%40googlegroups.com.
Hi Cedric,Thanks for your long and comprehensive e-mail.I would recommend to keep the future e-mails more focussed, as in your current message you are covering *a lot* of stuff and, at times, it is difficult to focus exactly on the problem you want to highlight and the question to the community.
See my feedback inline.On 27 Nov 2020, at 09:27, elzoc...@gmail.com <elzoc...@gmail.com> wrote:Hi Gerrit Community,First a 3.2 upgrade status, I have a snapshot of our production currently running, and upgraded successfully to 3.2.5.1, all basic tests look good, that's promising, I am hoping to move prod before the end of December!Am currently taking a look at the Gatling suit, very satisfying to see the clones ramping up, fast :) I still need to integrate it to a jenkins server, and setup the gatling servers in the cloud as my home connection ain't that good. and also need to figure out how to add other stuff than clones and fetches.One strange thing I have noted though during one of hops to 3.2, I was asked if I wanted to install the Plugin manager plugin, I said yes to see what it was, once gerrit was booted, clicking to the new Plugins tab, it displayed the plugin manager page with a "back to gerrit" (or something similar) link to the left, maybe an icon to the right, but nothing else, no plugins displayed. I checked the setting, I have the allowRemoteAdmin set, I tried to set the jenkinsUrl to the default gerrit-forge but I couldn't get anything displayed.Since then I destroyed the server and restarted from scratch, second time I didn't say yes to install the plugin, so can't check further on that point.OK that's it for 3.2 status.Now to prod, on the 8th of November, we upgraded Prod to 2.16.23.You should consider v2.16.25, which has two *important* security fixes. Be just careful on the reverse-proxy configuration and make sure you test your upgrades in staging before going to production.
Since then, I've received a few alerts from our monitoring tool about some very high system-load (ubuntu 18.04 LTS) logs tell me system_load has been over 40 in the 1min, 5min stat, and over 30 in the 15min one, CPU activity during those times is normal.Is Gerrit the only process that is running on the box?
Yesterday I've seen my first one live, when one on my collegue told his clones were stalling.Attached [show-threads-queue.txt] a example of the show-caches show-queue at that time. only a few clones in progress, very light repos (<100MiB). all the clones at that time where comming from the same VPC as Gerrit, so proximity/speed should be as good as it can be, no flaky home network :)repo1 - 97Mrepo2 - 3.7Mrepo3 - 1.1Mrepo4 - 5.6Mrepo5 - 484Krepo6 - 1.6Mrepo7 - 1.8Mrepo8 - 5.3Mrepo9 - 14MWhat are those numbers referring to? The size of the clone? The size of the bare repository?
I got to do top in time attached [top.txt], as you can see lots of HTTP threads are stuck in un-interruptible sleep. at 17:25:46 all 9 clones were still there, at 17:26:16, they were all gone, was looking at the top the `D` threads all disappeared at the same time.On the Stats side (thanks for the gerrit-monitoring project I've added added those dashboards last month, they look good :)), over the past 2 months, there is no real variations,On the grafana prometheus side, only the systemload shows a noticable different behaviour, java Heap is creeping up, but that has always been the case.That’s not normal: are you collecting the JVM GC logs and analysed them?
On the AWS monitoring side, here again systemload shows same as grafana, other stat which shows a different behaviour is the EFS data_read.iobytes graphs, see attached [efs_data_read_iobytes.png]Graphs bellow span 30days, as you can see, easy to pinpoint upgrade day without looking at the dates.<efs_data_read_iobytes.png><grafana_system.png>Be careful on the AWS EFS settings: I have noticed that the throughput configuration has a huge impact on Gerrit performance.Out of topic: why are you using EFS and not EBS? Are you in an HA configuration?
Any idea? I've been monitoring the group, didn't see anything similar pass through.For giving you the root cause of the slowdown, I would need to have access to the full configuration (apart secrets) and the past 30days of logs, so that I can analyse the performance figures over time.
Also, you need to enable the JVM GC log for understanding *IF* the problems noticed are caused by excessive GC.
We’ve noticed that tuning the JGit cache settings can highly reduce the frequency and severity of full and ergonomics GCs.
I'll upgrade to 2.16.25 this Sunday, will see if next week is better, and if not I might have to downgrade back to 2.16.22 if our community start noticing, never did a downgrade, might be a first for me.
I won’t recommend to upgrade/downgrade without any evidence of what’s the problem.I would be like giving random medicines to a sick patience: may go better, may go even worse, you just don’t know ;-(HTHLuca.
Hi,(sorry Groups gave me an error when I try publishing, I hope it won't send 5 copies... as I was going to retry new message from Matthias so will add my answers here too. thanks!)Thanks for your answers, will answer most of it inline,To Anish question for Gerrit Stats am using the Grafana dashboards which were posted a few weeks back. AWS/EFS stats are from datadog.Last note from Lucas, I do not have any meaningful sshd_log, the Server is behind an AWS ELB, and it only allows https traffic through.ssh is enabled in the config so we can perform maintenance tasks like gc, flush, or adding internal users.I took a look through random show-threads I took over the past few month, that project caches is rarely different than 90%.to answer Matthias,it's high, but that's just the system load, cpu are low so it's probably IO.I included a top.txt in the first message, it's an output of "top -H" sorted by process status, and reverted so the "D"s are on top.His cloning slaves are within the VPC, all in the same AWS AZ so interconnect should be optimal.Git GC is run every morning through the gerrit.config [gc] fields.
>> Do you use parallel gc or G1GC ? I don't know, sorry, my java exp. is somewhat limited.
I did the refs on 2 repo which was stalling for 5min earlier, there were 5 other similar clones in that show-threads, (<1MiB)repo 1du -h | tail -n 1380K .find refs -type f | wc -l0git show-ref | wc -l129repo 2du -h | tail -n 17.7M .git show-ref | wc -l1025find refs -type f | wc -l0
core.packedGitLimit | 10 MiB | ✅ | Maximum number of bytes to cache in memory from pack files. Maximum 1/4 of max heap size. |
core.packedGitMmap | false | ✅ | Whether to use Java NIO virtual memory mapping for JGit buffer cache. When set to true enables use of Java NIO virtual memory mapping for cache windows, false reads entire window into a byte[] with standard read calls. true is experimental and may cause instabilities and crashes since Java doesn't support explicit unmapping of file regions mapped to virtual memory. |
core.packedGitOpenFiles | 128 | ⃞ | Maximum number of streams to open at a time. Open packs count against the process limits. |
core.packedGitUseStrongRefs | false | ⃞ | Whether the window cache should use strong references (true ) or SoftReferences (false ). When false the JVM will drop data cached in the JGit block cache when heap usage comes close to the maximum heap size. |
core.packedGitWindowSize | 8 kiB | ✅ | Number of bytes of a pack file to load into memory in a single read operation. This is the "page size" of the JGit buffer cache, used for all pack access operations. All disk IO occurs as single window reads. Setting this too large may cause the process to load more data than is required; setting this too small may increase the frequency of read() system calls. |
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/e3832060-88ad-45cc-97f9-cb33118936f6n%40googlegroups.com.