Scaling Gerrit for large installations

620 views
Skip to first unread message

Jean-Baptiste Queru

unread,
Nov 11, 2014, 9:53:22 PM11/11/14
to Repo-discuss
Dear Gerrit-ers,

We (Yahoo) are in the process of evaluating the hardware that we need in order to run Gerrit at our scale.

We're using the following 2 sources as references:

As a rule of thumb, we're looking at 2TB+ of repository data, 20k+ projects, 20k+ pushes per day, thousands of users. As a mitigation, we have an easy opportunity to split the world into 6 buckets that we'd shard across 6 masters (which won't be quite even).

We already have systems in place to handle clones/fetches at our scale, and Gerrit would be used only for pushes/reviews.

We'd appreciate anyone's first-hand experience in sizing servers for such an installation, including (if known) the measured or expected benefits of additional RAM, cores, or I/O speed.

Thanks much,
JBQ

--

Jean-Baptiste M. "JBQ" Quéru
Architect, Mobile, Yahoo

Doug Kelly

unread,
Nov 12, 2014, 10:53:04 AM11/12/14
to repo-d...@googlegroups.com, j...@yahoo-inc.com
On Tuesday, November 11, 2014 8:53:22 PM UTC-6, JBQ wrote:
Dear Gerrit-ers,

We (Yahoo) are in the process of evaluating the hardware that we need in order to run Gerrit at our scale.

We're using the following 2 sources as references:

As a rule of thumb, we're looking at 2TB+ of repository data, 20k+ projects, 20k+ pushes per day, thousands of users. As a mitigation, we have an easy opportunity to split the world into 6 buckets that we'd shard across 6 masters (which won't be quite even).

We're probably not even 1/10th the size (just shy under 2k projects, ~150GB data, 2k pushes per day, and the one metric we come close to is probably supporting somewhere around a thousand users).  We have attempted sharding the repositories across three servers, though this really has roots in organizational differences, and has practically turned into one main server with two servers supporting a much smaller subset of users/projects.  Still, it has balanced some effect of the load, but at the same time, increased our administrative complexity by a factor of three. :)
 
We already have systems in place to handle clones/fetches at our scale, and Gerrit would be used only for pushes/reviews.

I think this was probably the best thing for us -- since pushes account for maybe a tenth of the total number of git operations across our servers.  Granted, we've only succeeded in really getting CI to use the systems for clone/fetch (and a relatively small percentage of users), but that still accounts for a large amount of network traffic and also 20-30% of the logins to the server, if memory serves.
 
We'd appreciate anyone's first-hand experience in sizing servers for such an installation, including (if known) the measured or expected benefits of additional RAM, cores, or I/O speed.
 
The best documented example of sizing the machines I've seen was actually the presentation Ericsson gave at the last Gerrit User Summit (and I believe the scale is on your magnitude):
Martin has also been trying to gather statistics on larger sites in another thread:

The configuration we use is 32 cores, 64GB RAM, with rather fast SSDs directly connected to each server.  On most servers, this is barely taxed (practically, we could run all three of our servers combined on a single physical system).  On our main server, a little under half is used by the JVM while the rest is used as disk cache by the OS.  Obviously, number of cores allows you to scale the number of concurrent connections, but this also has ties to the number of database threads (and your supporting database).  We use PostgreSQL, and I've estimated (possibly overkill) to support 1 DB thread per HTTP thread, and 2 DB threads per SSH thread.  I'm not sure this estimate is accurate, but it certainly isn't constraining us (one DB thread per HTTP/SSH thread may be sufficient).  So, for your backend database, it should scale accordingly (and I believe ours is on a similarly-configured system).

Since you said the clones/fetches are handled elsewhere, it sounds like you'd be different from many of the other configurations (even ones that use replication, but still use Gerrit to serve replicated content to honor the ACLs).  You may decide to use Gerrit's replication plugin to push to your mirroring solution (and certainly, the tips there of using the git protocol over some trusted connection would be valid).  The thing that I'm thinking of is the fact you won't be handling clones -- that's usually the most expensive in CPU time, and pushes (and even fetches) are relatively short lived by comparison.  20k/day is still a lot, though. :)

Our previous main Gerrit server was 16 cores, 36GB RAM.  When we retired it, we repurposed it as our main mirror server, handling the CI connections.  It still serves just fine in this role, and while the master server doubled most of the specs of that server, it wasn't lagging much before the migration, and the new server certainly seems to be comfortable handling the load -- but it did allow us to increase the maximum number of concurrent connections and the JVM's maximum heap size.  Also, while we increased the capacity of our master server, the number of users we've supported hasn't changed much, and the number of repositories has increased at a fairly steady 2-3 a week on average, I think.  Additionally, our load has been steady over the past 7 months or so (I don't have data from before we changed systems, but it wasn't long after the move I started capturing more).

While not nearly on your scale, I hope some of this anecdotal evidence is helpful.  There's still a lot to be said for other operations that aren't strictly Gerrit's responsibility (such as repacking repositories regularly) -- housekeeping tasks such as this are still important, and will benefit performance (though, I believe most of the gain from the bitmap indexes is in fetching).  It seems that currently we're bound by database connections and cores (since a single git operation tends to occupy one core for a period of time) for the number of threads we can support, and not so much the memory (though throwing additional memory at Java does tend to get it used).

Best of luck!

--Doug Kelly

Vlad Canţîru

unread,
Nov 12, 2014, 6:32:27 PM11/12/14
to Doug Kelly, repo-d...@googlegroups.com, j...@yahoo-inc.com
Hi,

Doug has covered all aspects pretty well. I would still emphasize on a few things. First you have to decide where will you store your repos - local file system (e.g. SSD) or network storage because this will influence significantly the architecture and hardware specs. I hope the next few ideas will help you decide.

Keeping 2K repos/2TB data clean is not easy, this takes significant time and CPU power. Also allocate extra disk space generously. General rule for us is to have at least double size of all hosted git repos. This is especially sensitive if you have large repositories that can temporary "explode" in size during garbage collection. There are cases when garbage collection might fail occasionally (it happens)  and instead of a 30GB you end-up with 100GB repo until the next repack finally succeeds.

If you choose to store your gits on local file system and you will deal with pushes and web UI operations only (distributed on six masters instances) you can probably look at medium-low capacity hardware or even VMs. Local file system and low number of cores brings you a challenge on how will you keep your gits clean but there is still a small advantage - you can repack your gits on local file system less often. Reading refs in dirty gits on SSD will still be quick enough to give you a good Gerrit performance but eventually you'll have to deal with garbage collection.

With shared storage you would want to have a dedicated physical box for repack operation, the more cores the better otherwise it might take you days to run a full cycle. Some of the busy repos you'll have to clean most likely more than once a day. You want to avoid reading ref by ref from network storage that has at best medium IO performance. Repacking your gits often is key for system performance. An alternative way is to ignore dormant repos and clean the busy one only, still a dedicated hardware if possible I believe is a must at this level.

Also if you store gits on network shared storage getting lots of RAM would help to cache repos which will keep your Gerrit performance up. I suspect this is the case for 1TB RAM mentioned in scaling article. Obviously all depends on what storage technologies you use but in most of the cases there is a trade-off you get with local file system and network storage unless you have access to technologies than can give you both.

Another important aspect/questions is: Will you totally restrict reads from master or there will still be reads allowed-expected? 
This becomes important on how you will tune your java_heap because JVM loads all repos that are cloned in memory which end-up in old generation heap. If there are no clones at all then I can speculate that you will be fine with as little as 6 GB java_heap maybe less but if there will be fetch/clones, especially large repos reads your java full garbage collection will kick-in too often. In some cases when fetching a few large repos every few minutes (e.g. Jenkins with polling approach) then these repos end-up in old generation heap "forever". Best case you can have is "stop the world" jvm garbage collection that runs way too often and worst case scenario old generation heap cannot be flushed and Gerrit would be crippled by running on young generation heap only. 

We used to run safely the master with about 15K pushes and 40-45K reads a day with 16 cores/64GB RAM,16-18 GB java_heap and full garbage collection ran during the busiest part of the day every 1-1.5 hours. Reads from master were not restricted but managed to have 95% of reads done from slaves. We had to ban reads from master for a number of large repos to be able to have the java_heap so low. 
We have upgraded the hardware since, became less restrictive and and increased the java_heap significantly but still in the range of 40GB (a short list of repos are still banned and cam read from slaves only). Tuning java_heap is sensitive, the approach we have is to consider incremental increase once the full garbage collection goes consistently under one hour.

Without knowing specific details like storage type, git repos size, if master will have reads only is hard to be more precise than this.
No matter which way you decide to go I don't think you'll have any significant issues or special hardware needs. In most of the cases people get in trouble not because of the hardware but because of not configuring system settings well and not revising them periodically[1]. Getting familiar with all options will make the difference.



Hope this helps,
Vladimir Cantiru

 
   

 













--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jean-Baptiste Queru

unread,
Nov 12, 2014, 7:44:46 PM11/12/14
to Vlad Canţîru, Doug Kelly, repo-d...@googlegroups.com
Vlad, Doug,

Very helpful. Thanks a lot.

JBQ
 
--

Jean-Baptiste M. "JBQ" Quéru
Architect, Mobile, Yahoo


Reply all
Reply to author
Forward
0 new messages