Gerrit replication problem using ssh tunnels and git protocol

301 views
Skip to first unread message

Piotr Gadzinski

unread,
Aug 13, 2014, 6:08:54 AM8/13/14
to repo-d...@googlegroups.com
Hi all,

We are running replication using ssh tunnels with git protocol. On the slaves we have a git-daemon started with the following options:

--verbose --syslog --reuseaddr --export-all --enable=receive-pack --listen=127.0.0.1

the solution worked fine for around 2 weeks but right now we are seeing on the master the following replication errors:

ERROR com.googlesource.gerrit.plugins.replication.ReplicationQueue (PushOne.java:228): Cannot replicate to git://localhost:31300/repo_name.git; repository not found´

On the slaves in the syslog there are 2 different errors coming from the git-daemon:

"Too many children, dropping connection"
"fatal: failed to read object ccb9b07703f5f70dcc32ed8f29d2d24ea3423ca3: Too many open files"

And indeed, the git-daemon has been forking and not cleaning down the child processes. ps aux | grep git-daemon | wc -l shows 33.
Replication is configured to use up to 3 threads per remote. In the same time gerrit queue does not seem to execute that many replication tasks (in total, not even per remote).

And at the end we get:

git-daemon status
Checking for service git-daemon                                                                              dead

If we kill the child git-daemon processes and restart the git-daemon service then after some time the situation occurs again.

Here is a sample remote definition:

[remote "name"]
 url = git://localhost:30500/${name}.git
 adminUrl = gerrit-slave-E:/path/gerrit/data/git/${name}.git
 threads = 3
 replicationDelay = 0
 mirror = true

Gerrit version: 2.7
git version: 1.7.12.4

Has anyone faced such problem? What could be the root cause of this?

Thanks!

Lundh, Gustaf

unread,
Aug 13, 2014, 7:22:10 AM8/13/14
to Piotr Gadzinski, repo-d...@googlegroups.com

Just a guess here, but have you checked that the receiving git-deamon does not just spawn git repacks which are quite slow, especially when several of them are running at the same time and you start to run out of memory?

 

Best regards

Gustaf

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Piotr Gadzinski

unread,
Aug 13, 2014, 9:11:12 AM8/13/14
to repo-d...@googlegroups.com, piotr.ga...@gmail.com
I have checked and there are hanging git receive-pack processes (children of git-daemon). Some are very old for example started 3 days ago.

I have also looked into the syslog for repack and found:

git-daemon[456]: error: failed to run repack

but nothing about the OOM errors. We have those boxes under monitoring and the memory hasn't been an issue - we still have lots of free.

Any other suggestions?

Thanks!

Vlad Canţîru

unread,
Aug 13, 2014, 9:04:07 PM8/13/14
to Piotr Gadzinski, repo-d...@googlegroups.com
Hi,

Looks like you have at least few potential issues. I cannot point which one affects your replication but here is what I would do:

- First, git 1.7.12 is too old, I would upgrade to at least 1.8.5 as there were major performance improvements in git starting with 1.8.0

- Then for the git daemon options you can try to set --max-connections to a higher value. The default is 32  and from what I see you have 33 and this is why it stops working. I believe this is not the real root cause but a symptom. Until you figure out the real root cause try to play with this option to buy some time. 

- Out of the above symptom my fist question would be how clean the repos are. In case you don't run garbage collection regularly this is what I would do as soon as you upgrade your git version. "git gc --aggressive", "git repack..." (or  try "gerrit gc" that comes built in with Gerrit). Any of these will do the job, over time you can pick your preferred solution.

- For "Too many open files" error  I would check the OS open file limit (/etc/security/limits.config or "ulimit -a"). Git manipulates with a large number of files and standard configured machines might not be set high enough.

This is pretty much all. Also I would suggest a few more things not directly related to this issue but out of your set-up you provided that would give you replication better stability and performance:
 

- I would  not use replicationDelay = 0. If there are two consecutive pushes the second one will fail due to collision and will be re-scheduled by the replication plugin. Especially if repos are dirty, git daemon can spend minutes to receive a commit trough the replication operation. As with the induced delay the replication operation will push all commits received during this waiting time. Even though this might be counter-intuitive but the busier a repository is the longer delay should be to ensure a predictable behavior.
I would suggest to start with the default 15 seconds (default) which collects all pushes during this time-frame and replicate them as one single transaction. If you really want to have it lower adjust gradually and watch the replication performance and logs for errors.


- Also threads = 3 is too low. There is a high risk to have a bottleneck for your replication mechanism. I saw this on many occasions and I am sure sooner or later it will happen to you. In a normal situation replication goes extremely quick but if somebody pushes large data like an initial bare repo push and does it for a few repos then all available replication threads will be taken for longer than normally expected time, other repos replications will stall waiting in the queue for an available thread. Three is just too low, you can start with 5 or 6 and increase as per need once your system become busier. 

Hope this helps,

Vladimir Cantiru
Reply all
Reply to author
Forward
0 new messages