gerrit gc can delete objects currently being pushed

379 views
Skip to first unread message

Doug Kelly

unread,
Mar 25, 2015, 10:50:12 AM3/25/15
to repo-d...@googlegroups.com
So, I came into work today with an interesting problem that I think I know the cause, but I wanted to get some other opinions, too.  In this specific case, a gerrit gc was started within seconds of the user pushing the object.  From the database, I see the patch created at 2015-03-24 21:02:32.427-05.  In the GC log, I find the repack started at 2015-03-24 21:02:31,767.  The user received an unpacker error (missing unknown, for the sha1 he was attempting to upload), and the database entries otherwise look normal.

This sounds like a race condition, but as I recall, the gc runner script written before my time perhaps foolishly removes the reflog, which could cause such a race condition (I can't find the source, but I believe I've heard this warning before when describing git reflog expire --expire=now).  This is also the first time we've seen such a, erm... lucky user.

Some of the detailed messages:

"C:\Program Files (x86)\Git\bin\git.exe" push --recurse-submodules=check --progress "origin" master:refs/for/<branch>
Counting objects: 7, done.
Delta compression using up to 12 threads.
Total 4 (delta 3), reused 0 (delta 0)
fatal: Unpack error, check server log
error: unpack failed: error Missing unknown 0150f526448c6ed6113b17fcbf6cc47ab3f90125
To <server>:<project>
! [remote rejected] master -> refs/for/<branch> (n/a (unpacker error))
error: failed to push some refs to '<server>:<project>'
Done

And from gc_log on the server:
[2015-03-24 21:02:31,767] INFO  : [<project>] gc config: gc.autopacklimit=4; gc.packrefs=true; gc.reflogexpire=never; gc.reflogexpireunreachable=never;
[2015-03-24 21:02:31,768] INFO  : [<project>] pack config: maxDeltaDepth=50, deltaSearchWindowSize=10, deltaSearchMemoryLimit=0, deltaCacheSize=52428800, deltaCacheLimit=100, compressionLevel=9, indexVersion=2, bigFileThreshold=52428800, threads=8, reuseDeltas=true, reuseObjects=true, deltaCompress=true, buildBitmaps=true
[2015-03-24 21:02:32,065] INFO  : [<project>] before: sizeOfPackedObjects=357748052, sizeOfLooseObjects=444265, numberOfPackedObjects=233644, numberOfPackFiles=53, numberOfPackedRefs=30706, numberOfLooseRefs=87, numberOfLooseObjects=141
[2015-03-24 21:03:02,217] INFO  : [<project>] after:  sizeOfPackedObjects=356300241, sizeOfLooseObjects=398, numberOfPackedObjects=233466, numberOfPackFiles=2, numberOfPackedRefs=30792, numberOfLooseRefs=2, numberOfLooseObjects=1

My gut is telling me to use sane reflog expire values (maybe 1 day for unreachable, 7 for reachable), remove the hack of removing the reflog entirely, and hope this issue doesn't occur again.  If it does, at least we know it wouldn't be removing the reflog seconds before performing a gc that causes the issue!

Christian Halstrick

unread,
Mar 25, 2015, 12:50:52 PM3/25/15
to repo-d...@googlegroups.com
At least we try in JGit to cope with this situations. When the server receives pushes the received data goes into a packfile. And this packfile is protected by a file with the same name as the packfile but with the suffix ".keep". JGits garbage collector takes care not to remove packfile's which have such a keep file (even if the objects in that packfile are not yet referenced by any commits). But at some point in time JGit code thinks that the push is now fully handled and "unlocks" the packfile (removes the .keep file). If now gerrit has not yet triggered the creation of new refs (refs/changes/...) and in that moment gc starts I guess that this gc will find the objects in the packfile to be unreachable and will delete the packfile (and therefore the just pushed objects).

But for this situation I don't know how increasing reflog expire values would help.

Doug Kelly

unread,
Mar 25, 2015, 1:53:51 PM3/25/15
to repo-d...@googlegroups.com


On Wednesday, March 25, 2015 at 11:50:52 AM UTC-5, Christian Halstrick wrote:
At least we try in JGit to cope with this situations. When the server receives pushes the received data goes into a packfile. And this packfile is protected by a file with the same name as the packfile but with the suffix ".keep". JGits garbage collector takes care not to remove packfile's which have such a keep file (even if the objects in that packfile are not yet referenced by any commits). But at some point in time JGit code thinks that the push is now fully handled and "unlocks" the packfile (removes the .keep file). If now gerrit has not yet triggered the creation of new refs (refs/changes/...) and in that moment gc starts I guess that this gc will find the objects in the packfile to be unreachable and will delete the packfile (and therefore the just pushed objects).

But for this situation I don't know how increasing reflog expire values would help.


Interesting.  Yeah, I found out we're not expiring the entire reflog, just deleting logs/refs/changes.  From the sound of it, there is a (possibly quite small) window where the pack could be unreferenced, with no .keep file, and get garbage collected... Yeah, if that's not recorded in the reflog (seems like the case since the ref hasn't been created yet), I see how the reflog expire wouldn't help in the specific case you've described.  There was an entry under refs/changes/* pointing to the SHA1 of the change, which is how I found the issue to begin with -- replication was stuck because it couldn't find the object in refs/changes.  So, did JGit perhaps allow blindly creating the ref without ensuring the object exists? I know the C git client will complain if you attempt to create a ref to a nonexistent object...

--Doug

Christian Halstrick

unread,
Mar 27, 2015, 6:00:52 AM3/27/15
to repo-d...@googlegroups.com
I started a related discussion for JGit here: http://dev.eclipse.org/mhonarc/lists/jgit-dev/msg02860.html

Jon Bettcher

unread,
May 20, 2015, 12:40:44 PM5/20/15
to Christian Halstrick, Repo and Gerrit Discussion
I too seem to have run into this issue today with gerrit 2.11 built off of stable-2.11.  Timeline:
  • User begins push of draft to repo at 05:01:11,707
  • Gerrit GC for a repo begins at 05:01:13,58
  • User completes push of draft to repo at 05:01:18,815
  • At 05:03:45 GC completes with the exception below in the error_log
  • All further pushes to the repo also result in failures due to the same missing unknown
The resolution was to delete the offending ref from gerrit's git repo, as well as the offending patchset from the reviewdb.

Gerrit gc is currently enabled for all repos, once a day, during what should be a low-usage period, but the repository does get used by teams around the globe.

Has anyone discovered a way to mitigate this race condition? 

[2015-05-20 05:03:47,444] ERROR com.google.gerrit.server.git.GarbageCollection : 
org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed.
        at org.eclipse.jgit.api.GarbageCollectCommand.call(GarbageCollectCommand.java:192)
        at com.google.gerrit.server.git.GarbageCollection.run(GarbageCollection.java:83)
        at com.google.gerrit.server.git.GarbageCollection.run(GarbageCollection.java:61)
        at com.google.gerrit.server.git.GarbageCollectionRunner.run(GarbageCollectionRunner.java:83)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:379)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.eclipse.jgit.errors.MissingObjectException: Missing unknown 50b4c4a788b84f823f315631abf14a3f17c31f29
        at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:148)
        at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:229)
        at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:865)
        at org.eclipse.jgit.internal.storage.file.GC.prune(GC.java:396)
        at org.eclipse.jgit.internal.storage.file.GC.gc(GC.java:170)
        at org.eclipse.jgit.api.GarbageCollectCommand.call(GarbageCollectCommand.java:175)



--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bassem Rabil

unread,
May 21, 2015, 8:05:15 AM5/21/15
to repo-d...@googlegroups.com, j...@bettcher.net, christian...@gmail.com
We experienced this behaviour before using JGit 3.6.2 and 3.7 with Gerrit 2.10. We are using JGit standalone to run the garbage collection outside the JVM running the Gerrit application.
To resolve this we are fetching the missing ref from slaves.

Christian,
Any updates on your proposed fix this by adding grace period for pack files within garbage collection in JGit ?

Thanks
Bassem

Bassem Rabil

unread,
May 22, 2015, 10:04:00 AM5/22/15
to repo-d...@googlegroups.com, j...@bettcher.net, christian...@gmail.com
I registered a bug in JGit tracker for this issue at:

Christian Halstrick

unread,
May 22, 2015, 10:30:38 AM5/22/15
to repo-d...@googlegroups.com, christian...@gmail.com, j...@bettcher.net
let's continue discussion on the jgit bug.

Bassem Rabil

unread,
Jun 8, 2015, 9:21:51 AM6/8/15
to repo-d...@googlegroups.com, j...@bettcher.net, christian...@gmail.com
We experienced another instance of this issue with deleting objects being pushed. The impact of this issue is severe at our end because it blocks all pushes to the affected repository. Is there a chance to get this fixed for the coming 2.11.1 release ?

Thanks
Bassem

Matthias Sohn

unread,
Jun 16, 2015, 10:31:00 AM6/16/15
to Bassem Rabil, Repo and Gerrit Discussion, j...@bettcher.net, Christian Halstrick
Christian has proposed a fix in JGit
could you give this fix a try ?

-Matthias

Bassem Rabil

unread,
Jun 16, 2015, 11:06:30 AM6/16/15
to repo-d...@googlegroups.com, j...@bettcher.net, bassem.ra...@ericsson.com, christian...@gmail.com
Unfortunately we were unable to reproduce this issue on our test/staging environments. Whenever we try to increase the rate of pushing to reproduce this, we get locking issues on the repository level and DB, and when decreasing the rate of pushing while running garbage collection does not reproduce the issue. This case specifically needs all planets aligned to reproduce it consistently, and might take few months to reproduce.  However I will run gc stress tests for this fix on our test environment and make sure it brings no regression and will be able to confirm the fix once it is in our production around July-August time frame. I will keep you posted on the JGit bug tracker regarding the stress test results.

Thanks 
Bassem
Reply all
Reply to author
Forward
0 new messages