jgit "pack is corrupt, removing it from pack list" in Gerrit 2.10

1,240 views
Skip to first unread message

James E. Blair

unread,
May 11, 2015, 6:14:34 PM5/11/15
to repo-d...@googlegroups.com
Hi,

OpenStack just upgraded to version 2.10 (patch level equivalent to
2.10.3), with jgit 3.7.2, and we see the following error shortly after
the system starts to see significant load:

[2015-05-11 16:30:13,687] ERROR org.eclipse.jgit.internal.storage.file.ObjectDirectory : Pack file /home/gerrit2/review_site/git/openstack/nova.git/objects/pack/pack-93ad57004de887eb835b2bd4df2d7c3f6a5c394b.pack is corrupt, removing it from pack list
org.eclipse.jgit.errors.CorruptObjectException: Object at 87,706,216 in /home/gerrit2/review_site/git/openstack/nova.git/objects/pack/pack-93ad57004de887eb835b2bd4df2d7c3f6a5c394b.pack has bad zlib stream
at org.eclipse.jgit.internal.storage.file.PackFile.load(PackFile.java:840)
at org.eclipse.jgit.internal.storage.file.PackFile.get(PackFile.java:259)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openPackedObject(ObjectDirectory.java:417)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openPackedFromSelfOrAlternate(ObjectDirectory.java:386)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openObject(ObjectDirectory.java:378)
at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:145)
at org.eclipse.jgit.diff.ContentSource$ObjectReaderSource.open(ContentSource.java:140)
at org.eclipse.jgit.diff.ContentSource$Pair.open(ContentSource.java:276)
at org.eclipse.jgit.diff.DiffFormatter.open(DiffFormatter.java:1033)
at org.eclipse.jgit.diff.DiffFormatter.createFormatResult(DiffFormatter.java:963)
at org.eclipse.jgit.diff.DiffFormatter.toFileHeader(DiffFormatter.java:928)
at com.google.gerrit.server.patch.PatchListLoader$2.call(PatchListLoader.java:203)
at com.google.gerrit.server.patch.PatchListLoader$2.call(PatchListLoader.java:200)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.zip.DataFormatException
at org.eclipse.jgit.internal.storage.file.WindowCursor.inflate(WindowCursor.java:323)
at org.eclipse.jgit.internal.storage.file.PackFile.decompress(PackFile.java:340)
at org.eclipse.jgit.internal.storage.file.PackFile.load(PackFile.java:813)
... 16 more

Since that pack file holds the bulk of the objects in that repository,
that means that many actions related to that repository fail thereafter.

A restart of Gerrit clears the error for a while, but then it returns.
This affects more than one repository (we've seen at least 11
repositories affected so far). For a given repository, the largest pack
file is the one affected. This may simply be due to chance as that pack
file holds the bulk of the objects for the repo. We have seen it affect
pack files ranging in size from 1MB - 300MB.

We have not seen a repeat of the file offset in that error, so each time
the error returns, it seems to be in a different place in the file.

Git fsck reports no errors other than some dangling commits. Git
verify-pack reports no errors on the pack file. We used git show-index
to find the blob at the offset indicated, and git show on that blob
works fine. We installed jgit-cli and it is also able to show the same
blob without error.

The files are on a local ext4 filesystem. We do repack repositories
once a week, and we believe we have also seen errors related to the pack
file names being reused. However, we believe that to be a separate
error and not related to the "bad zlib stream" error under discussion
here. In particular, we have restarted Gerrit multiple times since
encountering it after the last external repack was completed, so nothing
should be altering the repositories on disk aside from Gerrit at this
point.

It's worth noting that this commit recently touched relevant parts of
jgit:

https://github.com/eclipse/jgit/commit/94c4d7eee85d5ffe19d04c5a6e60192430d4fe1e#diff-cd9200eacde17f31ca6b3f490d4a2a97R319

We have not found a logic error in the commit (which is in jgit 3.7),
however, with it, if the supplied buffer is not large enough to receive
the decompressed object, the packfile will be marked as corrupt. This
suggests that if the packfile is not actually corrupt (which we believe
to be the case) that either Gerrit is not sizing the buffer correctly or
unexpected data are being produced in the decompression process (whether
via zlib or jgit's window routines).

We would appreciate some suggestions as to how to further diagnose this
error.

Thanks,

Jim

Justin Clift

unread,
May 14, 2015, 10:01:44 PM5/14/15
to James E. Blair, repo-d...@googlegroups.com
On 11 May 2015, at 23:20, James E. Blair <cor...@gnu.org> wrote:
<snip>
> OpenStack just upgraded to version 2.10 (patch level equivalent to
> 2.10.3), with jgit 3.7.2, and we see the following error shortly after
> the system starts to see significant load:
>
> [2015-05-11 16:30:13,687] ERROR org.eclipse.jgit.internal.storage.file.ObjectDirectory : Pack file /home/gerrit2/review_site/git/openstack/nova.git/objects/pack/pack-93ad57004de887eb835b2bd4df2d7c3f6a5c394b.pack is corrupt, removing it from pack list
> org.eclipse.jgit.errors.CorruptObjectException: Object at 87,706,216 in /home/gerrit2/review_site/git/openstack/nova.git/objects/pack/pack-93ad57004de887eb835b2bd4df2d7c3f6a5c394b.pack has bad zlib stream
<snip>
> A restart of Gerrit clears the error for a while, but then it returns.

For us (gluster.org), we do a manual "git gc" on the affected repo when
it happens, and the error seems to go away for a week or two before
needing it again.

We're on an older Gerrit version though, which doesn't have the automatic
git gc stuff. 2.10.x does though, so you may find benefit in enabling it
for your installation (daily maybe?).

https://gerrit-documentation.storage.googleapis.com/Documentation/2.10/config-gerrit.html#gc


<snip>
> The files are on a local ext4 filesystem. We do repack repositories
> once a week, and we believe we have also seen errors related to the pack
> file names being reused. However, we believe that to be a separate
> error and not related to the "bad zlib stream" error under discussion
> here. In particular, we have restarted Gerrit multiple times since
> encountering it after the last external repack was completed, so nothing
> should be altering the repositories on disk aside from Gerrit at this
> point.
>
> It's worth noting that this commit recently touched relevant parts of
> jgit:
>
> https://github.com/eclipse/jgit/commit/94c4d7eee85d5ffe19d04c5a6e60192430d4fe1e#diff-cd9200eacde17f31ca6b3f490d4a2a97R319
>
> We have not found a logic error in the commit (which is in jgit 3.7),
> however, with it, if the supplied buffer is not large enough to receive
> the decompressed object, the packfile will be marked as corrupt. This
> suggests that if the packfile is not actually corrupt (which we believe
> to be the case) that either Gerrit is not sizing the buffer correctly or
> unexpected data are being produced in the decompression process (whether
> via zlib or jgit's window routines).
>
> We would appreciate some suggestions as to how to further diagnose this
> error.

If you find and fix a Gerrit bug to resolve this, that would be really
welcome. :)

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

Doug Kelly

unread,
May 18, 2015, 2:14:01 PM5/18/15
to repo-d...@googlegroups.com, cor...@gnu.org, jus...@gluster.org


On Thursday, May 14, 2015 at 9:01:44 PM UTC-5, Justin Clift wrote:
On 11 May 2015, at 23:20, James E. Blair <cor...@gnu.org> wrote:
<snip>
> OpenStack just upgraded to version 2.10 (patch level equivalent to
> 2.10.3), with jgit 3.7.2, and we see the following error shortly after
> the system starts to see significant load:
>
> [2015-05-11 16:30:13,687] ERROR org.eclipse.jgit.internal.storage.file.ObjectDirectory : Pack file /home/gerrit2/review_site/git/openstack/nova.git/objects/pack/pack-93ad57004de887eb835b2bd4df2d7c3f6a5c394b.pack is corrupt, removing it from pack list
> org.eclipse.jgit.errors.CorruptObjectException: Object at 87,706,216 in /home/gerrit2/review_site/git/openstack/nova.git/objects/pack/pack-93ad57004de887eb835b2bd4df2d7c3f6a5c394b.pack has bad zlib stream
<snip>
> A restart of Gerrit clears the error for a while, but then it returns.

For us (gluster.org), we do a manual "git gc" on the affected repo when
it happens, and the error seems to go away for a week or two before
needing it again.

We're on an older Gerrit version though, which doesn't have the automatic
git gc stuff.  2.10.x does though, so you may find benefit in enabling it
for your installation (daily maybe?).

  https://gerrit-documentation.storage.googleapis.com/Documentation/2.10/config-gerrit.html#gc

I can confirm both of these points.  Specifically, I've encountered the error while doing a reindex as part of the upgrade path, and also solved the issue by running a manual "git gc" (this was in the process of converting a very old, legacy Gerrit database to the latest version, making the required upgrades as I went).  That said, I've been running 2.10.2 on our main servers for a few weeks now, and not yet seen this issue with a running system... but it will be another point to watch for!

--Doug 

Will Saxon

unread,
May 19, 2015, 1:17:45 PM5/19/15
to repo-d...@googlegroups.com, cor...@gnu.org, jus...@gluster.org

We're stuck on 2.9 for the same reasons. Reindex barfs on some of our repos - we have a handful containing changesets with large zipfiles - and the result is that these repos either show a severely reduced number of changes or no changes at all. I've been trying to find time to write this up. We've run gc/repack/prune and then fsck prior to attempting reindex, and there is no indication that we have corrupt packs or anything that should prevent indexing.

The exceptions we get are all similar to the above. In all cases, the exceptions are prefixed with something like this:

[2015-05-19 12:13:52,876] WARN  com.google.gerrit.server.patch.PatchListLoader : 5000 ms timeout reached for Diff loader in project theProject on commit e3c20263e55cecaf24ba476972b01a3a55eecc97 on path somePath/someZip.zip comparing a68c4513d85b9f3685371d5fdc65aa7b3be5d1f6..f6ddef38ce110e9ea3d04a8fff2eafb12f5380f7

In some cases the somePath/someZip.zip is /dev/null, these seem to correlate with the second SHA being 0.

At least for us, I think these exceptions are entirely due to mishandling of the timeout. Is there a way to bump the timeout? Running gc is not doing anything for us.

We are running Gerrit in a 4CPU/16GB VM, which has been fine for day-to-day usage and prior upgrades, including the initial index creation.

-Will

Bassem Rabil

unread,
May 19, 2015, 1:37:36 PM5/19/15
to repo-d...@googlegroups.com, cor...@gnu.org, jus...@gluster.org
This can be tuned using the cache "diff" configuration in gerrit.config [1]. I am quoting below the description of this cache settings. If you have in your Gerrit instance many changes with large number of lines diff, I think you may need to increase this to few minutes. 


"cache.diff.timeout
Maximum number of milliseconds to wait for diff data before giving up and falling back on a simpler diff algorithm that will not be able to break down modified regions into smaller ones. This is a work around for an infinite loop bug in the default difference algorithm implementation.

Values should use common unit suffixes to express their setting:

ms, milliseconds
s, sec, second, seconds
m, min, minute, minutes
h, hr, hour, hours
If a unit suffix is not specified, milliseconds is assumed.

Default is 5 seconds."

[1] https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#cache

Will Saxon

unread,
May 19, 2015, 2:03:08 PM5/19/15
to repo-d...@googlegroups.com, jus...@gluster.org, cor...@gnu.org
Setting this timeout to 2m resolves the reindex problem for us.

I don't think a diff cache timeout should be causing reindex to fail entire packs/projects though, should it?

-Will

Bassem Rabil

unread,
May 19, 2015, 2:12:24 PM5/19/15
to Will Saxon, repo-d...@googlegroups.com, jus...@gluster.org, cor...@gnu.org
This timeout was introduced to avoid getting into infinite loop with MyersDiff to get the diff, however this does not cause failing the reindex. This message is a warning that the diff is falling back to a more simple algorithm to calculate the diff instead of MyersDiff.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to a topic in the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/repo-discuss/CYYoHfDxCfA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bassem Rabil

unread,
May 19, 2015, 2:18:12 PM5/19/15
to Will Saxon, repo-d...@googlegroups.com, justin, corvus
You can check this discussion [1] for explanation of some changes in JGit starting Gerrit 2.10.

Will Saxon

unread,
May 19, 2015, 2:52:19 PM5/19/15
to repo-d...@googlegroups.com, cor...@gnu.org, sax...@gmail.com, jus...@gluster.org
I think it's pretty clear that it is at least strongly correlated with reindex failure. Every time we reindex and encounter a diff timeout, the associated pack and project fail to reindex.

Sven Selberg

unread,
May 20, 2015, 9:27:39 AM5/20/15
to repo-d...@googlegroups.com, sax...@gmail.com, cor...@gnu.org


Den tisdag 19 maj 2015 kl. 20:52:19 UTC+2 skrev Will Saxon:
I think it's pretty clear that it is at least strongly correlated with reindex failure. Every time we reindex and encounter a diff timeout, the associated pack and project fail to reindex.


I think Sasa has a very plausible explanation of how [1].

/Sven



1. PatchListLoader times out for "Diff loader"
~200 ms after
2. Project pack file is deemed corrupt and removed
Almost instantaneously 
3. PatchListLoader : Error computing PatchListKey
Since jgit finds no valid pack file (my assumption) it can not get refs/meta/config and can't compute access so it ends up with
4. "Cannot read project" 
Until gc.

If you have any idea about what could be the root cause, all help would be much appretiated.

Zivkov wrote:

Sven, I believe that you found the root cause!

Haven't yet looked at the source code but I am quite sure that what happens is the following:
* the diff computation is done in its own thread T1
* the "main" thread waits on the T1 with some predefined timeout
* when the timeout is reached the thread T1 will be canceled
* if, at the time when T1 was canceled, it was performing an IO on a pack file, an exception will be thrown by the JVM 
* the default exception handling at this place in JGit will declare the pack file as corrupt and remove it from its list

The problem, likely, has to be solved in the last two steps i.e. we have to differentiate an
exception thrown due to a real pack corruption and an exception thrown due to the thread being interrupted.

Matthias Sohn

unread,
May 20, 2015, 10:18:42 AM5/20/15
to Sven Selberg, Repo and Gerrit Discussion, sax...@gmail.com, cor...@gnu.org
When PackFile.idx [1] catches an IOException it marks the corresponding pack invalid. If the exception happens
to be an InterruptedIOException this could be the cause of the problem we are chasing here.


-Matthias

Saša Živkov

unread,
May 20, 2015, 11:07:06 AM5/20/15
to Matthias Sohn, Sven Selberg, Repo and Gerrit Discussion, Will Saxon, cor...@gnu.org
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Khai Do

unread,
Aug 5, 2015, 6:01:59 PM8/5/15
to Repo and Gerrit Discussion, matthi...@gmail.com, sven.s...@sonymobile.com, sax...@gmail.com, cor...@gnu.org
The bug[1] for this problem was marked fixed with change 68978 [2].  However I was able to repro the bug even with that change.  The bug is now in open state so I was wondering if anybody knows whether anyone is working on a fix?

Message has been deleted

Glenn Chen

unread,
Aug 23, 2015, 7:38:56 AM8/23/15
to Repo and Gerrit Discussion
 Our gerrit is now installed 2.10.6, but this problem is still there.
Will this be fixed in the next release of 2.10?

Thanks
Glenn

 

Glenn Chen

unread,
Aug 25, 2015, 9:55:17 AM8/25/15
to Repo and Gerrit Discussion
Most of our users used to view the "old screen" UI,
However the "old screen" is removed from 2.11.

We hope gerrit can release an update for 2.10.x, 
so that our users can keep the "old screen".

Thanks a lot
Glenn

Glenn Chen於 2015年8月23日星期日 UTC+8下午7時38分56秒寫道:
Reply all
Reply to author
Forward
0 new messages