We recently upgrade from Gerrit 3.8.x to Gerrit 3.9.9 and had to wipe our large H2 caches because online indexing made them so large that they were taking way too long to build bloomfilters (all while blocking everything) on a large 135G cache. This blocking is for sure an issue, and something that I know work is being done to avoid, so I don't want to hijack this thread to discuss that. However, since we had to wipe some caches, it slowed things down in other ways (as can be expected), and I noticed what I believe is a fairly egregious concurrency issue that we probably should address.
The concurrency issue I am concerned about is that somewhere in the ChangeKindCacheImpl.getChangeKind() stack, possibly due to either a misuse of the caching API, or an issue in Caffeine itself. If the caches are not well populated, we will see many threads blocking on the ConcurrentHashMap.compute() method (see #1 below), while the thread holding the lock is also calling ConcurrentHashMap.compute() and is doing a jgit ThreeWayMerger.merge() call (see #2 below) which is obviously a slow operation which is potentially doing IO (accessing pack files). This is a problem because the ConcurrentHashMap.compute() was designed for short operations as it blocks other threads (as is clearly seen by my stack trace with 16 threads waiting on the same lock) and as mentioned in the javadocs for ConcurrentHashMap.compute():
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html#compute-K-java.util.function.BiFunction-"The entire method invocation is performed atomically. Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple..."
If the compute() operation is doing IO, I would say this no longer qualifies as short and simple. I hope I/we can fix this, but I am not quite sure yet why the misuse happens, i.e. which calling layer is violating a guideline, is it the H2CacheImpl.get() call, the CaffeinatedGuavaCache.get() call? It seems self evident to my that a caching layer must assume that computing a value to store in a cache (but not necessarily a hashMap) will never be "sort and simple", or else a cache would not be needed in the first place. Therefore, a caching API should make sure that computing a value for a key does not block other operations from other threads on its cache while computing this value (and thus likely should not be using ConcurrentHashMap.compute() for this computation?
I will keep looking into this, but in the meantime, maybe someone will tell me that this has been fixed in a later version of Caffeine, or that we are misusing the Caffeine APIs, or that I should post this on a Caffeine mailing list, or that I am simply way off-base and point me in the right direction...? If anyone else can back up my story and say that they have also seen this as an issue, it would be helpful to the conversation.
Thanks,
-Martin
* 1 Blocked thread example (16 almost identical example of these in my stack trace)
"HTTP GET /r/a/changes/?q=change:XXXX&o=ALL_REVISIONS&o=CURRENT_COMMIT&o=DETAILED_ACCOUNTS&o=LABELS ... waiting for monitor entry [0x00007f89a73f0000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.concurrent.ConcurrentHashMap.compute(java...@17.0.9/ConcurrentHashMap.java:1931)
- waiting to lock <0x00004ffe0b17a0d8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2411)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2394)
at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
at com.github.benmanes.caffeine.guava.CaffeinatedGuavaCache.get(CaffeinatedGuavaCache.java:67)
at com.google.gerrit.server.cache.h2.H2CacheImpl.get(H2CacheImpl.java:164)
at com.google.gerrit.server.change.ChangeKindCacheImpl.getChangeKind(ChangeKindCacheImpl.java:347)
at com.google.gerrit.server.change.ChangeKindCacheImpl.getChangeKindInternal(ChangeKindCacheImpl.java:395)
at com.google.gerrit.server.change.ChangeKindCacheImpl.getChangeKind(ChangeKindCacheImpl.java:365)
at com.google.gerrit.server.change.RevisionJson.toRevisionInfo(RevisionJson.java:341)
* 2 Thread doing a complex IO operation while inside compute() holding up all the other threads:
java.lang.Thread.State: BLOCKED (on object monitor)
at org.eclipse.jgit.internal.storage.file.Pack.idx(Pack.java:163)
- waiting to lock <0x0000401c5e3b93c8> (a org.eclipse.jgit.internal.storage.file.Pack)
at org.eclipse.jgit.internal.storage.file.Pack.get(Pack.java:283)
at org.eclipse.jgit.internal.storage.file.PackDirectory.open(PackDirectory.java:227)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openPackedObject(ObjectDirectory.java:412)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openPackedFromSelfOrAlternate(ObjectDirectory.java:375)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openObjectWithoutRestoring(ObjectDirectory.java:365)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openObject(ObjectDirectory.java:350)
at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:133)
at com.google.gerrit.server.git.InMemoryInserter$Reader.open(InMemoryInserter.java:128)
at org.eclipse.jgit.lib.ObjectReader$Filter.open(ObjectReader.java:636)
at org.eclipse.jgit.treewalk.CanonicalTreeParser.reset(CanonicalTreeParser.java:191)
at org.eclipse.jgit.treewalk.CanonicalTreeParser.createSubtreeIterator0(CanonicalTreeParser.java:225)
at org.eclipse.jgit.treewalk.CanonicalTreeParser.createSubtreeIterator(CanonicalTreeParser.java:203)
at org.eclipse.jgit.treewalk.CanonicalTreeParser.createSubtreeIterator(CanonicalTreeParser.java:39)
at org.eclipse.jgit.treewalk.TreeWalk.enterSubtree(TreeWalk.java:1371)
at org.eclipse.jgit.merge.ResolveMerger.mergeTreeWalk(ResolveMerger.java:1985)
at org.eclipse.jgit.merge.ResolveMerger.mergeTrees(ResolveMerger.java:1922)
at org.eclipse.jgit.merge.ResolveMerger.mergeImpl(ResolveMerger.java:928)
at org.eclipse.jgit.merge.Merger.merge(Merger.java:233)
at org.eclipse.jgit.merge.Merger.merge(Merger.java:186)
at org.eclipse.jgit.merge.ThreeWayMerger.merge(ThreeWayMerger.java:95)
at com.google.gerrit.server.change.ChangeKindCacheImpl$Loader.call(ChangeKindCacheImpl.java:243)
at com.google.gerrit.server.change.ChangeKindCacheImpl$Loader.call(ChangeKindCacheImpl.java:167)
at com.google.gerrit.server.cache.h2.H2CacheImpl.lambda$get$1(H2CacheImpl.java:175)
at com.google.gerrit.server.cache.h2.H2CacheImpl$$Lambda$1997/0x00007f8ca8d8c458.call(Unknown Source)
at com.github.benmanes.caffeine.guava.CaffeinatedGuavaCache.lambda$get$0(CaffeinatedGuavaCache.java:69)
at com.github.benmanes.caffeine.guava.CaffeinatedGuavaCache$$Lambda$1561/0x00007f8ca8cb1e38.apply(Unknown Source)
at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139)
at com.github.benmanes.caffeine.cache.LocalCache$$Lambda$1051/0x00007f8ca8907248.apply(Unknown Source)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2413)
at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$1052/0x00007f8ca8907c70.apply(Unknown Source)
at java.util.concurrent.ConcurrentHashMap.compute(java...@17.0.9/ConcurrentHashMap.java:1916)
- locked <0x00004ffe0b17a0d8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2411)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2394)
at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
at com.github.benmanes.caffeine.guava.CaffeinatedGuavaCache.get(CaffeinatedGuavaCache.java:67)
at com.google.gerrit.server.cache.h2.H2CacheImpl.get(H2CacheImpl.java:164)
at com.google.gerrit.server.change.ChangeKindCacheImpl.getChangeKind(ChangeKindCacheImpl.java:347)
at com.google.gerrit.server.change.ChangeKindCacheImpl.getChangeKindInternal(ChangeKindCacheImpl.java:395)
at com.google.gerrit.server.change.ChangeKindCacheImpl.getChangeKind(ChangeKindCacheImpl.java:365)
at com.google.gerrit.server.change.RevisionJson.toRevisionInfo(RevisionJson.java:341)
at com.google.gerrit.server.change.RevisionJson.getRevisions(RevisionJson.java:277)