Another possible memory leak in fcrepo 4.7.4

130 views
Skip to first unread message

Peter Matthew Eichman

unread,
Aug 21, 2017, 4:35:43 PM8/21/17
to fedor...@googlegroups.com
Hello all,

In the process of doing another large load of objects, we have discovered another possible memory leak. When running fcrepo 4.7.4 in Tomcat with a 2GB max heap, we were experience a loss of the ability to write to the repository after about 4000 objects. (However, read operations through HTTP GET continued to succeed).

Looking at heap dumps taken at this point, when the memory usage was close to the 2GB ceiling, show approximately 1.2GB taken up by one JCR session [1]. I am not an expert in reading this, but there appears to be a circular reference in memory. The org.modeshape.jcr.JcrSession (at address 0xac2fb168) has a number of ConurrentHashMaps and Nodes, and then additional references to the same JcrSession @ 0xac2fb168 [2]. Could this be what is preventing these objects from getting garbage collected?

Additionally, my best guess so far, based on the string values that I find when drilling down into the ConcurrentHashMap structures, is that this may be happening somewhere in the fcrepo-audit module [3]. All the strings that appear to be RDF property local names are all audit/PREMIS related: "hasEventRelatedAgent", "hasEventRelatedObject", and so forth. Has anyone else experienced memory issues like this when using the internal audit module?

For now, we have implemented a workaround of increasing our VM RAM and JVM max heap size. We are also investigating switching to the external triplestore audit module [4], for reasons beyond these possible memory problems. However, if this is indeed a memory leak in the internal audit module or elsewhere in the fcrepo core, we are also interested in helping to fix it properly.

Thanks,
-Peter

[1] see attached dominator_tree.png
[2] see attached list_objects.png

--
Peter Eichman
Senior Software Developer
University of Maryland Libraries
dominator_tree.png
list_objects.png

Andrew Woods

unread,
Aug 21, 2017, 5:18:12 PM8/21/17
to fedor...@googlegroups.com
Hello Peter,
Can you describe how to reproduce the memory leak you are observing? 

Are the steps limited to:
1. Deploy Fedora with the Audit module enabled
2. Ingest resources (container? binary?)

I notice that `list_objects.png` includes reference to the AccessControlManager. Do you have a specific authorization configuration that may also be coming into play?

Thanks,
Andrew

--
You received this message because you are subscribed to the Google Groups "Fedora Tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech+unsubscribe@googlegroups.com.
To post to this group, send email to fedor...@googlegroups.com.
Visit this group at https://groups.google.com/group/fedora-tech.
For more options, visit https://groups.google.com/d/optout.

Peter Matthew Eichman

unread,
Aug 21, 2017, 5:28:43 PM8/21/17
to fedor...@googlegroups.com
Hi Andrew,

That description sounds correct. We are ingesting containers only. Specifically, we are GETting binaries that were previously loaded into fcrepo (METS XML files), parsing them in our batchloader extractocr.py client [1], and creating Web Annotation objects.

We are processing each page as single transaction; each transaction contains roughly 20 web annotation objects. Each web annotation object has an additional 5 hash URI objects that make up its component parts (body, target, selector, etc.).

The batchloader client is using an SSL client cert for authentication, and that user is configured with the fedoraAdmin role in our Tomcat users.xml.

Thanks,
-Peter

Peter Matthew Eichman

unread,
Aug 23, 2017, 12:49:31 PM8/23/17
to fedor...@googlegroups.com
Hi Andrew,

I have some additional information that may be of use. Since my original post, we have been running our Tomcat with a larger heap (8GB), but we still encountered the transaction hangup problem when the heap was only about 50% full. We also tried increasing the Modeshape event buffer size from 1024 to 2048, but that also did not work. Today I have been running the extractocr process with the Solr indexing disabled (feature:stop umd-fcrepo-indexing-solr in Karaf), and so far that has been yielding the best results. It's been running about 3 hours and has processed 1280 objects and created at least 10,000 objects.

Also of note, each time when restarting Tomcat after it hung up this week and stopped accepting transaction commits, I had to do an explicit kill of the process instead of using Tomcat's native shutdown command. Otherwise, it reported on waiting for threads to complete, but they never did. Could that be caused by the Modeshape event buffer filling up, and not being able to empty itself out?

To sum up:
* Probably not a classic memory leak, heap getting full situation
* Problems with the Modeshape event buffer?
* Getting overloaded due to concurrent Solr indexing requests?

Thanks,
-Peter

herr.s...@googlemail.com

unread,
Aug 24, 2017, 7:56:13 AM8/24/17
to Fedora Tech

Dear all,

 

I am facing quite the same difficulties as Peter. Some months ago I reported an ingesting problem here: https://groups.google.com/forum/#!searchin/samvera-tech/ingesting$20problem|sort:relevance/samvera-tech/cTY24Q8rBGc/9vM0MP9AEwAJ. The hints I got were all fine, but didn’t solve the problem.

 

Now I am ingesting a collection (without any files), and after some three or four thousand objects the process slows down. When I am indexing less or no data at all to Solr, an improvement can be observed.

 

I am asking myself if not too little memory might be the problem but too much. Maybe the garbage collection (which might have a problem with a very large heap) is the culprit – either in Fedora or in Solr or in both ...

 

I didn’t try out all combinations, but for me this behaviour is visible in Fedora 4.7.3 and in the older 4.5.1.

 

Yours

 

Oliver

Peter Matthew Eichman

unread,
Aug 24, 2017, 1:41:43 PM8/24/17
to fedor...@googlegroups.com
Hi Oliver,

It's good to know that other people are having similar problems. I've begun to wonder if it has to do with the number of concurrent connections to Tomcat, or something similar. Andrew has suggested that I reach out to the Modeshape folks to see if they might have any more suggestions. Also, in the past we had some similar stuck thread problems with our Fedora 2 instance that were (mostly) addressed by turning off swap for the machine it was running on.

Still in search of a solution,
-Peter

--
You received this message because you are subscribed to the Google Groups "Fedora Tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech+unsubscribe@googlegroups.com.
To post to this group, send email to fedor...@googlegroups.com.
Visit this group at https://groups.google.com/group/fedora-tech.
For more options, visit https://groups.google.com/d/optout.

Aaron Birkland

unread,
Aug 24, 2017, 4:47:04 PM8/24/17
to fedor...@googlegroups.com

Hi Peter & Oliver,

 

It’s not entirely clear to me what the triggering factor might be.  As far as I understand it, the scenario as described by Peter involves

 

  • Ingesting thousands of objects
  • Relatively short-lived transactions containing tens or hundreds of objects
  • Async solr indexing, which results in many concurrent of GET requests to the repository

 

Out of curiosity, is an external message broker being used, and is solr subscribed to a queue via this broker?  When solr is disabled, are messages still being placed into a queue for later solr indexing?

 

Thanks,

 

  -Aaron

To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.

herr.s...@googlemail.com

unread,
Aug 25, 2017, 4:37:29 AM8/25/17
to Fedora Tech
Hi, everybody, specially Aaron and Peter,

because I am modelling the data with Samvera, it seems to me I can't turn off Solr entirely (and by the way, it wouldn't make sense for us).

So I'm looking forward as well searching for a solution and will let you know if I'm finding something interesting.

Kind regards from

Oliver

--
Dr. Oliver Schöner
Administrator für virtuelle Fachbibliotheken, Abt. IDM 2.3
Staatsbibliothek zu Berlin – Preußischer Kulturbesitz

Tel.:       +49 30 266-43 22 33
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.
To post to this group, send email to fedor...@googlegroups.com.
Visit this group at https://groups.google.com/group/fedora-tech.
For more options, visit https://groups.google.com/d/optout.



--
Peter Eichman
Senior Software Developer
University of Maryland Libraries

Peter Matthew Eichman

unread,
Aug 25, 2017, 11:06:33 AM8/25/17
to fedor...@googlegroups.com
Hi Aaron,

Yes, that is an accurate description of our scenario.

We are using an external ActiveMQ instance, and yes, the messages are getting stored in the queue until we re-enable Solr indexing. We are controlling the indexing using Camel routes [1][2][3] in Karaf. Thus we actually have independent control over Solr indexing and Fuseki indexing (we've been leaving the Fuseki indexing on, and it hasn't appeared to cause a problem).

And once we re-enable the Solr indexing, the indexing proceeds without any problems.

-Peter

Peter Matthew Eichman

unread,
Aug 29, 2017, 2:49:11 PM8/29/17
to fedor...@googlegroups.com
We are now encountering a further complication on this problem. We have started experiencing stuck Tomcat threads that appear to be due to locking code in the org.modeshape.common.collection.RingBuffer. These don't appear to be connected to JVM memory usage. These stuck threads require an explicit kill of Tomcat and then a restart.

Partial stack trace from this problem:

WARNING: Thread "http-bio-9601-exec-1" (id=49) has been active for 33,364 milliseconds (since 8/28/17 4:15 PM) to serve the same request for https://fcre
po.lib.umd.edu/fcrepo/rest/tx:e8c89983-3eb1-49fd-a762-74328b4f2065/fcr:tx/fcr:commit and may be stuck (configured threshold for this StuckThreadDetection
Valve is 30 seconds). There is/are 1 thread(s) in total that are monitored by this Valve and may be stuck.
java.lang.Throwable
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at org.modeshape.common.collection.ring.RingBuffer.add(RingBuffer.java:153)
        at org.modeshape.jcr.bus.RepositoryChangeBus.notify(RepositoryChangeBus.java:180)
        at org.modeshape.jcr.cache.document.WorkspaceCache.changed(WorkspaceCache.java:320)
        at org.modeshape.jcr.txn.Transactions.updateCache(Transactions.java:296)
        at org.modeshape.jcr.cache.document.WritableSessionCache.save(WritableSessionCache.java:748)
        at org.modeshape.jcr.JcrSession.save(JcrSession.java:1179)
        at org.fcrepo.kernel.modeshape.FedoraSessionImpl.commit(FedoraSessionImpl.java:95)
        at org.fcrepo.kernel.modeshape.services.BatchServiceImpl.commit(BatchServiceImpl.java:115)
        at org.fcrepo.http.api.FedoraTransactions.finalizeTransaction(FedoraTransactions.java:133)
        at org.fcrepo.http.api.FedoraTransactions.commit(FedoraTransactions.java:101)

Peter Matthew Eichman

unread,
Aug 29, 2017, 2:49:43 PM8/29/17
to fedor...@googlegroups.com
I should add this is even when Solr indexing is disabled, and relatively frequently (we had it happen twice today).

Aaron Birkland

unread,
Aug 29, 2017, 2:51:51 PM8/29/17
to fedor...@googlegroups.com

Hm, so this is happening with just:

  • Ingesting thousands of objects
  • Relatively short-lived transactions containing tens or hundreds of objects

If solr is disabled, is there anything else doing reads?  Fuseki indexing?

To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.

Peter Matthew Eichman

unread,
Aug 29, 2017, 3:02:51 PM8/29/17
to fedor...@googlegroups.com
Fuseki indexing as still happening. That will be our next step, to disable both indexing processes.

Peter Matthew Eichman

unread,
Aug 30, 2017, 1:43:42 PM8/30/17
to fedor...@googlegroups.com
Aaron,

We just tried it with all indexing disabled this morning, but it still only got through 10 pages before hanging up.

-Peter

Benjamin J. Armintor

unread,
Aug 30, 2017, 3:34:06 PM8/30/17
to fedor...@googlegroups.com
Peter:

Am I right that this is the issue Mohamed Rasheed raised in https://developer.jboss.org/thread/275431?_sscc=t ? If so, maybe Oliver (and others) voting for the issue over there would help?

- Ben

Benjamin J. Armintor

unread,
Aug 30, 2017, 3:54:47 PM8/30/17
to fedor...@googlegroups.com
Peter:

I blame search relevance, but apologies: I had not yet seen that you raised https://developer.jboss.org/thread/275858 - hopefully Oliver can upvote that, too?

- Ben

Peter Matthew Eichman

unread,
Aug 31, 2017, 9:15:29 AM8/31/17
to fedor...@googlegroups.com
Ben,

Yes, my post and Mohamed's are about the same issue. Any traction we can get in the Modeshape community for help looking into this would be appreciated.

-Peter

Peter Matthew Eichman

unread,
Aug 31, 2017, 2:59:06 PM8/31/17
to fedor...@googlegroups.com
Hello all,

Thanks to everyone in the tech call for their suggestions for things to try. We tried adding the interruptThreadThreshold to the stuck thread valve, but while it was clearly running, it wasn't able to solve the problem (or even stop the threads, as a matter of fact).

The next thing we are going to try is to bump Modeshape to 5.4. I found a bugfix ticket [1] in the release notes [2] that sounds a lot like our issue, so I am hoping that an upgrade is all that is needed to solve this.

-Peter

Peter Matthew Eichman

unread,
Sep 5, 2017, 1:53:52 PM9/5/17
to fedor...@googlegroups.com
Hello all,

We believe we have narrowed the problem down to the Modeshape 5.3 bug with user-initiated transactions [1]. By adapting our batchloader to not retry previously failed transactions each time we ran it, we were able to succesfully complete our load of OCR annotations (~200,000 objects) over the weekend, with no Tomcat stuck threads due to hanging transactions.

I am currently working on creating a test case that isolates this hanging behavior using rolled back transactions. Once I get that complete, I will be happy to contribute it, especially as part of an upgrade to Modeshape 5.4 on the 4.7-maintenance branch.

Thanks,
-Peter

Aaron Birkland

unread,
Sep 5, 2017, 1:57:12 PM9/5/17
to fedor...@googlegroups.com

Hi Peter,

 

That’s great.  I wonder if I was unable to reproduce it because I wasn’t intentionally rolling back transactions?

 

In any case, does it seem like 5.4 fixes it when you re-enable rollbacks?

To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.

Peter Matthew Eichman

unread,
Sep 5, 2017, 4:03:51 PM9/5/17
to fedor...@googlegroups.com
Hi Aaron,

We haven't gotten to testing it on 5.4 yet, I've been trying to get the test case written so that it is failing on 5.3. No luck yet; my guess is that it's due to the fact that one of the triggering conditions in Modeshape is the rollback being called in a different thread from where the transaction is created; I'm not sure how to force that in my test case.

-Peter

Peter Matthew Eichman

unread,
Sep 8, 2017, 2:24:41 PM9/8/17
to fedor...@googlegroups.com
Hello all,

One more status update. Since I have been unable to construct a reliable test case to demonstrate the stuck thread issue, we are putting that on hold. Instead, I have created PRs [1] [2] for upgrading to Modeshape 5.4.0.Final [3], as (at least at first blush) that would appear to fix the problem.

Thanks,
-Peter

Andrew Woods

unread,
Sep 10, 2017, 8:04:57 AM9/10/17
to fedor...@googlegroups.com
Hello Peter,
I am glad we have a path forward on this issue.

We will get your pull-requests into the codebase after verification that there are no unexpected impacts of bumping the ModeShape version.

Thanks again,
Andrew
Reply all
Reply to author
Forward
0 new messages