Tomcat memory error corrupts Fedora 4 data store

Jim Coble

unread,

Mar 24, 2016, 9:57:20 AM3/24/16

to fedor...@googlegroups.com, hydra...@googlegroups.com

Posted to fedora-tech and hydra-tech

Fedora 4.4.0, Tomcat 7.0.54, Java 1.8.0_71, RHEL 7.2

During a trial migration of objects from our Fedora 3 repository into a Fedora 4 repository, Tomcat experienced an OutOfMemoryError (“GC overhead limit exceeded”). This appears to have resulted in at least one corrupted object in the Fedora 4 data store. When we try to access the object, we get the following error from Infinispan …

org.infinispan.persistence.spi.PersistenceException: java.io.StreamCorruptedException: Unexpected byte found when reading an object: 50 at org.infinispan.marshall.core.MarshalledEntryImpl.unmarshall(MarshalledEntryImpl.java:116) at org.infinispan.marshall.core.MarshalledEntryImpl.getValue(MarshalledEntryImpl.java:61) at org.infinispan.interceptors.CacheLoaderInterceptor.loadIfNeeded(CacheLoaderInterceptor.java:270) at org.infinispan.interceptors.CacheLoaderInterceptor.loadIfNeededAndUpdateStats(CacheLoaderInterceptor.java:335) at

…

This is actually the second time this has happened to us during a trial migration. (For the first time, see here: https://groups.google.com/forum/#!msg/fedora-tech/6cfSLCTl0q8/OAASoBcmBQAJ .) As suggested by Andrew the first time we encountered the problem, we ran the verify_leveldb.py script found at https://wiki.duraspace.org/display/FEDORA4x/Backup+and+Restore#BackupandRestore-LevelDBBackup but, as near as we can determine, it did not find any problem with the LevelDB database.

It is not at all clear to us how to recover the corrupted object. We also could not delete the object. Attempts to send a DELETE command to the object using curl resulted in the same error message noted above.

The fact that the problem occurred in the first place raises concerns for us about the fault tolerance of the Fedora 4 architecture (if a Tomcat memory error can result in a corruption of the data store). Add to this that there is no clear way that we know of to recover or even delete the corrupted object and that makes it a blocker for us in migrating from Fedora 3 to Fedora 4.

Has anyone else experienced this type of error? Does anyone have any idea how we can recover — or at least delete — the corrupted object?

Thanks.

—Jim

--------------------------------------------

Jim Coble

Digital Repository Services

Duke University Libraries

jim....@duke.edu

Benjamin Armintor

unread,

Mar 24, 2016, 10:31:11 AM3/24/16

to fedor...@googlegroups.com

Jim-

I'm still not clear on the environment or the particulars. The Tomcat throwing the OutOfMemoryError was the Fedora 3 Tomcat or the Fedora 4 Tomcat? What tool were you using to perform the migration? Do you have complete stacktraces of both the OutOfMemoryError and the PersistenceException?

Regards,

Ben

--
You received this message because you are subscribed to the Google Groups "Hydra-Tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hydra-tech+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jim Coble

unread,

Mar 24, 2016, 11:04:25 AM3/24/16

to Fedora Tech

Ben--

Memory error was thrown by Tomcat running Fedora 4 and a gist of the error can be found at https://gist.github.com/coblej/03a216d4df41f725949e .

And a gist of the infinispan error can be found at https://gist.github.com/coblej/16bc1fdc9be2613ddd78 .

Thanks for any help you can provide.

--Jim

Benjamin Armintor

unread,

Mar 24, 2016, 11:11:36 AM3/24/16

to fedor...@googlegroups.com

Jim--

Thanks! What was the migration tool?

--Ben

--
You received this message because you are subscribed to the Google Groups "Fedora Tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.
To post to this group, send email to fedor...@googlegroups.com.
Visit this group at https://groups.google.com/group/fedora-tech.

Jim Coble

unread,

Mar 24, 2016, 11:27:13 AM3/24/16

to Fedora Tech

Ben--

Sorry, forgot you asked about that. We are using a somewhat customized version of the FedoraMigrate gem ( https://github.com/projecthydra-labs/fedora-migrate ). The problem occurred during the processing of a set of Resque jobs using this job class https://github.com/duke-libraries/dul-hydra/blob/develop/lib/dul_hydra/migration/migrate_single_object_relationships_job.rb .

--Jim

Andrew Woods

unread,

Mar 25, 2016, 4:31:06 PM3/25/16

to fedor...@googlegroups.com

Hello Jim,

Thanks for raising the issue.

Based on the information you have provided, my suspicion is that your garbage collection JVM options need to be tuned [1]. As a result of the application failure due to garbage collection out-of-memory and a hard-stop of Tomcat7, it appears that your LevelDB was left in an inconsistent state. As a result of your previous email list posting [2] with a similar error, Fedora now offers the ability to configure [3] either a MySQL or PostgreSQL database for F4 objects as an alternative to LevelDB.

That said, I would be interested in understanding the exact state of the surrounding application environment with an eye towards determining the circumstances that caused the issue, how a correct state can be restored and suggestions for avoiding it in the future. With this in mind, below are several categories of questions:

1) Application environment
2) Migration process and data

3) Changes since previous related thread

4) Recovery efforts

#1. To begin with, could you please provide:
a) the version of F4 you were using
b) the JAVA_OPTS that were set for your F4 Tomcat7 process
c) the version of Java that you are using, and

d) the amount of RAM on your F4 machine

#2. It sounds like you were using a modified version of the fedora-migrate [4] gem. Could you please describe the topology of your migration setup, i.e.
a) Were the F3 and F4 applications on the same or different servers?

b) On which machine was the fedora-migrate client running?

Also,

c) What was the nature of you migrating resources? All objects? Mix of objects and datastreams?

d) What kind of resource failed? Object? Datastream?

e) How many resources migrated before the failure?

f) Was this number roughly the same as the last failure you posted to this list?

#3. Could please,
a) describe the changes you made to your migration process and configuration since the last time you posted a similar error to this list?

b) Did you explore the use of F4 transactions as suggested in the previous thread?

#4. As noted last time, there are scripts for inspecting [5] detecting and recovering [6] from LevelDB issues.

a) If you have not already done so, could you please run the documented scripts and respond with the output of those executions?

b) Are you in a position to tar the LevelDB directory (fcrepo.home/fcrepo.ispn.repo.cache), making it available for myself or others to inspect?

It would be very helpful to the community if you could help debugging the situation by providing responses to the above questions. In the meantime, I would suggest inspecting and tuning your garbage-collection configuration [1] while also switching in either MySQL or PostgreSQL for the default LevelDB.

Please respond back indicating where further assistance would be helpful.

Thanks,

Andrew
[1] https://jira.duraspace.org/browse/FCREPO-1294?focusedCommentId=43880&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-43880
[2] https://groups.google.com/forum/#!msg/fedora-tech/6cfSLCTl0q8/OAASoBcmBQAJ
[3] https://wiki.duraspace.org/display/FEDORA4x/Configuring+JDBC+Object+Store
[4] https://github.com/projecthydra-labs/fedora-migrate
[5] https://wiki.duraspace.org/display/FEDORA4x/How+to+inspect+LevelDB
[6] https://wiki.duraspace.org/display/FEDORA4x/Backup+and+Restore#BackupandRestore-LevelDBBackup

Jim Coble

unread,

Mar 25, 2016, 4:58:26 PM3/25/16

to Fedora Tech

Andrew--

Thanks for your reply. I will pull together the information you requested and post it early next week.

--Jim

Jim Coble

unread,

Mar 28, 2016, 10:56:05 AM3/28/16

to fedor...@googlegroups.com

Andrew--

Below, you will find answers to the questions you asked.

I think you are correct that tuning the garbage collection parameters on our Tomcat JVM could perhaps reduce the likelihood that the Tomcat error will recur. As you will see below, the max heap space was set to 2GB when we did the trial migration. We have 16GB of RAM on the Tomcat server so we are already planning to increase -Xmx to 8g. Any other suggestions you or others might have for tuning the Tomcat JVM JAVA_OPTS would be appreciated.

With respect to LevelDB, we are interested in switching to MySQL and have already been planning to do so once MySQL support is available in a released version of Fedora 4. Do you have an estimate on when that might be?

Finally, our main concern at this point is how to recover from the corrupted object. We have not been able to figure how to access any information about it (e.g., we don't know how to find the LevelDB entries that pertain to it) nor are we able to delete it. Ideally, we like to be able to recover as much information as we can about the object. Barring that, we would at least like to be able to delete it from the Fedora 4 repository.

Thanks for any help you or others in the community can give us on this.

--Jim

#1. To begin with, could you please provide:
a) the version of F4 you were using

>>> Release: 4.4.0 | Build #f6d33b32 (2015-10-12)

b) the JAVA_OPTS that were set for your F4 Tomcat7 process

>>> We basically copied the JAVA_OPTS from the F4 documentation, though I think we did increase the heap space a bit.

JAVA_OPTS="-Dfcrepo.home=/srv/perkins/fcrepo\

-Dsolr.solr.home=/srv/perkins/solr/solr\

-Dfcrepo.audit.container=/audit\

-Djava.awt.headless=true\

-Dfile.encoding=UTF-8\

-server\

-Xms1g\

-Xmx2g\

-XX:NewSize=256m\

-XX:MaxNewSize=256m\

-XX:MetaspaceSize=256m\

-XX:MaxMetaspaceSize=256m\

-XX:+DisableExplicitGC"

c) the version of Java that you are using, and

>>> java version "1.8.0_71"

d) the amount of RAM on your F4 machine

>>> 16GB

#2. It sounds like you were using a modified version of the fedora-migrate [4] gem. Could you please describe the topology of your migration setup, i.e.
a) Were the F3 and F4 applications on the same or different servers?

>>> Different servers (VM's). There are three servers relevant to the migration process:

-- Server running F3

-- Server running ActiveFedora9 Hydra application code (including the fedora-migrate code)

-- Server running F4

b) On which machine was the fedora-migrate client running?

>>> The middle server listed above. This server runs our ActiveFedora9 Hydra staff application code, which includes the fedora-migrate gem and our custom migration code. It communicates with the source F3 server and the target F4 server via Tomcat SSL port 8443.

Also,

c) What was the nature of you migrating resources? All objects? Mix of objects and data streams?

>>> We were migrating all objects in our pre-production F3 repository (11,541 objects). The migration occurs in two passes. The first pass migrates the objects and all their datastreams except for the RELS-EXT (and DC) datastream. The second pass migrates the RELS-EXT datastream for each object (by setting relationship attributes on the previously migrated objects). For some objects (those containing structural metadata), there is a third pass that updates the structural metadata with the F4 ID's.

d) What kind of resource failed? Object? Datastream?

>>> The failure occurred on the second pass, where the migration code was attempting to migrate the relationships (RELS-EXT datastream) for an object that, as far as we know, was otherwise successfully migrated in the first pass.

e) How many resources migrated before the failure?

>>> When the failure occurred, the first pass had been successfully completed for all 11,541 objects and we were about 8,000 objects into the second pass (relationship migration). There was a pause of a few hours (not exactly sure how many) between the end of the first pass and the beginning of the second pass, though Tomcat was not restarted during that interval.

f) Was this number roughly the same as the last failure you posted to this list?

>>> I honestly don't remember exactly where the error occurred in the previous failure or even whether it was during the first pass or the second pass.

#3. Could please,
a) describe the changes you made to your migration process and configuration since the last time you posted a similar error to this list?

>>> Our migration process uses Resque to queue up a migration job for each object for each of the migration passes. On the previous failure, we had assigned 5 Resque workers to the migration queue, which meant that as any as 5 of these migration jobs were being executed in parallel. Following the first failure, we wiped the F4 repository and re-ran the migration with 1 Resque worker -- that migration run did not fail and ran to successful completion of all passes. Then, for this most recent trial migration, we used 2 Resque workers and it was the failure of that run that prompted this thread.

b) Did you explore the use of F4 transactions as suggested in the previous thread?

>>> We did not. ActiveFedora, on which the FedoraMigrate gem and our application code relies, does not currently support F4 transactions, so we don't consider the use of F4 transactions as a truly viable option for us unless/until ActiveFedora provides support for them.

#4. As noted last time, there are scripts for inspecting [5] detecting and recovering [6] from LevelDB issues.

a) If you have not already done so, could you please run the documented scripts and respond with the output of those executions?

>>>

(python)tomcat@lib-repostore-pre-01 /srv/perkins $ bin/verify_leveldb.py fcrepo/fcrepo.ispn.repo.cache/dataFedoraRepository/

bin/verify_leveldb.py : Inspecting db: fcrepo/fcrepo.ispn.repo.cache/dataFedoraRepository/

bin/verify_leveldb.py : Backup verification successful!

bin/verify_leveldb.py : Total records: 658974

bin/verify_leveldb.py : Time taken: 0:00:07.317280

(python)tomcat@lib-repostore-pre-01 /srv/perkins $ bin/repair_leveldb.py fcrepo/fcrepo.ispn.repo.cache/dataFedoraRepository/

mv: cannot stat ‘fcrepo/fcrepo.ispn.repo.cache/dataFedoraRepository//*.ldb’: No such file or directory

b) Are you in a position to tar the LevelDB directory (fcrepo.home/fcrepo.ispn.repo.cache), making it available for myself or others to inspect?

>>> I believe you should be able to access the compressed directory here: https://duke.box.com/fcrepo-ispn-repo-cache-tar-gz . I set that link to expire at the end of April 2016. While I don't believe there is anything particularly sensitive in there, it is a subset of actual data from our repository. If it helps in analyzing the LevelDB database, the F4 ID for the object with the error is 51/6c/bf/96/516cbf96-5449-4bdc-abe7-a4c5bb628e30 .

--
You received this message because you are subscribed to a topic in the Google Groups "Fedora Tech" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fedora-tech/AluCA-Rp7Jg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fedora-tech...@googlegroups.com.

David Chandek-Stark

unread,

Mar 30, 2016, 12:27:23 PM3/30/16

to Fedora Tech, hydra...@googlegroups.com

Another oddity in the current state of our repository: while requesting text/html at /fcrepo/rest/prod (root container) blows up with the infinispan exception, a request for text/turtle returns a prematurely terminated chunked response and no error (nor in Tomcat error log nor access log). That is, with `curl --raw' I can see the chunked encoding termination sequence at the end of obviously truncated content. The corrupted object URL is not included in the partial response. This appears to mean that we can't discover all the objects in the repository using the REST API. Earlier troubleshooting to check whether there were additional corrupt objects relied on our Solr index to iterate over known object URIs.

--David

Andrew Woods

unread,

Apr 4, 2016, 1:45:59 PM4/4/16

to fedor...@googlegroups.com

Hello Jim,
I have run the "verify_leveldb.py" [1] script over the leveldb files you provided with no indication of a corrupted state. I have not yet attempted to perform a visual inspection of the database, but will keep you posted. Also, if you have the entire fcrepo.home directory, a tar of that would be helpful for debugging. I am unable to start up F4 with the provided leveldb files alone.

We will continue to investigate how to get out of a corrupted state (in the absence of a system backup). In the meantime, you appear to have an excellent, reproducible environment for creating a corrupted repository.

Would you be able to run your same ingest against the 4.5.1-RC [2], with MySQL configured [3]? It would be helpful to see if that configuration behaves differently.

Also, in addition to increasing your heap space, you may want to try adding "-XX:+UseG1GC", see IRC discussion [4]. To help in analyzing your Java garbage collection patterns, you will want to retain the garbage collection log with JAVA_OPTS like the following:
----
JAVA_OPTS="${JAVA_OPTS} -Xloggc:/tmp/java-gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
----

You can then graphically inspect the garbage collection behavior with GCViewer [5].

[1] https://wiki.duraspace.org/display/FEDORA4x/Backup+and+Restore#BackupandRestore-BackupStrategies

[2] https://github.com/fcrepo4/fcrepo4/releases/tag/fcrepo-4.5.1-RC-1

[3] https://wiki.duraspace.org/display/FEDORA4x/Configuring+JDBC+Object+Store

[4] http://irclogs.fcrepo.org/2016-03-31.html

[5] https://github.com/chewiebug/GCViewer

Jim Coble

unread,

Apr 5, 2016, 11:47:13 AM4/5/16

to Fedora Tech

Andrew--

Thanks for your reply.

I am working on getting approval to send you the entire fcrepo.home directory. I will let you know if/when I am able to do that.

We deployed Fedora 4.5.1-RC1 in our development server environment late last week and configured it to use MySQL. We ran a trial migration in that environment starting on Friday morning. That migration run was done using the Tomcat JVM heap space set to 8GB and 10 migration workers. We did get about half a dozen job failures on the relationship migration pass but, when I retried the failed jobs, they ran successfully. So, not a flawless run but, as near as we can tell, it did not result in any enduring data store corruption.

We will try a migration run using the "-XX:+UseG1GC" option on the Tomcat JVM as you suggest (as well as the garbage collection logging options you suggest). When using the "G1GC" option, should we make any other changes to the JAVA_OPTS recommended here? https://wiki.duraspace.org/display/FEDORA44/Java+HotSpot+VM+Options+recommendations .

At this point, I think our number one concern is how to remove the corrupted object from the repository (or otherwise recover from the corrupted state).

Thanks for your help on this.

--Jim

Diana Cooper

unread,

May 18, 2016, 11:56:03 AM5/18/16

to Fedora Tech

Hello,

I am also having a similar problem. It started last week after there was an out of memory condition with tomcat running fedora. Now, accessing certain content results in the StreamCorruptedException error. The exception stack trace matches Jim's exactly.

I ran the repair script as Jim described, which, unlike Jim's case, resulted in no errors; no output at all.

After running the repair script, the same content results in the corruption exception.

I'm running fedora 4.1.1 and java 1.7.0_91 on a machine with 16GB ram.

tomcat command line:

/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java \

-Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties \

-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager \

-Xms4G \

-Xmx8G \

-server \

-XX:+UseParallelGC \

-Djava.endorsed.dirs=/usr/local/tomcat/endorsed \

-classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar \

-Dcatalina.base=/usr/local/tomcat \

-Dcatalina.home=/usr/local/tomcat \

-Djava.io.tmpdir=/usr/local/tomcat/temp \

org.apache.catalina.startup.Bootstrap start

If we were to use mysql instead of leveldb, would there be a way to migrate from one to the other?

My priority is also to recover from the corrupted state, either by removing the corrupted content or repairing it.

Thanks,

Diana

Jim Coble

unread,

May 18, 2016, 1:17:39 PM5/18/16

to Fedora Tech

Diana--

In our case, we did not find a find to recover from the corrupted state but, since it occurred during a trial migration, we had the luxury of deleting the corrupted repository and starting over. Although this won't help with a recovery effort, here is what we are doing now, in case it is of any use to you going forward.

We are now using MySQL as Infinispan's data store. I don't know if it's any less prone than LevelDB to the kind of corruption problem we encountered should a similar Tomcat memory error occur but we are certainly more familiar and comfortable with it.

We also made some changes to how we run Tomcat, some of which you already do. Like you, we now have a maximum of 8 GB of heap space on a 16 GB VM. We also switched the garbage collector to Garbage First (-XX:+UseG1GC). With this configuration, we have not experienced any further Tomcat memory errors while running a number of additional trial migrations with up to 10 migration workers at a time (using Resque). We are scheduled to begin migrating "for real" next Monday so I hope our luck holds out.

We upgraded to Fedora 4.5.1 to get the MySQL support.

--Jim

Diana Cooper

unread,

May 18, 2016, 4:59:02 PM5/18/16

to Fedora Tech

Thanks Jim.

I was afraid of that.

Diana

Andrew Woods

unread,

May 18, 2016, 10:18:45 PM5/18/16

to fedor...@googlegroups.com

Hello Diana,

I am sorry to hear about the data corruption. We may have some good news, however. I have a modified version of the Fedora 4.1.1 web-application that should skip over corrupted elements. Ideally, you will be able to load your corrupted fcrepo.home with this web-application, then execute /fcr:backup followed by /fcr:restore into a fresh 4.1.1 Fedora installation. If all goes well, then you can upgrade to 4.5.1 with MySQL/PostgreSQL.

To start with, could you stop your current Fedora 4.1.1, and install fcrepo-webapp-4.1.1-recovery.war:

https://github.com/fcrepo4/fcrepo4/releases/download/fcrepo-4.1.1/fcrepo-webapp-4.1.1-recovery.war

If this patched application is able to successfully load, please perform a backup/restore as documented here:

https://wiki.duraspace.org/display/FEDORA41/RESTful+HTTP+API+-+Backup+and+Restore

Good luck,

Andrew

--
You received this message because you are subscribed to the Google Groups "Fedora Tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.

Diana Cooper

unread,

May 20, 2016, 1:53:35 PM5/20/16

to Fedora Tech

Hi Andrew,

Thanks for your reply.

The upshot is it didn't work. I verified the original 4.1.1 war causes the backup to crash with the StreamCorruptedException.

With the recovery war before the backup/restore, when I accessed the corrupted content, for lack of a better description since I don't actually know what is corrupted, I got the following exception:

INFO 13:30:03.381 (WildcardExceptionMapper) Exception intercepted by WildcardExceptionMapper:

java.lang.NullPointerException: null

at org.modeshape.jcr.JcrSession.cachedNode(JcrSession.java:622) ~[fcrepo-kernel-impl-4.1.1.jar:na]

at org.modeshape.jcr.JcrSession.node(JcrSession.java:656) ~[fcrepo-kernel-impl-4.1.1.jar:na]

at org.modeshape.jcr.JcrSession.node(JcrSession.java:675) ~[fcrepo-kernel-impl-4.1.1.jar:na]

at org.modeshape.jcr.JcrSession.getNode(JcrSession.java:859) ~[fcrepo-kernel-impl-4.1.1.jar:na]

at org.modeshape.jcr.JcrSession.getNode(JcrSession.java:842) ~[fcrepo-kernel-impl-4.1.1.jar:na]

at org.modeshape.jcr.JcrSession.getNode(JcrSession.java:126) ~[fcrepo-kernel-impl-4.1.1.jar:na]

at org.fcrepo.http.commons.api.rdf.HttpResourceConverter.getNode(HttpResourceConverter.java:255) ~[fcrepo-http-commons-4.1.1.jar:na]

at org.fcrepo.http.commons.api.rdf.HttpResourceConverter.doForward(HttpResourceConverter.java:116) ~[fcrepo-http-commons-4.1.1.jar:na]

at org.fcrepo.http.commons.api.rdf.HttpResourceConverter.doForward(HttpResourceConverter.java:77) ~[fcrepo-http-commons-4.1.1.jar:na]

at com.google.common.base.Converter.correctedDoForward(Converter.java:154) ~[guava-18.0.jar:na]

at com.google.common.base.Converter.convert(Converter.java:147) ~[guava-18.0.jar:na]

at org.fcrepo.http.api.FedoraBaseResource.getResourceFromPath(FedoraBaseResource.java:65) ~[fcrepo-http-api-4.1.1.jar:na]

at org.fcrepo.http.api.ContentExposingResource.resource(ContentExposingResource.java:410) ~[fcrepo-http-api-4.1.1.jar:na]

at org.fcrepo.http.api.FedoraLdp.describe(FedoraLdp.java:185) ~[fcrepo-http-api-4.1.1.jar:na]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.7.0_91]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.7.0_91]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_91]

at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_91]

at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) ~[jersey-server-2.13.jar:na]

I did the backup/restore using the recovery war and there were no errors.

Now, accessing the same corrupted content produces just this (no stack trace):

INFO 15:02:24.251 (WildcardExceptionMapper) Exception intercepted by WildcardExceptionMapper:

java.lang.NullPointerException: null

There seems to be a range of resource ids that cause this exception to occur. The application just chooses ids at random when creating a new resource, and if the id exists or has a tombstone, a new id is chosen. I've hacked the code to detect this exception and pretend the resource is gone. But I'm not sure if other problems will surface. And the exception message on the application side is perhaps not specific enough so my hack would potentially hide bigger problems.

Diana

Andrew Woods

unread,

May 20, 2016, 2:05:08 PM5/20/16

to fedor...@googlegroups.com

Hello Diana,

Thanks for the feedback. Let me see if I can summarize what you have done.

- Prior to any remedies, when you try to access your Fedora, you get an exception and you can see no resources

- You stopped your Fedora

- You deployed the "4.1.1 recovery" Fedora

- You were then able to see the resources in your repository with the exception of one or more corrupted resources

- You performed /fcr:backup

- You started a new, fresh, empty 4.1.1 Fedora (not a "recovery" Fedora)

- You performed /fcr:restore on the new Fedora

Is this correct?

Thanks,

Andrew

Diana Cooper

unread,

May 20, 2016, 2:26:12 PM5/20/16

to Fedora Tech

Hi Andrew,

Thank you. A few corrections:

- Prior to any remedies, most content was accessible in Fedora. A couple resource ids resulted in the stream corrupted exception. I'm not sure what content, if any, should have been stored at these resource ids.
- After deploying the 4.1.1 recovery Fedora, the same content resulted in an exception (though a different one). The rest of the content was still accessible.

- After the backup, I forgot to switch back to the non-recovery Fedora. Does the recovery war cause problems for the restore?

Diana

Andrew Woods

unread,

May 20, 2016, 3:40:45 PM5/20/16

to fedor...@googlegroups.com

Hello Diana,

The main question is, "Did you fcr:restore to an new, empty Fedora"? And if so, are you still getting exceptions?

I would expect that you would no longer get exceptions, and that the corrupted resources no longer exist.

Thanks,

Andrew

Diana Cooper

unread,

May 20, 2016, 4:22:52 PM5/20/16

to fedor...@googlegroups.com

Hi Andrew,

Yes. I stopped Fedora, created a new completely empty directory mounted at fcrepo4-data, restarted Fedora, then started the restore process. The original and new data directories are roughly the same size, around 420GB.

Diana

Diana Cooper

unread,

May 20, 2016, 5:10:00 PM5/20/16

to Fedora Tech

I'm not sure the ids for which I'm getting exceptions ever pointed to any actual data. I don't have references to those ids in my inventory list.

I am continuing to get exceptions like the one I reported above.

Is there any information I can provide to help track down this issue?

Diana

Andrew Woods

unread,

May 20, 2016, 5:15:12 PM5/20/16

to fedor...@googlegroups.com

Thanks, Diana.

Is it true that your Fedora is in a working state, but there are one or more corrupted/missing resources?

If that is true, then it becomes a question of how to restore the missing resource(s), presumably from tape backup or the original source. It should be noted that the external fcrepo-serialization [1] capability is designed to follow your repository activity, writing raw RDF and optionally binaries to disk as a portable backup of your Fedora.

Then you will want to upgrade your repository to 4.5.1 with a MySQL or PostgreSQL backend, using JAVA_OPTS informed by the recommendations [2].

Please respond back to the list if there is anything else that the community can offer.

Andrew

[1] https://github.com/fcrepo4-exts/fcrepo-camel-toolbox/tree/master/fcrepo-serialization

[2] https://wiki.duraspace.org/display/FEDORA4x/Java+HotSpot+VM+Options+recommendations

Reply all

Reply to author

Forward