Hi Joerg,
There is a lot of information here and I'd like to restate my
understanding of what you are trying to accomplish so I can offer
assistance.
You would like to migrate your data, studies and files, from one system,
Dspace, to another system, DVN. The DVN instance is currently running in a
virtual machine that has 12GB RAM and 4 CPUs. You are using the DVN batch
import utility to import both studies and files after first creating study
metadata in DDI format for import. You have successfully imported most of
the studies without issue though there seemed to have been some initial
issues with needing more memory.
Now, two studies in particular are failing import. The server log you
provided reports an out of memory exception:
Caused by: java.lang.OutOfMemoryError: Java heap space
at
edu.harvard.iq.dvn.ingest.statdataio.impl.plugins.sav.SAVFileReader.decodeR
ecordTypeDataCompressed(SAVFileReader.java:2319)
Let me explain how this utility is working, what I think might be
happening and then some ideas on how to work around this issue.
First, the batch import utility can be run multiple times with different
batches. Each batch is a predefined set of study metadata and associated
files. Of primary importance is the study metadata data of course since
files are associated with studies. Not all file types are processed in the
same way however. Some file types, .sav, .por, .dta, are recognized as
subsettable and undergo additional processing where the file metadata is
extracted, summary statistics are generated and a tabular data text file
version of the file is created. Depending on the size and complexity of
the subsettable file, this can take a lot of system resources. A rule of
thumb is that although we accept files of up to 2GB, around 200MB for
subsettable files is a practical suggested limit, though we have
successfully ingested subsettable files of almost 1GB. Text files of 2GB
have no such issues.
I'm not sure whether you are trying to run the entire import as a single
batch each time or you are doing the import incrementally. What I think
might be happening is during the course of the import, the heap is being
consumed but still within limits until it encounters the 300MB .sav file
in one of the studies you mentioned. This is large enough to cause it to
run out of resources and halt. My guess and hope is that if this file were
simple not included in the study import and uploaded manually afterwards,
this problem would not occur. Another approach might be to batch import
all the successful studies and then do each of these studies individually
using the batch import utility, effectively lightening the load of each
batch.
Some of the specific errors and notifications are I think just resulting
from this ingest process.
I do want to confirm that you have been increasing your heap size as you
have been adding memory. Check your domain.xml file and confirm your Xms
and Xmx options are both around 10GB.
Also, please note that we mainly use Vms for development or testing
purposes and they may not provide sufficient performance to process large
amounts of data in a production environment.
Regards,
Kevin