Unable to restart DVN after a data load (for the 3rd time!)

瀏覽次數:184 次
跳到第一則未讀訊息

Joerg Messer

未讀,
2013年5月16日 下午6:48:282013/5/16
收件者:dataverse...@googlegroups.com
Greetings,

I've been busily trying to move our ~1600 studies (205GB) from an instance of DSpace (v1.5) to DVN (v3.4).  Things have been going quite well and most of the studies appeared to load without issue.  Unfortunately there appears to be 1 or more studies that make it impossible to restart DVN after the load.  At least that is my current theory.  I've gone through 2 cycles of loading the data, not being able to restart and having to re-initialize my DVN instance.  It has now happened for the third time and I thought it wise to ask for some help.  When I try to restart DVN using "service glassfish start", it takes a while and I finally end up with the message #1 (below).  After the restart attempt, the Glassfish daemon is running and putting considerable load on the server but I'm not able to connect to it either via the admin interface on port 4848 or port 80. A further attempt with the same command ends up giving me message #2 (below).  The only remaining option is to kill the Glassfish daemon (signal 9) and possibly try the whole cycle again to no avail.

I seem to get a large number of message describing SPSS variables in the server.log file along the lines of

[#|2013-05-16T12:28:16.532-0700|INFO|glassfish3.1.2|edu.harvard.iq.dvn.ingest.statdataio.impl.plugins.sav|_ThreadID=16;_ThreadName=Thread-2;|X050 => Type of habitat|#]

for two different studies.  These message are always terminated by a message similar to the following

#|2013-05-16T12:28:16.548-0700|INFO|glassfish3.1.2|edu.harvard.iq.dvn.ingest.statdataio.impl.plugins.sav|_ThreadID=16;_ThreadName=Thread-2;|sumstat:long case=[0, 0, 0, 0, 0,...,0, 0]

I'm currently running DVN on a Red Hat (v5.8) Linux VM with 4 virtual cpus and 12 GB of memory.

At copy of my server.log file is include and I can provide other info if anyone is interested.  There are some other error messages in the server.log file which may be relevant but they don't mean much to me.  The two studies which might be causing problems are:

European and World Values Survey Integrated Data File, 1999-2002
http://abacus.library.ubc.ca/handle/10573/41360

European and World Values Surveys Four-wave Integrated Data File, 1981-2004
http://abacus.library.ubc.ca/jspui/handle/10573/41294

A further oddity is that I get "Dataverse Network: Upload request complete" messages for these two studies every time I try to restart Glassfish.  I wonder if these uploads are stranded in some sort of limbo state.

Needless to say, any pointers would be very much appreciated. 

//Joerg Messer (UBC)


---
Message #1
---
service glassfish start
Starting GlassFish server: glassfishWaiting for domain1 to start ......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
No response from the Domain Administration Server (domain1) after 600 seconds.
The command is either taking too long to complete or the server has failed.
Please see the server log files for command status.
Please start with the --verbose option in order to see early messages.
Command start-domain failed.
---

---
Message #2
---
service glassfish start
Starting GlassFish server: glassfishThere is a process already using the admin port 4848 -- it probably is another instance of a GlassFish server.
Command start-domain failed.
---

server_log.txt

Philip Durbin

未讀,
2013年5月17日 上午8:25:382013/5/17
收件者:dataverse...@googlegroups.com
OutOfMemoryError...

https://github.com/IQSS/dvn/blob/3.4/src/DVN-ingest/src/edu/harvard/iq/dvn/ingest/statdataio/impl/plugins/sav/SAVFileReader.java#L2319
is the line that corresponds to

Caused by: java.lang.OutOfMemoryError: Java heap space
at edu.harvard.iq.dvn.ingest.statdataio.impl.plugins.sav.SAVFileReader.decodeRecordTypeDataCompressed(SAVFileReader.java:2319)
> --
> You received this message because you are subscribed to the Google Groups
> "Dataverse Users Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dataverse-commu...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin

Condon, Kevin

未讀,
2013年5月17日 上午11:25:312013/5/17
收件者:dataverse...@googlegroups.com

Hi Joerg,

There is a lot of information here and I'd like to restate my
understanding of what you are trying to accomplish so I can offer
assistance.

You would like to migrate your data, studies and files, from one system,
Dspace, to another system, DVN. The DVN instance is currently running in a
virtual machine that has 12GB RAM and 4 CPUs. You are using the DVN batch
import utility to import both studies and files after first creating study
metadata in DDI format for import. You have successfully imported most of
the studies without issue though there seemed to have been some initial
issues with needing more memory.

Now, two studies in particular are failing import. The server log you
provided reports an out of memory exception:

Caused by: java.lang.OutOfMemoryError: Java heap space
at
edu.harvard.iq.dvn.ingest.statdataio.impl.plugins.sav.SAVFileReader.decodeR
ecordTypeDataCompressed(SAVFileReader.java:2319)

Let me explain how this utility is working, what I think might be
happening and then some ideas on how to work around this issue.

First, the batch import utility can be run multiple times with different
batches. Each batch is a predefined set of study metadata and associated
files. Of primary importance is the study metadata data of course since
files are associated with studies. Not all file types are processed in the
same way however. Some file types, .sav, .por, .dta, are recognized as
subsettable and undergo additional processing where the file metadata is
extracted, summary statistics are generated and a tabular data text file
version of the file is created. Depending on the size and complexity of
the subsettable file, this can take a lot of system resources. A rule of
thumb is that although we accept files of up to 2GB, around 200MB for
subsettable files is a practical suggested limit, though we have
successfully ingested subsettable files of almost 1GB. Text files of 2GB
have no such issues.

I'm not sure whether you are trying to run the entire import as a single
batch each time or you are doing the import incrementally. What I think
might be happening is during the course of the import, the heap is being
consumed but still within limits until it encounters the 300MB .sav file
in one of the studies you mentioned. This is large enough to cause it to
run out of resources and halt. My guess and hope is that if this file were
simple not included in the study import and uploaded manually afterwards,
this problem would not occur. Another approach might be to batch import
all the successful studies and then do each of these studies individually
using the batch import utility, effectively lightening the load of each
batch.

Some of the specific errors and notifications are I think just resulting
from this ingest process.

I do want to confirm that you have been increasing your heap size as you
have been adding memory. Check your domain.xml file and confirm your Xms
and Xmx options are both around 10GB.

Also, please note that we mainly use Vms for development or testing
purposes and they may not provide sufficient performance to process large
amounts of data in a production environment.

Regards,
Kevin


Joerg Messer

未讀,
2013年5月24日 下午4:16:392013/5/24
收件者:dataverse...@googlegroups.com
Kevin,

I've followed your advice as I've indicated in a separate email message.  The problem is definitely related to the processing large subsettable files during the batch upload as you indicated.  I have about 8 studies that have large SPSS .por and .sav files which range in size from 200MB to 600MB.  Alas, they're causing nothing but problems.  I've done some new tests where I ZIP them up during the initial upload and do a separate file upload of the non-compressed file later.  We don't seem to have any further problems with the bulk loads themselves but as soon as I try to manually load a large .por/.sav file I'm getting some very strange behavior. 

The largest subsettable loaded so far is 257MB after increasing the heap size to 10GB.  When I tried to upload the next largest file (313MB),  DVN sends out a status message indicating that "Dataverse Network: The upload of your subsettable file(s) is in progress" and later a completion message indicating that the upload has proceeded successfully.  When I check to see if this is in fact the case, I find no sign of the uploaded file.  Other than that, the VM cpu/memory load goes back down and the DVN server appears to be working fine.  The problem is that as soon as I try to restart the server (which now works fine after increasing the memory), the cpu/memory load immediately shoots up and I get a message indicating that "Dataverse Network: Upload request complete".   When I check to see whether the subsettable file is now available, it is nowhere to be found.  Somewhat later I get the exact same message again and still no file.  The system load eventually goes back down and all appears well again.  I've tried restarting DVN a number of times and each time I get the same behaviour each time.  I've also increased the JVM heap memory another 50% to 15GB between restart cycles without any improvement.  The error message in the server.log file is "java.lang.OutOfMemoryError: GC overhead limit exceeded". 

BTW, the server setting are:

-XX:+UnlockDiagnosticVMOptions
-XX:+UseParallelOldGC
-XX:+AggressiveHeap
-XX:+DisableExplicitGC
-XX:MaxPermSize=384m
-XX:NewRatio=2
-Xmx15000m
-Xss256k
 

I know that I'm pushing the system a bit since you've already mentioned that anything over 200MB in the way of a subsettable file size may not work but it does raise a few questions:

1) If we, or any of our users, load a file for which we have insufficient resources to process how do we recover.  In other words (and this harks back to my batch load termination post), how do we flush the file attempted upload from the file load queue so that it doesn't cause performance grieve every time we restart DVN?

2) Is there any way to set a limit on the size of subsettable files which will be processed?  This should keep the problem from happening. 

3) How do we determine the amount of memory that we need going into production?  I believe there was something in the docs about the IQSS server running with 64GB.  How did you determine this number?

4) Would you care to hazard a guess as to how much memory would be required to successfully upload a 600MB .por file?  You indicated that you were able to process a 1GB file and it would be interesting to know how the server was resourced for this ingest. How much memory are other DVN site using?

5) Can the resource requirements needed for the ingest later be reduced for "regular" use?

6) Would it be best  to limit our ingest of subsettable files to those under 200MB in size?  You mentioned this as a practical limit.  I'd like to make sure that our DVN server is as robust as possible so this would be small compromise. 

7) Lastly, do you see any problems with the Glassfish JVM setttings I included.

Apologies for the many questions.  I'm hoping we're about to overcome the last hurdles before going into production with DVN.

//Joerg Messer (UBC)

Condon, Kevin

未讀,
2013年5月31日 下午5:37:332013/5/31
收件者:dataverse...@googlegroups.com


From: dataverse...@googlegroups.com [dataverse...@googlegroups.com] on behalf of Joerg Messer [joerg....@gmail.com]
Sent: Friday, May 24, 2013 4:16 PM
To: dataverse...@googlegroups.com
Subject: Re: Unable to restart DVN after a data load (for the 3rd time!)
BTW, the server setting are:

-XX:+UnlockDiagnosticVMOptions
-XX:+UseParallelOldGC
-XX:+AggressiveHeap
-XX:+DisableExplicitGC
-XX:MaxPermSize=384m
-XX:NewRatio=2
-Xmx15000m
-Xss256k
 

I know that I'm pushing the system a bit since you've already mentioned that anything over 200MB in the way of a subsettable file size may not work but it does raise a few questions:

1) If we, or any of our users, load a file for which we have insufficient resources to process how do we recover.  In other words (and this harks back to my batch load termination post), how do we flush the file attempted upload from the file load queue so that it doesn't cause performance grieve every time we restart DVN?

Subsettable files that do not compete ingest such as in the case where glassfish is restarted, remain in an upload directory where it keeps trying upon restart. So, if you kill glassfish because it is consuming too much memory, restarting will just kick off the ingest again.

The directory where the files live is: glassfish3/glassfish/domains/domain1/applications/DVN-web/temp . Just delete the file in question and you should be all set.

2) Is there any way to set a limit on the size of subsettable files which will be processed?  This should keep the problem from happening.  

Not currently though that is something we are looking at doing: https://redmine.hmdc.harvard.edu/issues/2617


3) How do we determine the amount of memory that we need going into production?  I believe there was something in the docs about the IQSS server running with 64GB.  How did you determine this number? 

Partly through trial and error and partly by purchasing the current, higher end server class machines available. Our previous machines, running until last October had 32GB and worked well for a long time. File sizes have been growing larger and larger more recently so we will need to revisit the size limiting option and code efficiency.

4) Would you care to hazard a guess as to how much memory would be required to successfully upload a 600MB .por file?  You indicated that you were able to process a 1GB file and it would be interesting to know how the server was resourced for this ingest. How much memory are other DVN site using?

I don't have an answer for you. We were able to ingest an 800GB spss file using our 32GB machine, though it took a long time -more than an hour. I don't have numbers on what other sites are using though we did post this question to the list several months ago. If sites do no have many subsettable files and do not do much harvesting, then the memory requirements would be more modest. For example, our build environment has only 4GB.


5) Can the resource requirements needed for the ingest later be reduced for "regular" use?

Well, ingest is part of the issue. Other operations on that large file could be problematic such as downloading in an alternate format, subsetting, running analysis.


6) Would it be best  to limit our ingest of subsettable files to those under 200MB in size?  You mentioned this as a practical limit.  I'd like to make sure that our DVN server is as robust as possible so this would be small compromise. 

It would be a good guideline. Files larger than 200MB can be uploaded without additional processing if you use type "Other".


7) Lastly, do you see any problems with the Glassfish JVM setttings I included.

I see no obvious problems. We have experimented with different settings recommended by the glassfish3 developer:


I hope those answers are useful.

Kevin

Joerg Messer

未讀,
2013年6月3日 下午2:42:022013/6/3
收件者:dataverse...@googlegroups.com
Kevin,

I upped the memory on our VM to 32GB last week and, as you proposed, ingested the really large files manually.  The largest subsettable .por file that was successfully uploaded was over 500MB.  We had an even larger 600MB file but it seems to suffer from some sort of corruption since it failed early in the ingest process.  With this substantial amount of ram in place, memory errors have disappeared.  It looks like are bulk loading woos have been resolved.  Many thanks for your help.  Looking forward to the next release of DVN.  Ciao.

//Joerg
回覆所有人
回覆作者
轉寄
0 則新訊息