Error #1 - OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
at
org.htmlparser.lexer.InputStreamSource.fill(InputStreamSource.java:337)
at
org.htmlparser.lexer.InputStreamSource.read(InputStreamSource.java:396)
at org.htmlparser.lexer.Page.getCharacter(Page.java:705)
at org.htmlparser.lexer.Lexer.scanJIS(Lexer.java:685)
at org.htmlparser.lexer.Lexer.parseString(Lexer.java:749)
at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:394)
at
org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:69)
at
org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156)
at
org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303)
at
org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
at
org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
at
org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53)
at
org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at
org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at
org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216
This is fairly explicit however I now have my doubts about the solution being as simple as increasing RAM. Experiments with using less RAM while indexing this WARC just make it fail faster!! I will do further investigation before I proceed with getting more RAM.
Error #2 - failed canonicalize address
WARNING: FAILED canonicalize(http://www.laptopsonline.co.nz/wEPDwUKLTk1NDY5Nzk2NQ9kFgJmD2QWAgIBD2QWAgIDD2QWAgIDD2QWAmYPZBYCZg9kFh):NLNZ-NZ-CRAWL-003-20130224110204186-01513-7390~wbgrp-crawl006.us.archive.org~8443.warc.gz 116293486
This error is referenced here: http://sourceforge.net/p/archive-access/mailman/message/30255525/ . The mail chain identifies old versions of Java and Openwayback as a potential problem, we are ok here as we are running much newer versions. And the mail chain concludes that the problem was with and old version of wayback (<1.6.0) that was used to making the WARCs. This seems not to be the problem here as tag in our WAR says "software: Heritrix/3.1.2" that seems to be a recent version so I don't know what it might be.
Error #3 - Failed parse of http status line
java.io.IOException: Failed parse of http status line.
at
org.archive.io.RecoverableIOException.<init>(RecoverableIOException.java:36)
at
org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:294)
at
org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
at
org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
at
org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53)
at
org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55
The code says:
String
statusLineStr = EncodingUtil.getString(statusBytes, 0,
statusBytes.length - eolCharCount, ARCConstants.DEFAULT_ENCODING);
if ((statusLineStr == null)
||
!StatusLine.startsWithHTTP(statusLineStr)) {
throw new
RecoverableIOException("Failed parse of http status line.");
So maybe this is because of the DNS responses in the WARCs that generate this problem like this one:
WARC-Target-URI: dns:theprojectoffice.co.nz
Just to exclude the obvious, you say the VM has 4GB, but you never state what the JVM max memory is configured as (-Xmx option)?
If left at default (256MB I think) you’d definitely hit OOMEs.
- Kris
|
||||
--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
openwayback-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.