Some issues trying to stand up a test instance

80 views
Skip to first unread message

Jay Gattuso

unread,
Nov 9, 2015, 1:57:12 PM11/9/15
to openwayback-dev
Hi all, 

One of my dev colleagues is trying to put together a reference instance of OW over some WARC content we have. 

The app sits on a VM with 4gb RAM and 250GB storage, and the WARCs are on an NTFS mount. 

We're running into all sorts of technical issues, and he asked me to see if the following errors had an obvious / known cause... 


Any thoughts or suggestions appreciated. 


________________________


Error #1 - OutOfMemoryError: Java heap space

java.lang.OutOfMemoryError: Java heap space
    at org.htmlparser.lexer.InputStreamSource.fill(InputStreamSource.java:337)
    at org.htmlparser.lexer.InputStreamSource.read(InputStreamSource.java:396)
    at org.htmlparser.lexer.Page.getCharacter(Page.java:705)
    at org.htmlparser.lexer.Lexer.scanJIS(Lexer.java:685)
    at org.htmlparser.lexer.Lexer.parseString(Lexer.java:749)
    at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:394)
    at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:69)
    at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156)
    at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216

This is fairly explicit however I now have my doubts about the solution being as simple as increasing RAM. Experiments with using less RAM while indexing this WARC just make it fail faster!! I will do further investigation before I proceed with getting more RAM.


Error #2 - failed canonicalize address

WARNING: FAILED canonicalize(http://www.laptopsonline.co.nz/wEPDwUKLTk1NDY5Nzk2NQ9kFgJmD2QWAgIBD2QWAgIDD2QWAgIDD2QWAmYPZBYCZg9kFh):NLNZ-NZ-CRAWL-003-20130224110204186-01513-7390~wbgrp-crawl006.us.archive.org~8443.warc.gz 116293486

This error is referenced here: http://sourceforge.net/p/archive-access/mailman/message/30255525/ . The mail chain identifies old versions of Java and Openwayback as a potential problem, we are ok here as we are running much newer versions. And the mail chain concludes that the problem was with and old version of wayback (<1.6.0) that was used to making the WARCs. This seems not to be the problem here as tag in our WAR says "software: Heritrix/3.1.2" that seems to be a recent version so I don't know what it might be.


Error #3 - Failed parse of http status line

java.io.IOException: Failed parse of http status line.
    at org.archive.io.RecoverableIOException.<init>(RecoverableIOException.java:36)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:294)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55

The code says:

            String statusLineStr = EncodingUtil.getString(statusBytes, 0,
                    statusBytes.length - eolCharCount, ARCConstants.DEFAULT_ENCODING);
            if ((statusLineStr == null) ||
                    !StatusLine.startsWithHTTP(statusLineStr)) {
                throw new RecoverableIOException("Failed parse of http status line.");

So maybe this is because of the DNS responses in the WARCs that generate this problem like this one:

WARC-Target-URI: dns:theprojectoffice.co.nz


andrew.jackson

unread,
Nov 9, 2015, 3:26:31 PM11/9/15
to openwayback-dev
Issue 1 appears to be another instance of this issue: https://github.com/iipc/openwayback/issues/162 - there's no workaround apart from using a different CDX indexer AFIAK.

For 2, you might want to extract the record from offset 116293486 in NLNZ-NZ-CRAWL-003-20130224110204186-01513-7390~wbgrp-crawl006.us.archive.org~8443.warc.gz and take a look - the code:

https://github.com/internetarchive/wayback/blob/master/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WARCRecordToSearchResultAdapter.java#L205-L206

only shows a shortened URL, is if it's malformed in a way the canonicaliser can't cope with you'll have to look at the source.

Similarly, for 3, I think you'd have to look at the record and see what's happening. DNS responses should really be skipped, so I suspect the data has an issue that the code isn't coping with.

HTH,
Andy

Mohamed Elsayed

unread,
Nov 16, 2015, 8:47:26 AM11/16/15
to openwayback-dev
I have an OutOfMemory issue with different reason. Deploying OpenWayback 2.2.0 on tomcat7 on Debian 8 Jessie.

Exception in thread "http-bio-80-exec-452" java.lang.OutOfMemoryError: Java heap space
        at java.util.HashMap.inflateTable(HashMap.java:316)
        at java.util.HashMap.put(HashMap.java:488)
        at sun.util.resources.OpenListResourceBundle.loadLookup(OpenListResourceBundle.java:134)
        at sun.util.resources.OpenListResourceBundle.loadLookupTablesIfNecessary(OpenListResourceBundle.java:113)
        at sun.util.resources.OpenListResourceBundle.handleGetObject(OpenListResourceBundle.java:74)
        at java.util.ResourceBundle.getObject(ResourceBundle.java:389)
        at java.util.ResourceBundle.getObject(ResourceBundle.java:392)
        at java.util.ResourceBundle.getString(ResourceBundle.java:355)
        at java.util.Locale.getDisplayString(Locale.java:1670)
        at java.util.Locale.getDisplayLanguage(Locale.java:1580)
        at java.util.Locale.getDisplayLanguage(Locale.java:1561)
        at org.archive.wayback.core.WaybackRequest.extractHttpRequestInfo(WaybackRequest.java:1158)
        at org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java:284)
        at org.archive.wayback.util.webapp.RequestMapper.handleRequest(RequestMapper.java:198)
        at org.archive.wayback.util.webapp.RequestFilter.doFilter(RequestFilter.java:146)

My workaround to keep the service running is restarting tomcat7 from time to time! Hopefully, I will work on it next days.

Kristinn Sigurðsson

unread,
Nov 16, 2015, 9:22:37 AM11/16/15
to openway...@googlegroups.com

Just to exclude the obvious, you say the VM has 4GB, but you never state what the JVM max memory is configured as (-Xmx option)?

 

If left at default (256MB I think) you’d definitely hit OOMEs.

 

- Kris

 

Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
Leiddu hugann að umhverfinu áður en þú prentar út tölvupóst

Fyrirvari / Disclaimer

--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mohamed Elsayed

unread,
Nov 24, 2015, 4:18:53 AM11/24/15
to openwayback-dev
After setting maximum heap size to 2G in catalina.sh, I didn't need to reboot tomcat7, but another issue has been raised which forced me to restart it as shown below:

Nov 24, 2015 11:00:09 AM org.archive.wayback.util.webapp.PortMapper getRequestHandlerContext
FINER: No mapping for web.archive.bibalex.org/web
Nov 24, 2015 11:00:09 AM org.archive.wayback.util.webapp.PortMapper getRequestHandlerContext
FINER: No mapping for web.archive.bibalex.org/
Nov 24, 2015 11:00:09 AM org.archive.wayback.util.webapp.PortMapper getRequestHandlerContext
FINE: Mapped to RequestHandler with /web
Nov 24, 2015 11:00:09 AM org.archive.format.gzip.zipnum.ZipNumBlockLoader attemptLoadBlock
SEVERE: java.io.FileNotFoundException: cdx_name.cdx.gz (Too many open files) -- -r 5840861196-5840928010 cdx_name.cdx.gz
Nov 24, 2015 11:00:09 AM org.archive.wayback.webapp.AccessPoint logError
WARNING: Runtime Error
org.archive.wayback.exception.ResourceIndexNotAvailableException: java.io.FileNotFoundException: cd_name.cdx.gz (Too many open files) -- -r 5840861196-5840928010 cdx_name.cdx.gz
        at org.archive.wayback.resourceindex.LocalResourceIndex.doCaptureQuery(LocalResourceIndex.java:216)
        at org.archive.wayback.resourceindex.LocalResourceIndex.query(LocalResourceIndex.java:332)
        at org.archive.wayback.webapp.AccessPoint.queryIndex(AccessPoint.java:598)
        at org.archive.wayback.memento.DefaultMementoHandler.renderMementoTimemap(DefaultMementoHandler.java:20)
        at org.archive.wayback.webapp.AccessPoint.handleQuery(AccessPoint.java:1168)
        at org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java:325)

        at org.archive.wayback.util.webapp.RequestMapper.handleRequest(RequestMapper.java:198)
        at org.archive.wayback.util.webapp.RequestFilter.doFilter(RequestFilter.java:146)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:503)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:314)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:745)

Another interesting thing, Java process uses almost 50% from memory capacity after running tomcat7 for 2 days or something. I knew this information by executing top command line.

andrew.jackson

unread,
Nov 26, 2015, 2:53:04 PM11/26/15
to openwayback-dev
Ah, good old 'Too many open files'. The simplest fix is to up the ulimit on the number of open file handles that the user account running Wayback can have.

However, if the number off handles needed keeps growing, this may indicate a file handle leak. For example, the FlatFile implementation opens a fresh file handle for every CDX file, for every request, so the number of file handles you need will go up when you have more visitors. My proposed pull request should fix that (https://github.com/iipc/openwayback/pull/277), but I think you are using the ZipNum thing, which may behave differently.

Andy
Reply all
Reply to author
Forward
0 new messages