Cdx-Indexer

61 views
Skip to first unread message

Fernando Melo

unread,
Sep 25, 2015, 5:19:44 AM9/25/15
to openwayback-dev
I tested creating the .cdx indexes for some old arc files.
The cdx-indexer worked fine for all the arc files except one where it crashed. Can someone help me find what is wrong with a specific arc file?

Best Regards
      Fernando Melo

Kristinn Sigurðsson

unread,
Sep 28, 2015, 7:10:45 AM9/28/15
to openway...@googlegroups.com
Is there any error message?

If not, your best bet is to look at how far it got with creating the CDX file. Then inspect the ARC manually starting with the last entry in the CDX file. If the cdx-indexer is crashing, it is almost certainly a faulty ARC.

- Kris

-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
> --
> You received this message because you are subscribed to the Google Groups
> "openwayback-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to openwayback-d...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Fernando Melo

unread,
Sep 28, 2015, 8:55:42 AM9/28/15
to openwayback-dev
Yes a long stack trace is being output, but the cdx is produced anyway.
It appears to be the same error over and over regarding invalid port number.

CRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: gnu.inet.encoding.IDNAException: String too long. .com
at org.archive.url.UsableURIFactory.fixupDomainlabel(UsableURIFactory.java:615)
at org.archive.url.UsableURIFactory.fixupAuthority(UsableURIFactory.java:569)
at org.archive.url.UsableURIFactory.fixup(UsableURIFactory.java:428)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
org.apache.commons.httpclient.URIException: invalid port number
at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
at org.apache.commons.httpclient.URI.<init>(URI.java:147)
at org.archive.url.LaxURI.<init>(LaxURI.java:77)
at org.archive.url.UsableURI.<init>(UsableURI.java:128)
at org.archive.url.UsableURIFactory.makeOne(UsableURIFactory.java:287)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
at org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adaptInner(ARCRecordToSearchResultAdapter.java:102)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:60)
at org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter.adapt(ARCRecordToSearchResultAdapter.java:40)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

Kristinn Sigurðsson

unread,
Sep 28, 2015, 9:28:45 AM9/28/15
to openway...@googlegroups.com
That seems pretty clear. The ARC contains (probably multiple) entries with an invalid URL. You'll either need to fix the ARC (if possible) or accept that not everything in it can be indexed.

This shouldn't affect other entries in the ARC, just the ones with the malformed URLs.

- Kris

-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
> -----Original Message-----
> From: openway...@googlegroups.com [mailto:openwayback-
> d...@googlegroups.com] On Behalf Of Fernando Melo
> r.java:216)
> org.apache.commons.httpclient.URIException:
> gnu.inet.encoding.IDNAException: String too long. .com at
> org.archive.url.UsableURIFactory.fixupDomainlabel(UsableURIFactory.java:6
> 15)
> at
> org.archive.url.UsableURIFactory.fixupAuthority(UsableURIFactory.java:569)
> at org.archive.url.UsableURIFactory.fixup(UsableURIFactory.java:428)
> at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:275)
> at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:265)
> at org.archive.url.UsableURIFactory.getInstance(UsableURIFactory.java:233)
> at
> org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(Aggre
> ssiveUrlCanonicalizer.java:223)
> at
> org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapt
> er.adaptInner(ARCRecordToSearchResultAdapter.java:102)
> at
> org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapt
> er.adapt(ARCRecordToSearchResultAdapter.java:60)
> at
> org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapt
> r.java:216)
>
>
> Em sexta-feira, 25 de setembro de 2015 10:19:44 UTC+1, Fernando Melo
> escreveu:
>
> I tested creating the .cdx indexes for some old arc files.
> The cdx-indexer worked fine for all the arc files except one where it
> crashed. Can someone help me find what is wrong with a specific arc file?
>
> Best Regards
> Fernando Melo
>

Fernando Melo

unread,
Sep 28, 2015, 10:40:33 AM9/28/15
to openwayback-dev
Thank you. It helped I wanted to be sure that it was not indexing the problematic urls.
Reply all
Reply to author
Forward
0 new messages