CDX index resource not available for all sites

101 views
Skip to first unread message

conorsh...@gmail.com

unread,
Sep 18, 2017, 8:21:34 AM9/18/17
to openwayback-dev
Hi all,

 

The problem:

I have a domain crawl of the .ie domain from 2007 which I'm trying to access. The overall size of the crawl is 3.7 TB

I followed the setup instructions here:  https://github.com/iipc/openwayback/wiki/How-to-configure

If I use the BDB option I can view the sites, but the index grows to about 1TB quite quickly and I run out of space before everything is indexed.

I already have a CDX file that was generated at the time of the crawl (2007). 

If I use the CDX option I can see the links to view sites, but all of them return Resource Not Available.

 

Further details:

I only have read and execution rights on the folder containing the webarchive. 

I tried generating a new cdx file and path-index for a single file but ran into the same issue.

I had a look at this answer, but I don't think that's my issue since the CDX file was working in the past, although it was using wayback and nutch instead of openwayback.

 

Recreating the problem:

Unfortunately I can't attach any of the warc files I'm working with. I've also left out my cdx file since it's 68 GB in size.

I've attached my configuration files and some screen shots of what happens using BDB vs CDX.

To switch between configurations I copy wayback.xml.bdb or wayback.xml.cdx to /usr/share/tomcat/openwayback/WEB_INF/wayback.xml.

 

If anyone can see what I'm doing wrong or point me in the direction of further documentation I'd really appreciate it.

 

Thanks,

Conor

list_view_bdb.PNG
list_view_cdx.PNG
site_bdb.PNG
site_cdx.PNG
BDBCollection.xml
CDXCollection.xml
wayback.xml.bdb
wayback.xml.cdx

Sawood Alam

unread,
Sep 18, 2017, 10:08:46 AM9/18/17
to openway...@googlegroups.com
Hi Conor,

You said, all of them return Resource Not Available. However, in your screenshots you have demonstrated an example which illustrates otherwise. That said, the kind of issue you are describing, it seems like CDX files are in place and sorted as required or else you would not be able to see the listing. However, either there is some issue in your path-index.txt file (or its configuration), the WARC files are not located where path-index is suggesting they are, file permissions or path-index as well as WARC files should be revisited, or WARC files are corrupted in some way (in the past I have seen WARC files who's first block was uncompressed while rest of the WARC was gzipped).

I would perhaps chase a failing request till the end by finding that URL and timestamp in the CDX file manually, read the filename and offsets from the CDX file, find that file map in the path-index file, seek the offset and read bits from the WARC files based on the CDX entry, then decompress it (if gzipped) to see the content. I would choose a text file (HTML/CSS/JS) for this exercise.

Alternatively, I would grab a few WARC files, copy them elsewhere, then run them through a different replay system such as PyWB to locate potential issue.

Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529


--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-dev+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/openwayback-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

conorsh...@gmail.com

unread,
Sep 19, 2017, 4:04:06 AM9/19/17
to openwayback-dev
Hi Sawood,

Thanks for your reply. The screenshot which shows a working site is using BDB. I can't get any site to work when using CDX. The screenshots ending in _bdb use bdb and _cdx use cdx. I'm sure the warc files aren't corrupt as they work with BDB. I'd like to use CDX instead of BDB because it scales far better and I would like to host an entire domain crawl. 

The idea about path-index.txt seems promising. I'll have a look at making a fresh path-index file and possibly changing permissions on the warc files.
Also going to look into PyWB.

Thanks again,
Conor
Best,
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.

Sawood Alam

unread,
Sep 19, 2017, 8:17:15 AM9/19/17
to openway...@googlegroups.com
Hi Conor,

Just a quick note that path-index.txt files are also sorted the same way CDX file are, i.e., with LC_ALL=C env variable set.

Best,


For more options, visit https://groups.google.com/d/optout.
--
Reply all
Reply to author
Forward
0 new messages