The problem:
I have a domain crawl of the .ie domain from 2007 which I'm trying to access. The overall size of the crawl is 3.7 TB
I followed the setup instructions here: https://github.com/iipc/openwayback/wiki/How-to-configure
If I use the BDB option I can view the sites, but the index grows to about 1TB quite quickly and I run out of space before everything is indexed.
I already have a CDX file that was generated at the time of the crawl (2007).
If I use the CDX option I can see the links to view sites, but all of them return Resource Not Available.
Further details:
I only have read and execution rights on the folder containing the webarchive.
I tried generating a new cdx file and path-index for a single file but ran into the same issue.
I had a look at this answer, but I don't think that's my issue since the CDX file was working in the past, although it was using wayback and nutch instead of openwayback.
Recreating the problem:
Unfortunately I can't attach any of the warc files I'm working with. I've also left out my cdx file since it's 68 GB in size.
I've attached my configuration files and some screen shots of what happens using BDB vs CDX.
To switch between configurations I copy wayback.xml.bdb or wayback.xml.cdx to /usr/share/tomcat/openwayback/WEB_INF/wayback.xml.
If anyone can see what I'm doing wrong or point me in the direction of further documentation I'd really appreciate it.
Thanks,
Conor
--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-dev+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/openwayback-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Best,
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/2e2222de-e7f9-431a-956d-b8273fd03556%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.