On Thu, Feb 07, 2019 at 08:39:35AM -0800, Spencer Dorsey wrote:
> I'll dig through the cdx toolkit and see what I can put together while I
> wait for 2008-2012 index.
Excellent :-) Once the 2008-2012 index of ARC files comes out, it
might take a short while for me to make sure cdx_toolkit works with it
properly. The base warcio library does seamlessly deal with both ARC
and WARC, but there's always a chance for a bug or two on my side.
Also, I'd recommend that you check out extracting from the Internet
Archive's Wayback Machine in addition to Common Crawl. Switching from
one to the other is as simple as:
% cdxt --cc --from=2008 --to=2019 warc '
example.com/*' --prefix CC-EXAMPLE-COM
% cdxt --ia --from=2008 --to=2019 warc '
example.com/*' --prefix IA-EXAMPLE-COM
You'll want to experiment with adding some "-v" flags on the end,
and/or running "warcio index" on the warc.gz file as it's being
generated, to see what's going on. For a big domain, I generally
extract each year separately because of the long runtimes. For
example, running an extract of IA's NY Times archive took about 12
wall-clock hours per year. Here's how to grab one year:
% cdxt --ia --from=2012 --to=2012 warc '
example.com/*' --prefix IA-EXAMPLE-COM --subprefix 2012
-- greg