Finding and downloading entire domains

Spencer Dorsey

unread,

Feb 6, 2019, 8:07:26 PM2/6/19

to Common Crawl

I can see that a few questions like this have already been asked + answered but it looks like things have changed quite a bit recently so I figured I would just ask. I've been tasked with downloading the full contents (going as far back as I can) of about 50 domains for a (university) research team. I have fast internet, a couple powerful computers, and I know Python. Will this be possible? Any pointers on where to start?

Also, I noticed that the posted indexes only go back to 2013. Does the Crawl contain data from before 2013 or am I on my own there?

Thanks for your patience

Sebastian Nagel

unread,

Feb 7, 2019, 8:32:40 AM2/7/19

to common...@googlegroups.com

Hi Spencer,

> Will this be possible?

Yes, definitely and most points are already answered in preceding threads on this lists.
If it's ok to "buffer" the content in custom WARC files, the easiest option would be to use the cdx
toolkit, see https://pypi.org/project/cdx-toolkit/
But the decision how to proceed depends also on the expected total amount of data: it it's about
millions of page captures it might be worth to gather the data first on a machine in the AWS
us-east-1 region where the Common Crawl data is located.

> Also, I noticed that the posted indexes only go back to 2013. Does the Crawl contain
> data from before 2013 or am I on my own there?

There are three data sets covering the years 2008 - 2012. The good news: URL indexes for these
crawls are already in preparation but it might still take a couple of weeks until they're finally
published.
The older crawls use the ARC format. But that shouldn't matter with Python: the module "warcio" can
be used to parse ARC files the same ways as WARC files.

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,

Feb 7, 2019, 10:20:34 AM2/7/19

to common...@googlegroups.com

On Thu, Feb 7, 2019 at 8:32 AM Sebastian Nagel <seba...@commoncrawl.org> wrote:

> Will this be possible?

Yes, definitely and most points are already answered in preceding threads on this lists.

I was going to answer "almost certainly not," but finding which answer is closest to the truth for your use case is easy using the URL index

http://index.commoncrawl.org

My thinking is that none of your 50 sites will have 100% coverage in either breadth or depth, which is what I'm interpreting your request to be for.

Sebastian may disagree, but I think since you've only got 50 sites, it'd be OK to ping the CC Index Server directly rather than dealing with downloading the CDX files as long as you're polite about it. It's only going to be a few thousand queries. For example, this will tell you how many page crawls are in the 2013 index for Cornell:

http://index.commoncrawl.org/CC-MAIN-2013-20-index?url=http%3A%2F%2Fcornell.edu%2F*&output=json&showNumPages=true

It'd be a pretty quick matter to use one of the Python toolkits to cycle through all 50 domains and ~60 crawls to see what your coverage looks like.

Tom

Sebastian Nagel

unread,

Feb 7, 2019, 10:53:55 AM2/7/19

to common...@googlegroups.com

Hi Tom, hi Spencer,

in case to include also subdomains, you may use this query:

http://index.commoncrawl.org/CC-MAIN-2013-20-index?url=cornell.edu&output=json&matchType=domain&showNumPages=true

The result
{"pageSize": 5, "blocks": 228, "pages": 46}
indicates that there are about 228*3000 page captures for cornell.edu and subdomains,
one block holds 3000 page captures.

> it'd be OK to ping the CC Index Server directly

A few thousand queries is definitely not an issue. The server can handle several million queries per
day. It'd be great if you do not overload it, so that it stays responsive
for other users.

Thanks,
Sebastian

Spencer Dorsey

unread,

Feb 7, 2019, 11:39:36 AM2/7/19

to Common Crawl

Hi Sebastian,

I'll dig through the cdx toolkit and see what I can put together while I wait for 2008-2012 index.

Thanks for the fast reply!

~Spencer

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Spencer Dorsey

unread,

Feb 7, 2019, 11:43:52 AM2/7/19

to Common Crawl

I'll definitely test this with a couple domains but is it generally best to think of the CC data for any given domain as a sample? I would aim for 100% coverage with any custom scrapers I build but I'd probably be ok with getting 90%+ of the pages for any domain if it saves me the time / hassle of having to go through each domain myself.

Maybe I'll spread the queries over a few days and hit up the index.

Thanks!

~Spencer

Greg Lindahl

unread,

Feb 8, 2019, 2:17:04 PM2/8/19

to common...@googlegroups.com

On Thu, Feb 07, 2019 at 08:39:35AM -0800, Spencer Dorsey wrote:

> I'll dig through the cdx toolkit and see what I can put together while I
> wait for 2008-2012 index.

Excellent :-) Once the 2008-2012 index of ARC files comes out, it
might take a short while for me to make sure cdx_toolkit works with it
properly. The base warcio library does seamlessly deal with both ARC
and WARC, but there's always a chance for a bug or two on my side.

Also, I'd recommend that you check out extracting from the Internet
Archive's Wayback Machine in addition to Common Crawl. Switching from
one to the other is as simple as:

% cdxt --cc --from=2008 --to=2019 warc 'example.com/*' --prefix CC-EXAMPLE-COM
% cdxt --ia --from=2008 --to=2019 warc 'example.com/*' --prefix IA-EXAMPLE-COM

You'll want to experiment with adding some "-v" flags on the end,
and/or running "warcio index" on the warc.gz file as it's being
generated, to see what's going on. For a big domain, I generally
extract each year separately because of the long runtimes. For
example, running an extract of IA's NY Times archive took about 12
wall-clock hours per year. Here's how to grab one year:

% cdxt --ia --from=2012 --to=2012 warc 'example.com/*' --prefix IA-EXAMPLE-COM --subprefix 2012

-- greg

Reply all

Reply to author

Forward