Using Common Crawl to Build a Specific Database

Kim

unread,

Jul 9, 2016, 12:05:11 AM7/9/16

to Common Crawl

Hello,

I'm a non-technical person working with some developers on a prototype. Is it possible to use CC to extract URLs in a specific business domain, say, architecture?

Even more specific, can one extract from CC a listing of URLs using key words from that business domain?

Thanks in advance,

Kim

Ivan Habernal

unread,

Jul 11, 2016, 3:37:38 AM7/11/16

to Common Crawl

Hi Kim,

In case you're fine with only URLs that contain some keywords, you can have a look at a list of extracted URLs from CC we made for our C4Corpus:

https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_list_of_urls_from_commoncrawl

But it seems to me your goal is to get HTML pages (not only URLs) for your domain of interest; for that you probably would need to build a search engine over the corpus or some topic model clustering, both not trivial. Have a look at CommonSearch https://about.commonsearch.org/

Best,

Ivan

Sebastian Nagel

unread,

Jul 11, 2016, 4:53:47 AM7/11/16

to Common Crawl

Hi Kim,

I would second Ivan that the task is not trivial. But it's doable! One approach could be:

First, get pages/sites of the desired domain from DMOZ or common search:
http://www.dmoz.org/search?q=architecture
https://uidemo.commonsearch.org/?g=en&q=architecture
Ev. do this to also for other domains and/or add close but undesired domains (e.g., "home depot").

Second, get the location in the Common Crawl archives via the CC index:
http://index.commoncrawl.org/CC-MAIN-2016-26-index?url=archiseek.com&matchType=domain&output=json

Third, "fetch" the pages of your domain from the Common Crawl archives on AWS S3
and train a classifier on the page content.

Now you could run the classifier over the whole Common Crawl data to get more
"architecture" pages, or even try to find links to architecture pages not included in
the Common Crawl data.

The Common Crawl index is also available on AWS S3 as gzipped files. In case, you only want to
grep the URLs for keywords that's the way to go:

idx=CC-MAIN-2016-26
for i in `seq 0 299`; do
     aws s3 cp --no-sign-request s3://commoncrawl/cc-index/collections/$idx/indexes/cdx-`printf %05d $i`.gz - \
       | gzip -dc \
       | grep -i architecture
done

For how to get the page content using the WARC file location and offset, see one of the previous threads in this group.

Best,
Sebastian

Reply all

Reply to author

Forward