How to download specific domain/sub domain data with WARC, WET format

681 views
Skip to first unread message

Nelson Jiao

unread,
Apr 26, 2017, 10:34:21 PM4/26/17
to Common Crawl
Hi,

It seems that https://github.com/trivio/common_crawl_index is deprecated to download specific domain/sub domain data with remote_copy. I cannot find any other solutions in this mailing list.

What I am trying to do is downloading specific domain/sub domain data to local to analysis. (https://github.com/CI-Research/KeywordAnalysis). 

  • e.g. "com.ibm.www", "com.ibm.www/watson/"
  • Better download it as WET format, WARC format is OK too.
Thanks in advance,

Nelson

Sebastian Nagel

unread,
Apr 28, 2017, 5:04:07 AM4/28/17
to common...@googlegroups.com
Hi Nelson,

the Trivio index is for crawl data released until 2012. The index for the 2012 data
is still up on
http://urlsearch.commoncrawl.org/

URLs from 2013 until now are indexed on
http://index.commoncrawl.org/

The API
https://github.com/ikreymer/pywb/wiki/CDX-Server-API#api-reference
allows to search for URL prefixes or domains, e.g.,
http://index.commoncrawl.org/CC-MAIN-2017-13-index?url=ibm.com/watson
http://index.commoncrawl.org/CC-MAIN-2017-13-index?url=ibm.com&matchType=domain
or even to access content directly
http://index.commoncrawl.org/CC-MAIN-2017-13/20170331020845/https://www.ibm.com/watson/

But you can use also the WARC file path, offset and length to fetch single records from the
archives. E.g., to fetch the index record

com,ibm)/analytics/watson-analytics/us-en/operations 20170328162103 {"url":
"https://www.ibm.com/analytics/watson-analytics/us-en/operations", "mime": "text/html", "status":
"200", "digest": "BC34YS6UH5P3HUSJJDPDE7VRPT76SVMJ", "length": "15127", "offset": "938031737",
"filename":
"crawl-data/CC-MAIN-2017-13/segments/1490218189802.18/warc/CC-MAIN-20170322212949-00240-ip-10-233-31-227.ec2.internal.warc.gz"}

one possible way is
% curl -s -r938031737-$((938031737+15127-1)) \
"https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-13/segments/1490218189802.18/warc/CC-MAIN-20170322212949-00240-ip-10-233-31-227.ec2.internal.warc.gz"
| gzip -dc

But see also https://groups.google.com/d/msg/common-crawl/8vnQnUA-0-0/8aT5g-9SFgAJ

There are also tools to do this on a local copy of the index, e.g.
https://github.com/centic9/CommonCrawlDocumentDownload/

> * Better download it as WET format, WARC format is OK too.
At present, the WET files are not indexed.

Best,
Sebastian

On 04/27/2017 04:34 AM, Nelson Jiao wrote:
> Hi,
>
> It seems that https://github.com/trivio/common_crawl_index is deprecated to download specific
> domain/sub domain data with remote_copy. I cannot find any other solutions in this mailing list.
>
> What I am trying to do is downloading specific domain/sub domain data to local to analysis.
> (https://github.com/CI-Research/KeywordAnalysis).
>
> * e.g. "com.ibm.www", "com.ibm.www/watson/"
> * Better download it as WET format, WARC format is OK too.
>
> Thanks in advance,
>
> Nelson
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Nelson Jiao

unread,
Apr 28, 2017, 5:24:49 AM4/28/17
to Common Crawl
Sebastian,

This is really helpful. We will try some of approaches you pointed to us. 

Very appreciate for your help!

Nelson  
Reply all
Reply to author
Forward
0 new messages