Hi Nelson,
the Trivio index is for crawl data released until 2012. The index for the 2012 data
is still up on
http://urlsearch.commoncrawl.org/
URLs from 2013 until now are indexed on
http://index.commoncrawl.org/
The API
https://github.com/ikreymer/pywb/wiki/CDX-Server-API#api-reference
allows to search for URL prefixes or domains, e.g.,
http://index.commoncrawl.org/CC-MAIN-2017-13-index?url=ibm.com/watson
http://index.commoncrawl.org/CC-MAIN-2017-13-index?url=ibm.com&matchType=domain
or even to access content directly
http://index.commoncrawl.org/CC-MAIN-2017-13/20170331020845/https://www.ibm.com/watson/
But you can use also the WARC file path, offset and length to fetch single records from the
archives. E.g., to fetch the index record
com,ibm)/analytics/watson-analytics/us-en/operations 20170328162103 {"url":
"
https://www.ibm.com/analytics/watson-analytics/us-en/operations", "mime": "text/html", "status":
"200", "digest": "BC34YS6UH5P3HUSJJDPDE7VRPT76SVMJ", "length": "15127", "offset": "938031737",
"filename":
"crawl-data/CC-MAIN-2017-13/segments/1490218189802.18/warc/CC-MAIN-20170322212949-00240-ip-10-233-31-227.ec2.internal.warc.gz"}
one possible way is
% curl -s -r938031737-$((938031737+15127-1)) \
"
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-13/segments/1490218189802.18/warc/CC-MAIN-20170322212949-00240-ip-10-233-31-227.ec2.internal.warc.gz"
| gzip -dc
But see also
https://groups.google.com/d/msg/common-crawl/8vnQnUA-0-0/8aT5g-9SFgAJ
There are also tools to do this on a local copy of the index, e.g.
https://github.com/centic9/CommonCrawlDocumentDownload/
> * Better download it as WET format, WARC format is OK too.
At present, the WET files are not indexed.
Best,
Sebastian
> * e.g. "com.ibm.www", "com.ibm.www/watson/"
> * Better download it as WET format, WARC format is OK too.
>
> Thanks in advance,
>
> Nelson
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.