extract data using offset value in CDX API

300 views
Skip to first unread message

Gautam Balasubramanian

unread,
Jul 4, 2016, 9:11:28 AM7/4/16
to Common Crawl, Manoharan Chinnasamy
We are trying to figure out how to use your data dump for a specific domain. We have found the link to warc file using  CDX API.

Example :
 http://index.commoncrawl.org/CC-MAIN-2016-22-index?url=barchick.com&matchType=domain&output=json

A part of the output is 
{"urlkey": "com,barchick)/", "timestamp": "20160524114937", "status": "200", "url": "http://www.barchick.com/", "filename": "crawl-data/CC-MAIN-2016-22/segments/1464049270555.40/warc/CC-MAIN-20160524002110-00228-ip-10-185-217-139.ec2.internal.warc.gz", "length": "17240", "mime": "text/html", "offset": "375611148", "digest": "STCRRRMZY5SHDFVUCNBO2YBA4DRXWFUT"}

What is the use of this offset and length ? how we can leverage this ? 

Thanks,
Gautam B.

Sebastian Nagel

unread,
Jul 4, 2016, 10:23:53 AM7/4/16
to Common Crawl, mchin...@owler.com
 Hi Gautam,

offset and length indicate the position of the WARC record in the WARC file. It's possible to pull the archived document including the WARC
record headers out, e.g., by the command below.  Of course, this is also possible using a simple HTTP request with a range specified.
Running it on EC2 in the AWS us-east-1 region should be significantly faster, esp. when many documents are extracted.

Also index.commoncrawl.org extracts documents this way.  The URL to fetch this document for the given timestamp:
  http://index.commoncrawl.org/CC-MAIN-2016-22/20160524114937id_/http://www.barchick.com/

Please, do not use index.commoncrawl.org "at scale"!  It's a single and small server. Thanks.
Accessing the index and WARC files directly from a small EC2 machine should be also much faster.

See also
  https://github.com/ikreymer/pywb/wiki/CDX-Server-API
  https://github.com/centic9/CommonCrawlDocumentDownload/

Best,
Sebastian


 % aws s3api get-object \
        --range bytes=375611148-$((375611148+17240-1)) \
        --bucket commoncrawl \
        --key crawl-data/CC-MAIN-2016-22/segments/1464049270555.40/warc/CC-MAIN-20160524002110-00228-ip-10-185-217-139.ec2.internal.warc.gz \
        /tmp/test_warc_chunk.gz

 % zcat /tmp/test_warc_chunk.gz | head
WARC/1.0
WARC-Type: response
WARC-Date: 2016-05-24T11:49:37Z
WARC-Record-ID: <urn:uuid:cb2a25f0-bd05-4dc9-9059-673904f530f1>
Content-Length: 79309
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:dfddc741-6c55-4368-992a-e651b5496b42>
WARC-Concurrent-To: <urn:uuid:90506c88-e0a0-489a-9172-aacf8fc6e744>
WARC-IP-Address: 178.79.176.202
WARC-Target-URI: http://www.barchick.com/

 % zcat /tmp/test_warc_chunk.gz | tail
/* <![CDATA[ */
var _wpcf7 = {"loaderUrl":"http:\/\/www.barchick.com\/wp-content\/plugins\/contact-form-7\/images\/ajax-loader.gif","sending":"Sending ...","cached":"1"};
/* ]]> */
</script>
<script type='text/javascript' src='http://www.barchick.com/wp-content/plugins/contact-form-7/includes/js/scripts.js?ver=3.3.1'></script>
<script type='text/javascript' src='http://www.barchick.com/wp-includes/js/wp-embed.min.js?ver=4.5.2'></script>

</body>
</html>
Reply all
Reply to author
Forward
0 new messages