Querying Common Crawl with WARC-Record-ID

Jai Pancholi

unread,

Jul 12, 2018, 6:48:48 AM7/12/18

to Common Crawl

Hello,

I am currently trying to index records of the News crawl whose response contains a certain string. In my index I am storing the WARC-Record-ID for the records which contain this string. With this index, how would I go about querying Common Crawl with the WARC-Record-ID to pull the full record?

Would I need to store extra information? I noticed in the non News WARC records, there is also the WARC-Warcinfo-ID value in the header which could be used to identify the warc.gz file the record is in, but this is not present in the News crawl.

Thank you.

Jai Pancholi

unread,

Jul 12, 2018, 6:52:07 AM7/12/18

to Common Crawl

Would using something like a byte offset be useful?

Sebastian Nagel

unread,

Jul 12, 2018, 7:23:33 AM7/12/18

to common...@googlegroups.com

Hi Jai,

yes, if you have the file name (S3 path), WARC record offset and length (in bytes)
you can fetch a single WARC record, see also
https://groups.google.com/forum/#!msg/common-crawl/pQ34q-_EARU/FLFtvTfXAwAJ

Best,
Sebastian

On 07/12/2018 12:52 PM, Jai Pancholi wrote:
>
> Would using something like a byte offset

> <http://www.automatingosint.com/blog/2015/08/osint-python-common-crawl/> be useful?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Jai Pancholi

unread,

Jul 12, 2018, 7:39:46 AM7/12/18

to Common Crawl

Thanks for the speedy response! How would I be able to get filename, WARC Record and length from a single record? I am using the cc-pyspark library and only the first record (record type warcinfo) per warc file actually contains the filename?

The only headers that seem to be available are:

WARC/1.0

WARC-Record-ID: <urn:uuid:67d699d5-1b5a-4766-9762-424b0f1f61b6>

Content-Length: 226358

WARC-Date: 2018-07-05T07:29:29Z

WARC-Type: response

WARC-Target-URI: https://www.libertatea.ro/stiri/declaratia-unica-trebuie-depusa-pana-pe-16-iulie-care-reducerile-sunt-acordate-de-fisc-2318001

Content-Type: application/http; msgtype=response

WARC-Payload-Digest: sha1:NN25PNASXCFSJLONXPO4GZ4AAORYJ6LV

WARC-Block-Digest: sha1:W4665EQRODKGSAXTR7LRHU6QDO5NCFXP

On Thursday, 12 July 2018 12:23:33 UTC+1, Sebastian Nagel wrote:

Hi Jai,

yes, if you have the file name (S3 path), WARC record offset and length (in bytes)
you can fetch a single WARC record, see also
https://groups.google.com/forum/#!msg/common-crawl/pQ34q-_EARU/FLFtvTfXAwAJ

Best,
Sebastian

On 07/12/2018 12:52 PM, Jai Pancholi wrote:
>
> Would using something like a byte offset
> <http://www.automatingosint.com/blog/2015/08/osint-python-common-crawl/> be useful?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Sebastian Nagel

unread,

Jul 12, 2018, 8:51:24 AM7/12/18

to common...@googlegroups.com

Hi Jai,

unfortunately, cc-pyspark needs to be patched to achieve this.
Offset and length are not part of the ArcWarcRecord but are known
only to the ArchiveIterator, see the warcio docs [1,2] and
the attached patch.

Feel free to open an issue on github for cc-pyspark [3] to improve
this. A good idea how to extend the interface without breaking
backward-compatibility is also welcome.

Thanks,
Sebastian

[1] https://pypi.org/project/warcio/
[2] https://github.com/webrecorder/warcio
[3] https://github.com/commoncrawl/cc-pyspark/issues

> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.

> > Visit this group at https://groups.google.com/group/common-crawl

> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

sparkcc.patch

Sebastian Nagel

unread,

Jul 19, 2019, 4:38:09 AM7/19/19

to Common Crawl

Hi,

follow up: this has been tracked at

https://github.com/commoncrawl/cc-pyspark/issues/6

A pull request is open now for review, I'll merge it soon into the master branch.

Sebastian

Reply all

Reply to author

Forward