Obtaining WARC files from TLDs

149 views
Skip to first unread message

Rens Oliemans

unread,
Jan 27, 2021, 3:47:26 PM1/27/21
to Common Crawl
Hi everyone! I have been trying to download WARC files for a lot of web pages from a specific TLD, but am not sure how to proceed.
Let's take .fr for example for 2020-50, then my approach was as follows:
grep '^fr' cluster-2020-50.idx | cut -f 2 | uniq
> cdx-00190.gz, cdx-00191.gz, cdx-00192.gz, cdx-00193.gz, cdx-00194.gz

However, when I take a look at these .gz files, I see a lot of domains with the following structure:
"urlinfo, mime, url, length, offset, ..., filename"
where filename is something like "crawl-data/CC-MAIN-2020-50/segments/xxx/warc/CC-MAIN-2020...-...warc.gz"

But I want to get a lot of WARC files for a spefic domain, so it seems really inefficient to ask the s3 server to get me (a part of) a .warc.gz file for each domain.
Rather, I'd like to get a huge .warc.gz file and process it locally, filtering the irrelevant pages. How would I go about doing this?


Sebastian Nagel

unread,
Jan 28, 2021, 5:34:39 AM1/28/21
to common...@googlegroups.com
Hi Rens,

> it seems really inefficient to ask the s3 server to get me (a part of) a .warc.gz file

No, sending range requests to S3 is actually quite efficient.

Just in case you've missed it: using the WARC file name, record offset and length,
you can send a range request to pick only the records selected from the index. See:
https://groups.google.com/g/common-crawl/c/iZVW5ai9jQI/m/9RKQll_lAQAJ
https://groups.google.com/g/common-crawl/c/7nVhY9D1qa0/m/QMAUukYsCAAJ


> Rather, I'd like to get a huge .warc.gz file and process it locally, filtering the irrelevant pages.
> How would I go about doing this?

Use Greg's cdx-toolkit (https://pypi.org/project/cdx-toolkit/) to query the CDX index and write
all matched records into a WARC file - see the command "warc".

You might also have a look at the columnar index to filter irrelevant pages away already in the query:
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
If yes, there's a Spark job which extracts records via the columnar index and stores them in WARCs:
https://github.com/commoncrawl/cc-index-table/#export-subsets-of-the-common-crawl-archives


Best,
Sebastian


On 1/27/21 9:47 PM, Rens Oliemans wrote:
> Hi everyone! I have been trying to download WARC files for a lot of web pages from a specific TLD, but am not sure how to proceed.
> Let's take .fr for example for 2020-50, then my approach was as follows:
> grep '^fr' cluster-2020-50.idx | cut -f 2 | uniq
> > cdx-00190.gz, cdx-00191.gz, cdx-00192.gz, cdx-00193.gz, cdx-00194.gz
>
> However, when I take a look at these .gz files, I see a lot of domains with the following structure:
> "urlinfo, mime, url, length, offset, ..., *filename*"
> where filename is something like "crawl-data/CC-MAIN-2020-50/segments/xxx/warc/CC-MAIN-2020...-...warc.gz"
>
> But I want to get /a lot/ of WARC files for a spefic domain, so it seems really inefficient to ask the s3 server to get me (a part of) a

Rens Oliemans

unread,
Feb 11, 2021, 6:15:48 PM2/11/21
to Common Crawl
Hi Sebastian,

Thanks for the helpful reply! We're doing range requests and they are indeed more efficient than my intuition, nice :)
We looked into CDX-toolkit in the beginning but since we're doing some parts in Spark this also came with some disadvantages, right no we're doing it more or less manually, but that isn't too complex really.
We're now well on our way, thanks for CC, it's awesome.

Best,
Rens

Sebastian Nagel

unread,
Feb 12, 2021, 4:49:59 AM2/12/21
to common...@googlegroups.com
Hi Rens,

> We're doing range requests and they are indeed more efficient than my intuition, nice :)

Thanks for the notice. Indeed, the performance of the S3 servers responding to many small requests is amazing.

> we're doing some parts in Spark

I hope you didn't miss that you can fire the range requests from Spark. Examples available for Python/PySpark [1,2] and Java [3].

Of course, if you have a complex pipeline afterwards to process the data (or even multiple ones) which need a lot of tweaking or iterations,
it can be better to store the extracted subset. Then it doesn't
matter which tool you use.

Best,
Sebastian

[1] https://github.com/commoncrawl/cc-pyspark/blob/9b1fe7955caa9bea99d36829f75a5fcbd71f0d9a/sparkcc.py#L375
[2] https://github.com/commoncrawl/cc-pyspark/blob/master/cc_index_word_count.py
[3] https://github.com/commoncrawl/cc-index-table/#export-subsets-of-the-common-crawl-archives


On 2/12/21 12:15 AM, Rens Oliemans wrote:
> Hi Sebastian,
>
> Thanks for the helpful reply! We're doing range requests and they are indeed more efficient than my intuition, nice :)
> We looked into CDX-toolkit in the beginning but since we're doing some parts in Spark this also came with some disadvantages, right no we're
> doing it more or less manually, but that isn't too complex really.
> We're now well on our way, thanks for CC, it's awesome.
>
> Best,
> Rens
>
> On Thursday, January 28, 2021 at 11:34:39 AM UTC+1 Sebastian Nagel wrote:
>
> Hi Rens,
>
> > it seems really inefficient to ask the s3 server to get me (a part of) a .warc.gz file
>
> No, sending range requests to S3 is actually quite efficient.
>
> Just in case you've missed it: using the WARC file name, record offset and length,
> you can send a range request to pick only the records selected from the index. See:
> https://groups.google.com/g/common-crawl/c/iZVW5ai9jQI/m/9RKQll_lAQAJ
> <https://groups.google.com/g/common-crawl/c/iZVW5ai9jQI/m/9RKQll_lAQAJ>
> https://groups.google.com/g/common-crawl/c/7nVhY9D1qa0/m/QMAUukYsCAAJ
> <https://groups.google.com/g/common-crawl/c/7nVhY9D1qa0/m/QMAUukYsCAAJ>
>
>
> > Rather, I'd like to get a huge .warc.gz file and process it locally, filtering the irrelevant pages.
> > How would I go about doing this?
>
> Use Greg's cdx-toolkit (https://pypi.org/project/cdx-toolkit/ <https://pypi.org/project/cdx-toolkit/>) to query the CDX index and write
> all matched records into a WARC file - see the command "warc".
>
> You might also have a look at the columnar index to filter irrelevant pages away already in the query:
> https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> <https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>
> If yes, there's a Spark job which extracts records via the columnar index and stores them in WARCs:
> https://github.com/commoncrawl/cc-index-table/#export-subsets-of-the-common-crawl-archives
> <https://github.com/commoncrawl/cc-index-table/#export-subsets-of-the-common-crawl-archives>
>
>
> Best,
> Sebastian
>
>
> On 1/27/21 9:47 PM, Rens Oliemans wrote:
> > Hi everyone! I have been trying to download WARC files for a lot of web pages from a specific TLD, but am not sure how to proceed.
> > Let's take .fr for example for 2020-50, then my approach was as follows:
> > grep '^fr' cluster-2020-50.idx | cut -f 2 | uniq
> > > cdx-00190.gz, cdx-00191.gz, cdx-00192.gz, cdx-00193.gz, cdx-00194.gz
> >
> > However, when I take a look at these .gz files, I see a lot of domains with the following structure:
> > "urlinfo, mime, url, length, offset, ..., *filename*"
> > where filename is something like "crawl-data/CC-MAIN-2020-50/segments/xxx/warc/CC-MAIN-2020...-...warc.gz"
> >
> > But I want to get /a lot/ of WARC files for a spefic domain, so it seems really inefficient to ask the s3 server to get me (a part of) a
> > .warc.gz file for each domain.
> > Rather, I'd like to get a huge .warc.gz file and process it locally, filtering the irrelevant pages. How would I go about doing this?
> >
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/ea933d34-2db5-474c-924d-a80578b1f75fn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/ea933d34-2db5-474c-924d-a80578b1f75fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages