How to build a subset of CommonCrawl dataset?

952 views
Skip to first unread message

Chillar Anand

unread,
Nov 1, 2018, 11:09:29 AM11/1/18
to Common Crawl
I am trying to build a dataset of top 1 million domains crawls from CC data to analyze trends in technologies.

I am trying to achieve this without downloading entire index or entire dataset. With cc-index-server repo, I can run a mirror locally and use this mirror to query and fetch required pages.
This adds additional wsgi server overhead. Is there a simple way to download required urls from cluster index file itself?

Is there a better way to build subset of CC dataset?

Sebastian Nagel

unread,
Nov 5, 2018, 5:29:15 AM11/5/18
to common...@googlegroups.com
Hi,

there has been a similar question about extracting job pages recently in this group
https://groups.google.com/forum/#!topic/common-crawl/EBYaos2Yk1M

In general, it's possible to fetch single WARC records using WARC filename,
record offset and length given in the URL index, see:
https://groups.google.com/forum/#!msg/common-crawl/pQ34q-_EARU/FLFtvTfXAwAJ

You may use Greg's https://github.com/cocrawler/cdx_toolkit to download the records.
Alternatively, the project https://github.com/commoncrawl/cc-index-table
contains code to fetch WARC records using Spark.

> build a dataset of top 1 million domains crawls

If you want all pages from the top 1 million domains, it's probably faster to
process the entire archives. There are about 30 million domains covered per
monthly crawl. The number of page captures per domain shows a power-law
distribution: the top domains (eg. wordpress.com) have millions of pages but
on the long tail there are only few pages per domain.

The top 1 million domains with the most captures already account for 70%
of a monthly crawl.

Of course, it largely depends what you define as "top" and "domain", and
whether all pages of those domains are required.

Best,
Sebastian

Chillar Anand

unread,
Nov 17, 2022, 11:24:06 PM11/17/22
to Common Crawl
I occasionally fetch a small subset(~0.5 to 5%) of CC data for my personal use.

I want this step as a single command and I want to do this without using 
big data processing tools(eg: Presto(Athena), Spark) or incurring heavy costs.

Here is a single command to scan and fetch entire CC index.

Screenshot 2022-11-18 at 09.50.34.png
I already have few servers running in us-east-1 region. I will run this command in those
server and create a subset of the index. We can run it on our laptop as well but it will be much slower.

This setup has worked well for me. I wrote a detailed blog post on the same as well.


Best,
Chillar Anand

Greg Lindahl

unread,
Nov 18, 2022, 4:52:26 PM11/18/22
to common...@googlegroups.com
Chillar,

Thanks for posting this duckdb example, I was wanting to try out
duckdb with CC's columnar index!

I maintain the cdx_toolkit python client, which allows you to use
supersets of CC crawls as if it was a single index, similar to how
Internet Archive works.

I have a nearly-finished Athena branch for the columnar index, and
along the way I was curious how fast duckdb (free but rate limited)
is compared to Amazon Athena (which charges small amounts of $
for metadata operations.)

-- greg

https://github.com/cocrawler/cdx_toolkit
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/930b16b3-58a4-4f89-9490-20293c38be5cn%40googlegroups.com.


Greg Lindahl

unread,
Nov 18, 2022, 5:07:41 PM11/18/22
to common...@googlegroups.com

On Fri, Nov 18, 2022 at 01:52:21PM -0800, Greg Lindahl wrote:

> along the way I was curious how fast duckdb (free but rate limited)
> is compared to Amazon Athena (which charges small amounts of $
> for metadata operations.)

Oh and I see you've given numbers, it's an hour to iterate over an entire
single crawl. Your query

```
where content_languages ilike '%tel%'
```

isn't indexed, so presumably it's a full read of that one column.

Chillar Anand

unread,
Nov 18, 2022, 10:58:16 PM11/18/22
to Common Crawl
> it's an hour to iterate over an entire single crawl.

Yes. I wanted to use free resources only for testing. So I was using t2.micro.
But if you use other instances, it will be much faster.

> so presumably it's a full read of that one column.

Yes. It is a full read of the column.

Sebastian Nagel

unread,
Nov 20, 2022, 4:36:23 PM11/20/22
to common...@googlegroups.com
Hi,

thanks for the really nice example! I'll add it to our list of examples.

Duckdb is a great tool, I used it a couple of times to inspect and query
Parquet files on the local disk. It's good to know that it can also used
for data on S3.


> > so presumably it's a full read of that one column.
>
> Yes. It is a full read of the column.

That's not necessarily the case. The column `content_languages`
should be backed by a dictionary in almost all row groups.
The Parquet writer may fall back from dictionary to plain encoding
only if the number of unique combinations of content languages
gets to high to be stored efficiently using a dictionary.
If there is a dictionary, only the dictionary is used to determine
whether there are rows with Telugu content. If not the row group is
entirely skipped.

To speed up the query I would try to avoid that all columns are
extracted in the results. Instead of

select * from ...

only list the columns needed for further processing steps, eg.

select url, content_languages,
warc_filename, warc_record_offset, warc_record_length
from ...

If possible avoid columns which make a significant part of the
storage, see the numbers in
http://data.commoncrawl.org/cc-index/table/cc-main/index.html


Best,
Sebastian

Greg Lindahl

unread,
Nov 20, 2022, 6:57:21 PM11/20/22
to common...@googlegroups.com
On Sun, Nov 20, 2022 at 10:36:18PM +0100, Sebastian Nagel wrote:

> If possible avoid columns which make a significant part of the
> storage, see the numbers in
> http://data.commoncrawl.org/cc-index/table/cc-main/index.html

Oh, that's a great table! Seems like the typical use-case could leave
off 1/2 of the compressed size... you usually only need one of {url,
url_surtkey} and very rarely need content_digest.

-- greg

Chillar Anand

unread,
Nov 21, 2022, 12:18:38 AM11/21/22
to Common Crawl
> That's not necessarily the case.

Thanks for the detailed explanation. I was not aware of that.

> only list the columns needed for further processing steps

I have read about that in your blog posts but forgot about it as data extraction is done in an hour.

With all columns, extracting data is taking about ~1 hour. 
With 5 columns, extracting data is taking only ~20 minutes.

There is ~65% improvement in performance. 

I have updated the article as well. Thanks again.
Reply all
Reply to author
Forward
0 new messages