Small Portion of CC data

44 views
Skip to first unread message

Aline Bessa

unread,
Apr 27, 2015, 4:36:23 PM4/27/15
to common...@googlegroups.com
Hi all,

I've been googling it with no success. Are there any nice samples available for Common Crawl's data? 100,000 or 1,000,000 pages would be interesting, and super useful for my research project (as long as they are randomly sampled).

Stephen Merity

unread,
Apr 27, 2015, 6:50:36 PM4/27/15
to common...@googlegroups.com
Hi Aline,

This is an interesting question and one that has come up before. We'd love to provide a solution for this but we've had trouble in the past defining the requirements of such a dataset.

The main issue is what constitutes a random sample of the web. I'd imagine that a good approximation for a random sample would be obtaining a single page from each domain. There are hundreds of millions of domains however, and for a sample of 100k or 1 million web pages, you'd not even get a page per domain. Another question is which page would we want from each domain? Using home pages is not likely to be a great sample as they're all quite formulaic in structure and frequently contain little information or only snippets to other more detailed pages. If not home pages, how do we decide what page we come across on a domain? Trying to decide upon the size of the sample is another question.

The core issue is in making the random sample support the widest set of use cases possible. If anyone has any insights on that, I'd be interested to hear them! :)

As a first approximation, a single segment from a crawl archive generally contains a few million web pages. Processing this to create a sampling of documents should be relatively easy and may fit well depending on the task.

On Tue, Apr 28, 2015 at 6:36 AM, Aline Bessa <ali...@gmail.com> wrote:
Hi all,

I've been googling it with no success. Are there any nice samples available for Common Crawl's data? 100,000 or 1,000,000 pages would be interesting, and super useful for my research project (as long as they are randomly sampled).

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Tom Morris

unread,
Apr 28, 2015, 2:48:34 PM4/28/15
to common...@googlegroups.com
On Mon, Apr 27, 2015 at 6:50 PM, Stephen Merity <ste...@commoncrawl.org> wrote:

The main issue is what constitutes a random sample of the web. I'd imagine that a good approximation for a random sample would be obtaining a single page from each domain. There are hundreds of millions of domains however, and for a sample of 100k or 1 million web pages, you'd not even get a page per domain. Another question is which page would we want from each domain? Using home pages is not likely to be a great sample as they're all quite formulaic in structure and frequently contain little information or only snippets to other more detailed pages. If not home pages, how do we decide what page we come across on a domain? Trying to decide upon the size of the sample is another question.

The core issue is in making the random sample support the widest set of use cases possible. If anyone has any insights on that, I'd be interested to hear them! :)

One of the things that crossed my mind recently is that it might be nice to have a CC-based sample which approximated the methodology used to construct the Clueweb12 corpus (and an open source pipeline to create it from an given CC crawl, so that it can be easily replicated).  It would be impossible to replicate exactly since crawling is done using a different seed URL list and different methodology, but a few things could be done to bring them into closer alignment: 1) remove duplicates, 2)  remove multimedia files and 3) filter against URLBlackList.com.  The Clueweb12 corpus also filtered non-English and "adult" documents, which I have mixed feelings about.  Perhaps the CC corpus could segregate them and people could make their own choices as to what to include.

In terms of random sampling, Clueweb12 used a super-simple fixed ~7% sample, taking every 14th page.  A single page per domain doesn't seem like it would be all that useful, but if you were to do it, I'm not sure you could improve on the home page.  As sparse and formulaic as they may be, they should, by rights, convey the main purpose and messages of the site.  Given that we're a couple years down the road, perhaps doubling or quadrupling the size of their small sample would be a good target.  This would be in the 100-200M page range.

Tom
Reply all
Reply to author
Forward
0 new messages