Random sampling from Common Crawl data

PythonGuru

unread,

Aug 18, 2015, 8:06:23 PM8/18/15

to Common Crawl

Hello,

On Amazon EMR or locally (if possible), I am trying to randomly sample a small number of webpages from randomly selected files in a randomly selected segment from March 2015 data.

Can anyone guide me to a source or tutorial on how to do that?

I would appreciate it if you can help!

Ivan Habernal

unread,

Aug 19, 2015, 2:03:07 AM8/19/15

to Common Crawl

Hi,

I was recently doing something similar, here are some of the steps:

Download list of all warc files


wget https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-27/warc.paths.gz

gunzip warc.paths.gz

Get random 1000 (or any number you want) paths to warc files

shuf warc.paths | head -n 1000 > top1kwarc.txt

Download the files (I was putting them directly on our local HDFS)

for i in `cat top1kwarc.txt` ; do replaced=$(echo $i | tr "/" "_")
 ; echo $replaced; wget "https://aws-publicdatasets.s3.amazonaws.com/$i"
 -O - | hadoop fs -put - /our_local_hdfs_path/$replaced; done

For extracting the HTML pages from warc.gz files, you can use any tool (for instance https://github.com/ept/warc-hadoop ).

Hope that helps!

Best,

Ivan

Dne středa 19. srpna 2015 2:06:23 UTC+2 PythonGuru napsal(a):

PythonGuru

unread,

Aug 19, 2015, 12:40:38 PM8/19/15

to Common Crawl

Thank you so much, Ivan.

I need to randomize the segment before selecting files randomly. Any idea on how to do that?

Tom Morris

unread,

Aug 19, 2015, 12:59:05 PM8/19/15

to common...@googlegroups.com

On Wed, Aug 19, 2015 at 12:40 PM, PythonGuru <hsun...@gmail.com> wrote:

I need to randomize the segment before selecting files randomly. Any idea on how to do that?

A Python guru would probably do this in Python, but if you want to do it from the shell, just start with the segments list:

curl https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-27/segment.paths.gz | gunzip | shuf -n 1

and use the result to filter the list of warc files to just those for the chosen segment and do your random draw from that.

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

PythonGuru

unread,

Aug 21, 2015, 2:18:39 PM8/21/15

to Common Crawl

Thank you, Tom.

From your experience, do you guys recommend that I extract the main text of the webpages from CC data on EMR or locally?

If you recommend using EMR, can you let me know the steps it takes to randomly sample webpages from randomly selected segment and files?

Thank you.

Tom Morris

unread,

Aug 22, 2015, 12:00:53 AM8/22/15

to common...@googlegroups.com

On Fri, Aug 21, 2015 at 2:18 PM, PythonGuru <hsun...@gmail.com> wrote:

From your experience, do you guys recommend that I extract the main text of the webpages from CC data on EMR or locally?

Either works. It's a tradeoff that depends on your familiarity with EMR vs local processing, the scale of the job, your budget and a bunch of other factors. I'm not sure anyone can make that decision for you. If it's just a few files from a single segment and you're not familiar with Hadoop and EMR, local processing is almost certainly easier. If you want to process an entire crawl, it's almost certainly worthwhile to learn Hadoop/EMR, even if you don't know it already. In between, well, it depends...

If you recommend using EMR, can you let me know the steps it takes to randomly sample webpages from randomly selected segment and files?

I'd probably do the random selection as a one-time offline process and create a static list of files that is reproduceable input to your processing job.

Tom

PythonGuru

unread,

Aug 25, 2015, 7:38:10 AM8/25/15

to Common Crawl

Thank you for sharing your thoughts. For me to select 3000 pages, it might be easier to select from a list of urls.

Can you recommend an efficient way of getting urls only from CC-MAIN-2015-14? Thank you.

Tom Morris

unread,

Aug 25, 2015, 11:16:31 AM8/25/15

to common...@googlegroups.com

On Tue, Aug 25, 2015 at 7:38 AM, PythonGuru <hsun...@gmail.com> wrote:

Thank you for sharing your thoughts. For me to select 3000 pages, it might be easier to select from a list of urls.
Can you recommend an efficient way of getting urls only from CC-MAIN-2015-14? Thank you.

The URL index is certainly the easiest way to get a list of URLs for a crawl. If you're willing to accept the half million URLs in the meta index as being suitably randomly distributed, you could just use that:

$ aws --no-sign-request s3 cp s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/cluster.idx .

$ wc -l *idx

549054 cluster.idx

Otherwise you could download the entire 300 file index and process that:

$ aws --no-sign-request s3 cp s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/cdx-00000.gz .

...

$ aws --no-sign-request s3 cp s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/cdx-00299.gz .

The announcement describing the index is here (but note that the index location is incorrect in that post):