wget https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-27/warc.paths.gz
gunzip warc.paths.gz
Get random 1000 (or any number you want) paths to warc files
shuf warc.paths | head -n 1000 > top1kwarc.txt
Download the files (I was putting them directly on our local HDFS)for i in `cat top1kwarc.txt` ; do replaced=$(echo $i | tr "/" "_")
; echo $replaced; wget "https://aws-publicdatasets.s3.amazonaws.com/$i"
-O - | hadoop fs -put - /our_local_hdfs_path/$replaced; done
I need to randomize the segment before selecting files randomly. Any idea on how to do that?
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
From your experience, do you guys recommend that I extract the main text of the webpages from CC data on EMR or locally?
If you recommend using EMR, can you let me know the steps it takes to randomly sample webpages from randomly selected segment and files?
Thank you for sharing your thoughts. For me to select 3000 pages, it might be easier to select from a list of urls.Can you recommend an efficient way of getting urls only from CC-MAIN-2015-14? Thank you.