Updated getting started with Common Crawl

57 views
Skip to first unread message

Matt Horridge

unread,
Nov 10, 2016, 9:13:55 AM11/10/16
to Common Crawl
Hi there

I'm a student new the common crawl data set and I was just wondering if there were any updated tutorials for beginners.

The few I've found online seem quite out of date, AWS and specifcally EMR have seemed to updated quite a bit in the last few years and navigating it to execute a sample job is a bit of a minefield right now


Any help would be much appreciated

Regards

Sebastian Nagel

unread,
Nov 14, 2016, 10:18:22 AM11/14/16
to common...@googlegroups.com
Hi Matt,

thanks for your interest in Common Crawl. You're right the "officially" provided
examples need an overhaul and we hope to get this done soon. A list with verified
examples is already getting compiled.

> and specifcally EMR

If you have in mind
https://github.com/commoncrawl/cc-mrjob
I've tested it recently: running it locally works out-of-the-box,
in distributed mode there is one issue with a work-around available.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages