Updated getting started with Common Crawl

已查看 57 次
跳至第一个未读帖子

Matt Horridge

未读,
2016年11月10日 09:13:552016/11/10
收件人 Common Crawl
Hi there

I'm a student new the common crawl data set and I was just wondering if there were any updated tutorials for beginners.

The few I've found online seem quite out of date, AWS and specifcally EMR have seemed to updated quite a bit in the last few years and navigating it to execute a sample job is a bit of a minefield right now


Any help would be much appreciated

Regards

Sebastian Nagel

未读,
2016年11月14日 10:18:222016/11/14
收件人 common...@googlegroups.com
Hi Matt,

thanks for your interest in Common Crawl. You're right the "officially" provided
examples need an overhaul and we hope to get this done soon. A list with verified
examples is already getting compiled.

> and specifcally EMR

If you have in mind
https://github.com/commoncrawl/cc-mrjob
I've tested it recently: running it locally works out-of-the-box,
in distributed mode there is one issue with a work-around available.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

回复全部
回复作者
转发
0 个新帖子