I'm a student new the common crawl data set and I was just wondering if there were any updated tutorials for beginners.
The few I've found online seem quite out of date, AWS and specifcally EMR have seemed to updated quite a bit in the last few years and navigating it to execute a sample job is a bit of a minefield right now
Any help would be much appreciated
Regards
Sebastian Nagel
未读,
2016年11月14日 10:18:222016/11/14
回复作者
登录即可回复作者
转发
登录即可转发
删除
您无权在此群组中删除帖子
复制链接
举报消息
请登录以举报消息
显示原始帖子
要么此群组的电子邮件地址为匿名状态,要么您得查看成员电子邮件地址权限才能查看原始帖子
收件人 common...@googlegroups.com
Hi Matt,
thanks for your interest in Common Crawl. You're right the "officially" provided
examples need an overhaul and we hope to get this done soon. A list with verified
examples is already getting compiled.
> and specifcally EMR
If you have in mind
https://github.com/commoncrawl/cc-mrjob I've tested it recently: running it locally works out-of-the-box,
in distributed mode there is one issue with a work-around available.