Errors When Running Example Code

Marianne Fletcher

unread,

Aug 4, 2017, 12:30:24 AM8/4/17

to Common Crawl

I am almost completely new to AWS and Common Crawl and I'm having a tonne of difficulty getting the sample code to work, even on the provided AMI which I got here:

https://aws.amazon.com/amis/common-crawl-quick-start/

This is the point the example gets up to and the error message for ExampleMetadataDomainPageCount:

17/08/04 04:17:40 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost/user/ec2-user/.staging/job_201708040348_0005

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/common-crawl%2Fparse-output%2Fsegment%2F1341690166822%2Fmetadata-01849' - ResponseCode=403, ResponseMessage=Forbidden

I believe it is a problem with my credentials since I am forbidden access. I am using the Access Keys from the AWS Security Credentials panel but there are a lot of different credentials that I've needed (for SSH, CodeCommit access, etc.) and I could be using the wrong ones. The Access Keys seem to be the only ones with the right format for the .awssecret file as specified in the readme, though.

I'm really hoping that someone with some experience will know what the problem is. I am making no progress by myself.

Thank you!

Sebastian Nagel

unread,

Aug 4, 2017, 4:16:44 AM8/4/17

to common...@googlegroups.com

Hi Marianne,

the AMI dates to 2012 and I would surprised if it works without any changes. Sorry for that.
There are still many old and partially outdated examples out there which need maintenance or
should be deprecated otherwise.

The AMI is aimed to process the 2012 data set [1]. It cannot be used to process newer data
because the data format has changed from ARC to WARC. We release the data of the main crawl
monthly, at present about 3 billion pages per month. In case you're interested in newer
data have a look at the examples listed in
http://commoncrawl.org/the-data/examples/ (newer examples on top)

We actively maintain to example libraries to process the data from 2013 - present:
https://github.com/commoncrawl/cc-mrjob (Python, mrjob)
https://github.com/commoncrawl/cc-pyspark (Python, Spark)

In case you want to continue with the AMI: afaics, the AMI still holds the old public data set
location which has been changed [2] from
s3://aws-publicdatasets/common-crawl/
to
s3://commoncrawl/
That needs to be fixed first. But it may be that there are further problems.

If you have a specific use case in mind, it may be a good idea to ask other users in this group
how they would approach it.

Thanks and best,
Sebastian

[1] http://commoncrawl.org/2012/07/2012-crawl-data-now-available/
[2] https://groups.google.com/d/topic/common-crawl/nKuQK68rebo/discussion

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Marianne Fletcher

unread,

Aug 4, 2017, 6:37:15 AM8/4/17

to Common Crawl

Thank you for letting me now.

I'll have a look at the other examples that you mentioned.

The project is an assignment for university in which we learn how to combine different AWS services and process a large dataset, so it is quite simple once I learn how to connect them. But I am completely new to them.

Sebastian Nagel

unread,

Aug 4, 2017, 6:55:01 AM8/4/17

to common...@googlegroups.com

Hi Marianne,

> we learn how to combine different AWS services
> and process a large dataset

The "classical" way would be to use S3 + EC2 + EMR.
For sure, there are many more possible ways, e.g.,
replacing EC2 + EMR by one of the server-less services.

Good luck and feel free to ask for further advice!

Best,
Sebastian

Reply all

Reply to author

Forward