Hi Marianne,
the AMI dates to 2012 and I would surprised if it works without any changes. Sorry for that.
There are still many old and partially outdated examples out there which need maintenance or
should be deprecated otherwise.
The AMI is aimed to process the 2012 data set [1]. It cannot be used to process newer data
because the data format has changed from ARC to WARC. We release the data of the main crawl
monthly, at present about 3 billion pages per month. In case you're interested in newer
data have a look at the examples listed in
http://commoncrawl.org/the-data/examples/ (newer examples on top)
We actively maintain to example libraries to process the data from 2013 - present:
https://github.com/commoncrawl/cc-mrjob (Python, mrjob)
https://github.com/commoncrawl/cc-pyspark (Python, Spark)
In case you want to continue with the AMI: afaics, the AMI still holds the old public data set
location which has been changed [2] from
s3://aws-publicdatasets/common-crawl/
to
s3://commoncrawl/
That needs to be fixed first. But it may be that there are further problems.
If you have a specific use case in mind, it may be a good idea to ask other users in this group
how they would approach it.
Thanks and best,
Sebastian
[1]
http://commoncrawl.org/2012/07/2012-crawl-data-now-available/
[2]
https://groups.google.com/d/topic/common-crawl/nKuQK68rebo/discussion
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.