Accessing S3 data on AWS EMR

Peter Schulam

unread,

Apr 22, 2022, 2:52:47 PM4/22/22

to Common Crawl

Hi everyone,

First, thanks to the CC team for making such great data available!

I’m having some trouble accessing the data through S3 when running spark jobs on AWS EMR. I’ve read about the recent changes to access (move to cloudfront and no more unsigned S3 requests), but I’m still getting access denied exceptions in my code.

I believe both my head node and core nodes have the correct permissions. They are using the EMR_EC2_DefaultRole, which allows all S3 operations on all resources. I can also confirm access to the CC data by other means on these machines:

* If I run `aws s3 ls s3://commoncrawl/crawl-data/` on the head node, it works as expected.

* If I run `hdfs dfs -ls s3://commoncrawl/crawl-data/` it also works as expected.

I’ve been accessing this data through S3 via spark for a while now, and never had any problems. I’m sure it’s related to the recent changes at the beginning of the month, but I don’t see where I’ve gone wrong.

Has anyone had similar issues? Or have suggestions on what I might check to debug?

Thank you!

Peter

Sebastian Nagel

unread,

Apr 22, 2022, 4:53:29 PM4/22/22

to common...@googlegroups.com

Hi Peter,

> * If I run `aws s3 ls s3://commoncrawl/crawl-data/` on the head node

What happens if you run it on one of the core or task nodes?

> I’m sure it’s related to the recent changes at
> the beginning of the month, but I don’t see where I’ve gone wrong.

Assumed the source code is on Git, you could grep for the following
keywords:
- if using Python and boto3
git grep -F UNSIGNED
- if using Java, Scala (JVM)
git grep -F AnonymousAWSCredentialsProvider
- if the AWS CLI is used
git grep -F -e --no-sign-request

These keywords could point to unauthenticated S3 access which is
now disabled.

Otherwise, could you share some context?
- programming language (Scala, Java, Python, R)
- the exception stack trace
- eventually a code snippet

Thanks!

Best,
Sebastian

Peter Schulam

unread,

Apr 22, 2022, 6:42:19 PM4/22/22

to Common Crawl

Thanks so much for your help, Sebastian!

I was in a pyspark shell to collect a stack trace to report here, and it ended up working. I have no idea what changed, but perhaps something temporary related to AWS.

In any case, I'm including answers to your questions below in case it's helpful to anyone in the future.

> What happens if you run it on one of the core or task nodes?

Great suggestion, thanks. I don't have any issues accessing the data from the core/task nodes through the AWS CLI.

> Assumed the source code is on Git, you could grep for the following
keywords:

I don't have any code base at this point, I had just fired up a `pyspark` shell to run a quick analysis and was unable to read an index file in.

However, I did dump the pyspark configuration with:

```

conf = sc.getConf().getAll()

with open("conf.txt", "w") as stream:

for k, v in sorted(conf):

stream.write(f"{k}={v}\n")

```

And I don't see the AnonymousAWSCredentialsProvider class listed in any of the spark.hadoop.fs.s3* settings, so I'm fairly sure I'm sending authenticated requests. I also double checked by trying to access data in private buckets and it succeeds.

Reply all

Reply to author

Forward