Need help in accessing common crawl data through AWS (Athena/EMR Serverless)

99 views
Skip to first unread message

Stierle O.

unread,
Jul 23, 2024, 10:25:46 AMJul 23
to Common Crawl
I am currently working on a university project that involves accessing and processing Common Crawl data using AWS services, specifically Athena and EMR Serverless. I have been following the tutorial provided in this AWS blog post, but I have encountered some challenges towards the end of the process and am unable to get it to work as intended.

Here is a brief overview of the steps I have taken and the issues I am facing:

  1. Set Up AWS Athena and EMR Serverless: I have set up Athena and EMR Serverless as per the instructions in the blog post.
  2. Data Processing: I am trying to process the Common Crawl data using the setup, but I am running into issues, particularly when it comes to executing the EMR Serverless job and processing the WARC files correctly.
    1. I have set up an anaconda environment as a EC2 Instance and have run all commands from there.
    2. Some Jobs were successful, however the output file was a .gz.parquet file, thus inaccessible through the S3 Select function in AWS.

Despite following the steps outlined in the tutorial, I am unable to get the expected results. Specifically, I am struggling with the following:

  • Job Execution: While trying to resolve the file output issue, my EMR Serverless jobs are failing, and I am not sure how to debug the issue effectively.
  • Data Processing: I need assistance in correctly processing the WARC files and ensuring that the output is in the desired format (Parquet). (To be able to use S3 Select)

Could someone please provide me with guidance on how to troubleshoot and resolve these issues? Additionally, any tips on best practices for processing Common Crawl data using AWS services would be greatly appreciated.

Thank you for your time and assistance in advance.

Sebastian Nagel

unread,
Jul 24, 2024, 7:32:50 AMJul 24
to common...@googlegroups.com
Hi Oliver,

one note ahead: I wasn't aware of this examples - it's a great one,
although not a simple one!

We definitely want to replay it. Unfortunately, given that the holiday
season is straight ahead, this might take a few weeks. Sorry about that!


> 2. ... however the output file was a
> .gz.parquet file, thus inaccessible through the S3 Select
> function in AWS.

From [1]:
"Amazon S3 Select supports columnar compression for Parquet
using GZIP or Snappy."

Could you share more information or the exact error message?

And just to confirm: by default, if not configured otherwise a
cc-pyspark job writes its output as Parquet using gzip compression for
column chunk data.

It's possible to disable Parquet compression:

$> $SPARK_HOME/bin/spark-submit ./server_count.py --help
...
--output_compression OUTPUT_COMPRESSION
Output compression codec: None, gzip/zlib
(default), zstd, snappy, lzo, etc.

Please also have a look at related output options. For example, you
might want to use JSON or CSV in the beginning for easier debugging.


> * *Job Execution*: While trying to resolve the file output issue, my
> EMR Serverless jobs are failing, and I am not sure how to debug
> the issue effectively.

Unfortunately, I have no experience with EMR Serverless. Key on EMR
(or Hadoop and Spark generally) is to look into the task / executor
logs. The log file of the job client might not show the reason for
the failure.

A general advice: I'd always test my jobs in local mode (on your
desktop or laptop), first with few, later with more WARC files
as input, finally you repeat the same on EMR: little input and small
cluster first, etc.

Otherwise: if you could share more information about your project:
- logs of the failing jobs and/or tasks
- use case or code
This might help to give you more detailed advice.


Best,
Sebastian


[1]
https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html
[2] https://status.commoncrawl.org/

On 7/23/24 14:41, Stierle O. wrote:
> I am currently working on a university project that involves accessing
> and processing Common Crawl data using AWS services, specifically Athena
> and EMR Serverless. I have been following the tutorial provided in this
> AWS blog post
> <https://aws.amazon.com/de/blogs/big-data/preprocess-and-fine-tune-llms-quickly-and-cost-effectively-using-amazon-emr-serverless-and-amazon-sagemaker/>, but I have encountered some challenges towards the end of the process and am unable to get it to work as intended.
>
> Here is a brief overview of the steps I have taken and the issues I am
> facing:
>
> 1. *Set Up AWS Athena and EMR Serverless*: I have set up Athena and EMR
> Serverless as per the instructions in the blog post.
> 2. *Data Processing*: I am trying to process the Common Crawl data
> using the setup, but I am running into issues, particularly when it
> comes to executing the EMR Serverless job and processing the WARC
> files correctly.
> 1. I have set up an anaconda environment as a EC2 Instance and have
> run all commands from there.
> 2. Some Jobs were successful, however the output file was a
> .gz.parquet file, thus inaccessible through the S3 Select
> function in AWS.
>
> Despite following the steps outlined in the tutorial, I am unable to get
> the expected results. Specifically, I am struggling with the following:
>
> * *Job Execution*: While trying to resolve the file output issue, my
> EMR Serverless jobs are failing, and I am not sure how to debug the
> issue effectively.
> * *Data Processing*: I need assistance in correctly processing the

Stierle O.

unread,
Aug 2, 2024, 11:03:28 AMAug 2
to Common Crawl
Hi Sebastian,

thank you for your early answer and sorry for my late one.

I have tried to implement your suggestion about disabling the parquet compression through adding "--output_compression", "None" in my SparkSubmit Job in the entryPointArguments of my command, however it failed. Like you said the logs weren't accessible in the EMR Serverless Studios Driver logs.

Just to clarify my approach in the Blog Post:
1. Set up a S3 Bucket to store input files, scripts, output, etc.
2. Use AWS Athena to filter the necessary common crawl data for further processing
3. Run a Spark Job Command to further process the filtered Athena data (In EMR Serverless through my EC2 Instance as a Conda Environment)

My Questions:
1. Am I correct in my assumptions of every step or have I potentially missed something or misinterpreted something?
2. Would you recommend me to use an EMR Cluster rather than EMR Serverless, given that you know more about the former?


You said:
> A general advice: I'd always test my jobs in local mode (on your
> desktop or laptop), first with few, later with more WARC files
> as input, finally you repeat the same on EMR: little input and small
> cluster first, etc.

I have tried to follow the Blog post, because that was the only thing I have found that utilizes AWS in any meaningful way. I have tried to access the commoncrawl data through https:// (locally), however I never figured out how to get beyond the warc paths file. I am stuck with these .gz files opened as a txt file: crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00000.warc.gz


Thank you for your time and answer.

Sincerely,
Oliver
Reply all
Reply to author
Forward
0 new messages