Hi Oliver,
one note ahead: I wasn't aware of this examples - it's a great one,
although not a simple one!
We definitely want to replay it. Unfortunately, given that the holiday
season is straight ahead, this might take a few weeks. Sorry about that!
> 2. ... however the output file was a
> .gz.parquet file, thus inaccessible through the S3 Select
> function in AWS.
From [1]:
"Amazon S3 Select supports columnar compression for Parquet
using GZIP or Snappy."
Could you share more information or the exact error message?
And just to confirm: by default, if not configured otherwise a
cc-pyspark job writes its output as Parquet using gzip compression for
column chunk data.
It's possible to disable Parquet compression:
$> $SPARK_HOME/bin/spark-submit ./server_count.py --help
...
--output_compression OUTPUT_COMPRESSION
Output compression codec: None, gzip/zlib
(default), zstd, snappy, lzo, etc.
Please also have a look at related output options. For example, you
might want to use JSON or CSV in the beginning for easier debugging.
> * *Job Execution*: While trying to resolve the file output issue, my
> EMR Serverless jobs are failing, and I am not sure how to debug
> the issue effectively.
Unfortunately, I have no experience with EMR Serverless. Key on EMR
(or Hadoop and Spark generally) is to look into the task / executor
logs. The log file of the job client might not show the reason for
the failure.
A general advice: I'd always test my jobs in local mode (on your
desktop or laptop), first with few, later with more WARC files
as input, finally you repeat the same on EMR: little input and small
cluster first, etc.
Otherwise: if you could share more information about your project:
- logs of the failing jobs and/or tasks
- use case or code
This might help to give you more detailed advice.
Best,
Sebastian
[1]
https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html
[2]
https://status.commoncrawl.org/
On 7/23/24 14:41, Stierle O. wrote:
> I am currently working on a university project that involves accessing
> and processing Common Crawl data using AWS services, specifically Athena
> and EMR Serverless. I have been following the tutorial provided in this
> AWS blog post
> <
https://aws.amazon.com/de/blogs/big-data/preprocess-and-fine-tune-llms-quickly-and-cost-effectively-using-amazon-emr-serverless-and-amazon-sagemaker/>, but I have encountered some challenges towards the end of the process and am unable to get it to work as intended.
>
> Here is a brief overview of the steps I have taken and the issues I am
> facing:
>
> 1. *Set Up AWS Athena and EMR Serverless*: I have set up Athena and EMR
> Serverless as per the instructions in the blog post.
> 2. *Data Processing*: I am trying to process the Common Crawl data
> using the setup, but I am running into issues, particularly when it
> comes to executing the EMR Serverless job and processing the WARC
> files correctly.
> 1. I have set up an anaconda environment as a EC2 Instance and have
> run all commands from there.
> 2. Some Jobs were successful, however the output file was a
> .gz.parquet file, thus inaccessible through the S3 Select
> function in AWS.
>
> Despite following the steps outlined in the tutorial, I am unable to get
> the expected results. Specifically, I am struggling with the following:
>
> * *Job Execution*: While trying to resolve the file output issue, my
> EMR Serverless jobs are failing, and I am not sure how to debug the
> issue effectively.
> * *Data Processing*: I need assistance in correctly processing the