Stierle O. <stierle...@gmail.com>: Aug 02 08:03AM -0700
Hi Sebastian,
thank you for your early answer and sorry for my late one.
I have tried to implement your suggestion about disabling the parquet
compression through adding "--output_compression", "None" in my SparkSubmit
Job in the entryPointArguments of my command, however it failed. Like you
said the logs weren't accessible in the EMR Serverless Studios Driver logs.
Just to clarify my approach in the Blog Post:
1. Set up a S3 Bucket to store input files, scripts, output, etc.
2. Use AWS Athena to filter the necessary common crawl data for further
processing
3. Run a Spark Job Command to further process the filtered Athena data (In
EMR Serverless through my EC2 Instance as a Conda Environment)
My Questions:
1. Am I correct in my assumptions of every step or have I potentially
missed something or misinterpreted something?
2. Would you recommend me to use an EMR Cluster rather than EMR Serverless,
given that you know more about the former?
You said:
> desktop or laptop), first with few, later with more WARC files
> as input, finally you repeat the same on EMR: little input and small
> cluster first, etc.
I have tried to follow the Blog post, because that was the only thing I
have found that utilizes AWS in any meaningful way. I have tried to access
the commoncrawl data through https:// (locally), however I never figured
out how to get beyond the warc paths file. I am stuck with these .gz files
opened as a txt file:
crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00000.warc.gz
Thank you for your time and answer.
Sincerely,
Oliver
Sebastian Nagel schrieb am Mittwoch, 24. Juli 2024 um 13:32:50 UTC+2:
|