Re: [cc] Digest for common-crawl@googlegroups.com - 1 update in 1 topic

75 views

Skip to first unread message

Lee Prevost

unread,

Aug 3, 2024, 11:52:07 AM8/3/24

to common...@googlegroups.com

Having gone done deep in the rabbit hole before, could your needs be satisfied by the CC index which can be first used to target your archives and limit your search?

If so, there is an excellent guide for how to setup the cc index using nothing other than the magic of AWS Athena. I can try to find that and point you to it. Athena’s meta information pointing to the underlying parquet files just need updated with a simple sql command periodically. It makes finding warcs, wets and other info immensely easier. There is also a web version of the index search but much more limited than Athena.

Lee

Sent from my iPad

On Aug 2, 2024, at 10:14 PM, common...@googlegroups.com wrote:

common...@googlegroups.com Google Groups

Topic digest
View all topics

Need help in accessing common crawl data through AWS (Athena/EMR Serverless) - 1 Update

Need help in accessing common crawl data through AWS (Athena/EMR Serverless)

Stierle O. <stierle...@gmail.com>: Aug 02 08:03AM -0700

Hi Sebastian,

thank you for your early answer and sorry for my late one.

I have tried to implement your suggestion about disabling the parquet
compression through adding "--output_compression", "None" in my SparkSubmit
Job in the entryPointArguments of my command, however it failed. Like you
said the logs weren't accessible in the EMR Serverless Studios Driver logs.

Just to clarify my approach in the Blog Post:
1. Set up a S3 Bucket to store input files, scripts, output, etc.
2. Use AWS Athena to filter the necessary common crawl data for further
processing
3. Run a Spark Job Command to further process the filtered Athena data (In
EMR Serverless through my EC2 Instance as a Conda Environment)

My Questions:
1. Am I correct in my assumptions of every step or have I potentially
missed something or misinterpreted something?
2. Would you recommend me to use an EMR Cluster rather than EMR Serverless,
given that you know more about the former?

You said:
> desktop or laptop), first with few, later with more WARC files
> as input, finally you repeat the same on EMR: little input and small
> cluster first, etc.

I have tried to follow the Blog post, because that was the only thing I
have found that utilizes AWS in any meaningful way. I have tried to access
the commoncrawl data through https:// (locally), however I never figured
out how to get beyond the warc paths file. I am stuck with these .gz files
opened as a txt file:
crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00000.warc.gz

Thank you for your time and answer.

Sincerely,
Oliver

Sebastian Nagel schrieb am Mittwoch, 24. Juli 2024 um 13:32:50 UTC+2:

Back to top

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to common-crawl...@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages