Common Crawl and Apache Spark

536 views
Skip to first unread message

giorgio79

unread,
Jan 6, 2015, 10:54:58 AM1/6/15
to common...@googlegroups.com
Anyone processing CommonCrawl with Apache Spark https://spark.apache.org/? Would love to read a tutorial :)

Taka Shinagawa

unread,
Jan 7, 2015, 4:50:48 AM1/7/15
to common...@googlegroups.com
I've processed small subset of CommonCrawl data set with Spark (in Scala and Python). For large data set, there are challenges with available memory of local hardware/EC2 instances. I'm looking into it more. I'm planning to write a tutorial.

Cheers,
Taka

Akshay Bhat

unread,
Jan 7, 2015, 9:46:41 PM1/7/15
to common...@googlegroups.com
In my opinions Spark is unsuitable for Extract Transform Load type tasks using Common Crawl Data. 
It is however very suitable for analysis once you have extracted data stored on S3.

Wojciech Stokowiec

unread,
Oct 8, 2015, 6:26:44 AM10/8/15
to Common Crawl
We've processed CC with Scala/Akka actors run on a samll cluster and then dumped the filtered results (polish websites) to Cassandra. From there we've used Spark for analytics, n-building etc.
I also think that Spark is not well-siuted for CC processing. 

Colin Dellow

unread,
Oct 8, 2015, 10:29:52 PM10/8/15
to Common Crawl
I'd be curious to hear how others have tackled this.

I've used Spark in perhaps an odd way when processing Common Crawl:

- spin up a bunch of EC2 spot instances
- the instance auto-configures itself
- the instance fetches a list of CC files to process (eg, one of the aggregate .gz files published by CC)
- the instance works through that list:
  - uses S3 as a distributed lock to record which instance is working on it
  - publishes intermediate results to S3

Why this is odd: each instance is a standalone Spark instance. I basically use Spark because it's a convenient way to scale across cores on heterogeneous instance types and has good integration with S3. I'll typically run another job to coalesce the intermediate results into a single result.

So, while you can't do an end-to-end job wholly within Spark, this is a really cheap way to process CC data. Plus, you can dial the time-to-completion up/down at any point by just killing or adding new instances.
Reply all
Reply to author
Forward
0 new messages