I'd be curious to hear how others have tackled this.
I've used Spark in perhaps an odd way when processing Common Crawl:
- spin up a bunch of EC2 spot instances
- the instance auto-configures itself
- the instance fetches a list of CC files to process (eg, one of the aggregate .gz files published by CC)
- the instance works through that list:
- uses S3 as a distributed lock to record which instance is working on it
- publishes intermediate results to S3
Why this is odd: each instance is a standalone Spark instance. I basically use Spark because it's a convenient way to scale across cores on heterogeneous instance types and has good integration with S3. I'll typically run another job to coalesce the intermediate results into a single result.
So, while you can't do an end-to-end job wholly within Spark, this is a really cheap way to process CC data. Plus, you can dial the time-to-completion up/down at any point by just killing or adding new instances.