Have you considered making the crawl index available on BigQuery? There are already a lot of really useful datasets available in BigQuery's bigquery-public-data project. The CC index could be a really nice addition. I know it would be very useful to me in my search engine research.
Since the crawl indexes are already available as Parquet files, there would be a relatively easy path into BigQuery. I did a quick cost analysis and the results are below.
What would need to be done, and the associated expense points:
1. Copy Parquet files from commoncrawl S3 bucket to Google Cloud Storage [1]
a. S3 egress
b. GCS ingress
c. GCS storage
2. Load into BigQuery [2]
d. Compute to run BigQuery load job
e. BigQuery ingest
f. BigQuery storage
The associated costs:
1a. Free since it is hosted on AWS Public Data bucket
1b. Free
1c. Temporary storage, so should be very low
2d. Free if using BigQuery Data Transfer Service? [3]
2e. Free if using BigQuery Data Transfer Service? [3]
2f. Free via the Google Public Dataset program if they accepted CC
I'd love to hear your thoughts on this.
Regards,
Alan Gibson