CC index on Google BigQuery

623 views
Skip to first unread message

Alan Gibson

unread,
Feb 19, 2022, 3:23:39 PM2/19/22
to Common Crawl
Hello CC,

Have you considered making the crawl index available on BigQuery? There are already a lot of really useful datasets available in BigQuery's bigquery-public-data project. The CC index could be a really nice addition. I know it would be very useful to me in my search engine research.

Since the crawl indexes are already available as Parquet files, there would be a relatively easy path into BigQuery. I did a quick cost analysis and the results are below. 

What would need to be done, and the associated expense points:

1. Copy Parquet files from commoncrawl S3 bucket to Google Cloud Storage [1]
    a. S3 egress
    b. GCS ingress
    c. GCS storage
2. Load into BigQuery [2]
    d. Compute to run BigQuery load job
    e. BigQuery ingest
    f. BigQuery storage

The associated costs:

1a. Free since it is hosted on AWS Public Data bucket
1b. Free
1c. Temporary storage, so should be very low
2d. Free if using BigQuery Data Transfer Service? [3]
2e. Free if using BigQuery Data Transfer Service? [3]
2f. Free via the Google Public Dataset program if they accepted CC

I'd love to hear your thoughts on this.

Regards, 

Alan Gibson

Sebastian Nagel

unread,
Feb 21, 2022, 6:49:33 AM2/21/22
to common...@googlegroups.com
Hi Alan,

thanks for the suggestion and the detailed description how to transfer
the data.

Unfortunately, Common Crawl has a single engineer and we do not have the
resources to do the on-boarding and, even more important, to provide
support for BigQuery users, write tutorials, etc. Without the
perspective to build a community of users there, it does not make sense,
at least in my opinion.

Thanks for you understanding! I know that the index could be a valuable
resource if you're bound Google Cloud.

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages