CC index on Google BigQuery

623 views

Skip to first unread message

Alan Gibson

unread,

Feb 19, 2022, 3:23:39 PM2/19/22

to Common Crawl

Hello CC,

Have you considered making the crawl index available on BigQuery? There are already a lot of really useful datasets available in BigQuery's bigquery-public-data project. The CC index could be a really nice addition. I know it would be very useful to me in my search engine research.

Since the crawl indexes are already available as Parquet files, there would be a relatively easy path into BigQuery. I did a quick cost analysis and the results are below.

What would need to be done, and the associated expense points:

1. Copy Parquet files from commoncrawl S3 bucket to Google Cloud Storage [1]

a. S3 egress

b. GCS ingress

c. GCS storage

2. Load into BigQuery [2]

d. Compute to run BigQuery load job

e. BigQuery ingest

f. BigQuery storage

The associated costs:

1a. Free since it is hosted on AWS Public Data bucket

1b. Free

1c. Temporary storage, so should be very low

2d. Free if using BigQuery Data Transfer Service? [3]

2e. Free if using BigQuery Data Transfer Service? [3]

2f. Free via the Google Public Dataset program if they accepted CC

References:

[1] https://hevodata.com/blog/amazon-s3-to-bigquery/#method1

[2] https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet

[3] https://cloud.google.com/bigquery-transfer/pricing

[4] https://cloud.google.com/vpc/network-pricing

I'd love to hear your thoughts on this.

Regards,

Alan Gibson

Sebastian Nagel

unread,

Feb 21, 2022, 6:49:33 AM2/21/22

to common...@googlegroups.com

Hi Alan,

thanks for the suggestion and the detailed description how to transfer
the data.

Unfortunately, Common Crawl has a single engineer and we do not have the
resources to do the on-boarding and, even more important, to provide
support for BigQuery users, write tutorials, etc. Without the
perspective to build a community of users there, it does not make sense,
at least in my opinion.

Thanks for you understanding! I know that the index could be a valuable
resource if you're bound Google Cloud.

Best,
Sebastian

Reply all

Reply to author

Forward

0 new messages