Is there plan to host the data on Google's public datasets?

407 views
Skip to first unread message

Derek Chia

unread,
May 22, 2016, 2:34:09 AM5/22/16
to Common Crawl
Hello,

Just want to find out if there's any plans to upload the data from CommonCrawl (WARC files) into Google's public datasets - https://www.google.com/publicdata/directory. I thought it would be interesting to make use of Google's tools (e.g. BigQuery) to conduct analysis.

Cheers,

Derek

Tom Morris

unread,
Jun 1, 2016, 11:41:05 AM6/1/16
to common...@googlegroups.com
On Sun, May 22, 2016 at 2:34 AM, Derek Chia <derek...@gmail.com> wrote:

Just want to find out if there's any plans to upload the data from CommonCrawl (WARC files) into Google's public datasets - https://www.google.com/publicdata/directory. I thought it would be interesting to make use of Google's tools (e.g. BigQuery) to conduct analysis.

I don't know if there any plans to do this. Anyone from CommonCrawl want to comment?

The link you provided doesn't actually go to the BigQuery public datasets though. Those are datasets for use with the Public Data Explorer visualization tool.

BigQuery public datasets are described here: https://cloud.google.com/bigquery/public-data/
although, ironically, the reddit list is more complete: https://www.reddit.com/r/bigquery/wiki/datasets
because many of the semi-official datasets are in different buckets, such as Felipe Hoffa's bucket, fh-bigquery, the HTTP Archive bucket, etc.

For non-BigQuery access, Google also makes some public data available on Google Cloud Storage, such as the public genomics data, which would be another option for storing the CommonCrawl data, depending on what type of processing folks wanted to do.

If the CommonCrawl folks haven't already started the discussion, I'd be willing to query Google and see if we can get them interested in hosting.

Tom

Sara Crouse

unread,
Jun 7, 2016, 3:19:23 PM6/7/16
to Common Crawl


On Wednesday, June 1, 2016 at 8:41:05 AM UTC-7, Tom Morris wrote:
Hi Derek and Tom,

Thanks for raising this topic. At this time, there are no plans to upload Common Crawl data to Google public datasets. That said, others may be using BigQuery to process derivative (structured) datasets using file formats that are more compatible with Google's processing tools. 

Sara
Reply all
Reply to author
Forward
0 new messages