Preview release of URL in columnar format

34 views
Skip to first unread message

Sebastian Nagel

unread,
Jan 26, 2018, 9:44:22 AM1/26/18
to common...@googlegroups.com
Dear Common Crawl users,

we're glad to announce the preview of our URL index in a tabular/columnar format at
s3://commoncrawl/cc-index/table/cc-main/warc/

More details and examples how to use the table are found on Github [1]
and a human-readable list of the fields is provided at [2].

It contains the same data as the URL index (http://index.commoncrawl.org/)
but in a different format (Parquet [3]) which more suitable for analytical queries
and allows to access column content without reading the entire data. It is fast
to process or query if you're only interested in domain names, MIME types, etc.
while the existing URL index server is the better option to look up a single URL.

Note that it's a preview release, i.e., there is little documentation and the format
(table schema) may change. But if you're already familiar with Parquet [3], Spark [4]
or Presto/Athena [5,6], you may give it a try. Let us know about your experience,
share your examples, or tell us how we might improve the format.

Right now, only the a single monthly crawl (CC-MAIN-2018-05, January 2018) is contained
in the columnar index. We'll add more archives as soon as the format is stable.

Thanks,
Sebastian


[1] https://github.com/commoncrawl/cc-index-table/
[2] https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
[3] http://parquet.apache.org/
[4] http://spark.apache.org/
[5] https://prestodb.io/
[6] https://aws.amazon.com/athena/
Reply all
Reply to author
Forward
0 new messages