How can we use the Host- and Domain-Level Web Graphs efficiently.

69 views

Skip to first unread message

zhan su

unread,

Feb 27, 2023, 7:05:04 AM2/27/23

to Common Crawl

Thanks for your great work for the common crawl dataset. I have a question about the graph dataset. Can I use the file "cc-main-2021-22-oct-nov-jan-host-ranks.txt.gz" directly to creat a database from amazon platform(Athena)? I want to know the historical websites rank. It is convenient to have a database so I can search the historical pagerank directly.

Sebastian Nagel

unread,

Feb 28, 2023, 6:06:35 AM2/28/23

to common...@googlegroups.com

Hi,

I think it's possible to ingest a gzipped, tab-separated text file into Athena,
see [1,2]. However, querying the data would be quite inefficient.

The text files containing the rankings (but also vertex labels and edges) have
been introduced in 2017 [3] following the format used for the Common Search
webgraphs [4]. It was about having a universally readable format without
dependencies on specific tools or software libraries.

However, perhaps it's time to explore a more efficient storage format to hold
the rankings and eventually merge them with the vertex file. The tools for
reading Parquet files has significantly improved in recent years and Parquet
readers are now available for virtually all programming languages.

Thanks for the suggestion! There's now an open request in the cc-webgraph
project on Github to track this idea. Feel free to comment on [5] to specify
your use case. Any other suggestions or even objections are also welcome!

> I want to know the historical websites rank.

Does this mean that you're also interested in the rankings derived from previous
graphs or just the most recent?

Best,
Sebastian

[1] https://docs.aws.amazon.com/athena/latest/ug/lazy-simple-serde.html
[2] https://docs.aws.amazon.com/athena/latest/ug/compression-support-iceberg.html
[3] https://commoncrawl.org/2017/05/hostgraph-2017-feb-mar-apr-crawls/
[4]
https://web.archive.org/web/20171117063402/https://about.commonsearch.org/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank/
[5] https://github.com/commoncrawl/cc-webgraph/issues/7

On 2/27/23 13:05, zhan su wrote:
> Thanks for your great work for the common crawl dataset. I have a question about
> the graph dataset. Can I use the file
> "cc-main-2021-22-oct-nov-jan-host-ranks.txt.gz

> <https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/host/cc-main-2021-22-oct-nov-jan-host-ranks.txt.gz>" directly to creat a database from amazon platform(Athena)? I want to know the historical websites rank. It is convenient to have a database so I can search the historical pagerank directly.
>

Reply all

Reply to author

Forward

0 new messages