How can we use the Host- and Domain-Level Web Graphs efficiently.
65 views
Skip to first unread message
zhan su
unread,
Feb 27, 2023, 7:05:04 AM2/27/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Thanks for your great work for the common crawl dataset. I have a question about the graph dataset. Can I use the file "cc-main-2021-22-oct-nov-jan-host-ranks.txt.gz" directly to creat a database from amazon platform(Athena)? I want to know the historical websites rank. It is convenient to have a database so I can search the historical pagerank directly.
Sebastian Nagel
unread,
Feb 28, 2023, 6:06:35 AM2/28/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi,
I think it's possible to ingest a gzipped, tab-separated text file into Athena,
see [1,2]. However, querying the data would be quite inefficient.
The text files containing the rankings (but also vertex labels and edges) have
been introduced in 2017 [3] following the format used for the Common Search
webgraphs [4]. It was about having a universally readable format without
dependencies on specific tools or software libraries.
However, perhaps it's time to explore a more efficient storage format to hold
the rankings and eventually merge them with the vertex file. The tools for
reading Parquet files has significantly improved in recent years and Parquet
readers are now available for virtually all programming languages.
Thanks for the suggestion! There's now an open request in the cc-webgraph
project on Github to track this idea. Feel free to comment on [5] to specify
your use case. Any other suggestions or even objections are also welcome!
> I want to know the historical websites rank.
Does this mean that you're also interested in the rankings derived from previous
graphs or just the most recent?
On 2/27/23 13:05, zhan su wrote:
> Thanks for your great work for the common crawl dataset. I have a question about
> the graph dataset. Can I use the file
> "cc-main-2021-22-oct-nov-jan-host-ranks.txt.gz