Hi Jay,
> I am writing a book ...
> We have a chapter completely dedicated to exploring common crawl datasets ...
sounds great! And feel free to start as many threads as there are open questions
about Common Crawl data.
> exploring page level webgraphs to rank pages. Was that idea ever implemented by common crawl and
> some page level graph data made public?
It wasn't realized. There are approx. 50 billion unique URLs linked in one monthly crawl.
I'd be painful to build such large graphs and rank the nodes.
> or are weights from host level web graphs being used currently to drive the frequency of
> crawling the same pages via monthly crawls?
Yes. Domain-level and host-level ranks based on harmonic centrality are mapped
- from domain homepages to linked pages using OPIC
- or distributed to pages selected and weighted by inlink counts
Most important the domain-level ranks are used to define how many pages and subdomains are
allowed to be crawled from every domain.
> there is something known as transpose of the graph in compressed BVGraphs format; it probably is
> what I am looking for but I just want to confirm it here especially since file size seems larger
> (8.7GB) than the other graph file (7.40GB).
Yes, that's correct. For most of our host/domain graphs (but not all) the transposed graphs are larger.
Best,
Sebastian