page level webgraph? and opening BVgraph files

jay patel

unread,

Jun 27, 2020, 1:02:43 AM6/27/20

to Common Crawl

Hi all,

I am writing a book aimed at Python/pyspark fluent audience for Apress (a Springer imprint) titled "getting structured data from internet: web crawling on production scale".

We have a chapter completely dedicated to exploring common crawl datasets so that our readers can work with real world web corpora for practical applications such as spam detection, graph analysis etc discussed in the book.

So I apologize in advance if I create multiple threads in coming weeks asking more details here.

Ok, so the question I had today was regarding page level webgraphs. I know in one of the old threads here Sebestian mentioned that they are exploring page level webgraphs to rank pages. Was that idea ever implemented by common crawl and some page level graph data made public? or are weights from host level web graphs being used currently to drive the frequency of crawling the same pages via monthly crawls?

Secondly, with reference to feb/march/may 2020 host level webgraphs (https://commoncrawl.org/2020/06/host-and-domain-level-web-graphs-febmarmay-2020/ ) the hosts edges text files represent outlinks aka from_id, to_id; however, for most practical applications we would need to perform link inversion on them to get inlinks. Not a big deal, probably can load up these text files on a SQL database to query for inlinks.

However, I noticed that there is something known as transpose of the graph in compressed BVGraphs format; it probably is what I am looking for but I just want to confirm it here especially since file size seems larger (8.7GB) than the other graph file (7.40GB).

Lastly, I have only ever worked on BVGraph using the framework described by Sebastiano Vigna et al (http://webgraph.di.unimi.it/). Does anyone have any examples or blogs of how to work with these files in pure CPython 3.x? The official documentation mentions a python 2.x and Jython based package (https://github.com/mapio/py-web-graph ) but I would rather not complicate things for my readers by introducing java packages and Jython querying etc.

Tom Morris

unread,

Jun 27, 2020, 2:25:01 PM6/27/20

to common...@googlegroups.com

On Sat, Jun 27, 2020 at 1:02 AM jay patel <jaypa...@gmail.com> wrote:

Lastly, I have only ever worked on BVGraph using the framework described by Sebastiano Vigna et al (http://webgraph.di.unimi.it/). Does anyone have any examples or blogs of how to work with these files in pure CPython 3.x? The official documentation mentions a python 2.x and Jython based package (https://github.com/mapio/py-web-graph ) but I would rather not complicate things for my readers by introducing java packages and Jython querying etc.

I've never used it, but it looks like one possibility might be https://pypi.org/project/pylibbvg/ which is built on top of the C library https://github.com/dgleich/libbvg

Tom

jay patel

unread,

Jun 28, 2020, 7:02:41 AM6/28/20

to Common Crawl

Thanks a lot Tom for pointing that out. Somehow I had missed coming across this library!

Sebastian Nagel

unread,

Jun 29, 2020, 10:04:38 AM6/29/20

to common...@googlegroups.com

Hi Jay,

> I am writing a book ...
> We have a chapter completely dedicated to exploring common crawl datasets ...

sounds great! And feel free to start as many threads as there are open questions
about Common Crawl data.

> exploring page level webgraphs to rank pages. Was that idea ever implemented by common crawl and
> some page level graph data made public?

It wasn't realized. There are approx. 50 billion unique URLs linked in one monthly crawl.
I'd be painful to build such large graphs and rank the nodes.

> or are weights from host level web graphs being used currently to drive the frequency of
> crawling the same pages via monthly crawls?

Yes. Domain-level and host-level ranks based on harmonic centrality are mapped
- from domain homepages to linked pages using OPIC
- or distributed to pages selected and weighted by inlink counts

Most important the domain-level ranks are used to define how many pages and subdomains are
allowed to be crawled from every domain.

> there is something known as transpose of the graph in compressed BVGraphs format; it probably is
> what I am looking for but I just want to confirm it here especially since file size seems larger
> (8.7GB) than the other graph file (7.40GB).

Yes, that's correct. For most of our host/domain graphs (but not all) the transposed graphs are larger.

Best,
Sebastian

Reply all

Reply to author

Forward