Host- and domain-level web graph data sets of Nov/Dec/Jan 2017/2018 crawls

128 views
Skip to first unread message

Sebastian Nagel

unread,
Feb 8, 2018, 3:43:28 AM2/8/18
to common...@googlegroups.com
Hi all,

we're pleased to announce the 4th release of our webgraph data set.
More details and links to download the data set can be found on our blog:

http://commoncrawl.org/2018/02/webgraphs-nov-dec-2017-jan-2018/

As usual we provide for both the host-level and domain-level graph:
- graph in text format (vertices and edges)
- graph for use in the webgraph framework [1]
- harmonic centrality and page rank


Best,
Sebastian


[1] http://webgraph.di.unimi.it/

Sebastian Nagel

unread,
Feb 21, 2018, 11:31:54 AM2/21/18
to common...@googlegroups.com
Hi,

apologies for this issue...

A bug [1] caused that only links from the January 2018 crawl
are used in the "Nov/Dec/Jan 2017/2018" webgraph release.

Of course, the release will be fixed to include all links from
all 3 monthly crawls. This will affect the host- and domain-level graphs
and also the rankings. We'll eventually also keep the erroneous
release (but will correct the release path and file names).

Note that previous releases are not affected: while the bug [1] was
present already in the first version of the shell script to build
the host-level graphs, it also depends on the way how the script
was called - in one turn or step by step for each monthly crawl
and the merged graph.

Short story why the bug has been uncovered: while reading a paper about
the July 2017 webgraph release [2], I've (finally) wondered why the
domain-level graph is about 25% smaller than that of the previous two
releases. For the host-level graph a smaller size was expected because
spam domains with large numbers of hosts/subdomains have been excluded
during the last crawls. However, there should be only a small impact
on the number of domains. A careful check of the log files then
brought the final evidence that the smaller size is better explained
by a bug. Very sorry about that...

Best,
Sebastian

[1] https://github.com/commoncrawl/cc-webgraph/commit/0a406f6c988678bc480340d17a2415442f75dc9a
[2] https://arxiv.org/abs/1802.05435

Sebastian Nagel

unread,
Feb 23, 2018, 3:41:23 AM2/23/18
to common...@googlegroups.com, Lukasz Bilangowski
Hi Lukasz,

I expect that it's ready end of next week.

Best,
Sebastian

On 02/23/2018 12:20 AM, Lukasz Bilangowski wrote:
> Hi Sebastian,
>
> Bugs happen and I wanted to let you know your transparency is much appreciated.
>
> Do you maybe know when more or less the fixed release can be expected to arrive on S3?
>
> Best regards,
> Lukasz
> Webfinery
>> --
>> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
>> To post to this group, send email to common...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/common-crawl.
>> For more options, visit https://groups.google.com/d/optout.
>

Sebastian Nagel

unread,
Feb 28, 2018, 11:08:30 AM2/28/18
to common...@googlegroups.com
Hi everyone,

the fixed version is now available, you'll find the updated download links on
http://commoncrawl.org/2018/02/webgraphs-nov-dec-2017-jan-2018/

The graphs now fit the expected size:
- domain-level graph: 94 million nodes, 1.44 billion edges
- host-level graph: 2.75 billion nodes, 8.6 billion edges

The "broken" graphs are kept with the correct name cc-main-2018-jan-*,
see the note in the blog post.

Best,
Sebastian

Soner Altin

unread,
Mar 26, 2018, 11:40:40 AM3/26/18
to Common Crawl
Hi all,

This version doesn't have ranks.txt.gz (hosts ranked by harmonic centrality and pagerank) file for host graph where older versions have. Is there any reason for this or am I looking for a wrong path?

Thanks in advance!

Sebastian Nagel

unread,
Mar 26, 2018, 1:37:44 PM3/26/18
to common...@googlegroups.com
Hi,

> Is there any reason for this or am I looking for a wrong path?

No, you look on the right place. Sorry, but I took the host-level ranks down
one week ago because the list was incomplete. I've discovered it first
when using the ranks to feed the March crawl, but hadn't the time to look
at the problem. It's somewhere in
https://github.com/commoncrawl/cc-webgraph/blob/master/src/script/webgraph_ranking/process_webgraph.sh
that an error while sorting and joining the ranks is not properly caught and causes
the script to exit.

I plan to fix or even rewrite this part (a sort and join of a 2+ billion list
is slow anyway) when preparing the next release of the webgraph (early in May
for Feb/Mar/Apr). By now I can only offer the domain-level ranks or the previous
host-level ranks as alternative. Sorry.

Best,
Sebastian


On 03/26/2018 05:40 PM, Soner Altin wrote:
> Hi all,
>
> This version doesn't have *ranks.txt.gz (hosts ranked by harmonic centrality and pagerank*) file for
> > [2] https://arxiv.org/abs/1802.05435 <https://arxiv.org/abs/1802.05435>
> >
> >
> > On 02/08/2018 09:43 AM, Sebastian Nagel wrote:
> >> Hi all,
> >>
> >> we're pleased to announce the 4th release of our webgraph data set.
> >> More details and links to download the data set can be found on our blog:
> >>
> >>    http://commoncrawl.org/2018/02/webgraphs-nov-dec-2017-jan-2018/
> <http://commoncrawl.org/2018/02/webgraphs-nov-dec-2017-jan-2018/>
> >>
> >> As usual we provide for both the host-level and domain-level graph:
> >> - graph in text format (vertices and edges)
> >> - graph for use in the webgraph framework [1]
> >> - harmonic centrality and page rank
> >>
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> [1] http://webgraph.di.unimi.it/
> >>
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

Sebastian Nagel

unread,
May 7, 2018, 10:45:52 AM5/7/18
to Common Crawl
Hi Soner,

the ranks file is now available and complete.

Thanks for your patience,
Sebastian

Soner Altin

unread,
May 7, 2018, 1:20:38 PM5/7/18
to common...@googlegroups.com
Hi Sebastian,

That's great, thanks a lot!

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/v8x-Ap88WZI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.



--
Best,

Soner
Reply all
Reply to author
Forward
0 new messages