Large hyperlink graph published, covering 3.5 billion web pages and 128 billion hyperlinks

Robert Meusel

non lue,

12 nov. 2013, 08:31:3512/11/2013

à web-data...@googlegroups.com

Hi all,

the Web Data Commons team is happy to announce the publication of a new large hyperlink graph.

The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public.

The graph can be downloaded in various formats from http://webdatacommons.org/hyperlinkgraph

We provide initial statistics about the topology of the graph at http://webdatacommons.org/hyperlinkgraph/topology.html

We hope that the graph will be useful for researchers who develop

Search algorithms that rank results based on the hyperlinks between pages.
SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
Graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

We want to thanks the Common Crawl project for providing their great web crawl and thus enabling the creation of the WDC Hyperlink Graph.

The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services. We thank your sponsors a lot.

Best Regards,

Chris, Oliver & Robert

Akshay Bhat

non lue,

27 nov. 2013, 21:48:2427/11/2013

à web-data...@googlegroups.com

Hi I tried downloading it, but it seems that the bucket is no made public.

s3cmd get --add-header=x-amz-request-payer:requester s3://wgc-2012-data/index/index-00054.gz --force

s3://wgc-2012-data/index/index-00054.gz -> ./index-00054.gz [1 of 1]

ERROR: S3 error: 403 (Forbidden):

aws s3 cp s3://wgc-2012-data/index/index-00010.gz temp.gz

A client error (AccessDenied) occurred: Access Denied

Robert Meusel

non lue,

28 nov. 2013, 07:51:0828/11/2013

à web-data...@googlegroups.com

HI,

I just rechecked but the bucket and also the files are "readable/downloadable" for authenticated users. Please make sure you are using the right version of s3cmd (1.5.0-alpha1) as the standard version does not support extra headers.

Cheers,

Robert

Le message a été supprimé

Akshay Bhat

non lue,

28 nov. 2013, 21:38:5028/11/2013

à web-data...@googlegroups.com

Still cannot download it, I am using the latest Amazon's official latest awscli
I can download the Common Crawl data (which is similarly shared using requester pays) but I cannot download your files

E.g.

aws-test aub3$ aws s3 cp s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz temp.gz

Completed 4 of 20 part(s) with 1 file(s) remaining

aub3$ aws s3 cp s3://wgc-2012-data/webgraph/original/network.graph temp.graph

A client error (AccessDenied) occurred: Access Denied

Robert Meusel

non lue,

29 nov. 2013, 05:34:3229/11/2013

à web-data...@googlegroups.com

Hi,

unfortunately the data is not available similar to the common crawl. Their data is stored in a public-bucket, which means Amazon is taking care of all charges (Storage, Requests, Access). We do not have a public bucket, so we are paying for the storage and as we use the Requester Pays Option (http://docs.aws.amazon.com/AmazonDevPay/latest/DevPayDeveloperGuide/S3RequesterPays.html) authenticated users, trying to download have to explicitely add the "requester-pays" header to signalize that they are aware of the charges which are made to their account.

Please retry and add the header to your download (s3cmd uses the --add-header=x-amz-request-payer:requester option).

Hope this helps,

Robert

Akshay Bhat

non lue,

29 nov. 2013, 08:37:1029/11/2013

à web-data...@googlegroups.com

Thanks, that worked (I used the lastest version of s3cmd).

Tie hky

non lue,

7 sept. 2014, 16:40:2607/09/2014

à web-data...@googlegroups.com

Hi Robert,

I am interested in the hyperlink data and want to download the data.

Would you please answer the following questions? Thanks.

1. Hyperlink graph 2012:

For Index/Arc files, the total size is 376GB, including 45GB(Index files) and 331GB(Arc files)


		But for WebGraph Files, the total size is 56GB, including 52GB(network.graph) + 4GB(network.offsets) + 1.5MB(network.properties) The size is significantly different, is it normal? 2. Hyperlink Graph 2014: The size of data for Index/Arc files is 20GB. The size of data for WebGrpah files is 22.1GB, including 20GB(webgraph.graph) and 2.1GB(webgrah offsets). There causes two questions: (1)The size for two formats are almost same. Why in 2012 data, the size for two formats are significantly different? (2)Why the size of data for 2014 is significantly smaller then 2012 data while the hyperlinks in 2014 is about half of that in 2012?

But for WebGraph Files, the total size is 56GB, including 52GB(network.graph) + 4GB(network.offsets) + 1.5MB(network.properties)

The size is significantly different, is it normal?

2. Hyperlink Graph 2014:

The size of data for Index/Arc files is 20GB.

The size of data for WebGrpah files is 22.1GB, including 20GB(webgraph.graph) and 2.1GB(webgrah offsets).

There causes two questions:

(1)The size for two formats are almost same. Why in 2012 data, the size for two formats are significantly different?

(2)Why the size of data for 2014 is significantly smaller then 2012 data while the hyperlinks in 2014 is about half of that in 2012?

Robert Meusel

non lue,

8 sept. 2014, 03:39:0908/09/2014

à web-data...@googlegroups.com

Hi Tie,

Great to hear that you are interested in using the graph data.

The sizes of the files you named for the 2012 are different, as the first files are plain text. So in the ARC File, each line has two long values, representing the IDs of two pages within the graph. Within the INDEX file, each line has the URL of a page and the corresponding Id. The WebGraph Files are an own format which is highly optimized for graphs, thus the files are really small.

For the 2014 you are comparing the size of the INDEX with the size of the webgraph, which you should not, as the INDEX - as above - includes the URLs as plain text and the webgraph only include an optimized binary representation of the network/graph using just IDs (longs or ints).

As the 2012 graph exceed the INTEGER range we need to use LONG, resulting in larger files. The 2014 graph is less than MAX_INTEGER, so the webgraph files use INT.

Hope this helps,

Robert

Tie hky

non lue,

11 sept. 2014, 09:07:5011/09/2014

à web-data...@googlegroups.com

Thanks Robert.

WDC provides different formats for WebGraph data. Would you please let me know what tools are used to convert the data into different formats, what's the format of original data?

Because size of the WebGrpha data is significantly smaller, I'd like to download it. And I'd like retrieve some info from the data.

I am new to WebGraph, is it just a graph display tool, or does it provide some functions to retrieve different info from the data?

Thanks again.

Répondre à tous

Répondre à l'auteur

Transférer