Link to Graph node data is not working

28 views
Skip to first unread message

Tom Alby

unread,
Feb 3, 2021, 12:10:56 PM2/3/21
to Common Crawl
Hi,

in the last blog post about graph data, the node data is linked to https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/cc-main-2020-jul-aug-sep-host-vertices.paths.gz , but that link seems to be broken. I get a 162 byte binary file. Couldn't find it in the S3 bucket either.

All other data works for me. Any idea where that file is?

Best

Tom

Sebastian Nagel

unread,
Feb 3, 2021, 1:21:14 PM2/3/21
to common...@googlegroups.com
Hi Tom,

the link points to a gzip-compressed file listing the paths of the 12 vertices files.
By adding the prefix https://commoncrawl.s3.amazonaws.com/ to each path you get the
list of URLs to download all vertices files. Same procedure for the edges.

While the domain graph is small enough to fit into two single files (one for the node
names and one for the edges), the host-level graph is shipped in multiple files.

> All other data works for me.

If you already downloaded the *.graph/*.properties files, you could just use them to explore the graphs.
Alex Xue wrote recently a tutorial how to explore the graphs in the webgraph format:
https://github.com/commoncrawl/cc-notebooks/tree/master/cc-webgraph-statistics

Best,
Sebastian




On 2/3/21 6:10 PM, Tom Alby wrote:
> Hi,
>
> in the last blog post about graph data <https://commoncrawl.org/2020/10/host-and-domain-level-web-graphs-julaugsep-2020/>, the node data is
> linked
> to https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/cc-main-2020-jul-aug-sep-host-vertices.paths.gz ,
> but that link seems to be broken. I get a 162 byte binary file. Couldn't find it in the S3 bucket either.
>
> All other data works for me. Any idea where that file is?
>
> Best
>
> Tom
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/c024c2b7-2b2f-4fea-9e28-c997a4da5c40n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/c024c2b7-2b2f-4fea-9e28-c997a4da5c40n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tom Alby

unread,
Feb 3, 2021, 1:47:38 PM2/3/21
to Common Crawl
Thanks, Sebastian.
I now understand what the issue was: MacOS did not gunzip the file properly, resulting in a rubbish file. Using gunzip on my Linux computer, it worked just fine. Sorry about the confusion.
Best
Tom

Sebastian Nagel

unread,
Feb 3, 2021, 2:26:26 PM2/3/21
to common...@googlegroups.com
Hi Tom,

thanks for the feedback. I'll update the description to make it
clear that the host-level are shipped in multiple files. I also
shortly considered to use a different packaging format for the
next webgraph release (expected in about one week). But the
multi-file gzipped packaging makes it easier to read the graph
using big data tool. In addition, the host-level graphs may vary
in size, so two single files may definitely become difficult to
handle.

Best,
Sebastian

On 2/3/21 7:47 PM, Tom Alby wrote:
> Thanks, Sebastian.
> I now understand what the issue was: MacOS did not gunzip the file properly, resulting in a rubbish file. Using gunzip on my Linux computer,
> it worked just fine. Sorry about the confusion.
> Best
> Tom
>
> On Wednesday, February 3, 2021 at 7:21:14 PM UTC+1 Sebastian Nagel wrote:
>
> Hi Tom,
>
> the link points to a gzip-compressed file listing the paths of the 12 vertices files.
> By adding the prefix https://commoncrawl.s3.amazonaws.com/ <https://commoncrawl.s3.amazonaws.com/> to each path you get the
> <https://groups.google.com/d/msgid/common-crawl/c024c2b7-2b2f-4fea-9e28-c997a4da5c40n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/c024c2b7-2b2f-4fea-9e28-c997a4da5c40n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/62a20c92-bbfe-43c6-ad60-4ffacaaf82dfn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/62a20c92-bbfe-43c6-ad60-4ffacaaf82dfn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tom Morris

unread,
Feb 3, 2021, 9:03:13 PM2/3/21
to common...@googlegroups.com
I get "error 79 - Inappropriate file type or format" when I double click on the file in OS X, but the `file` command recognizes it as a gzip file and the `gunzip` command unzips it from a terminal window without problem, but I chalk this up to an OS X Archive Utility bug or weirdness. I tried a couple of simple things like renaming it to .txt.gz instead of .paths.gz, but that didn't improve matters.

Tom

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/62a20c92-bbfe-43c6-ad60-4ffacaaf82dfn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages