Web Graph "\n" on domain name

26 views
Skip to first unread message

Mark Smith

unread,
Jun 2, 2017, 9:53:24 AM6/2/17
to Common Crawl
Hello,

I've got another case of data I wanted to get clarified.

I'm using the Web Graph data.

I've downloaded the latest vertices.txt.gz that was posted this weekend.

I notice that there are some hostnames that end in \n. Here's an example:

gzcat vertices.txt.gz | grep -n "org.cimm-us.www"
353936797:353936796 org.cimm-us.www\n


If I cross check that against ranks.txt.gz, the \n does not appear:

gzcat ranks.txt.gz | grep -n "org.cimm-us.www"
20864878:20864901 16910086 27970826 2.46094778903277e-09 org.cimm-us.www

I'm assuming it's just a small data issue. But, I wanted to make sure that there's not a different meaning.

-Mark

Sebastian Nagel

unread,
Jun 2, 2017, 10:13:43 AM6/2/17
to Common Crawl
Hi Mark,

that's due to a "hot fix" for superfluous new lines in the text export of the graph, see
  https://groups.google.com/d/topic/common-crawl/OebIGTP4GMs/discussion

In doubt, just ignore all nodes containing a \n.

Thanks,
Sebastian
Reply all
Reply to author
Forward
0 new messages