In-House Web Graph vertices.txt format confusion

28 views
Skip to first unread message

Mark Smith

unread,
May 28, 2017, 11:09:02 AM5/28/17
to Common Crawl
Hello,

I'm processing the new in-house web graph and am confused by something in vertices.txt.gz

I'm expecting every line to contain two fields but I'm finding some lines do not.

Here's an example case:
gzcat gz/vertices.txt.gz |  tail -n +311896710   | more

311894452       il.clullle
311894453       il.cm
311894454       il.co
311894455       il.co
.zemereshet
311894456       il.co.0
311894457       il.co.0-15
311894458       il.co.0-5
311894459       il.co.004
311894460       il.co.004.mail


Notice that .zemereshet line. It is a single field and appears to be out of sort order.

I count 2761 lines in that file that are either blank or are a single hostname.

What do those lines mean?

-Mark

Sebastian Nagel

unread,
May 28, 2017, 11:45:01 AM5/28/17
to common...@googlegroups.com
Hi Mark,

argh, I see. It was a bug that host names containing new lines where not
rejected while spanning up the graph. It's already eliminated [1], but the graph
is merged from three monthly crawl and one or two of them still contain these
nodes. Because the graph is intermediately stored in Parquet this doesn't matter
but it breaks the text format of course.

I'll fix the vertices.txt.gz until tomorrow and join the two lines with an '\n'
in between. This makes an invalid host name but avoids that the same host appears
twice.

If you can't wait, just ignore the line ".zemereshet".

Thanks,
Sebastian

[1] https://github.com/commoncrawl/cc-pyspark/commit/00913145e370f95b7a1fd23fa8410d6d01ca261d


On 05/28/2017 05:09 PM, Mark Smith wrote:
> Hello,
>
> I'm processing the new in-house web graph and am confused by something in vertices.txt.gz
>
> I'm expecting every line to contain two fields but I'm finding some lines do not.
>
> Here's an example case:
> gzcat gz/vertices.txt.gz | tail -n +311896710 | more
>
> 311894452 il.clullle
> 311894453 il.cm
> 311894454 il.co
> 311894455 il.co
> .zemereshet
> 311894456 il.co.0
> 311894457 il.co.0-15
> 311894458 il.co.0-5
> 311894459 il.co.004
> 311894460 il.co.004.mail
>
>
> Notice that /.zemereshet/ line. It is a single field and appears to be out of sort order.
>
> I count 2761 lines in that file that are either blank or are a single hostname.
>
> What do those lines mean?
>
> -Mark
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Mark Smith

unread,
May 28, 2017, 11:54:59 AM5/28/17
to Common Crawl
Hi Sebastian,

Thanks for checking and letting me know. I'll ignore those lines.

-Mark

Sebastian Nagel

unread,
May 28, 2017, 12:22:19 PM5/28/17
to common...@googlegroups.com
Hi Mark,

it's fixed now.

Thanks,
Sebastian
> > 311894453 il.cm <http://il.cm>
> > 311894454 il.co <http://il.co>
> > 311894455 il.co <http://il.co>
> > .zemereshet
> > 311894456 il.co.0
> > 311894457 il.co.0-15
> > 311894458 il.co.0-5
> > 311894459 il.co.004
> > 311894460 il.co.004.mail
> >
> >
> > Notice that /.zemereshet/ line. It is a single field and appears to be out of sort order.
> >
> > I count 2761 lines in that file that are either blank or are a single hostname.
> >
> > What do those lines mean?
> >
> > -Mark
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages