Hi Mark,
argh, I see. It was a bug that host names containing new lines where not
rejected while spanning up the graph. It's already eliminated [1], but the graph
is merged from three monthly crawl and one or two of them still contain these
nodes. Because the graph is intermediately stored in Parquet this doesn't matter
but it breaks the text format of course.
I'll fix the vertices.txt.gz until tomorrow and join the two lines with an '\n'
in between. This makes an invalid host name but avoids that the same host appears
twice.
If you can't wait, just ignore the line ".zemereshet".
Thanks,
Sebastian
[1]
https://github.com/commoncrawl/cc-pyspark/commit/00913145e370f95b7a1fd23fa8410d6d01ca261d
On 05/28/2017 05:09 PM, Mark Smith wrote:
> Hello,
>
> I'm processing the new in-house web graph and am confused by something in vertices.txt.gz
>
> I'm expecting every line to contain two fields but I'm finding some lines do not.
>
> Here's an example case:
> gzcat gz/vertices.txt.gz | tail -n +311896710 | more
>
> 311894452 il.clullle
> 311894453
il.cm
> 311894454
il.co
> 311894455
il.co
> .zemereshet
> 311894456 il.co.0
> 311894457 il.co.0-15
> 311894458 il.co.0-5
> 311894459 il.co.004
> 311894460 il.co.004.mail
>
>
> Notice that /.zemereshet/ line. It is a single field and appears to be out of sort order.
>
> I count 2761 lines in that file that are either blank or are a single hostname.
>
> What do those lines mean?
>
> -Mark
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.