Long time baseline web graph

38 views
Skip to first unread message

Jeff Allen

unread,
Apr 28, 2021, 12:18:17 PM4/28/21
to Common Crawl

Hi All,

I'm working on a project which involves a long time baseline calculation of PageRank/Centrality, with a custom URL normalization (So not just host/domain).

I've been playing around using the WAT files, which are great. But I don't have the resources to process all of the WATs from 2013 to present. So I'd like to do a sampling if possible.

Is there a recommended way to sample the crawl? If I pull in the first ~20 WAT files for each crawl, will that have any biases? Or if I randomly pull in 20 WAT files for each crawl, will that get around any biases?

I'm just wanting to avoid a situation where the crawl has some sequential nature to it, and so using only a subset of the WATs for each crawl would lead to a biased view of domains.

Also, do you have a sense from the host and domain based calculations of what fraction of the crawl you need to get reasonably accurate results? For example, if you use 10% of the crawl, how different are the rankings (Or, how deep in the list do you have to go before the results are completely different)?

Finally, based on previous questions, I'm guessing there isn't a data set available that is simply the URL level graph? Looks like the external link graph (URLs and all links on the URL to a different domain) is ~3% of the WAT file size, so would help scale my project up.

Loving the Common Crawl data, so thank you!
Jeff

Sebastian Nagel

unread,
Apr 28, 2021, 4:44:18 PM4/28/21
to common...@googlegroups.com
Hi Jeff,

> randomly pull in 20 WAT files

This is the better option. The pages captured in one segment are distributed pseudo-randomly
over WARC/WAT/WET files. Because one segment is fetched by one job, all pages are captured
in the same time range (2-3 hours). By picking WAT files at random, you get a "better" random
sample.


> Also, do you have a sense from the host and domain based calculations of what fraction of the crawl you
> need to get reasonably accurate results? For example, if you use 10% of the crawl, how different are
> the rankings (Or, how deep in the list do you have to go before the results are completely different)?

Yes, a graph spanned up from a small page sample might become extremely sparse. But no idea when
level of density is required to get a robust ranking. I'd be interested in those findings!



> there isn't a data set available that is simply the URL level graph?

No. The page-level graph is too big to work with, or better said: the benefits
wouldn't outweigh the costs, likely.

Just for comparison: every month about 10-15% of the WAT files of the preceding crawl
are processed to extract and sample links/URLs. From the February/March crawl we observed
the following numbers

8610 WAT files processed (random sample)
366 million WAT response records

73.1 billion links in total
14.4 billion media links (skipped)
38.7 billion page links (unified per page)
8.25 billion unique URLs before filtering and normalization
5.85 billion unique URLs after filtering and normalization
2.84 billion URLs sampled


> external link graph (URLs and all links on the URL to a different domain) is ~3% of the WAT file size

Yes, sounds reasonable. Most links are domain-internal. If you throw them away this could significantly reduce the size of the graph.


Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/c3c9e2c5-5ab9-4d6e-9978-424f29b6a288n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/c3c9e2c5-5ab9-4d6e-9978-424f29b6a288n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages