> randomly pull in 20 WAT files
This is the better option. The pages captured in one segment are distributed pseudo-randomly
over WARC/WAT/WET files. Because one segment is fetched by one job, all pages are captured
in the same time range (2-3 hours). By picking WAT files at random, you get a "better" random
> Also, do you have a sense from the host and domain based calculations of what fraction of the crawl you
> need to get reasonably accurate results? For example, if you use 10% of the crawl, how different are
> the rankings (Or, how deep in the list do you have to go before the results are completely different)?
Yes, a graph spanned up from a small page sample might become extremely sparse. But no idea when
level of density is required to get a robust ranking. I'd be interested in those findings!
> there isn't a data set available that is simply the URL level graph?
No. The page-level graph is too big to work with, or better said: the benefits
wouldn't outweigh the costs, likely.
Just for comparison: every month about 10-15% of the WAT files of the preceding crawl
are processed to extract and sample links/URLs. From the February/March crawl we observed
the following numbers
8610 WAT files processed (random sample)
366 million WAT response records
73.1 billion links in total
14.4 billion media links (skipped)
38.7 billion page links (unified per page)
8.25 billion unique URLs before filtering and normalization
5.85 billion unique URLs after filtering and normalization
2.84 billion URLs sampled
> external link graph (URLs and all links on the URL to a different domain) is ~3% of the WAT file size
Yes, sounds reasonable. Most links are domain-internal. If you throw them away this could significantly reduce the size of the graph.
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> To view this discussion on the web visit