--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
We're still using a list of URLs, primarily composed of the blekko sourced data, rather than "link discovery on crawl",
We're creating a new system which will use PageRank over the extracted web graph to prioritize which web pages to crawl and which new web pages to add.
I just stumbled about this paper by Martin Hepp: http://ceur-ws.org/Vol-1426/paper-04.pdfMaybe for the sake of representativity of the crawls, it might be worth looking at the "actual" sitemap of the domains which are "high ranked" and crawl the stated pages (besides others).
Cheers
Am Montag, 10. August 2015 09:49:14 UTC+2 schrieb Robert Meusel:Hi,I have seen that within the newer crawl data announcements, blekko is not mentioned any more. Can you briefly explain, how the crawls where obtained? What was the strategy? Is still a list of URLs used or is it back to "link discovery on crawling"?Thanks a lot,Robert
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
That's an interesting paper (in a proceedings for a conference which apparently hasn't happened yet), but Martin Hepp has a very specific focus, structured product data, and agenda, increasing actual and perceived usage of it. That may not be representative of how the majority use, or want to use, the Common Crawl.Note that, given a fixed budget, focusing on crawling entire domains, whether by using sitemaps or other means, will, necessarily, reduce the number of domains which are crawled. Focusing on crawling all structured product data will mean sacrificing crawling popular pages. While it's clear that for Martin Hepp's purposes a crawl consisting of all structured product data on the web would be a good thing, I doubt it would benefit the majority of Common Crawl's users due to the sacrifices it would require.
I agree that for many of the purposes of the CommonCrawl, such a "deep" crawl is not needed and the tradeoffs would not be sensible to make.