Crawling Strategy of newer Crawls

173 views
Skip to first unread message

Robert Meusel

unread,
Aug 10, 2015, 3:49:14 AM8/10/15
to Common Crawl
Hi,

I have seen that within the newer crawl data announcements, blekko is not mentioned any more. Can you briefly explain, how the crawls where obtained? What was the strategy? Is still a list of URLs used or is it back to "link discovery on crawling"?

Thanks a lot,
Robert

Stephen Merity

unread,
Aug 11, 2015, 7:52:08 PM8/11/15
to common...@googlegroups.com
Hi Robert,

Since blekko was acquired by IBM, they are no longer able to provide URL ranking information to us. We're still using a list of URLs, primarily composed of the blekko sourced data, rather than "link discovery on crawl", mainly as we modified Nutch heavily to work that way. Running "link discovery on crawling" at scale is problematic on AWS with the Nutch codebase as it stands.

We're creating a new system which will use PageRank over the extracted web graph to prioritize which web pages to crawl and which new web pages to add. Whilst the current URL list is certainly not optimal, we feel that continuing to provide monthly snapshots is useful for a number of use cases. This will also allow us to broaden and specify the reach of the pages that we crawl, such as expanding our coverage of the non-English web.

We're also open for any suggestions on obtaining or producing a list of URLs to target should anyone have suggestions. Additionally, for anyone interested in improving Nutch, we have a number of features to be implemented that will make it a better system out of the box for performing web archiving at scale.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Robert Meusel

unread,
Aug 12, 2015, 6:48:16 AM8/12/15
to Common Crawl
Hi Stephen,

Thanks a lot for the explanation. This is really helpful to interprete the results we extract from the crawls. Will the whole list of URLs you are using/considering be made public?

Cheers,
Robert

Christian Buck

unread,
Aug 12, 2015, 7:35:49 AM8/12/15
to common...@googlegroups.com
Hi,

It would be great to have a more multilingual crawl, the blekko seed
based crawls seems to be strongly biased towards the English part of the
web, as opposed to e.g. the 2012 crawl.

We maintain a list of language distribution per domain that I'd be happy
to share to boost the multilingual sites.

cheers,
Christian
> <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

Tom Morris

unread,
Aug 12, 2015, 2:27:17 PM8/12/15
to common...@googlegroups.com
Stephen - Thanks.  This is great information.  It'd be nice if meta information like this could be released concurrently with the changes as they're deployed.

On Tue, Aug 11, 2015 at 7:51 PM, Stephen Merity <ste...@commoncrawl.org> wrote:
We're still using a list of URLs, primarily composed of the blekko sourced data, rather than "link discovery on crawl",

So the same list of URLs is crawled each month? (modulo reachability issues, etc)  When did the crawler cut over from the monthly blekko provided list to the static list?
 
We're creating a new system which will use PageRank over the extracted web graph to prioritize which web pages to crawl and which new web pages to add.

Is there a timeframe for when this will be deployed?

Tom

Robert Meusel

unread,
Aug 24, 2015, 10:33:21 AM8/24/15
to Common Crawl
Hi There,

I just stumbled about this paper by Martin Hepp: http://ceur-ws.org/Vol-1426/paper-04.pdf
Maybe for the sake of representativity of the crawls, it might be worth looking at the "actual" sitemap of the domains which are "high ranked" and crawl the stated pages (besides others).

Cheers

Tom Morris

unread,
Aug 24, 2015, 11:50:30 AM8/24/15
to common...@googlegroups.com
On Mon, Aug 24, 2015 at 10:33 AM, Robert Meusel <robert...@gmail.com> wrote:
I just stumbled about this paper by Martin Hepp: http://ceur-ws.org/Vol-1426/paper-04.pdf
Maybe for the sake of representativity of the crawls, it might be worth looking at the "actual" sitemap of the domains which are "high ranked" and crawl the stated pages (besides others).

That's an interesting paper (in a proceedings for a conference which apparently hasn't happened yet), but Martin Hepp has a very specific focus, structured product data, and agenda, increasing actual and perceived usage of it.  That may not be representative of how the majority use, or want to use, the Common Crawl.

Note that, given a fixed budget, focusing on crawling entire domains, whether by using sitemaps or other means, will, necessarily, reduce the number of domains which are crawled.  Focusing on crawling all structured product data will mean sacrificing crawling popular pages.  While it's clear that for Martin Hepp's purposes a crawl consisting of all structured product data on the web would be a good thing, I doubt it would benefit the majority of Common Crawl's users due to the sacrifices it would require.

Tom

 

Cheers


Am Montag, 10. August 2015 09:49:14 UTC+2 schrieb Robert Meusel:
Hi,

I have seen that within the newer crawl data announcements, blekko is not mentioned any more. Can you briefly explain, how the crawls where obtained? What was the strategy? Is still a list of URLs used or is it back to "link discovery on crawling"?

Thanks a lot,
Robert

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Martin Hepp

unread,
Aug 25, 2015, 3:57:23 AM8/25/15
to Common Crawl
Hi Tom.


On Monday, August 24, 2015 at 5:50:30 PM UTC+2, Tom Morris wrote:
That's an interesting paper (in a proceedings for a conference which apparently hasn't happened yet), but Martin Hepp has a very specific focus, structured product data, and agenda, increasing actual and perceived usage of it.  That may not be representative of how the majority use, or want to use, the Common Crawl.

Note that, given a fixed budget, focusing on crawling entire domains, whether by using sitemaps or other means, will, necessarily, reduce the number of domains which are crawled.  Focusing on crawling all structured product data will mean sacrificing crawling popular pages.  While it's clear that for Martin Hepp's purposes a crawl consisting of all structured product data on the web would be a good thing, I doubt it would benefit the majority of Common Crawl's users due to the sacrifices it would require.

I would like to clarify that the problem we are describing in the paper is a generic one as soon as you use CommonCrawl to extract structured data. We limit our analysis to the product / e-commerce domain, because we only have evidence for that domain, but it is clear that as soon as you have a database-driven dynamic Web site with a lot of entries in the database, the CommonCrawl approach will not cover the full original database content. 

I agree that for many of the purposes of the CommonCrawl, such a "deep" crawl is not needed and the tradeoffs would not be sensible to make. Our paper tries to describe why what the Web Data Commons project extracts from CommonCrawl is a very limited and biased subset of the structured data found on the Web. There is no problem with CommonCrawl per se. We just challenge the general validity of the Web Data Commons extraction statistics and want to correct hopes and expectations on what can be done with the resulting body of extracted data.

Martin

Tom Morris

unread,
Aug 25, 2015, 12:45:13 PM8/25/15
to common...@googlegroups.com
On Tue, Aug 25, 2015 at 3:57 AM, Martin Hepp <mfh...@gmail.com> wrote:

I agree that for many of the purposes of the CommonCrawl, such a "deep" crawl is not needed and the tradeoffs would not be sensible to make. 

Yes, that's the main point I was trying to make in response to Robert's suggestion that full site crawls be done.

It might be interesting to try and use the sitemaps to at least capture a "representative" (whatever that means) thin vertical slice of a web site, but I'm not sure how you'd do that and how you'd balance doing it with the competing goals doing a popularity (ie PageRank) based crawl.

There's an old (2009) Google paper that contains some interesting data on sitemap usage by web sites and how Google integrates (or integrated them at the time) into its crawling prioritization, URL canonicalization, etc.  They weren't worried about capturing a representative slice, but rather integrating sitemaps into the popularity ranking.

Has anyone seen any more recent followups on sitemaps by Google or other search engines?

Tom
Reply all
Reply to author
Forward
0 new messages