How crawling is performed

112 views
Skip to first unread message

Spider99

unread,
Oct 9, 2017, 1:57:02 AM10/9/17
to Common Crawl
Hi,

I wanted to know how commoncrawl  crawls the web from the seed urls, does the spiders keeps just following the links from the pages in DFS/BFS fashion or there is some other method?

please take this question in the context of both news crawling and normal crawling. 

And follow up question to that in the context of news data is, how are you guys segregating between listing/index news pages and description news pages?, because as i see on CC News archive all the pages are description news pages, there aren't any listing/index news pages.  

Thanks,

Sebastian Nagel

unread,
Oct 9, 2017, 5:00:44 AM10/9/17
to common...@googlegroups.com
Hi,

there is a huge database of URLs and the page fetch status, fetch time, score, content signature. It
contains about 15 billion URLs right now. The fetch lists of the monthly crawls is sampled from this
15 billion URLs:
- select URLs for which
threshold >= ((score * time_elapsed_since_last_fetch) + status_penalty)
- penalties are applied to 404s, robots.txt exclusions, duplicates, not modifies pages, etc.
- there is a good chance that we get the same status again

Last month we've added one billion new URLs to the crawl database using three different approaches.
One of them is a breadth-first crawl, see
http://commoncrawl.org/2017/09/september-2017-crawl-archive-now-available/


> news crawling

is different. The news crawler uses RSS and Atom feeds and news sitemaps to find links to articles,
see https://groups.google.com/d/topic/common-crawl/eQC0nLVqmQs/discussion


Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Spider99

unread,
Oct 9, 2017, 6:42:13 AM10/9/17
to Common Crawl
Thanks Sebastian,

I wanted to know that whether  news  crawled data contains news listing/index pages or not?

For ex:- i am talking about these kind of pages (http://accuray.com/news-events)

Sebastian Nagel

unread,
Oct 9, 2017, 6:54:46 AM10/9/17
to common...@googlegroups.com
> whether news crawled data contains news listing/index pages or not?

There should be no (almost no) listing/index pages. Even if the listing or index page is contained
in feeds or sitemaps, it's only fetched once and never again. Except, of course, the URL isn't
unique (eg. contains a timestamp).
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages