All data contained in latest crawl?

62 views
Skip to first unread message

brano199

unread,
Jul 21, 2017, 6:39:35 AM7/21/17
to Common Crawl
Hello,

i have a question, if lets say CC-MAIN-2017-26 contains all data from all previous crawls(i mean all urls with refreshed timestamp that were there before) + new urls contained in this crawl.

Thank you for the answer.

Sebastian Nagel

unread,
Jul 21, 2017, 8:21:18 AM7/21/17
to common...@googlegroups.com
No. There is some overlap between the crawls but a previous crawl is never a subset of a newer one:
- pages disappear
- get excluded by robots.txt
- are detected as duplicates and (re)fetched less often
- low ranking pages are scheduled for refetch not every month

You'll find the numbers here:
https://commoncrawl.github.io/cc-crawl-statistics/
(overlap, size, amount of new URLs/pages, etc.)

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

brano199

unread,
Jul 21, 2017, 12:25:36 PM7/21/17
to Common Crawl
Thank you for clarification, Sebastian. I think i need to be more concrete here though. I asked this because i wanted to do some analysis over user ratings on TripAdvisor and i would like to get the "most fresh" data for as most Hotels/Attractions/Restaurants as possible. So the question should be more like, if i have to necessarily go through all crawls to get the all data, so that´s why i was asking about the overlap,because i thought that deeper i would go into the past,the more overlaps there will be.

So maybe to try to explain my needs better. Does the latest crawl contain all links not blocked by robots.txt for tripadvisor or does it skip some parts of the page?


On Friday, July 21, 2017 at 2:21:18 PM UTC+2, Sebastian Nagel wrote:
No. There is some overlap between the crawls but a previous crawl is never a subset of a newer one:
- pages disappear
- get excluded by robots.txt
- are detected as duplicates and (re)fetched less often
- low ranking pages are scheduled for refetch not every month

You'll find the numbers here:
  https://commoncrawl.github.io/cc-crawl-statistics/
(overlap, size, amount of new URLs/pages, etc.)

Best,
Sebastian

On 07/21/2017 12:39 PM, brano199 wrote:
> Hello,
>
> i have a question, if lets say CC-MAIN-2017-26 contains all data from all previous crawls(i mean all
> urls with refreshed timestamp that were there before) + new urls contained in this crawl.
>
> Thank you for the answer.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
Reply all
Reply to author
Forward
0 new messages