Hi Michael,
> 1) If some specific webpage appears in e.g. crawl dump August 2017, can I assume that it will also
> appear in all future dumps? I.e. once a webpage is discovered by the crawler, will it always be
> included in every single dump from that point on?
It will be revisited but not necessarily every month. How often depends on:
- the score of a page
- it's fetch and duplicate status
- number of pages per site/host: for politeness we guarantee a delay between successive requests
to the same host. Within one monthly crawl the number of pages per host is usually less than
500,000. For large sites (> 500,000 pages) this means necessarily a longer average re-fetch
interval per page.
cf.
https://groups.google.com/d/topic/common-crawl/dqIoNOA3koM/discussion
> 2) If some specific webpage appears for the first time in e.g. crawl dump August 2017 (i.e. it has
> never appeared in any previous crawl dump), does that give me any kind of information about the
> creation date of that webpage or is it just as likely to be the first appearance of that webpage
> in a crawl dump because the crawler was just seeded with some new starting points and therefore
> discovered that webpage for the first time (even though the page itself might already be e.g.
> several years old)?
Both ways are possible.
One way to go could be to look at a creation time stamp in the content. Some CMS provide this
information, the modification time is given frequently in the HTML or the HTTP header.
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.