Hi Rich,
just a draft for one way to come close to the desired result...
I don't know about a ready-to-use solution for the given problem. Some programming is needed. :)
And maybe someone has a better and smarter solution for this interesting problem!
> To do that I'd like to search to find websites that are of at least 100 pages in size, are still
> online today, but that haven't had any pages added or updated since 2014.
The Common Crawl index contains a document checksum / hash for each crawled page generated on the
"raw" HTML content. This includes navigation elements, boilerplates, and may also contain elements
which change frequently: date or time, number of visitors, etc.
Indexes are available back to 2014. One possible way to get the desired set of websites:
1. prepare a list of websites of sufficient size. This is possible with Common Crawl data but keep
in mind that only only a sample is crawled and there is no guarantee that a website is crawled
exhaustively.
2. take at least two indexes: one from 2014 (or early 2015) and a recent one, and compare
the two checksums for pages of your list of candidate websites. If the checksum is different
the website the page belongs to can be excluded.
Maybe it's more efficient to start with step 2 since the list of unchanged pages since 2014 is
probably quite small (esp., if the raw HTML is compared).
The result would be a list of candidates, not a ready result set because
- [of at least 100 pages in size]: the number of crawled pages of a site is probably lower than the
real number of pages
- or sites may be missing in the Common Crawl snapshots at all
- page additions are also hardly detected with a snapshot crawl
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.