Sorry for the somewhat silly question, but I am very new to Common Crawl.
For research purposes, I wish to roughly assess the rate at which webpages' content change over time.
For example, given a large number of URLs, the text of 10% will change within a week, 25% after 4 weeks, 35% within a year, etc.
I thought of using Common Crawl for this purpose (comparing text of the same URLs in different crawls and evaluate the change rate),
but if I understand correctly, Common Crawl actually try to avoid re-crawling the same webpages, unless it know they have changed.
So my questions are:
1. Do you think it is reasonable to use Common Crawl for this task?
2. Had anyone tried something similar, or has a pointer for something similar?
3. How does Common Crawl "know" that some URLs have changed? Is this information available? Do you think it be useful for my purpose?
This is a side-note in my research, so I am not aiming at novel results single-handedly achieved by myself.
I just need an answer to evaluate if some other mechanism could rely on webpages change rate.
Many thanks :)
Uri