Can you estimate the rate at which webpages' content change using Common Crawl?

53 views
Skip to first unread message

Uri Klarman

unread,
Jul 12, 2016, 11:38:00 AM7/12/16
to Common Crawl
Sorry for the somewhat silly question, but I am very new to Common Crawl.

For research purposes, I wish to roughly assess the rate at which webpages' content change over time.
For example, given a large number of URLs, the text of 10% will change within a week, 25% after 4 weeks, 35% within a year, etc.

I thought of using Common Crawl for this purpose (comparing text of the same URLs in different crawls and evaluate the change rate),
but if I understand correctly, Common Crawl actually try to avoid re-crawling the same webpages, unless it know they have changed.

So my questions are:
1. Do you think it is reasonable to use Common Crawl for this task?
2. Had anyone tried something similar, or has a pointer for something similar?
3. How does Common Crawl "know" that some URLs have changed? Is this information available? Do you think it be useful for my purpose?

This is a side-note in my research, so I am not aiming at novel results single-handedly achieved by myself.
I just need an answer to evaluate if some other mechanism could rely on webpages change rate.

Many thanks :)
Uri

Tom Morris

unread,
Jul 14, 2016, 11:16:08 AM7/14/16
to common...@googlegroups.com
On Tue, Jul 12, 2016 at 11:38 AM, Uri Klarman <uri.k...@gmail.com> wrote:

For research purposes, I wish to roughly assess the rate at which webpages' content change over time.
For example, given a large number of URLs, the text of 10% will change within a week, 25% after 4 weeks, 35% within a year, etc.

I thought of using Common Crawl for this purpose (comparing text of the same URLs in different crawls and evaluate the change rate),
but if I understand correctly, Common Crawl actually try to avoid re-crawling the same webpages, unless it know they have changed.

This may be true for some recent crawls, but I don't think it was always true historically. 
 
So my questions are:
1. Do you think it is reasonable to use Common Crawl for this task?

I don't think it's unreasonable, but you'd probably need to get started to see if it's actually feasible in practice.
 
2. Had anyone tried something similar, or has a pointer for something similar?
3. How does Common Crawl "know" that some URLs have changed? Is this information available? Do you think it be useful for my purpose?

You could use the hash of the page content which is stored Common Crawl Index as a quick way to tell if the contents are different, but it includes not only the main content, but boilerplate, changeable elements like "recent Tweets" or "related stories", and other stuff that you may or may not want to count when determining whether a page "really" changed or not.

Tom


This is a side-note in my research, so I am not aiming at novel results single-handedly achieved by myself.
I just need an answer to evaluate if some other mechanism could rely on webpages change rate.

Many thanks :)
Uri

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Uri Klarman

unread,
Jul 15, 2016, 2:25:50 PM7/15/16
to Common Crawl
Thank you Tom for your answers. I will dig deeper into this...
Uri
Reply all
Reply to author
Forward
0 new messages