time series models from common crawl data

51 views
Skip to first unread message

leuemi...@gmail.com

unread,
Nov 6, 2017, 2:04:31 AM11/6/17
to Common Crawl
Hey there,

I am building language-specific word embedding models from common crawl data using your index search to filter for country-code top-level domain and then downloading the specific parts of the warc files for the search results. It's working great so far and I want to express my gratitude for providing the service you do.

Following these first steps I would also like to think about how I could potentially build those word embedding models for different points in time, so e.g. a model for 2014, 2015, 2016, 2017 etc.

Basically for every website I grab from one of your warc files I would need a "creation" timestamp (meaning: creation of that website, not creation of the crawled information in the archive). AFAIK those don't exist in any kind of standardized form. So what I have been thinking about now is if I could maybe utilize the temporal structure of your crawl archives to get this kind of information.
For example, as of right now, a new crawl dump arrives every month. So I would like to know:

1) If some specific webpage appears in e.g. crawl dump August 2017, can I assume that it will also appear in all future dumps? I.e. once a webpage is discovered by the crawler, will it always be included in every single dump from that point on?
2) If some specific webpage appears for the first time in e.g. crawl dump August 2017 (i.e. it has never appeared in any previous crawl dump), does that give me any kind of information about the creation date of that webpage or is it just as likely to be the first appearance of that webpage in a crawl dump because the crawler was just seeded with some new starting points and therefore discovered that webpage for the first time (even though the page itself might already be e.g. several years old)?

In general I would like to know if I can infer any kind of temporal information about a crawled webpage in one of your archives by means of the date when that archive was released.

Thanks in advance for any help.

Kind Regards,
//Michael

Sebastian Nagel

unread,
Nov 7, 2017, 5:12:22 AM11/7/17
to common...@googlegroups.com
Hi Michael,

> 1) If some specific webpage appears in e.g. crawl dump August 2017, can I assume that it will also
> appear in all future dumps? I.e. once a webpage is discovered by the crawler, will it always be
> included in every single dump from that point on?

It will be revisited but not necessarily every month. How often depends on:
- the score of a page
- it's fetch and duplicate status
- number of pages per site/host: for politeness we guarantee a delay between successive requests
to the same host. Within one monthly crawl the number of pages per host is usually less than
500,000. For large sites (> 500,000 pages) this means necessarily a longer average re-fetch
interval per page.
cf. https://groups.google.com/d/topic/common-crawl/dqIoNOA3koM/discussion

> 2) If some specific webpage appears for the first time in e.g. crawl dump August 2017 (i.e. it has
> never appeared in any previous crawl dump), does that give me any kind of information about the
> creation date of that webpage or is it just as likely to be the first appearance of that webpage
> in a crawl dump because the crawler was just seeded with some new starting points and therefore
> discovered that webpage for the first time (even though the page itself might already be e.g.
> several years old)?

Both ways are possible.

One way to go could be to look at a creation time stamp in the content. Some CMS provide this
information, the modification time is given frequently in the HTML or the HTTP header.


Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

leuemi...@gmail.com

unread,
Nov 9, 2017, 1:45:06 AM11/9/17
to Common Crawl
Hey Sebastian, thanks a lot for your detailed answer.


> 2) If some specific webpage appears for the first time in e.g. crawl dump August 2017 (i.e. it has
> never appeared in any previous crawl dump), does that give me any kind of information about the
> creation date of that webpage or is it just as likely to be the first appearance of that webpage
> in a crawl dump because the crawler was just seeded with some new starting points and therefore
> discovered that webpage for the first time (even though the page itself might already be e.g.
> several years old)?

Both ways are possible.

One way to go could be to look at a creation time stamp in the content. Some CMS provide this
information, the modification time is given frequently in the HTML or the HTTP header.

Okay, so would you agree that there is currently no consistent way to infer any kind of information about the creation/change date of a webpage by means of checking in which crawls it appears/not appears?

Thank you also for the suggestion about the modification timestamp in the http header. I have not seen that very often out in the wild but I'll be sure to check that out, maybe it appears more often than I thought.

Keep up the great work.

//Michael

Tom Morris

unread,
Nov 9, 2017, 12:51:31 PM11/9/17
to common...@googlegroups.com
On Tue, Nov 7, 2017 at 5:12 AM, Sebastian Nagel <seba...@commoncrawl.org> wrote:
Hi Michael,

> 1) If some specific webpage appears in e.g. crawl dump August 2017, can I assume that it will also
> appear in all future dumps? I.e. once a webpage is discovered by the crawler, will it always be
> included in every single dump from that point on?

It will be revisited but not necessarily every month. How often depends on:
- the score of a page
- it's fetch and duplicate status
- number of pages per site/host: for politeness we guarantee a delay between successive requests
  to the same host. Within one monthly crawl the number of pages per host is usually less than
  500,000. For large sites (> 500,000 pages) this means necessarily a longer average re-fetch
  interval per page.
cf. https://groups.google.com/d/topic/common-crawl/dqIoNOA3koM/discussion

At a slightly higher level than Sebastian's answer -  for old crawls, there's a large amount of overlap between crawls, but for newer crawls there's very little.

You can see the stats along with pretty pictures here:

Even given those two generalizations however, nothing is guaranteed.

Tom
Reply all
Reply to author
Forward
0 new messages