Hi Christian,
67 million seems a high number because there are only 52.5 million unique host names (with at least
one page of successfully fetched content) in the August crawl (CC-MAIN-2016-36).
Principally, a request could also result in a 404 or some other "failure".
But at least for this WARC file almost all responses are redirects:
zgrep -A30 "GET / HTTP/1" ./CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.gz \
| grep -E '^HTTP/1\.[01] ' | sort | uniq -c | sort -k1,1nr
323 HTTP/1.1 301 Moved Permanently
74 HTTP/1.1 302 Found
36 HTTP/1.0 301 Moved Permanently
20 HTTP/1.1 302 Moved Temporarily
5 HTTP/1.0 302 Found
5 HTTP/1.1 302 Object moved
4 HTTP/1.1 303 See other
3 HTTP/1.1 301 Moved
3 HTTP/1.1 302 Redirect
2 HTTP/1.1 301 MOVED PERMANENTLY
1 HTTP/1.0 301 Redirect
1 HTTP/1.1 301
1 HTTP/1.1 301 MovedPermanently
1 HTTP/1.1 301 Redirect
1 HTTP/1.1 301 TLS Redirect
1 HTTP/1.1 301
http://radaronline.com/
1 HTTP/1.1 302 Movido temporalmente
Also need to check whether the redirect target was successfully fetched,
and, of course, sometimes there may be two GET requests - one for http, one for https.
Best,
Sebastian
> Host:
english.stackexchange.com <
http://english.stackexchange.com>
> Accept-Encoding: x-gzip, gzip, deflate
> User-Agent: CCBot/2.0 (
http://commoncrawl.org/faq/)
> Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
>
> Best,
> Sebastian
>
> On 11/02/2016 11:04 AM, Christian Lund wrote:
> > Is it feasible to tag a page response with something like "root" or "home"?
> >
> > One could obviously check if the target URI is "/", but a lot of sites redirect the homepage to a
> > specific location, eg. a sub directory "/en/home.html".
> >
> > So I was thinking that if - in the commonCrawl URL index - each domain that is "root" could
> tag the
> > page response as such, which would make it easy to detect homepages when processing the WARC
> files.
> >
> > PS. Does the crawler specify language preference, I assume it is en/us?
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> >
common-crawl...@googlegroups.com <javascript:>
> <mailto:
common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to
common...@googlegroups.com <javascript:>
> > <mailto:
common...@googlegroups.com <javascript:>>.
> <
https://groups.google.com/group/common-crawl>.
> > For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.