Detecting Home

81 views
Skip to first unread message

Christian Lund

unread,
Nov 2, 2016, 6:04:37 AM11/2/16
to Common Crawl
Is it feasible to tag a page response with something like "root" or "home"?

One could obviously check if the target URI is "/", but a lot of sites redirect the homepage to a specific location, eg. a sub directory "/en/home.html".

So I was thinking that if - in the commonCrawl URL index - each domain that is "root" could tag the page response as such, which would make it easy to detect homepages when processing the WARC files.

PS. Does the crawler specify language preference, I assume it is en/us?

Christian Lund

unread,
Nov 2, 2016, 6:35:11 AM11/2/16
to Common Crawl
Common Search does what I outlined (filtering on "/" or "") in order to filter homepages.

class Homepages(FilterPlugin):
    """ Filters homepages """

    def match_url(self, url):
        return (url.parsed.path == "/" and url.parsed.query == "")



But if www.example.com 301 redirects to www.example.com/home.html, then I suppose that the commonCrawl target URI will contain /home.html and never have "/" for www.example.com, is that correct?

Sebastian Nagel

unread,
Nov 2, 2016, 10:00:11 AM11/2/16
to common...@googlegroups.com
Hi Christian,

this would mean to first invert the redirects and assign to each page/URL a list
of "redirected from" URLs. We could then save this list in the WARC metadata records.
Yes, this would be possible but would need to be implemented.

For now, you could take the redirects from the non-200-responses dataset
http://commoncrawl.org/2016/09/robotstxt-and-404-redirect-data-sets/
and extract a list of additional homepage target URLs. If redirect chains are accepted
the computation becomes more tricky. The good news: the non-200-response data
is comparable small, only 400 GB for the September crawl.

I think the "root" or "home" logic would need more strict rules regarding cross-domain
or cross-host redirects, and also ephemeral redirects used to set a session ID or a cookie.
I would rather leave it to the users to define any rules of this kind. For the crawler there are
only redirects without any semantics.


> PS. Does the crawler specify language preference, I assume it is en/us?

Yes. Good point. To get multi-lingual content it may eventually be better not to send a preferred
language.

Btw., the request headers are contained in the WARCs:

WARC/1.0
WARC-Type: request
...

GET /questions/45933/is-it-proper-to-capitalize-after-an-acronym/45959 HTTP/1.0
Host: english.stackexchange.com
Accept-Encoding: x-gzip, gzip, deflate
User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8


Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Sylvain Zimmer

unread,
Nov 2, 2016, 11:34:46 AM11/2/16
to common...@googlegroups.com
But if www.example.com 301 redirects to www.example.com/home.html, then I suppose that the commonCrawl target URI will contain /home.html and never have "/" for www.example.com, is that correct?

Exactly, the filter definitely isn't perfect and having the redirect dataset could help fix that.

(One case where it still works though is when there is a canonical URL meta tag on "/home.html" that points to "/".)

If you need that badly I could help you implement the redirect dataset as a new Common Search plugin that does the inversion Sebastian talked about. It would make the homepage filter work in those cases!
 


On Wednesday, November 2, 2016 at 11:04:37 AM UTC+1, Christian Lund wrote:
Is it feasible to tag a page response with something like "root" or "home"?

One could obviously check if the target URI is "/", but a lot of sites redirect the homepage to a specific location, eg. a sub directory "/en/home.html".

So I was thinking that if - in the commonCrawl URL index - each domain that is "root" could tag the page response as such, which would make it easy to detect homepages when processing the WARC files.

PS. Does the crawler specify language preference, I assume it is en/us?

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Christian Lund

unread,
Nov 2, 2016, 11:41:17 AM11/2/16
to Common Crawl
Hi Sebastian,

Thanks for the input.

I did a single file (CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc) sample test on the August crawl (29800x non-200 response files). Using grep "GET / HTTP/1" I found:

482 entries in the non-200 response file
1775 entries in the 200 response file

A very crude estimate brings this to:

29800 * 482 = 14.363.600
29800 * 1775 = 52.895.000

67.258.600 home pages in the August 2016 crawl.

The trick is then to figure out when a redirect is final. Are non-200 responses immediately executed and recorded in the same segment? If that is the case, then I don't need to check the larger Warc files for a 200 response, but could assume that if the redirect URI is not present within the same non-200 segment, then it must have returned 200. Would that be the case?

Christian Lund

unread,
Nov 2, 2016, 11:54:47 AM11/2/16
to Common Crawl
Hi Sylvian,

Thanks coming up with a possible solution, I might take you up on the offer.

For now I'm going to see if I can detect redirect chains within the crawl segments, as this would be fairly fast to compute.


On Wednesday, November 2, 2016 at 4:34:46 PM UTC+1, Sylvain Zimmer wrote:
But if www.example.com 301 redirects to www.example.com/home.html, then I suppose that the commonCrawl target URI will contain /home.html and never have "/" for www.example.com, is that correct?

Exactly, the filter definitely isn't perfect and having the redirect dataset could help fix that.

(One case where it still works though is when there is a canonical URL meta tag on "/home.html" that points to "/".)

If you need that badly I could help you implement the redirect dataset as a new Common Search plugin that does the inversion Sebastian talked about. It would make the homepage filter work in those cases!
 


On Wednesday, November 2, 2016 at 11:04:37 AM UTC+1, Christian Lund wrote:
Is it feasible to tag a page response with something like "root" or "home"?

One could obviously check if the target URI is "/", but a lot of sites redirect the homepage to a specific location, eg. a sub directory "/en/home.html".

So I was thinking that if - in the commonCrawl URL index - each domain that is "root" could tag the page response as such, which would make it easy to detect homepages when processing the WARC files.

PS. Does the crawler specify language preference, I assume it is en/us?

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

Sebastian Nagel

unread,
Nov 2, 2016, 12:03:15 PM11/2/16
to common...@googlegroups.com
Hi Christian,

67 million seems a high number because there are only 52.5 million unique host names (with at least
one page of successfully fetched content) in the August crawl (CC-MAIN-2016-36).

Principally, a request could also result in a 404 or some other "failure".
But at least for this WARC file almost all responses are redirects:

zgrep -A30 "GET / HTTP/1" ./CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.gz \
| grep -E '^HTTP/1\.[01] ' | sort | uniq -c | sort -k1,1nr
323 HTTP/1.1 301 Moved Permanently
74 HTTP/1.1 302 Found
36 HTTP/1.0 301 Moved Permanently
20 HTTP/1.1 302 Moved Temporarily
5 HTTP/1.0 302 Found
5 HTTP/1.1 302 Object moved
4 HTTP/1.1 303 See other
3 HTTP/1.1 301 Moved
3 HTTP/1.1 302 Redirect
2 HTTP/1.1 301 MOVED PERMANENTLY
1 HTTP/1.0 301 Redirect
1 HTTP/1.1 301
1 HTTP/1.1 301 MovedPermanently
1 HTTP/1.1 301 Redirect
1 HTTP/1.1 301 TLS Redirect
1 HTTP/1.1 301 http://radaronline.com/
1 HTTP/1.1 302 Movido temporalmente

Also need to check whether the redirect target was successfully fetched,
and, of course, sometimes there may be two GET requests - one for http, one for https.

Best,
Sebastian
> Host: english.stackexchange.com <http://english.stackexchange.com>
> Accept-Encoding: x-gzip, gzip, deflate
> User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
> Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
>
> Best,
> Sebastian
>
> On 11/02/2016 11:04 AM, Christian Lund wrote:
> > Is it feasible to tag a page response with something like "root" or "home"?
> >
> > One could obviously check if the target URI is "/", but a lot of sites redirect the homepage to a
> > specific location, eg. a sub directory "/en/home.html".
> >
> > So I was thinking that if - in the commonCrawl URL index - each domain that is "root" could
> tag the
> > page response as such, which would make it easy to detect homepages when processing the WARC
> files.
> >
> > PS. Does the crawler specify language preference, I assume it is en/us?
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

Christian Lund

unread,
Nov 2, 2016, 7:40:10 PM11/2/16
to Common Crawl
Hi Sebastian,

Yes, of course my estimate was off, because I mistakenly added both non-200 and 200 responses, which makes no sense.

Is it safe to assume that if the crawler encounters a redirect (301 or 302) then the new request will be within the same segment or could it be in another segment?

>     <mailto:common-crawl+unsub...@googlegroups.com <javascript:>>.
>     > To post to this group, send email to common...@googlegroups.com <javascript:>
>     > <mailto:common...@googlegroups.com <javascript:>>.
>     > Visit this group at https://groups.google.com/group/common-crawl
>     <https://groups.google.com/group/common-crawl>.
>     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

Greg Lindahl

unread,
Nov 2, 2016, 11:12:45 PM11/2/16
to common...@googlegroups.com
On Wed, Nov 02, 2016 at 04:34:24PM +0100, Sylvain Zimmer wrote:
> >
> > But if www.example.com 301 redirects to www.example.com/home.html, then I
> > suppose that the commonCrawl target URI will contain /home.html and never
> > have "/" for www.example.com, is that correct?
> >
>
> Exactly, the filter definitely isn't perfect and having the redirect
> dataset could help fix that.

A full WARC implementation includes the redirects. That's the way IA
does it, and their CDX index contains enough data that you can follow
a redirect chain to the final file, even if it's multiple redirects.

I see that since August Sebastian has been saving redirects in a
separate set of WARCs, that are not in the urlindex. Which is a bit
of a shame, but not too bad, given that it's only 400 gigs of data.

-- greg

Sebastian Nagel

unread,
Nov 3, 2016, 9:01:28 AM11/3/16
to common...@googlegroups.com
Hi Christian,

in most cases the redirect target should be fetched within the same segment.
Possible exceptions why the redirect target is not tracked in one of the WARC files
(take all, also the non-200 ones):
- excluded by robots.txt
- transient failures, no HTTP status (network timeout, SSL exception, etc.)
- redirect was part of a long queue (per host) which was dropped because
of a timelimit or too many exceptions for this queue

Of course, another URL could redirect to the same target. Then you may find the target
also in another segment. That's why we have URL-level duplicates (about 2% now).

Best,
Sebastian
> <http://english.stackexchange.com <http://english.stackexchange.com>>
> > Accept-Encoding: x-gzip, gzip, deflate
> > User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
> > Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
> > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> >
> >
> > Best,
> > Sebastian
> >
> > On 11/02/2016 11:04 AM, Christian Lund wrote:
> > > Is it feasible to tag a page response with something like "root" or "home"?
> > >
> > > One could obviously check if the target URI is "/", but a lot of sites redirect the
> homepage to a
> > > specific location, eg. a sub directory "/en/home.html".
> > >
> > > So I was thinking that if - in the commonCrawl URL index - each domain that is "root" could
> > tag the
> > > page response as such, which would make it easy to detect homepages when processing the
> WARC
> > files.
> > >
> > > PS. Does the crawler specify language preference, I assume it is en/us?
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "Common Crawl"
> group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to
> > > common-crawl...@googlegroups.com <javascript:>
> > <mailto:common-crawl...@googlegroups.com <javascript:> <javascript:>>.
> <https://groups.google.com/d/optout> <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> > Visit this group at https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

Sebastian Nagel

unread,
Nov 3, 2016, 9:40:13 AM11/3/16
to common...@googlegroups.com
> I see that since August Sebastian has been saving redirects in a
> separate set of WARCs,

To keep them in separate datasets was Sylvain's idea:
no changes to the existing datasets. No NLP people claiming their
language models doing wrong because "you are being redirected" is now the
most frequent English 4-gram.

> given that it's only 400 gigs of data.
That's a nice size. Esp., the robots.txt is a good entry point to get server-related
metrics without the need for "big data". The data is small enough to process it
on a smaller EC2 instance in a couple of hours. And there should be a response from
every crawled host, even if in the main data there are no successfully fetched content.

> that are not in the urlindex. Which is a bit of a shame

Thanks, good idea! They should be added for one of the next crawls, see
https://github.com/commoncrawl/webarchive-indexing/issues/3

Sebastian

Greg Lindahl

unread,
Nov 3, 2016, 1:01:48 PM11/3/16
to common...@googlegroups.com
On Thu, Nov 03, 2016 at 02:40:10PM +0100, Sebastian Nagel wrote:
> > that are not in the urlindex. Which is a bit of a shame
>
> Thanks, good idea! They should be added for one of the next crawls, see
> https://github.com/commoncrawl/webarchive-indexing/issues/3

Yay, thank you Sebastian -- that's a great way to make non-200s easily
usable by casual users while keeping them segregated for bulk users.

-- greg


Reply all
Reply to author
Forward
0 new messages