first level links

51 views
Skip to first unread message

Roxana Danger

unread,
May 17, 2018, 7:27:39 AM5/17/18
to Common Crawl
Hi all,
It's great the CC work! Thanks for democratizing the data!

I am trying to classify the pages in the 1rs level links per domain. For doing so, my idea is to retrieve the home page per domain and then, extract all of the href inside this page. 

For now, I am just using UK region but at least half of home pages (amongst the first 10 biggest sites) have status 301. So, I there is no page to download and therefore, can't retrieve its direct links.

Is there any better way to retrieve all of the URLs referred directly by the home page per domain?

Thank you very much in advance.
Roxana


  

Sebastian Nagel

unread,
May 17, 2018, 9:10:19 AM5/17/18
to common...@googlegroups.com
Hi Roxana,

> have status 301. So, I there is no page to download and therefore,
> can't retrieve its direct links.

That's a redirect and you have to follow it, the URL to follow
is in the "Location:" header of the HTTP response.

> Is there any better way to retrieve all of the URLs referred directly by the home page per domain?

Often the redirect points to a variant you could guess using some basic heuristics:
http:// -> https://
example.com -> www.example.com
/ -> /index.html

But it's also possible to retrieve the "Location" header from the Common Crawl data.
Could you share how you've retrieved the successful (status 200) 50% of the home pages?
That would make it easier to proceed with the redirects.

Thanks,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Roxana Danger

unread,
May 17, 2018, 9:29:15 AM5/17/18
to common...@googlegroups.com
Hi Sebastian,
Thank you very much for explaining redirections and the solution with Location data in the header...

I am using the function search_domain(domain) with domains extracted via Athena, and getting the shortest url inside the record_list returned by this method as in:

def homeRecord(record_list):
    homeRecord = ''
    lenhomeURL = 1000000
    for (_, record) in enumerate(record_list, 1):
        lenurl = len(record['url'])
        if record['status']=='200' and lenurl<lenhomeURL:
            lenhomeURL = lenurl
            homeRecord = record
    return homeRecord

With it I am planing to use the download_page(record) and BeautifulSoap to retrieve all the links of my interest.

But of course, it feels like I am re-doing a bit of what you have already done. Is there a link graph that I can follow in a more "clever" way?

Best regards,
Roxana



> To post to this group, send email to common...@googlegroups.com
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Sebastian Nagel

unread,
May 17, 2018, 9:46:08 AM5/17/18
to common...@googlegroups.com
Hi Roxana,

> But of course, it feels like I am re-doing a bit of what you have already done. Is there a link
> graph that I can follow in a more "clever" way?

Unfortunately, we do not have a hyperlinkgraph on page level. It would be quite big, the
host-level graph already has more than 2 billion nodes.

Here only a short example how you could approach the redirects. I would some more time to dig into
David's project to give you a more concrete solution. Let me know if you need further help. Thanks!

Let's take one redirect record from the URL index:

https://index.commoncrawl.org/CC-MAIN-2018-17-index?url=commoncrawl.org/faq&output=json

{"urlkey": "org,commoncrawl)/faq", "timestamp": "20180423071727", "url":
"http://commoncrawl.org/faq/", "status": "301", "mime-detected": "text/html", "mime": "text/html",
"digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "filename":
"crawl-data/CC-MAIN-2018-17/segments/1524125945855.61/crawldiagnostics/CC-MAIN-20180423070455-20180423090455-00167.warc.gz",
"offset": "1921949", "length": "739"}

You'll get the WARC record using the filename, offset and length and uncompress it via "gzip -dc":

curl --range 1921949-$((1921949+739-1))
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-17/segments/1524125945855.61/crawldiagnostics/CC-MAIN-20180423070455-20180423090455-00167.warc.gz
| gzip -dc

WARC/1.0
WARC-Type: response
WARC-Date: 2018-04-23T07:17:27Z
WARC-Record-ID: <urn:uuid:76f3b8cf-4819-4d50-b9e5-e7e95aff027f>
Content-Length: 549
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:a876f3c0-97b3-4db5-bbd8-50d0415fc81e>
WARC-Concurrent-To: <urn:uuid:d0c1f496-e2fa-4a56-9dd2-5dcd02a29840>
WARC-IP-Address: 104.28.21.25
WARC-Target-URI: http://commoncrawl.org/faq/
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Block-Digest: sha1:7OOVMBC4RWNLOFTLCBT2JLALRDLKMMCV
WARC-Truncated: length
WARC-Identified-Payload-Type: text/html

HTTP/1.1 301 Moved Permanently
Date: Mon, 23 Apr 2018 07:17:27 GMT
Content-Type: text/html
Connection: close
Set-Cookie: __cfduid=deec9068100e1deb492202618ee09ce841524467847; expires=Tue, 23-Apr-19 07:17:27
GMT; path=/; domain=.commoncrawl.org; HttpOnly
X-Powered-By: PHP/5.5.9-1ubuntu4.21
Location: http://commoncrawl.org/big-picture/frequently-asked-questions/
CF-Cache-Status: EXPIRED
Vary: Accept-Encoding
Expires: Mon, 23 Apr 2018 11:17:27 GMT
Cache-Control: public, max-age=14400
Server: cloudflare
CF-RAY: 40fe8cef11bd2144-EWR


That's the same as done by download_page(record). Just get the "Location:" line from the header
and continue with it's value:

https://index.commoncrawl.org/CC-MAIN-2018-17-index?url=http://commoncrawl.org/big-picture/frequently-asked-questions/&output=json

which return a successfully fetched record (status 200):

{"urlkey": "org,commoncrawl)/big-picture/frequently-asked-questions", "timestamp": "20180423071727",
"url": "http://commoncrawl.org/big-picture/frequently-asked-questions/", "status": "200",
"mime-detected": "text/html", "mime": "text/html", "digest": "GV5WPSWENGGHOJ3W257HBX372BS4F7JY",
"filename":
"crawl-data/CC-MAIN-2018-17/segments/1524125945855.61/warc/CC-MAIN-20180423070455-20180423090455-00605.warc.gz",
"offset": "72023082", "length": "7166"}


Be aware that it may happen that you need to follow a chain of redirects!


Hope that helps you to implement the solution.


Best,
Sebastian


On 05/17/2018 03:29 PM, Roxana Danger wrote:
> Hi Sebastian,
> Thank you very much for explaining redirections and the solution with Location data in the header...
>
> I am basically following the example by David Cedar
> (https://github.com/chedame/python-common-crawl-amazon-example
> <https://github.com/chedame/python-common-crawl-amazon-example> and https://www.cedar.net.au/using-python-and-common-crawl-to-find-products-from-amazon-com/)
> I am using the function search_domain(domain)with domains extracted via Athena, and getting the
>    example.com <http://example.com> -> www.example.com <http://www.example.com>
>    / -> /index.html
>
> But it's also possible to retrieve the "Location" header from the Common Crawl data.
> Could you share how you've retrieved the successful (status 200) 50% of the home pages?
> That would make it easier to proceed with the redirects.
>
> Thanks,
> Sebastian
>
> On 05/17/2018 01:27 PM, Roxana Danger wrote:
> > Hi all,
> > It's great the CC work! Thanks for democratizing the data!
> >
> > I am trying to classify the pages in the 1rs level links per domain. For doing so, my idea is to
> > retrieve the home page per domain and then, extract all of the href inside this page. 
> >
> > For now, I am just using UK region but at least half of home pages (amongst the first 10 biggest
> > sites) have status 301. So, I there is no page to download and therefore, can't retrieve its direct
> > links.
> >
> > Is there any better way to retrieve all of the URLs referred directly by the home page per domain?
> >
> > Thank you very much in advance.
> > Roxana
> >
> >
> >   
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>
> <mailto:common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>>.
> > To post to this group, send email to common...@googlegroups.com <mailto:common...@googlegroups.com>
> > <mailto:common...@googlegroups.com <mailto:common...@googlegroups.com>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

Roxana Danger

unread,
May 17, 2018, 9:52:27 AM5/17/18
to common...@googlegroups.com
Hi Sebastian,
that's great and for sure I will reimplement the method to consider this.
I will share when running.
Best regards,
Roxana


>     > Visit this group at https://groups.google.com/group/common-crawl
>     <https://groups.google.com/group/common-crawl>.
>     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>     --
>     You received this message because you are subscribed to the Google Groups "Common Crawl" group.
>     To unsubscribe from this group and stop receiving emails from it, send an email to
>     To post to this group, send email to common...@googlegroups.com
>     <mailto:common-crawl@googlegroups.com>.

>     Visit this group at https://groups.google.com/group/common-crawl
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> To post to this group, send email to common...@googlegroups.com

> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Greg Lindahl

unread,
May 17, 2018, 8:43:43 PM5/17/18
to common...@googlegroups.com
On Thu, May 17, 2018 at 03:10:15PM +0200, Sebastian Nagel wrote:

> Often the redirect points to a variant you could guess using some basic heuristics:
> http:// -> https://
> example.com -> www.example.com
> / -> /index.html

Sebastian,

This was my reason for asking you to put the redirs in the index, and
I'm glad that they're there and this is fairly easy to do now!

In my not-common-crawl dataset of the top 15 million hosts, just over
1 million of them have a different-host redirect for their frontpage,
and 1/2 million of them had a redirect from their frontpage to an
interior page like /index.php or whatever. So it's pretty common.

I don't count the extremely common situation of adding or subtracting
www or http/https, which have the same SURT and so all you have to do
is look for a '200' among the possible records.

-- greg

Reply all
Reply to author
Forward
0 new messages