Hi Roxana,
> But of course, it feels like I am re-doing a bit of what you have already done. Is there a link
> graph that I can follow in a more "clever" way?
Unfortunately, we do not have a hyperlinkgraph on page level. It would be quite big, the
host-level graph already has more than 2 billion nodes.
Here only a short example how you could approach the redirects. I would some more time to dig into
David's project to give you a more concrete solution. Let me know if you need further help. Thanks!
Let's take one redirect record from the URL index:
https://index.commoncrawl.org/CC-MAIN-2018-17-index?url=commoncrawl.org/faq&output=json
{"urlkey": "org,commoncrawl)/faq", "timestamp": "20180423071727", "url":
"
http://commoncrawl.org/faq/", "status": "301", "mime-detected": "text/html", "mime": "text/html",
"digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "filename":
"crawl-data/CC-MAIN-2018-17/segments/1524125945855.61/crawldiagnostics/CC-MAIN-20180423070455-20180423090455-00167.warc.gz",
"offset": "1921949", "length": "739"}
You'll get the WARC record using the filename, offset and length and uncompress it via "gzip -dc":
curl --range 1921949-$((1921949+739-1))
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-17/segments/1524125945855.61/crawldiagnostics/CC-MAIN-20180423070455-20180423090455-00167.warc.gz
| gzip -dc
WARC/1.0
WARC-Type: response
WARC-Date: 2018-04-23T07:17:27Z
WARC-Record-ID: <urn:uuid:76f3b8cf-4819-4d50-b9e5-e7e95aff027f>
Content-Length: 549
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:a876f3c0-97b3-4db5-bbd8-50d0415fc81e>
WARC-Concurrent-To: <urn:uuid:d0c1f496-e2fa-4a56-9dd2-5dcd02a29840>
WARC-IP-Address: 104.28.21.25
WARC-Target-URI:
http://commoncrawl.org/faq/
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Block-Digest: sha1:7OOVMBC4RWNLOFTLCBT2JLALRDLKMMCV
WARC-Truncated: length
WARC-Identified-Payload-Type: text/html
HTTP/1.1 301 Moved Permanently
Date: Mon, 23 Apr 2018 07:17:27 GMT
Content-Type: text/html
Connection: close
Set-Cookie: __cfduid=deec9068100e1deb492202618ee09ce841524467847; expires=Tue, 23-Apr-19 07:17:27
GMT; path=/; domain=.
commoncrawl.org; HttpOnly
X-Powered-By: PHP/5.5.9-1ubuntu4.21
Location:
http://commoncrawl.org/big-picture/frequently-asked-questions/
CF-Cache-Status: EXPIRED
Vary: Accept-Encoding
Expires: Mon, 23 Apr 2018 11:17:27 GMT
Cache-Control: public, max-age=14400
Server: cloudflare
CF-RAY: 40fe8cef11bd2144-EWR
That's the same as done by download_page(record). Just get the "Location:" line from the header
and continue with it's value:
https://index.commoncrawl.org/CC-MAIN-2018-17-index?url=http://commoncrawl.org/big-picture/frequently-asked-questions/&output=json
which return a successfully fetched record (status 200):
{"urlkey": "org,commoncrawl)/big-picture/frequently-asked-questions", "timestamp": "20180423071727",
"url": "
http://commoncrawl.org/big-picture/frequently-asked-questions/", "status": "200",
"mime-detected": "text/html", "mime": "text/html", "digest": "GV5WPSWENGGHOJ3W257HBX372BS4F7JY",
"filename":
"crawl-data/CC-MAIN-2018-17/segments/1524125945855.61/warc/CC-MAIN-20180423070455-20180423090455-00605.warc.gz",
"offset": "72023082", "length": "7166"}
Be aware that it may happen that you need to follow a chain of redirects!
Hope that helps you to implement the solution.
Best,
Sebastian
> I am using the function search_domain(domain)with domains extracted via Athena, and getting the
>
example.com <
http://example.com> ->
www.example.com <
http://www.example.com>
> / -> /index.html
>
> But it's also possible to retrieve the "Location" header from the Common Crawl data.
> Could you share how you've retrieved the successful (status 200) 50% of the home pages?
> That would make it easier to proceed with the redirects.
>
> Thanks,
> Sebastian
>
> On 05/17/2018 01:27 PM, Roxana Danger wrote:
> > Hi all,
> > It's great the CC work! Thanks for democratizing the data!
> >
> > I am trying to classify the pages in the 1rs level links per domain. For doing so, my idea is to
> > retrieve the home page per domain and then, extract all of the href inside this page.
> >
> > For now, I am just using UK region but at least half of home pages (amongst the first 10 biggest
> > sites) have status 301. So, I there is no page to download and therefore, can't retrieve its direct
> > links.
> >
> > Is there any better way to retrieve all of the URLs referred directly by the home page per domain?
> >
> > Thank you very much in advance.
> > Roxana
> >
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> >
common-crawl...@googlegroups.com <mailto:
common-crawl%2Bunsu...@googlegroups.com>
> <mailto:
common-crawl...@googlegroups.com
> <mailto:
common-crawl%2Bunsu...@googlegroups.com>>.
> > To post to this group, send email to
common...@googlegroups.com <mailto:
common...@googlegroups.com>
> > <mailto:
common...@googlegroups.com <mailto:
common...@googlegroups.com>>.
> <
https://groups.google.com/group/common-crawl>.
> > For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl%2Bunsu...@googlegroups.com>.
> <mailto:
common...@googlegroups.com>.
> <
https://groups.google.com/group/common-crawl>.
> For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> <mailto:
common...@googlegroups.com>.