Discovering 301s / 302s ?

37 views
Skip to first unread message

Soren Flexner

unread,
Oct 18, 2015, 9:29:40 PM10/18/15
to common...@googlegroups.com

  Hello,

  Hopefully someone out there with knowledge of the actual crawling process can help me out.  I'm interested in "discovering" redirects (301 / 302 / etc).  I'm not sure if this is possible using the Common Crawl index.

  From what I can tell, only the final 200 results are saved / available, is this true?  Is the actual 301 response saved anywhere?

  Below is an example of a redirecting domain (www.meraki.com redirects to www.meraki.cisco.com).


 Using the Common Crawl index, there are no results for the original domain called, only the redirect:




  Here's a command line version using wget.  Notice that the 301 redirect from meraki.com to meraki.cisco.com is reported during the fetch.
  Is this type of metadata stored anywhere, or is it discarded?

--2015-10-19 01:17:56--  http://www.meraki.com/
Resolving www.meraki.com (www.meraki.com)... 190.93.240.4, 190.93.241.4, 141.101.112.4, ...
Connecting to www.meraki.com (www.meraki.com)|190.93.240.4|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://meraki.cisco.com/ [following]
--2015-10-19 01:17:56--  https://meraki.cisco.com/
Resolving meraki.cisco.com (meraki.cisco.com)... 190.93.240.221, 141.101.123.221
Connecting to meraki.cisco.com (meraki.cisco.com)|190.93.240.221|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

    [ <=>                                                            ] 29,580      --.-K/s   in 0.02s   

2015-10-19 01:17:57 (1.59 MB/s) - ‘index.html’ saved [29580]


  Thanks for any advice!!
  Soren
Reply all
Reply to author
Forward
0 new messages