Discovering 301s / 302s ?

37 views

Skip to first unread message

Soren Flexner

unread,

Oct 18, 2015, 9:29:40 PM10/18/15

to common...@googlegroups.com

Hello,

Hopefully someone out there with knowledge of the actual crawling process can help me out. I'm interested in "discovering" redirects (301 / 302 / etc). I'm not sure if this is possible using the Common Crawl index.

From what I can tell, only the final 200 results are saved / available, is this true? Is the actual 301 response saved anywhere?

Below is an example of a redirecting domain (www.meraki.com redirects to www.meraki.cisco.com).

Using the Common Crawl index, there are no results for the original domain called, only the redirect:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=http://www.meraki.com : Returns zero results

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=http://www.meraki.cisco.com : Returns lots of results

Here's a command line version using wget. Notice that the 301 redirect from meraki.com to meraki.cisco.com is reported during the fetch.

Is this type of metadata stored anywhere, or is it discarded?

wget http://www.meraki.com

--2015-10-19 01:17:56-- http://www.meraki.com/

Resolving www.meraki.com (www.meraki.com)... 190.93.240.4, 190.93.241.4, 141.101.112.4, ...

Connecting to www.meraki.com (www.meraki.com)|190.93.240.4|:80... connected.

HTTP request sent, awaiting response... 301 Moved Permanently

Location: https://meraki.cisco.com/ [following]

--2015-10-19 01:17:56-- https://meraki.cisco.com/

Resolving meraki.cisco.com (meraki.cisco.com)... 190.93.240.221, 141.101.123.221

Connecting to meraki.cisco.com (meraki.cisco.com)|190.93.240.221|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: unspecified [text/html]

Saving to: ‘index.html’

[ <=> ] 29,580 --.-K/s in 0.02s

2015-10-19 01:17:57 (1.59 MB/s) - ‘index.html’ saved [29580]

Thanks for any advice!!

Soren

Reply all

Reply to author

Forward

0 new messages