crawler still hitting my site despite months of 301s

100 views
Skip to first unread message

burton...@gmail.com

unread,
Nov 25, 2014, 4:24:23 PM11/25/14
to common...@googlegroups.com
I run a fairly large and well-known ecommerce site. The site used to be known by one URL, and about a year ago, changed URLs. We still see hits, all the time, for the old URL from CCBot, despite months of sending it 301 redirects. How do we get it to stop? We get a request once every 3 seconds.

Sample:

GET /family/index.jsp?categoryId=2788201&cp=2788201&f=Brand%252F2601%252F&fbc=1&lmdn=Apparel+Type&f=PAD%2FApparel+Type%2FSweatshirts+%26%23047%3B+Fleece&fbc=1&fbn=Apparel+Type%7CSweatshirts+%26%23047%3B+Fleece&fbx=0 HTTP/1.0
Accept-Encoding: x-gzip, gzip, deflate
User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

HTTP/1.0 301 Moved Permanently
Server: BigIP
Connection: close
Content-Length: 0

burton...@gmail.com

unread,
Nov 25, 2014, 4:27:46 PM11/25/14
to common...@googlegroups.com
more sample showing less detail but showing frequency.

Nov 25 16:24:51 slot1/tmm info tmm[9211]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:24:54 slot3/tmm1 info tmm1[6335]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:24:56 slot2/tmm7 info tmm7[6583]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:24:59 slot3/tmm4 info tmm4[6338]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:01 slot3/tmm info tmm[6334]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:04 slot2/tmm6 info tmm6[6582]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:06 slot2/tmm5 info tmm5[6581]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:09 slot2/tmm7 info tmm7[6583]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:12 slot2/tmm2 info tmm2[6578]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:14 slot3/tmm4 info tmm4[6338]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:22 slot3/tmm2 info tmm2[6336]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:24 slot2/tmm info tmm[6576]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:27 slot1/tmm info tmm[9211]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:32 slot3/tmm info tmm[6334]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:35 slot2/tmm2 info tmm2[6578]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:37 slot3/tmm info tmm[6334]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:40 slot2/tmm2 info tmm2[6578]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:42 slot2/tmm info tmm[6576]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:45 slot3/tmm6 info tmm6[6340]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:48 slot2/tmm5 info tmm5[6581]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:50 slot3/tmm4 info tmm4[6338]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:53 slot2/tmm7 info tmm7[6583]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:55 slot2/tmm5 info tmm5[6581]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:25:58 slot2/tmm1 info tmm1[6577]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:26:00 slot3/tmm4 info tmm4[6338]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:26:03 slot2/tmm5 info tmm5[6581]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:26:06 slot1/tmm6 info tmm6[9219]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:26:08 slot1/tmm4 info tmm4[9217]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:26:11 slot1/tmm info tmm[9211]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp
Nov 25 16:26:13 slot3/tmm5 info tmm5[6339]: Rule store.nba.com <HTTP_REQUEST>: #NBA# 54.227.41.242 redirect nbastore.com /family/index.jsp

Stephen Merity

unread,
Nov 25, 2014, 7:19:57 PM11/25/14
to common...@googlegroups.com
Hi there,

My name's Stephen and I run the crawler at Common Crawl. Thanks for reaching out to us.

---
Regarding your questions

The crawler updates each of these URLs individually rather than on a domain basis, as 301s only work on individual URLs. This explains why we make one query every three seconds for a period of time via the store.nba.com domain. As an organisation that we hope provides benefit for the greater web community, we take care in how we perform crawling and believe this low query rate is fine for most domains.

As we may follow the URL from a different source that has not yet updated the links yet, we assume that they may now be correct and that the web page no longer redirects, so follow the URL again to the old domain. This is necessary as 301 redirects are misused heavily across the web and can't always be considered authoritative. 301 redirects are generally a low intensity query, so this is usually not a burden to the web server.

---
Solutions

I've now told the crawler explicitly to only follow modern links that point to nbastore.com, avoiding anything that will go via the old store.nba.com domain. That should take place in less than half an hour, after which you should no longer see any requests to store.nba.com that result in a 301 redirect.

This should solve the issue you're seeing above, but if this still is not a reasonable solution for you, we follow the robots.txt Robots Exclusion Standard and provide details in our FAQ on how to either throttle the crawler to a smaller number of queries or to block it altogether.

If you have any other questions, feel free to email me directly at ste...@commoncrawl.org. I do hope this email resolves any concerns you might have had.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl
Reply all
Reply to author
Forward
0 new messages