I just submitted the following reconsideration request to Google (and
would appreciate anyone's thoughts in the meantime):
Dear Google,
At HollywoodChicago.com, we have worked hard since early 2007 to
create a wide array of high-quality and original content. At its peak,
our Google PageRank climbed to about 5/10. A while back, I authorized
Examiner.com (7/10 PageRank) and Starpulse.com (6/10 PageRank) to
publish some of our content (i.e. film reviews and interviews) in
full.
When full HollywoodChicago.com content appears on Starpulse.com, a
link to the full HollywoodChicago.com article is always included.
While this was initially the case on Examiner.com as well, we now only
include a link to HollywoodChicago.com.
Either way, all full HollywoodChicago.com content on Starpulse.com and
Examiner.com is the rightful property of HollywoodChicago.com and is
originated by HollywoodChicago.com. A couple weeks ago, though, the
HollywoodChicago.com PageRank plummeted to 0/10. We are now losing
nearly all visibility and search traffic from Google.
This is a mistake and is not justified. We would appreciate a swift
correction to our search visibility. Jennie Johnson
([email address]) said on Sept. 30, 2008 that she was looking into
the issue and would get back in touch with me. She hasn't, though, and
hasn't responded to several follow-up requests.
Once this issue gets corrected, I am interested in learning why this
happened and writing about this on the Huffington Post so others can
make sure not to do the same thing. Thank you.
~ Adam M. Fendelman
Film Critic/Publisher, HollywoodChicago.com
Cell Phones Guide, About.com (part of the New York Times Co.)
Blogger, The Huffington Post
From what I can tell, it looks like your site is blocking the
Googlebot from accessing the content. You will likely see this in your
Webmaster Tools account, most likely under Diagnostics / Web Crawl /
HTTP errors. I wrote about similar situations in a recent blog post at
http://googlewebmastercentral.blogspot.com/2008/09/advanced-website-d...
If you are blocking IP addresses on your side, I would recommend
double-checking to make sure that you aren't blocking search engine
crawlers. Most search engines provide a simple way to verify crawlers
- the information for Googlebot is at
http://www.google.com/support/webmasters/bin/answer.py?answer=80553
Wow. That is incredibly useful. It may be the problem.
I just checked under Web crawl / Diagnostics and see 11,260 HTTP
errors due to a "4xx error" along with 2,684 URLs restricted by
robots.txt and 2,480 errors for URLs in Sitemaps. I wish Google would
have e-mailed me this information so I would have known it was a
problem.
You mention that I may be blocking IP addresses on my side. Yes, I am.
When spammers post spam messages to my discussion forums, I sometimes
block those IP addresses. I've sometimes used the IP block format, for
example, 55.55.55.55 and I've sometimes used this format: 55.55.%
Could it be that I've inadvertantly blocked Googlebot while trying to
block one of these spammers? If so, how would I know which IP address
unintentionally blocked Googlebot without deleting all the other ones
that are legit spammers?
Also, if this is the problem and I delete the IP address(es) in
question, will Google automatically start indexing me again? How long
might that take? Will my request for index reconsideration today cause
an issue with all this?
Thank you for all your help!
~ Adam M. Fendelman
Publisher, HollywoodChicago.com
> From what I can tell, it looks like your site is blocking the
> Googlebot from accessing the content. You will likely see this in your
> Webmaster Tools account, most likely under Diagnostics / Web Crawl /
> HTTP errors. I wrote about similar situations in a recent blog post athttp://googlewebmastercentral.blogspot.com/2008/09/advanced-website-d...
> If you are blocking IP addresses on your side, I would recommend
> double-checking to make sure that you aren't blocking search engine
> crawlers. Most search engines provide a simple way to verify crawlers
> - the information for Googlebot is athttp://www.google.com/support/webmasters/bin/answer.py?answer=80553
> Wow. That is incredibly useful. It may be the problem.
> I just checked ...
I aggree, WOW.
> You mention that I may be blocking IP addresses on my side. Yes, I am.
> When spammers post spam messages to my discussion forums, I sometimes
> block those IP addresses. I've sometimes used the IP block format, for
> example, 55.55.55.55 and I've sometimes used this format: 55.55.%
> Could it be that I've inadvertantly blocked Googlebot while trying to
> block one of these spammers?
Well, maybe. It really depends on how your software works.
When you block IP address it typically checks the IP address
of someone trying to post something to the forum.
It normally does not block access to your site for those IP
addresses, just the ability to post. To be sure check withe the
designer or vendor of your forum\CMS.
Anyway, Since Googlebot would not be trying to post anything
to your forums, these type of blocks, given my above, would not
be affected.
I believe what John Mu was referring to is a block by a firewall
or perhaps by the server itself. For that you likely need to check
with your hosting provider and their IT\Networking staff.
I see you are on an Apache server so there could also be
blocking rules (Allow, Deny) in your .htaccess. Be sure
you check there also.
> [..] Will my request for index reconsideration today cause
> an issue with all this?
When did you file a request for reconsideration?
What issues did you specify as rectified?
I'm not sure if you really even need to file a reconsideration
request, at least until the causes for all your HTTP errors
and Sitemap errors are addressed.
As far as the URLs blocked by robots.txt, what you specified
seems OK so they may not be real errors, they may be just
informational messages. Take a look at the actual URLs blocked
to determine if the blocking is appropriate. If not you may need to
adjust your robots.txt.
-- I think you need one blank line after the
Disallow: /?q=user/login/ directive in your robots.txt.
Maybe I'm being picky but wouldn't hurt to add one.
Also for any appropriately blocked URLs, make sure those URLs
are not included in your Sitemap.
> Wow. That is incredibly useful. It may be the problem.
> I just checked under Web crawl / Diagnostics and see 11,260 HTTP
> errors due to a "4xx error" along with 2,684 URLs restricted by
> robots.txt and 2,480 errors for URLs in Sitemaps. I wish Google would
> have e-mailed me this information so I would have known it was a
> problem.
> You mention that I may be blocking IP addresses on my side. Yes, I am.
> When spammers post spam messages to my discussion forums, I sometimes
> block those IP addresses. I've sometimes used the IP block format, for
> example, 55.55.55.55 and I've sometimes used this format: 55.55.%
> Could it be that I've inadvertantly blocked Googlebot while trying to
> block one of these spammers? If so, how would I know which IP address
> unintentionally blocked Googlebot without deleting all the other ones
> that are legit spammers?
> Also, if this is the problem and I delete the IP address(es) in
> question, will Google automatically start indexing me again? How long
> might that take? Will my request for index reconsideration today cause
> an issue with all this?
> Thank you for all your help!
> ~ Adam M. Fendelman
> Publisher, HollywoodChicago.com
> On Oct 7, 6:24 pm, JohnMu wrote:
> > Hi Adam and welcome to the groups!
> > From what I can tell, it looks like your site is blocking the
> > Googlebot from accessing the content. You will likely see this in your
> > Webmaster Tools account, most likely under Diagnostics / Web Crawl /
> > HTTP errors. I wrote about similar situations in a recent blog post athttp://googlewebmastercentral.blogspot.com/2008/09/advanced-website-d...
> > If you are blocking IP addresses on your side, I would recommend
> > double-checking to make sure that you aren't blocking search engine
> > crawlers. Most search engines provide a simple way to verify crawlers
> > - the information for Googlebot is athttp://www.google.com/support/webmasters/bin/answer.py?answer=80553
If you have allowed other sites with a higher PR to publish your
content then as far as Google is concerned it is going to appear as
duplicate content and they will ultimately select one source from
which to deliver results, and it would appear that they have not
chosen the original in your case.
I wouldn't allow anyone to use my content word for word - a digest
maybe, but not the same piece of work; presumably the trade off for
you was links back from a trusted site, but it appears to have back-
fired. From now on I'd be inclined to keep my content very firmly on
my site.
> > Wow. That is incredibly useful. It may be the problem.
> > I just checked ...
> I aggree, WOW.
> > You mention that I may be blocking IP addresses on my side. Yes, I am.
> > When spammers post spam messages to my discussion forums, I sometimes
> > block those IP addresses. I've sometimes used the IP block format, for
> > example, 55.55.55.55 and I've sometimes used this format: 55.55.%
> > Could it be that I've inadvertantly blocked Googlebot while trying to
> > block one of these spammers?
> Well, maybe. It really depends on how your software works.
> When you block IP address it typically checks the IP address
> of someone trying to post something to the forum.
> It normally does not block access to your site for those IP
> addresses, just the ability to post. To be sure check withe the
> designer or vendor of your forum\CMS.
> Anyway, Since Googlebot would not be trying to post anything
> to your forums, these type of blocks, given my above, would not
> be affected.
> I believe what John Mu was referring to is a block by a firewall
> or perhaps by the server itself. For that you likely need to check
> with your hosting provider and their IT\Networking staff.
> I see you are on an Apache server so there could also be
> blocking rules (Allow, Deny) in your .htaccess. Be sure
> you check there also.
> > [..] Will my request for index reconsideration today cause
> > an issue with all this?
> When did you file a request for reconsideration?
> What issues did you specify as rectified?
> I'm not sure if you really even need to file a reconsideration
> request, at least until the causes for all your HTTP errors
> and Sitemap errors are addressed.
> As far as the URLs blocked by robots.txt, what you specified
> seems OK so they may not be real errors, they may be just
> informational messages. Take a look at the actual URLs blocked
> to determine if the blocking is appropriate. If not you may need to
> adjust your robots.txt.
> -- I think you need one blank line after the
> Disallow: /?q=user/login/ directive in your robots.txt.
> Maybe I'm being picky but wouldn't hurt to add one.
> Also for any appropriately blocked URLs, make sure those URLs
> are not included in your Sitemap.
> Hope that helps,
> Abracadabra
> On Oct 8, 12:15 am, Adam Fendelman wrote:
> > John,
> > Wow. That is incredibly useful. It may be the problem.
> > I just checked under Web crawl / Diagnostics and see 11,260 HTTP
> > errors due to a "4xx error" along with 2,684 URLs restricted by
> > robots.txt and 2,480 errors for URLs in Sitemaps. I wish Google would
> > have e-mailed me this information so I would have known it was a
> > problem.
> > You mention that I may be blocking IP addresses on my side. Yes, I am.
> > When spammers post spam messages to my discussion forums, I sometimes
> > block those IP addresses. I've sometimes used the IP block format, for
> > example, 55.55.55.55 and I've sometimes used this format: 55.55.%
> > Could it be that I've inadvertantly blocked Googlebot while trying to
> > block one of these spammers? If so, how would I know which IP address
> > unintentionally blocked Googlebot without deleting all the other ones
> > that are legit spammers?
> > Also, if this is the problem and I delete the IP address(es) in
> > question, will Google automatically start indexing me again? How long
> > might that take? Will my request for index reconsideration today cause
> > an issue with all this?
> > Thank you for all your help!
> > ~ Adam M. Fendelman
> > Publisher, HollywoodChicago.com
> > On Oct 7, 6:24 pm, JohnMu wrote:
> > > Hi Adam and welcome to the groups!
> > > From what I can tell, it looks like your site is blocking the
> > > Googlebot from accessing the content. You will likely see this in your
> > > Webmaster Tools account, most likely under Diagnostics / Web Crawl /
> > > HTTP errors. I wrote about similar situations in a recent blog post athttp://googlewebmastercentral.blogspot.com/2008/09/advanced-website-d...
> > > If you are blocking IP addresses on your side, I would recommend
> > > double-checking to make sure that you aren't blocking search engine
> > > crawlers. Most search engines provide a simple way to verify crawlers
> > > - the information for Googlebot is athttp://www.google.com/support/webmasters/bin/answer.py?answer=80553
> > > Hope it helps!
> > > John- Hide quoted text -
It does a simple reverse IP lookup to determine the host name and then
checks that host name against the IP address again. I have a simple
python script that does this, I can clean it up and post it if you're
interested :).
Once Google is allowed to crawl your site again, things should pick up
fairly quickly. As Tim mentioned, you wouldn't need a reconsideration
request for this since it's not really an issue with the Webmaster
Guidelines.
Also, as Tim mentioned, the URLs blocked by the robots.txt are
possibly correctly blocked - you'd have to check your robots.txt and
those URLs to make sure, but in general they are fine (if you're happy
with your robots.txt file). We list URLs there to help you to double-
check those disallow statements.
Not sure if this is a problem but you also return a 401 unauthorized
on sitemap.xml in IE 7. It displays in Firefox but does not appear to
use the G Sitemap protocol. If you have a sitemap make sure your not
blocking G access too.
> Wow. That is incredibly useful. It may be the problem.
> I just checked under Web crawl / Diagnostics and see 11,260 HTTP
> errors due to a "4xx error" along with 2,684 URLs restricted by
> robots.txt and 2,480 errors for URLs in Sitemaps. I wish Google would
> have e-mailed me this information so I would have known it was a
> problem.
> You mention that I may be blocking IP addresses on my side. Yes, I am.
> When spammers post spam messages to my discussion forums, I sometimes
> block those IP addresses. I've sometimes used the IP block format, for
> example, 55.55.55.55 and I've sometimes used this format: 55.55.%
> Could it be that I've inadvertantly blocked Googlebot while trying to
> block one of these spammers? If so, how would I know which IP address
> unintentionally blocked Googlebot without deleting all the other ones
> that are legit spammers?
> Also, if this is the problem and I delete the IP address(es) in
> question, will Google automatically start indexing me again? How long
> might that take? Will my request for index reconsideration today cause
> an issue with all this?
> Thank you for all your help!
> ~ Adam M. Fendelman
> Publisher, HollywoodChicago.com
> On Oct 7, 6:24 pm, JohnMu wrote:
> > Hi Adam and welcome to the groups!
> > From what I can tell, it looks like your site is blocking the
> > Googlebot from accessing the content. You will likely see this in your
> > Webmaster Tools account, most likely under Diagnostics / Web Crawl /
> > HTTP errors. I wrote about similar situations in a recent blog post athttp://googlewebmastercentral.blogspot.com/2008/09/advanced-website-d...
> > If you are blocking IP addresses on your side, I would recommend
> > double-checking to make sure that you aren't blocking search engine
> > crawlers. Most search engines provide a simple way to verify crawlers
> > - the information for Googlebot is athttp://www.google.com/support/webmasters/bin/answer.py?answer=80553
To make sure this is indeed the problem and to make sure Google and
Google News (I assume that's being affected, too?) both start indexing
me again, I just deleted all the IP addresses that I had been blocking
because I thought they were spammers. This was my list of blocked IP
addresses:
I had never thought to do a reverse IP lookup on these. The way I got
these was from clear spam posts to my discussion forums. I just did a
reverse IP lookup on all of these now from http://remote.12dt.com/ and
I think we indeed found the problem! I was blocking 66.249.73.%, and
when I do a reverse IP lookup for that by filling in any number for
the last number (i.e. 66.249.73.11), it comes up with this:
66.249.73.22 resolves to
"crawl-66-249-73-22.googlebot.com"
Top Level Domain: "googlebot.com"
Country IP Address: UNITED STATES
Also, blocking 72.167.250.165 also appears to be bad because I know
that secureserver.net has to do with my GoDaddy Web host:
72.167.250.165 resolves to
"ip-72-167-250-165.ip.secureserver.net"
Top Level Domain: "secureserver.net"
Country IP Address: UNITED STATES
So, I've just unblocked all of these and can now confirm that I
clearly was inadvertantly blocking Googlebot. This isn't something
that would have ever crossed my mind. Now that that's deleted, this
should fix all my problems and I shouldn't have to do anything else,
right? Googlebot should automatically index me again soon, right?
> It does a simple reverse IP lookup to determine the host name and then
> checks that host name against the IP address again. I have a simple
> python script that does this, I can clean it up and post it if you're
> interested :).
> Once Google is allowed to crawl your site again, things should pick up
> fairly quickly. As Tim mentioned, you wouldn't need a reconsideration
> request for this since it's not really an issue with the Webmaster
> Guidelines.
> Also, as Tim mentioned, the URLs blocked by the robots.txt are
> possibly correctly blocked - you'd have to check your robots.txt and
> those URLs to make sure, but in general they are fine (if you're happy
> with your robots.txt file). We list URLs there to help you to double-
> check those disallow statements.
That does look like a Googlebot -- I'm glad you were able to spot and
remove the IP addresses. You should be able to see Googlebot's
activity in your server logs first. Once it starts crawling again,
things should return back to normal over time (I can't give a specific
timeframe though, it depends on many factors). Anyway, I'm happy to
see it sorted out!
Thanks so much! I'll continue monitoring this to make sure my PageRank
returns and Google indexing resumes. I'll also write up this story on
the Huffington Post so others can learn from it and avoid making the
same mistake.
Going forward, I now know to always do a reverse IP lookup before
banning an IP address to make sure I'm not inadvertently banning one I
shouldn't be.
> That does look like a Googlebot -- I'm glad you were able to spot and
> remove the IP addresses. You should be able to see Googlebot's
> activity in your server logs first. Once it starts crawling again,
> things should return back to normal over time (I can't give a specific
> timeframe though, it depends on many factors). Anyway, I'm happy to
> see it sorted out!
Hi Adam
I'd love to see the link when you're done :)
Regarding PageRank, keep in mind that we only update the Toolbar
PageRank a few times a year and we just recently updated it. I
wouldn't really look at PageRank for the next couple of months. It may
catch up with indexing, but it will likely just be updated in the next
round.
Thanks again, John. As these posts are public and your profile is as
well, you're in it. It's currently pending editor approval at the
HuffPost. I'll link it here when it goes live. To blow some time in
the meantime, here's the conclusion to a somewhat similar but also
very different case I recently had with GoDaddy:
> Hi Adam
> I'd love to see the link when you're done :)
> Regarding PageRank, keep in mind that we only update the Toolbar
> PageRank a few times a year and we just recently updated it. I
> wouldn't really look at PageRank for the next couple of months. It may
> catch up with indexing, but it will likely just be updated in the next
> round.
> Hi Adam
> I'd love to see the link when you're done :)
> Regarding PageRank, keep in mind that we only update the Toolbar
> PageRank a few times a year and we just recently updated it. I
> wouldn't really look at PageRank for the next couple of months. It may
> catch up with indexing, but it will likely just be updated in the next
> round.
> Thanks, Adam! That's a great write-up. I'll pass it on to the team!
> Sending email alerts is a great idea, but I don't know how easy it
> would be to get that right. It's definitely something to look at
> though.
I haven't gotten any response. I have no documented evidence, but it
seems like our indexing in Google Blogs is preventing the majority of
our indexing in Google News. We are an approved news provider for
Google News, but when we publish new content, it gets immediately
indexed in Google Blogs and then often doesn't in Google News. It does
sometimes get indexed by Google News, but rarely. It pretty much
always just goes to Google Blogs and then not to Google News.
We'd obviously prefer indexing in Google News. It seems like it's a
"one or the other" kind of thing. I did just find this form to request
removal from Google Blogs:
My question is whether or not I should do that removal request and if
doing so would help us get indexed more by Google News. Any help? Much
appreciated.
> From what I can tell, it looks like your site is blocking the
> Googlebot from accessing the content. You will likely see this in your
> Webmaster Tools account, most likely under Diagnostics / Web Crawl /
> HTTP errors. I wrote about similar situations in a recent blog post athttp://googlewebmastercentral.blogspot.com/2008/09/advanced-website-d...
> If you are blocking IP addresses on your side, I would recommend
> double-checking to make sure that you aren't blocking search engine
> crawlers. Most search engines provide a simple way to verify crawlers
> - the information for Googlebot is athttp://www.google.com/support/webmasters/bin/answer.py?answer=80553
I just received a good answer from Google about my Google News
indexing issue. It says two things:
1) My theory of an indexing conflict between Google News and Google
Blogs is wrong. They can be indexed by both without hurting the other.
That's good to know.
2) They've made some sort of update to their system to include
indexing for more of our articles within a few weeks. I don't know why
this was necessary -- as our site hasn't changed its format since the
beginning of the year -- but hopefully that'll help.
--------------------------------------------------------------------------- ----------------------------
Hi Adam,
Thank you for bringing this to our attention. There is no "either/or"
relationship between Google News and Google Blogsearch; it's perfectly
possible for an article to be included in both services.
You can visit Google Webmaster Tools at
https://www.google.com/webmasters/tools/siteoverview, which offers
error
reports specific to Google News. These error reports will explain any
problems we experienced crawling or extracting news articles from your
site.
In addition, we've updated your site's information in our system. We
should begin to crawl more articles from your site within a few weeks.
We
appreciate your patience during this process. Thanks for your interest
in
Google News.
Regards,
The Google Team
--------------------------------------------------------------------------- ----------------------------