You didn't mention sitemap.xml and how that affects a page in the
google Index. If we don't list a page in our sitemap, will it still be
included in the Google Index? It's very possible we'll have links to a
page no in our sitemap.
I am splitting my web site into two sites. I've moved the content, now
I'm using .htaccess to rewrite the URL from PlanetMike.com to the new
URL at MichaelClark.name. I'm giving out server code 301, and updating
my sitemap.xml. What will happen when the Googlebot next sees one of
my pages that has been moved? I assume the 301 will tell Googlebot
that the page isn't at PlanetMike.com any more, effectively removing
that page from Google. And at the same time the 301 is telling
GoogleBot to add the new page to its records for MichaelClark.name.
I've been watching both domains pretty closely in the Webmaster Tools
area, and it looks like everything is working smoothly.
As for password protected contents, are you sure that you don't index
those based on 3rd party signals like ODP listings or strong inbound
links?
You totally forgot to mention the neat X-Robots-Tag that allows
outputting REP tags like "noindex" even for non-HTML resources like
PDFs or videos in the HTTP header. That's an invention Google can be
very proud of. :)
@Ian M, who in the comments asks for a "Noindex:" statement in
robots.txt
Currently Google interprets Noindex: in robots.txt as (Disallow: +
Noindex:). I think that's completely wrong, because:
1. It's not compliant to the Robots Exclusion Standard.
2. It confuses Webmasters because "noindex" in robots.txt means
something completely different than "noindex" in meta tags or HTTP
headers.
3. Mixing crawler directives and indexer directives this way is a
plain weak point that will produce misunderstandings resulting in
traffic losses for Webmasters and less compelling contents available
to searchers. All indexer directives
(noindex,nofollow,noarchive,noodp, unavailable_after etc.) do require
crawling when put elsewhere. I do Webmaster support for ages and I
assure you that Webmasters will not get it. If nobody understands it
and adapts it, it's as useless as Yahoo's robots-nocontent class name
that only 500 sites on the whole Web make use of.
4. The REP's "noindex" tag has an implicit "follow" that Google
ignores in robots.txt for technical reasons (it's impossible to follow
links from uncrawled pages). When I put a robots meta tag with a
"noindex" value, then Google rightly follows my links, passes PageRank
and anchor text to those, and just doesn't list the URL on the SERPs.
When I do the same in robots.txt Google behaves totally different, for
no apparent reason. (Of course there's a reason but I want to keep
this statement simple.)
Having said all that, I appreciate it very much that Google works on
robots.txt evolvements. Kudos to Google! However, please don't assign
semantics of crawler directives to established indexer directives,
that doesn't work out. I see the PageRank problem, and I think I know
a better procedure to solve that. If you're interested, please read my
"RFC" linked above. ;)
@all
Do not make use of experimental robots.txt directives unless you
really know what you do, and that includes monitoring Google's
experiment very closely. If you've the programming skills, then better
make use of X-Robots-Tags to steer indexing respectivele deindexing of
your resources on site level. X-Robots-Tags work with HTML contents as
well as with all other content types.
@Riona
I hope you don't mind the cross posting. :)
Thanks for your time and have a nice day!
Sebastian
It's entirely possible for pages not listed in a Sitemap to get
crawled and indexed; we use data from Sitemaps to supplement our usual
crawl and discovery procedures, but they're not the only way that we
find out about URLs to crawl.
It sounds like you're on the right track, though--301 redirecting each
page on the old site to the corresponding page on the new site is the
way to go. As we crawl the old pages we'll see the 301 redirects and
know that that content is now found on the corresponding URL on your
new site.
However, it looks like you haven't yet implemented the 301 redirects
on planetmike.com, right? (If so, you might want to double-check,
because I tried a couple URLs and didn't get redirected anywhere.)
> It sounds like you're on the right track, though--301 redirecting each
> page on the old site to the corresponding page on the new site is the
> way to go. As we crawl the old pages we'll see the 301 redirects and
> know that that content is now found on the corresponding URL on your
> new site.
> However, it looks like you haven't yet implemented the 301 redirects
> on planetmike.com, right? (If so, you might want to double-check,
> because I tried a couple URLs and didn't get redirected anywhere.)
Thanks for looking. I'm slowly getting everything tweaked around.
There is a lot of cruft that has accumulated over 8 years. It's very
likely I've moved some files and directories and haven't got the
redirect working nicely yet. I should have gone a bit slower than I
did. Mike
Thanks for the video but can you please explain a bit about removing
https pages using webmaster tool as webmaster URL removal console
starts with http. Is their a help file I can read for it?
Thanks, looking for an answer, we been discussing it at WMW for few
days without a proper answer.
Regarding finding your indexed https pages, you could look at your
logs or analytics data to see which URLs are getting referrals from
Google search results.
Regarding the URL removal requests getting denied: if you request
removal of a URL that isn't indexed, the request should be marked
'Removed' since that URL isn't in our index ("removed" and "not
indexed" are basically synonymous in this case). If your requests are
getting denied, it's likely that the URL(s) in question don't meet the
criteria for removal. Take a close look at what you need to do to make
a URL eligible for removal:
>> Regarding finding your indexed https pages, you could look at your logs or analytics data to see which URLs are getting referrals from Google search results.
It may not be getting any traffic yet, I want to avoid the duplicate
pages to be on the safer side. Does google have some commands like
site:www.domain.com:443 etc. Can google develop one, this will be
really helpful.
>> Take a close look at what you need to do to make a URL eligible for removal:
Can you please add a section where we can check if a page satisfies
the conditions for removal or not? I think I have done my part by
adding it to the robots.txt
> Regarding finding your indexed https pages, you could look at your
> logs or analytics data to see which URLs are getting referrals from
> Google search results.
> Regarding the URL removal requests getting denied: if you request
> removal of a URL that isn't indexed, the request should be marked
> 'Removed' since that URL isn't in our index ("removed" and "not
> indexed" are basically synonymous in this case). If your requests are
> getting denied, it's likely that the URL(s) in question don't meet the
> criteria for removal. Take a close look at what you need to do to make
> a URL eligible for removal:
Right now the https version of your site, including the robots.txt
file, doesn't seem to be allowing connections; if Googlebot can't
access your robots.txt file, it won't be able to see whether your
URL(s) have been disallowed in the file, which may have something to
do with your removal requests being denied.
Have you considered using methods other than the URL removal tool to
remove any https URLs from the index? The URL removal tool is good for
urgent requests, but you can accomplish the same thing in many other
ways, most of which would probably be easier to implement since you
don't know exactly what URLs you want to remove. You could 301
redirect https pages to their http version, add an X-Robots-Tag header
with "noindex", or a variety of other methods: