How Do I Know Which Pages Are Not Indexed? Delisted?

15 views
Skip to first unread message

cyberkid

unread,
May 3, 2008, 12:51:09 PM5/3/08
to SOFTplus GSiteCrawler
I need to figure out which pages are not listed at Google and when.

Briefly, I see where the crawler can download the pages that are
indexed.

Only is there some place that it compares the sitemap it can download
to the one that it uploaded?

Lately, either at Google or MSN the bot has started to delist pages
with forum moderators going on about their relevance issues...


Thank you!

webado

unread,
May 3, 2008, 1:20:32 PM5/3/08
to SOFTplus GSiteCrawler
Google does not index pages because they are in the sitemap. Similarly
it does not omit pages from the index because they are not in the
sitemap.
All indexing starts with a crawl from the root. All pages found that
way can get indexed and will be indexed unless they are so badly
broken as to provide no indexable content.

Being indexed and appearing in searches are different things. Being
indexed is only the first prerequisite to appearing in searches.
Another one is to actually have the search terms on the page or in
links pointing to the page. But to actually appear in a search your
page has to be well indexed and rank better (have more, better
backlinks) than most pages that are also indexed and which also
qualify for the same search.


A site query tells you what is indexed. Not how well.

Whatever is missing means it's not indexed. Whether it's because it's
been delisted or simply not found to be relevant at all, cannot be
said in general. It's always in the specifics, after you analyze
everything.

Delisting, unless due to impossibility to crawl the page, would be due
more for breaking Webmaster Guidelines than anything else. Hidden
content, spamminess, etc.

cyberkid

unread,
May 3, 2008, 3:44:29 PM5/3/08
to SOFTplus GSiteCrawler
Hi-Thanks, makes sense and the logic is it rule.

The question is how to use GSitecrawler to compare what Google has
relevant to what was uploaded.

In that way I can identifuy the pages that are problematic.

There are many factors and thank you for pointing out the most common.

In some way it is at times profound to see how a page appears that is
not relevant to the search term but is in fact a page from the site.

It surprises me to see how the google bot creats such an anomoly as to
show pudding pie when apple pie was the recipt for apple pie was the
question.

In that way I have found that the search word phrases can cause the
bot to move away from the most obvious que and that is the page title.
Why it ignores the title is baffling if it determines instead to show
another subject or content page.
The site is purely a listing of items as generated from the
distribuots or mfg that handle them. Little if any of the content or
language on the site is my own.

That is a fair question to most I'm sure.We can create Web sites
regardless of our direct knowledge of the content or items listed.

Or put another way, I'm not a chef or a desert cook. But I do get
stuck with the dishes.

So again, I thinkk it is relevantto know which pages are not listed
and to also know when pages are delisted.

I've seen indexing go from one extreme to another since introducing
the new site which is virtually a dynamic mirror of its old obsolete
static site.

It's the obsolete site that Google once indexed that is no longer,
that has all caused all of the problems.

I failed to realize that the backlinks or old links to the site
required that I input a 302 header or some such protocol so the
integrity of those links was maintained.

There is still tremendous bias even in asking questions, that makes it
difficult to sort through or resolve the computational problems from
logic.

Again, I only want to find a way in which to use GSitecrawler to see
which pages are not listed that were submitted on the sitemap.
> > Thank you!- Hide quoted text -
>
> - Show quoted text -

webado

unread,
May 3, 2008, 6:15:10 PM5/3/08
to SOFTplus GSiteCrawler
You should have 301 redirected the urls from the obsolete site to the
new corresponding urls on the new site.

Until and unless you do that the new site's pages will not be indexed
as they will be duplicates or near duplicates of the old ones.

In partitcular if the old urls no longer exist yet when you try to
access them the server does not produce a 404- not found (or 410 -
gone forever) then the robots do not know that thse urls are no longer
to be used and shioudl be removed from the index. The 301 redireciotn
however is best since it transfers the value of the old pages
(pagerank) to the new ones where they are getting redirected.

Note that a 302 is not good in this instance, as it represents a
temporary redidection. It has to be a 301 (which is permanent
redirection) or else let them return a 404 which will allow you to
request removal of those pages from the index if you don't have
patience to wait until they drop out by themselves (coudl take very
long). A 200 return is mortal because it says the page is still there.

I cannot see a way to use GSC in any direct manner to find out which
urls are not yet listed.

I can recommmend you download the table "pages with internal links"
from your Webmaster Tools account (the Links tab > Internal Links).
Import that into an Access database. Then run GSC to generate a
sitemap - if you pick a simple url list in text form, you can then
import that into your Access database as well as a second table. Then
you may be able to create some query that intersects the 2 tables to
show you what's in one and not the other and then vice versa.
> > - Show quoted text -- Hide quoted text -

cyberkid

unread,
May 5, 2008, 11:40:48 AM5/5/08
to SOFTplus GSiteCrawler
The old site is the same URL as the new site. I'm sorry that it is not
easy for me to clarify such things or follow the protocol as directed.
that and I had to consolidate the sub-webs, sub-domains, and our stand
alone site to all fit into the new database driven site.

At any rate that was this time last year and things went steadily
downhill from there. I had presumed that the system was much more
nimble and recognized a URL it has indexed for the past ten years?
That and it should have picked up a couple of other things like the
site was on a different server running in a an asp.net environment
with aspx pages instead of HTML.

It doesn't. I would have thought given the degree of technology that
it woul dbe logical for the system to track its own indexing paths.
The site maps such as the one I have tried to trouble check is still
an idea that I wish I could follow through on.

I'd like to know which pages are no longer indexed. Again, it's just
my own sense of logic and it is obvious from all the discussions not
to be.

webado

unread,
May 5, 2008, 11:55:15 AM5/5/08
to SOFTplus GSiteCrawler
Page content is always html regardless of the extension they have.
A url where even on character is differnt from another one is a
different url.
Period.

You have to 301 redirect all urls that have changed.
> > - Show quoted text -- Masquer le texte des messages précédents -
>
> - Afficher le texte des messages précédents -
Reply all
Reply to author
Forward
0 new messages