This could be perceived as Internal Duplication.
This could also be a possible issue for Link Loss (the PR value of
inbound links is shared to different URLs rather than considated).
Employ a server based 301 redirect from the unwanted format to the
prefered format.
Select the prefered format in the GWMT.
ensure any/all links use the same format.
ensure that the search engine sitemap uses the prefered format
.
::: Internal Duplication / Multiple URL issues :::
I seem to be able to reach certain content through different URLs....
This could be perceived as Internal Duplication.
This could also be a possible issue for Link Loss (the PR value of
inbound links is shared to different URLs rather than considated).
Try to ensure only 1 URL is used with any/all links.
If absolutely necessary (such as after fixing this issue and to cover
older items), employ a server based 301 permenant redirect to the
prefered item URL.
Include the prefered URL in the search engine sitemap (and disclude
the unwanted one).
.
::: Strange URLs :::
Some URLs contain unusual characters...
These URLs also seem to have an assortment of words in them that don't
seem very 'focused'... instead they look 'stuffed' with possible
keywords/terms.
.
::: Invalid Code :::
You seem to be mixing HTML and XHTML fomatting on some markup... a
prime example is in teh Head section...
<meta name="robots" content="all" />
<meta name="description" content="GadgetGuy.com.au - Reviews, news,
comparisons and buyers guides on the latest gadgets, computers, home
theatre, phones, games and cool stuff!" />
Your DocType says HTML ... so those items should not be ending with a
/>
only with a
.
::: You MAY have hidden content :::
Looking thorugh the source code, I see this...
Thanks for the reply Autocrat, I'll speak to our developer to go over
the points you've raised. Sorry about the re-posting, I didn't know
about the 'bump' option.
And here's the feedback I received on your comments:
1. No external links, internal links, or sitemap links use http://gadgetguy.com.au so this would not be an issue.
2. Internal Duplication is a very recent issue; however again the
Sitemap and all the Index and Section home pages use the latest links
which are being crawled, so this should not be a significant issue.
Duplication does not explain why neither page would be indexed, or why
deeper indexing other than Homepage and Section Home is not occuring.
3. Commas in URLs we will investigate.
4. Long/keyword stuffed URLs. Have edited to fix issues of length and
keyword stuffing.
5. Invalid Code - mmm... I very much doubt this is an issue. The
DocType is used by the browser to work out how to interpret the HTML
at render time. Google has no interest in this, so I see no reason how
this could cause a problem.
6. We don't have hidden content. The DIV layer being hidden is an
empty formatting control.
7. Google is happy the sitemap is valid. We also validated the sitemap
independantly before submitting. However, as this is dynamically
generated there could be new content or very old content in it that is
appearing for the first time; nonetheless if there was a problem the
Google Webmaster report would report it - and there are no issues
showing at the moment.
How about Hidden links/text?
Viewing the page without images shows the top nav bar becomes
'invisibile' ....
Viewing the page without images shows a lot of text that is hard to
read.
(hell, even if it isn't an SE issue, should be fixed for certain user-
types.)
.
How about applying rel="nofollow" to the links in noscrip[t tags?
Review...
http://gadgetguy.com.au/product.asp?id=14&m_review=0 Erm... not really seeing anything important on here...
nothing specific to Reviewing hte Russel Hobbs rice cooker anyway
(inc. irrelevant title, h1 etc.)
Weak Image Alt attributes...
Why does it just say 'Product' ?
Not exactly useful to blind people (including bots ;))...
so missing an opportunity to not only be helpful, but possibly get an
extra point or two with the SE's.
.
Any exernal duplicate content...
Is it the content 'yours' or is it from other sources?
Is it being 'fed' to other sites?
Is it being scrapped/copied by other sites?
.
.
.
Hope all that helps.
(May be worth getting the develoepr in as well, saves you running back
and forth ;))
Now I'm all for having descriptive URLs, but .... this seems to be
taking it a bit too far and I have a bit of trouble identifying
anything that matches in the content of your page.
The problem with URLs like this is that they almost appear to be
random and in fact I can get exactly the same page by using something
like: http://www.gadgetguy.com.au/xyzzy-42.html . In general, you
should make sure that you have only one URL that leads to your content
-- all others should either redirect to the proper URL or return HTTP
result code 404 to signal that the URL is invalid. Without that, your
site is leading us (and all other crawlers) on a wild goose chase.
If your CMS is not able to handle this properly (one URL per piece of
content), I would recommend not using rewritten URLs so that we can
recognize and skip over unimportant parameters in your URL query
string.
2. Broken HTML code
In general, we try to get it right regardless of what a webmaster uses
on his page. However, there are limits to what we can guess at.
Although this is definitely not as important as the first point, you
can see this happening when you search for something like:
http://www.google.com/search?q=site:www.gadgetguy.com.au+intitle:shor...
Thanks for the welcome, and for the feedback - I'm sure to be a
regular here to read up on the finer points of SE stuff.
OK, I have almost finished paring back the URLs of all the site
sections, the pages that have actually been indexed. Of the pages that
aren't being indexed, the URLs are only the article title or the
product name.
I'm told that via Webmaster Tools it appears that Google has indexed
about 70% of our sitemap. Can you tell me what the relationship is, or
lag is, between a sitemap being indexed and the pages making it into
the search index.
I have also passed along the URL of this thread to our developer, so
either he will chime in or I'll continue to be the point man to
hopefully getting us more SE-friendly.
Don't worry too much about what appears in the SiteMap indexation
information...
it tends to be al ittle 'off target' (putting it nicely ;)).
.
For a more accurate idea, you may want to use the site: operator.
Enter this in the Google Search...
site:http://gadgetguy.com.au
Then browse to the very last page of results.
The figure in teh top right is a pretty good idea of the number of
pages indexed.
You may also want to try some variants...
site:gadgetguy.com.au
site:gadgetguy.com.au/*
Also, if you see the paragraph about 'omitted results', do try that...
and again go to the end.
.
Please be aware of Google DataCenters...
Google has info on nuemrous computers/networks.
Some of these are a little less up-to-date than others... and you may
end up connecting to one of htose, and get different results.
The DC contacted can be random... can be influenced by ISP, Location,
whether using local Google or .com Google, possible Browser Googles
(like in MFF) etc.
Always perform the same search several times...
And don't panic if you see changes in the results.
I work at the company which developed the GadgetGuy web site.
We would really appreciate advice on the real issue we've been
battling with here.
There are over 2,000 links to this website around the Internet, many
from respectable sites with good pagerank.
We have two sitemap feeds to Google. Both report about 75% of the
pages as Indexed, however we are only seeing pages from one sitemap
showing on Google. This is despite the fact the Googlebot is also
constantly crawling the site.
This has been the case for some time and some of the issues now of
multiple URLs pointing to one page have come about due to attempts to
fix this issue - such as making URLs shorter (which is controlled by
content authors using the CMS by the way).
It seems many of our pages are on the Google supplementary index.
Interestingly, on Google's new website trends page, our plot seems to
have tanked to ZERO even though the site still gets Google traffic,
it's like we're not measured anymore, or we're on some kind of black
list we can't explain.
Has anyone else experienced this and resolved the cause?
We're not interested in hearing about a whole lot of incremental
tweaks. We're quite well versed in best practice for SEO. What we're
trying to uncover here is a fundamental issue that is stopping pages
getting indexed.
> Now I'm all for having descriptive URLs, but .... this seems to be
> taking it a bit too far and I have a bit of trouble identifying
> anything that matches in the content of your page.
> The problem with URLs like this is that they almost appear to be
> random and in fact I can get exactly the same page by using something
> like:http://www.gadgetguy.com.au/xyzzy-42.html. In general, you
> should make sure that you have only one URL that leads to your content
> -- all others should either redirect to the proper URL or return HTTP
> result code 404 to signal that the URL is invalid. Without that, your
> site is leading us (and all other crawlers) on a wild goose chase.
> If your CMS is not able to handle this properly (one URL per piece of
> content), I would recommend not using rewritten URLs so that we can
> recognize and skip over unimportant parameters in your URL query
> string.
> 2. Broken HTML code
> In general, we try to get it right regardless of what a webmaster uses
> on his page. However, there are limits to what we can guess at.
> Although this is definitely not as important as the first point, you
> can see this happening when you search for something like:http://www.google.com/search?q=site:www.gadgetguy.com.au+intitle:shor...
Not being funny... but don't you think someone (such as JM) would have
at least 'hinted' at a serious or concerning issue?
Not saying that there isn't a major problem, and that we've all missed
it so far...
but the algo changes... and sometimes some of those changes hit hard.
Every so often, there seems some tweaks to the algo, and all of a
sudden a huge rise in 'isues' walks in here... and there are no major
problems... it's all 'little' things that add up!
So, not wanting to sound harsh... but you've been given some issues
already... maybe make sure those are resolved and see how things go?
The least that does it ensure that it's not those thigns causing the
issue.
Whilst making those changes, it also gives you the chance to look at
the code and see if anything else is showing up.
.
Your URLs all seem to include an item number at the end?
If so, you could setup a a simple script to check the URL, if it
matches the prefered format.
If not, send a http() response with a 301, and redirect to the shorter
URL.
Of course, it would greatly help if that was done 'after' ensuring all
links point to just one format (which currently they don't?).
Ensure the same format is in the sitemap, xo the bots get an idea of
the prefered format.
Also ensure those URLs are 'cleaner'... none of the 'possibly looking
stuffed' urls,
and that they are 'relevant' ... containing something to do with the
page title/h1 etc.
(I think this could be the biggest issue?)
.
I'd ensure the links are all working too (seems to irritate the GBot
no end!).
So thats all links go somewhere....
then once thats resolved.,..
make sure that the 'somewhere' has some content (as pointed out above,
some of those pages are basically 'empty').
After that... sort out the server response for non-existant URLs...
a 302 to the homepage is not exactly the 'best' approach!
It should respond with a proper 404 or 410.
Provide either a custom error page... or then set a delayed redirect.
So long as the bots get that 404/410, so they know not to index it.
.
Try fixing those...
see how thigns go...
failing that, ask again (preferably in this topic/thread... or at
least link to it so we know where to look ;)
> 7. Google is happy the sitemap is valid. We also validated the sitemap
> independantly before submitting. However, as this is dynamically
> generated there could be new content or very old content in it that is
> appearing for the first time; nonetheless if there was a problem the
> Google Webmaster report would report it - and there are no issues
> showing at the moment.
It may be 'valid' but it's not useful. Every priority but one set to
the default, and every lastmod set to the same time.