This could be perceived as Internal Duplication.
This could also be a possible issue for Link Loss (the PR value of
inbound links is shared to different URLs rather than considated).
Employ a server based 301 redirect from the unwanted format to the
prefered format.
Select the prefered format in the GWMT.
ensure any/all links use the same format.
ensure that the search engine sitemap uses the prefered format
.
::: Internal Duplication / Multiple URL issues :::
I seem to be able to reach certain content through different URLs....
This could be perceived as Internal Duplication.
This could also be a possible issue for Link Loss (the PR value of
inbound links is shared to different URLs rather than considated).
Try to ensure only 1 URL is used with any/all links.
If absolutely necessary (such as after fixing this issue and to cover
older items), employ a server based 301 permenant redirect to the
prefered item URL.
Include the prefered URL in the search engine sitemap (and disclude
the unwanted one).
.
::: Strange URLs :::
Some URLs contain unusual characters...
These URLs also seem to have an assortment of words in them that don't
seem very 'focused'... instead they look 'stuffed' with possible
keywords/terms.
.
::: Invalid Code :::
You seem to be mixing HTML and XHTML fomatting on some markup... a
prime example is in teh Head section...
<meta name="robots" content="all" />
<meta name="description" content="GadgetGuy.com.au - Reviews, news,
comparisons and buyers guides on the latest gadgets, computers, home
theatre, phones, games and cool stuff!" />
Your DocType says HTML ... so those items should not be ending with a
/>
only with a
.
::: You MAY have hidden content :::
Looking thorugh the source code, I see this...
Thanks for the reply Autocrat, I'll speak to our developer to go over
the points you've raised. Sorry about the re-posting, I didn't know
about the 'bump' option.
And here's the feedback I received on your comments:
1. No external links, internal links, or sitemap links use http://gadgetguy.com.au so this would not be an issue.
2. Internal Duplication is a very recent issue; however again the
Sitemap and all the Index and Section home pages use the latest links
which are being crawled, so this should not be a significant issue.
Duplication does not explain why neither page would be indexed, or why
deeper indexing other than Homepage and Section Home is not occuring.
3. Commas in URLs we will investigate.
4. Long/keyword stuffed URLs. Have edited to fix issues of length and
keyword stuffing.
5. Invalid Code - mmm... I very much doubt this is an issue. The
DocType is used by the browser to work out how to interpret the HTML
at render time. Google has no interest in this, so I see no reason how
this could cause a problem.
6. We don't have hidden content. The DIV layer being hidden is an
empty formatting control.
7. Google is happy the sitemap is valid. We also validated the sitemap
independantly before submitting. However, as this is dynamically
generated there could be new content or very old content in it that is
appearing for the first time; nonetheless if there was a problem the
Google Webmaster report would report it - and there are no issues
showing at the moment.
How about Hidden links/text?
Viewing the page without images shows the top nav bar becomes
'invisibile' ....
Viewing the page without images shows a lot of text that is hard to
read.
(hell, even if it isn't an SE issue, should be fixed for certain user-
types.)
.
How about applying rel="nofollow" to the links in noscrip[t tags?
Review...
http://gadgetguy.com.au/product.asp?id=14&m_review=0 Erm... not really seeing anything important on here...
nothing specific to Reviewing hte Russel Hobbs rice cooker anyway
(inc. irrelevant title, h1 etc.)
Weak Image Alt attributes...
Why does it just say 'Product' ?
Not exactly useful to blind people (including bots ;))...
so missing an opportunity to not only be helpful, but possibly get an
extra point or two with the SE's.
.
Any exernal duplicate content...
Is it the content 'yours' or is it from other sources?
Is it being 'fed' to other sites?
Is it being scrapped/copied by other sites?
.
.
.
Hope all that helps.
(May be worth getting the develoepr in as well, saves you running back
and forth ;))
Now I'm all for having descriptive URLs, but .... this seems to be
taking it a bit too far and I have a bit of trouble identifying
anything that matches in the content of your page.
The problem with URLs like this is that they almost appear to be
random and in fact I can get exactly the same page by using something
like: http://www.gadgetguy.com.au/xyzzy-42.html . In general, you
should make sure that you have only one URL that leads to your content
-- all others should either redirect to the proper URL or return HTTP
result code 404 to signal that the URL is invalid. Without that, your
site is leading us (and all other crawlers) on a wild goose chase.
If your CMS is not able to handle this properly (one URL per piece of
content), I would recommend not using rewritten URLs so that we can
recognize and skip over unimportant parameters in your URL query
string.
2. Broken HTML code
In general, we try to get it right regardless of what a webmaster uses
on his page. However, there are limits to what we can guess at.
Although this is definitely not as important as the first point, you
can see this happening when you search for something like:
http://www.google.com/search?q=site:www.gadgetguy.com.au+intitle:shor...
Thanks for the welcome, and for the feedback - I'm sure to be a
regular here to read up on the finer points of SE stuff.
OK, I have almost finished paring back the URLs of all the site
sections, the pages that have actually been indexed. Of the pages that
aren't being indexed, the URLs are only the article title or the
product name.
I'm told that via Webmaster Tools it appears that Google has indexed
about 70% of our sitemap. Can you tell me what the relationship is, or
lag is, between a sitemap being indexed and the pages making it into
the search index.
I have also passed along the URL of this thread to our developer, so
either he will chime in or I'll continue to be the point man to
hopefully getting us more SE-friendly.
Don't worry too much about what appears in the SiteMap indexation
information...
it tends to be al ittle 'off target' (putting it nicely ;)).
.
For a more accurate idea, you may want to use the site: operator.
Enter this in the Google Search...
site:http://gadgetguy.com.au
Then browse to the very last page of results.
The figure in teh top right is a pretty good idea of the number of
pages indexed.
You may also want to try some variants...
site:gadgetguy.com.au
site:gadgetguy.com.au/*
Also, if you see the paragraph about 'omitted results', do try that...
and again go to the end.
.
Please be aware of Google DataCenters...
Google has info on nuemrous computers/networks.
Some of these are a little less up-to-date than others... and you may
end up connecting to one of htose, and get different results.
The DC contacted can be random... can be influenced by ISP, Location,
whether using local Google or .com Google, possible Browser Googles
(like in MFF) etc.
Always perform the same search several times...
And don't panic if you see changes in the results.
I work at the company which developed the GadgetGuy web site.
We would really appreciate advice on the real issue we've been
battling with here.
There are over 2,000 links to this website around the Internet, many
from respectable sites with good pagerank.
We have two sitemap feeds to Google. Both report about 75% of the
pages as Indexed, however we are only seeing pages from one sitemap
showing on Google. This is despite the fact the Googlebot is also
constantly crawling the site.
This has been the case for some time and some of the issues now of
multiple URLs pointing to one page have come about due to attempts to
fix this issue - such as making URLs shorter (which is controlled by
content authors using the CMS by the way).
It seems many of our pages are on the Google supplementary index.
Interestingly, on Google's new website trends page, our plot seems to
have tanked to ZERO even though the site still gets Google traffic,
it's like we're not measured anymore, or we're on some kind of black
list we can't explain.
Has anyone else experienced this and resolved the cause?
We're not interested in hearing about a whole lot of incremental
tweaks. We're quite well versed in best practice for SEO. What we're
trying to uncover here is a fundamental issue that is stopping pages
getting indexed.
> Now I'm all for having descriptive URLs, but .... this seems to be
> taking it a bit too far and I have a bit of trouble identifying
> anything that matches in the content of your page.
> The problem with URLs like this is that they almost appear to be
> random and in fact I can get exactly the same page by using something
> like:http://www.gadgetguy.com.au/xyzzy-42.html. In general, you
> should make sure that you have only one URL that leads to your content
> -- all others should either redirect to the proper URL or return HTTP
> result code 404 to signal that the URL is invalid. Without that, your
> site is leading us (and all other crawlers) on a wild goose chase.
> If your CMS is not able to handle this properly (one URL per piece of
> content), I would recommend not using rewritten URLs so that we can
> recognize and skip over unimportant parameters in your URL query
> string.
> 2. Broken HTML code
> In general, we try to get it right regardless of what a webmaster uses
> on his page. However, there are limits to what we can guess at.
> Although this is definitely not as important as the first point, you
> can see this happening when you search for something like:http://www.google.com/search?q=site:www.gadgetguy.com.au+intitle:shor...
Not being funny... but don't you think someone (such as JM) would have
at least 'hinted' at a serious or concerning issue?
Not saying that there isn't a major problem, and that we've all missed
it so far...
but the algo changes... and sometimes some of those changes hit hard.
Every so often, there seems some tweaks to the algo, and all of a
sudden a huge rise in 'isues' walks in here... and there are no major
problems... it's all 'little' things that add up!
So, not wanting to sound harsh... but you've been given some issues
already... maybe make sure those are resolved and see how things go?
The least that does it ensure that it's not those thigns causing the
issue.
Whilst making those changes, it also gives you the chance to look at
the code and see if anything else is showing up.
.
Your URLs all seem to include an item number at the end?
If so, you could setup a a simple script to check the URL, if it
matches the prefered format.
If not, send a http() response with a 301, and redirect to the shorter
URL.
Of course, it would greatly help if that was done 'after' ensuring all
links point to just one format (which currently they don't?).
Ensure the same format is in the sitemap, xo the bots get an idea of
the prefered format.
Also ensure those URLs are 'cleaner'... none of the 'possibly looking
stuffed' urls,
and that they are 'relevant' ... containing something to do with the
page title/h1 etc.
(I think this could be the biggest issue?)
.
I'd ensure the links are all working too (seems to irritate the GBot
no end!).
So thats all links go somewhere....
then once thats resolved.,..
make sure that the 'somewhere' has some content (as pointed out above,
some of those pages are basically 'empty').
After that... sort out the server response for non-existant URLs...
a 302 to the homepage is not exactly the 'best' approach!
It should respond with a proper 404 or 410.
Provide either a custom error page... or then set a delayed redirect.
So long as the bots get that 404/410, so they know not to index it.
.
Try fixing those...
see how thigns go...
failing that, ask again (preferably in this topic/thread... or at
least link to it so we know where to look ;)
> 7. Google is happy the sitemap is valid. We also validated the sitemap
> independantly before submitting. However, as this is dynamically
> generated there could be new content or very old content in it that is
> appearing for the first time; nonetheless if there was a problem the
> Google Webmaster report would report it - and there are no issues
> showing at the moment.
It may be 'valid' but it's not useful. Every priority but one set to
the default, and every lastmod set to the same time.
It seems that someone did a complete redesign around mid January. You
can see some of the old design at http://web.archive.org/web/20070823193356/http://www.gadgetguy.com.au/ . In general, the best practice would be to have all the old URLs 301
redirect to the appropriate new ones. However, in this case, there
were a few things done in a suboptimal way:
1. Old URLs are 302 redirected to the homepage
2. For a period of about 3 weeks, it looks like you had robots meta
tags with a value of "none" across the site.
It contains a link like this:
<a href="photo-and-video-photography digital camera camcorder
videocamera handycam canon sony panasonic nikon channel 7 sunrise
australia-5.html">Photo and Video</a>
This ties in with my previous comment on long, difficult to understand
URLs. There are a lot of URLs that could end up showing that
content... this means we might spend a lot of time crawling through
URLs that are really just duplicates. URL length is not an issue
(apart from making it close to impossible for users to link to your
pages without copy&pasting the URL).
At this time, I would work on designing a very simple URL structure
that allows you to use relevant keywords in your URL so that the user
can understand what might be shown on the page. Also, you would want
to make sure that all canonical versions of your existing URLs
(including the old style ones) are 301 redirected to the new & simple
URL structure.
1. We've removed any remaining on 302's and changed to 301.
2. The robots none tag has us stumped. Can you see when this was?
3. We're already well on top of the URL issue too and have redirected
(301) the canonical domains to be sure.
> It seems that someone did a complete redesign around mid January. You
> can see some of the old design athttp://web.archive.org/web/20070823193356/http://www.gadgetguy.com.au/ > . In general, the best practice would be to have all the old URLs 301
> redirect to the appropriate new ones. However, in this case, there
> were a few things done in a suboptimal way:
> 1. Old URLs are 302 redirected to the homepage
> 2. For a period of about 3 weeks, it looks like you had robots meta
> tags with a value of "none" across the site.
> It contains a link like this:
> <a href="photo-and-video-photography digital camera camcorder
> videocamera handycam canon sony panasonic nikon channel 7 sunrise
> australia-5.html">Photo and Video</a>
> This ties in with my previous comment on long, difficult to understand
> URLs. There are a lot of URLs that could end up showing that
> content... this means we might spend a lot of time crawling through
> URLs that are really just duplicates. URL length is not an issue
> (apart from making it close to impossible for users to link to your
> pages without copy&pasting the URL).
> At this time, I would work on designing a very simple URL structure
> that allows you to use relevant keywords in your URL so that the user
> can understand what might be shown on the page. Also, you would want
> to make sure that all canonical versions of your existing URLs
> (including the old style ones) are 301 redirected to the new & simple
> URL structure.
1. The old URLs should redirect to the appropriate new URLs for
maximum effect. By redirecting it to the root URL, we tend to lose the
context provided by the old URL.
2. The robots "none" appears to have started when the new structure
was put up (around mid January). However, since that has long been
resolved, there's not much that can be done at this point to change it
-- except making sure that things work better from now on :)