In order to diagnose problems on the Web Crawl Errors pages, it would be useful to know not just the link that is a problem but the referrer. If the referrer is a page on my site, I need to know which one so I can fix it. If the referrer was from the google cache and I know that cached page is fixed on my site, I won't worry about it, and if the referrer was a third party then I'd like to be able to go to them and let them know they have a bad link to my site.
That would allow me to eliminate my own site as the suspected culprit, but doesn't answer the third point in particular, the possibility that google got the link from an outside referrer who I would like to contact to make sure that the link is fixed.
It isn't a problem at this time, as all of the errors have been located and repaired, but in general it would be useful to know how the google bot got to the page that returned an error.
You cannot find where the link was found in that case. You'd have to do your own on-site tracking and capture all 404's and log what was requested and from what referrer. But chances are you won't get a meaningful referrer if it's a robot, since they tend to collect links to crawl at later date rather than visit a site and imediately follow outgoing links to another site (which woud give a referrer).
Unless the robot keeps track of where link X was found as referrer in order to later report to you, I dont' see anything else you can do.
There are a bazillion pseudo-directories popping up all the time on the web with links to all kinds of sites, harvested from many places. Thats' where most errors seem to be, obsolete links on sites that don't care about verifying their own outgoing links.
Errrr... I believe the 301 takes precedence. The redirection will happen without existence being texted so the 404 will not be returned. I don't think you can outoput 2 headers: 404 followed by 301. But then I may be totally off. Sorry.
Maybe with server side scripting you can capture server responses and decide to issue the 404 or the 301 based on your own criteria. I'd nto venture into how that woudl be coded myself. Not experienced enough with this.
So on the assumption that you have to pick 404 or 301 to return, I'd say which one you use depends whom you cater to.
If you want robots to accept the page does not exist, you must return a 404. Robots don't care what the page returned says once a 404 is returned. For human visitors, the page that you display while returning a 404 can be a custom page, perhaps made like your home page, but I'd opt for a more distinctive error page, keeping with the general look of the site, all the same menus and all that, but which explicitly states the page was not found, and giving optins for further navigation.
If you want to recapture whatever value (e.g. PR) may have existed in the old urls which no longer exist, a 301 redirection will do that. You can channel it all to the homepage in the absence of a new equivalent url.
In general, a bad link to your site is only a problem when it passes traffic, when users use it to try to access your site. If you are tracking 404 pages, then you'll know which bad links are out there, which links make a difference.
A bad link to your site which does not pass traffic generally also has little value, you can usually ignore it.
When Google finds such a bad link, it might list it in the webmaster console, but if you know that it is not passing traffic, you can usually ignore it. If you are worried about a bad link that might bring visitors, why not just put a page up at that URL and either give them targeted information ("the page you were looking for is actually here ...") or even ask them what they were looking for and where they came from.
I didn't mean to say use an error page with a 301 or 302. Just to redirect somewhere if the page that's gone had any value to preserve.
I 301 redirect a few pages from the original site I had when I was using shtml whereas now I'm using php (and for a while I even had a Mambo site, grave error LOL), because they kept coming up as 404's from god knows where. I had removed them several times from the Console, to no avail. If you can't beat them you join them - thus the 301.
I have left as 404's several other url's which I cannot redirect since I have no clue what they were: old garbage so-called search engine friendly url's (but totally meaningless anyway) from my site's prior existence as a Mambo site. No big loss and good riddance. At least they are not in any index, not even as supplemental, so they are really in some of those silly pseudo-directories. Can't be bothered with them.
>If you can't beat them you join them - thus the 301.
That's a good idea :-). Doing it for single pages is certainly a valid way to handle it - just don't do it globally for all URLs :-)). I do the same for some URLs which people like to mis-type.
> Hey Ioldanach, we've indeed heard this request before, and so it's > definitely on the Webmaster Tools folks' radar!
I have no idea where these links are located. There is one particular link referencing a php program that simply does not exist and as far as I know never existed. There are others that look like other websites have their names tacked onto the end of one of my directories.
I mean, if I have no clue where you got the URL, how can I possibly fix the problem?
How badly are these NOT FOUNDs hurting rankings? I certainly want to fix them. I simply have zero idea where to start. I've search all my source code and see nothing.
> In order to diagnose problems on the Web Crawl Errors pages, it would > be useful to know not just the link that is a problem but the referrer. > If the referrer is a page on my site, I need to know which one so I > can fix it. If the referrer was from the google cache and I know that > cached page is fixed on my site, I won't worry about it, and if the > referrer was a third party then I'd like to be able to go to them and > let them know they have a bad link to my site.
I have automatically searched my coding with a search and replace utility. The errors reported by Google simply are NOT within my web pages. Thus, this feature is totally misleading and should either be fixed or removed entirely.
They would have to be coming from the Google Search Engine index, as they are not on my web pages as being reported by Google.