Aborted URLS are actually code snippets.

11 views
Skip to first unread message

Gav...

unread,
Dec 12, 2006, 4:48:52 AM12/12/06
to SOFTplus GSiteCrawler
Hi Guys, Excellent program ! :)

The aborted URLs Statistics lists a few 404 Not found errors, some of
which is correct , they are not there and I will remove the links to
them.

<pre><code>
&lt;div id='navigation'&gt;
&lt;ul class='level1'&gt;
&lt;li&gt;&lt;a href='home.html'&gt;Home&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href='about.html'&gt;About
us&lt;/a&gt;&lt;/li&gt;
&lt;li class='submenu' &gt;Services
&lt;ul class='level2' id='sub1'&gt;
&lt;li&gt;&lt;a href='design.html'&gt;Web site
design&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href='hosting.html'&gt;Web site
hosting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href='search.html'&gt;Search engine
submission&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href='contact.html'&gt;Contact
us&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</code></pre>

So, the above it part of a tutorial, the URLs do not go anywhere, and
they are not active links in any way.Your program I guess crawls for
URLs and ignores the fact they may be code and not actual URLs.

Two questions, can you sort it :) And second question, does the actual
Google Crawler act in this way (I would hope not)

Thanks

Gav...

webado

unread,
Dec 14, 2006, 12:28:36 AM12/14/06
to SOFTplus GSiteCrawler
Your site does have some broken links as found by Xenu.

http://www.minitutorials.com/firewalls/%5C%5Cwww.zonelog.com
error code: 404 (not found), linked from page(s):
http://www.minitutorials.com/firewalls/utb_part3_4.shtml

http://www.minitutorials.com/forums/favicon.ico
error code: 404 (not found), linked from page(s):
http://www.minitutorials.com/forums/index.php

http://www.minitutorials.com/issue_7/tutorials/PHPTut2.zip
error code: 404 (not found), linked from page(s):
http://www.minitutorials.com/webdesign/php/introtophp_2.php

http://www.minitutorials.com/issue_7/tutorials/simple_2.php
error code: 404 (not found), linked from page(s):
http://www.minitutorials.com/webdesign/php/introtophp_2.php

http://www.minitutorials.com/rss/feeds/itgfeed.xml
error code: 404 (not found), linked from page(s):
http://www.minitutorials.com/rss_xml/rss_display.php

5 broken link(s) reported

Also a whole slew of broken named anchors.

You are also mixing www and non www links on your site. Or perhaps some
are stated absolutely as http://minitutotials.com...... whereas the
others are relative so they end up being www.minitutorial.com/.......
when starting from www.mintutorials.com/ .

We can asusme that whatever Xenu sees Google also sees the same way.

Gav...

unread,
Dec 14, 2006, 5:07:38 AM12/14/06
to SOFTplus GSiteCrawler
Thanks very much for those links, Google Sitemap reports has so far
spotted two of those.
I will correct these shortly.

However I need to rephrase my original question I think as you never
answered what I was asking, my fault.

GsiteCrawler is reporting broken links in the ''Aborted URLs' list.

These links are :-

Failed at 12/12/2006 18:25:
URL:
http://www.minitutorials.com/webdesign/css/home.html
Error: HTTP-Error 404 Not Found
Linked from:
http://www.minitutorials.com/webdesign/css/create_css_menu_system2.php

Failed at 12/12/2006 18:25:
URL:
http://www.minitutorials.com/webdesign/css/about.html
Error: HTTP-Error 404 Not Found
Linked from:
http://www.minitutorials.com/webdesign/css/create_css_menu_system2.php

Failed at 12/12/2006 18:25:
URL:
http://www.minitutorials.com/webdesign/css/design.html
Error: HTTP-Error 404 Not Found
Linked from:
http://www.minitutorials.com/webdesign/css/create_css_menu_system2.php

Failed at 12/12/2006 18:25:
URL:
http://www.minitutorials.com/webdesign/css/hosting.html
Error: HTTP-Error 404 Not Found
Linked from:
http://www.minitutorials.com/webdesign/css/create_css_menu_system2.php

Failed at 12/12/2006 18:25:
URL:
http://www.minitutorials.com/webdesign/css/contact.html
Error: HTTP-Error 404 Not Found
Linked from:
http://www.minitutorials.com/webdesign/css/create_css_menu_system2.php

Failed at 12/12/2006 18:25:
URL:
http://www.minitutorials.com/webdesign/css/search.html
Error: HTTP-Error 404 Not Found
Linked from:
http://www.minitutorials.com/webdesign/css/create_css_menu_system2.php

These are all within the 'content' of the 'create_css_menu_system2.php'
page.
They are not actually links, they are code placed within <code> tags,
and as such
the &lt; and &gt; before quoting a URL like 'a href='home.html'.

These are not real links, they don't go anywhere and yes, the pages do
not actually
exist, there not meant to, it's code samples.

So these IMO should not be treated as live links, but instead should be
treated as
code data and as such ignored by GSiteCrawler.

It seems Xenu and Google are interpreting these correctly as neither
are reporting
these as broken links.

Thanks

Gav...

webado

unread,
Dec 14, 2006, 9:47:36 AM12/14/06
to SOFTplus GSiteCrawler
Sorry, I did not answer that because I don't know how to answer it. I
have not had a similar problem.

When I include code on a page I also use <pre> ....</pre> but no
<code> ... </code> tags.


Perhaps if you use &quot; instead of the apostrophe ' it might "fool"
the GSiteCrawler into ignoring the contents? I realize that's not the
answer you expect though.

I'd wait until John sees this however, as he'd know best what's
happening and why and might be able to fix it.

softplus

unread,
Dec 14, 2006, 6:43:24 PM12/14/06
to SOFTplus GSiteCrawler
That's a problem with the crawler as it is now -- it tries to grab
links whereever it sees them (it looks for the "href=" part and tries
to find the matching URL). I use this over a strict HTML parser because
so many sites are not really parsable ... this system mimics the way
Google does it when the pages are not cleanly coded (as far as I can
tell from outside :-)).

One thing you might be able to do is obfuscate a part of that text.
Instead of
href='url'
you could try
hre&#102;='url'
(I hope I got it right :-))

John

Gav...

unread,
Dec 17, 2006, 6:35:22 AM12/17/06
to SOFTplus GSiteCrawler
Hi John,

Yes, I did the obsucation and it works fine.
Then I did the same for code snippets of "img src..."
links using &#99; etc but this did not work for these.

Google and Xenu and other link checkers do not have
this problem and so I am bascially just using this
workaround just to please GSiteCrawler which is not
the ideal solution. I don't think I am prepared to do this
and so will just ignore this part of GSiteCrawler I
think.

Perhaps you could enhance this by checking using DOM
& checking the parent container against the namespace
or similar. By checking this you'll be able to ignore
anything inside of <code> tags etc..

Gav...

Reply all
Reply to author
Forward
0 new messages