Where am I going wrong

41 views
Skip to first unread message

Andy Seabrook

unread,
Oct 21, 2011, 10:23:48 AM10/21/11
to gsitec...@googlegroups.com
crawling the site on local development machine i get 100s of 404 errors

as an example:
   URL:                http://localhost/brightermobility/0.0.12/includes/bootscooters.php
   Error:              HTTP-Error 404 Not Found
   Linked from:        http://localhost/brightermobility/0.0.12/includes/mastandnav.inc.htm

the element is <a href="bootscooters.php">Boot/travel scooters</a>

Does the software have some sort of cache that needs to be cleared?
The same site uploaded onto the host domain yesterday was tested and it uncovered some errors which were subsequently fixed, the offenders were re-uploaded, afterwhich GSiteCrawler was still reporting these same errors. The site was functioning correctly in the context of the new changes. All links are validated on w3c link checker.

webado

unread,
Oct 22, 2011, 12:15:01 AM10/22/11
to SOFTplus GSiteCrawler
Well you can see that what is reported as missing are scripts to be
included, which should never be found in any navigation. Something is
wrong, perhaps your base href is off.

In any case you should first clear everything before starting to
recrawl. Delete all urls from the url list, flush the carwler queue
too and start a fresh crawl.



After such a crawl you need to delete all dodgy

On Oct 21, 10:23 am, Andy Seabrook <andyseabrook...@googlemail.com>
wrote:

Andy Seabrook

unread,
Oct 22, 2011, 5:49:11 AM10/22/11
to gsitec...@googlegroups.com
Yes I can see what is reported - and it is false!

Perhaps it wasnt clear from my post: The PHP scripts are where they are supposed to be, in the locations stated as not being found. If I follow the link it navigates as expected.

What do you mean the href is off? I guess you mean I have defined it erroneously. The location of the target is the same directory as the calling script, hence in the bootscooters example above there is no path in href, just the filename. As I said the site is functioning as it should, thats both local and published - link checkers on both domains are stating all links as valid. Physically following the link takes you to the appropriate script.

I had deleted all URLs from the list several times. I had pressed clear total queue several times if thats how you flush the crawler queue (if thats not how you flush please advise me)

On restarting the application today I get the message:
Warning the size of your database is over 900MB - please compact it. With a database this large, the crawlers will be disabled. 

I compacted the database, I cleared the URL's again, cleared total queue again, and repeated the compact: To no avail.
I wonder if the database size points to some issue: as this was a new install, the only two projects in the tree are the localhost and published version of the one site. Which is not large: 30 pages, nothing constructed dynamically. Why should it be so large? I run it nomore than 6 or so times in each domain.

All links reporting as failing if it helps are all constructed in javascript menus. Though it seems this should not be an issue as the URL's listed all show correctly.


Andy Seabrook

unread,
Oct 22, 2011, 6:12:29 AM10/22/11
to gsitec...@googlegroups.com
Ok I have woken up - I was not reading the URLs correctly but just assuming they were how I had expected them to be, but they are in fact not being interpreted correctly.

I can also guess where things are going wrong with the applications code.

For your info
This is an extract from MastAndNav.inc.htm

<!-- start of navbar-->
<ul id="nav">
    <li><a href="contact.php" class="nav_top" >CONTACT US</a></li>
    <li><a href="products.php" class="nav_top" data-flexmenu="flexmenu1">PRODUCTS</a>
            <ul id="flexmenu1" class="flexdropdownmenu">
            <li><a href="bootscooters.php">Boot/travel scooters</a>
                <ul>

                <li><a href="item.php?item=Pride Go-Go Elite Traveller 4">Pride Go-Go Elite Traveller 4</a></li>
                <li><a href="item.php?item=Pride Libre LX">Pride Libre LX</a></li>
                <li><a href="item.php?item=TGA Buddy">TGA Buddy</a></li>
                <li><a href="item.php?item=TGA Eclipse">TGA Eclipse</a></li>
                </ul>
            </li>
            <li ><a href="msscooters.php">Medium Size Scooters</a>
                <ul>

                <li><a href="item.php?item=TGA Sonnet">TGA Sonnet</a></li>
                <li><a href="item.php?item=Pride Colt Plus">Pride Colt Plus</a></li>
                <li><a href="item.php?item=Pride Colt Deluxe">Pride Colt Deluxe</a></li>
                </ul>
            </li>
            <li><a href="lrgscooters.php">Large(8mph) Road Going Scooters</a>
                <ul>

                <li><a href="item.php?item=Rascal 329 LE">Rascal 329 LE</a></li>
                <li><a href="item.php?item=TGA Mystere">TGA Mystere</a></li>
                <li><a href="item.php?item=TGA Vita">TGA Vita</a></li>
                </ul>
            </li>

            <li><a href="powerchairs.php">Power Chairs</a>
                <ul>

                <li><a href="item.php?item=Pride Jazzy Select 6">Pride Jazzy Select 6</a></li>
                <li><a href="item.php?item=Pride LX">Pride LX</a></li>
                </ul>
            </li>
            </ul>
    </li>
    <li><a href="index.php" class="nav_top">HOME</a></li>
</ul>

The script is included as a navigation element on every page on the site, the main index.php for instance, all of the main pages are in the site root directory. MastAndNav is in the subdirectory "includes"  if this "bit" of html was in the root your coding would have been appropriate when href, but the href is appropriate for php as the script that has included this is in the root and that is how PHP interprets the included code; relative to the primary script not the location of the included file.

Hope this helps,

Andy

Christina S

unread,
Oct 22, 2011, 9:20:07 AM10/22/11
to gsitec...@googlegroups.com
Javascript is not read by robots. So the crawling would only use the links found in actual html code. 
 
Are you starting the crawl at the homepage of this site on localhost?
Are you using relative addressing? If your site's homepage is NOT http://localhost/ but rather something in a lower folder, do you have links to the homepage expressed as "/"? That would resolve them as http://localhost/ rather than your lower folder.
 
Are you perhaps treating a robot differently in your  local .htaccess file? Is parsing .html as .php in effect (if applicable)?
 
Have you used Xenu as well to crawl the local site? Has it worked any better? Doubtful, but I have to ask.
 
 
--
You received this message because you are subscribed to the Google Groups "SOFTplus GSiteCrawler" group.
To view this discussion on the web visit https://groups.google.com/d/msg/gsitecrawler/-/69lMv6Df1z4J.
To post to this group, send email to gsitec...@googlegroups.com.
To unsubscribe from this group, send email to gsitecrawler...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/gsitecrawler?hl=en.

Andy Seabrook

unread,
Oct 24, 2011, 6:48:01 AM10/24/11
to gsitec...@googlegroups.com
Hi,

I realise that javascript is not read by robots but the menus functionality is scripted in javascript however the links are html and are included in the page via a PHP include. This is the same on the published site, which has no problem, in retrospect of course this discounts my guess as to why it might have been happening.

Yes crawl starts at homepage.
Relative addressing yes.


>If your site's homepage is NOT http://localhost/ but rather something in a lower folder, do you have links to the homepage expressed as "/"?

Are you asking does the address begin with a forward slash, if so as far as I am aware there none on the site expressed in this fashion, certainly not those in the menu as you can see in the snippet of the script above.


>Are you perhaps treating a robot differently in your  local .htaccess file?
It is not a mirrored environment (localhost is on windows)

As regards Xenu I am not famillier with it. But I installed it anyway in that it might help - it indeed says there are the same issues.

I cant follow the links interactivly to a dead end as expressed in either your application or Xenu.
Two examples that Xenu is throwing up are: Title - "HOME" and "CONTACT US"
The only place on the site that such items exist is in the menu structure showing in previous quoted script snippet.
where at the href is as simple as it could be: "index.php" and "contact.php" respectivly. Both are files are in the same directory as any page that calls them and as mentioned previously the menu elements are "included" from the includes subdirectory.

One thing I wondered about that i am now suspicious about but am not clear about the "workings of / or differences in"  is the way that addressing works in the more recent version of PHP. It is perhaps a candidate culprit again due to differences in the environments.

On the published site: PHP 5.2.14
On localhost; PHP 5.3.6

Hope this helps.

Andy

Christina S

unread,
Oct 24, 2011, 8:44:21 AM10/24/11
to gsitec...@googlegroups.com
Hi,
 
When a php or SSI include is stated as:
<?php include("/some-script.inc"); ?>
 
the script to be included is deemed to be located in the home directory - which is at least one level above the actual root directory of the site.
 
On an Apache webserver (with Cpanel) the absolute path to the root (public_html)  is, for instance, /home/user/public_html/ . So including something with a / in front would look for the file to include in the folder /home/ .
 
On a localhost installation root may be (depending what you installed and where):  /xampp/htdocs/ so the first / is referencing something in the folder above /xampp/  .
 
So in fact never use include with a path starting with /. To mean that to be the root of the site, you'd need to use on localhost :
<?php include("/xampp/htdocs/some-script.inc"): ?>
 
On a real webserver it would be different again, so your code needs to change. Messy.
 
 
Also most Apache webservers do not allow using the include function with a full url starting with http:// . Some may do. It's a security setting  to do with using or not the opendir module. If your local Apache installation is not configured the same way as your hosting webserver you'll be having trouble.
 
So always assume a php include is to be relative to (in or below)  the current folder where the currently running script is being executed.
 
Best.
 
Christina
 
----- Original Message -----
--
You received this message because you are subscribed to the Google Groups "SOFTplus GSiteCrawler" group.
To view this discussion on the web visit https://groups.google.com/d/msg/gsitecrawler/-/i5tFDRFZ7vwJ.
Reply all
Reply to author
Forward
0 new messages