Xenu will give you a broken links report.
It can also give orphan pages.
On Sep 22, 9:36 pm, Nic <
nic.thi...@gmail.com> wrote:
> Thanks webado I'm taking a look at the Xenu package. For others that
> might have a solution with gsitecrawler, I should add that I'm just
> trying to generate a report to compare against another report to find
> unlinked files and broken links. I'm using the URL list feature
> rather than generating a full sitemap.
>
> On Sep 22, 9:26 pm, webado <
web...@gmail.com> wrote:
>
>
>
> > Well it sounds tricky.
> > But the main purpose of building a sitemap is also for you to see how
> > robots will be crawling your site. The sitemap does not limit what
> > robots have to crawl and index, nor does it guarantee all the urls
> > listed in the sitemap will be indexed in any case.
>
> > If you cannot find ways to limit crawls only to certain folders for
> > this program, then you will also be unable to limit Googlebot or any
> > other robots from crawling just what you want crawled.
>
> > In your case and for such a structure I'd not include pdfs in the
> > sitemap at all. Googlebot will find them in any case. Thus start the
> > crawl with the
folderwww.mydomain.com.au/webcontent/andonly make a
> > sitemap of that.
>
> > You can help see the structure and hierarchy of the site by using Xenu
> > (but which does not obey robots.txt or any other robots directives).
> > One way to crawl using Xenu is to start
onwww.mydomain.com.au/webcontent/
> > and
addwww.mydomain.com.au/pdfs/tothe section "consider urls
> > starting with this as part fo the same site). The Xenu will crawl
justwww.mydomain.com.au/webcontent/plusany urls
fromwww.mydomain.com.au/pdfs/
> > .
>
> > Xenu can be downloaded from:
http://home.snafu.de/tilman/xenulink.html
>
> > On Sep 22, 3:03 am, Nic <
nic.thi...@gmail.com> wrote:
>
> > > I've been trying to work out how to limit a crawl to only show pages
> > > that are either within a specific URL path or within a second path and
> > > linked from a page in the original path. Any ideas?
>
> > > As an example, say my web content is in:
www.mydomain.com.au/webcontent/
>
> > > The PDFs that I want to list are linked from several pages under the
> > > above path but they reside in an different path, it's similar to the
> > > following:
www.mydomain.com.au/pdfs/
>
> > > I want to find all the content in /webcontent and then any pdfs that
> > > are linked from /webcontent/ to /pdfs/ and nothing else. I've tried
> > > setting the "Main URL"
towww.mydomain.com.au/webcontentandthen
> > >
listingwww.mydomain.com.au/pdfs/intheURL list and ticking
> > > "include" but it doesn't seem to list any of the content under /pdfs/.
>
> > > Am I missing something? Do I have to set the "Main URL"
towww.mydomain.com?
> > > If so, is there a way that I can exclude the (multiple) other paths
> > > that I'm not interested in (such
aswww.mydomain.com/otherstuff)
> > > without specifically listing each one?- Hide quoted text -
>
> - Show quoted text -