limit crawl to links between one folder and another without crawling entire domain

30 views
Skip to first unread message

Nic

unread,
Sep 22, 2011, 3:03:38 AM9/22/11
to SOFTplus GSiteCrawler

I've been trying to work out how to limit a crawl to only show pages
that are either within a specific URL path or within a second path and
linked from a page in the original path. Any ideas?

As an example, say my web content is in:
www.mydomain.com.au/webcontent/

The PDFs that I want to list are linked from several pages under the
above path but they reside in an different path, it's similar to the
following:
www.mydomain.com.au/pdfs/

I want to find all the content in /webcontent and then any pdfs that
are linked from /webcontent/ to /pdfs/ and nothing else. I've tried
setting the "Main URL" to www.mydomain.com.au/webcontent and then
listing www.mydomain.com.au/pdfs/ in the URL list and ticking
"include" but it doesn't seem to list any of the content under /pdfs/.

Am I missing something? Do I have to set the "Main URL" to www.mydomain.com?
If so, is there a way that I can exclude the (multiple) other paths
that I'm not interested in (such as www.mydomain.com/otherstuff)
without specifically listing each one?

webado

unread,
Sep 22, 2011, 8:26:36 AM9/22/11
to SOFTplus GSiteCrawler
Well it sounds tricky.
But the main purpose of building a sitemap is also for you to see how
robots will be crawling your site. The sitemap does not limit what
robots have to crawl and index, nor does it guarantee all the urls
listed in the sitemap will be indexed in any case.

If you cannot find ways to limit crawls only to certain folders for
this program, then you will also be unable to limit Googlebot or any
other robots from crawling just what you want crawled.

In your case and for such a structure I'd not include pdfs in the
sitemap at all. Googlebot will find them in any case. Thus start the
crawl with the folder www.mydomain.com.au/webcontent/ and only make a
sitemap of that.

You can help see the structure and hierarchy of the site by using Xenu
(but which does not obey robots.txt or any other robots directives).
One way to crawl using Xenu is to start on www.mydomain.com.au/webcontent/
and add www.mydomain.com.au/pdfs/ to the section "consider urls
starting with this as part fo the same site). The Xenu will crawl just
www.mydomain.com.au/webcontent/ plus any urls from www.mydomain.com.au/pdfs/
.

Xenu can be downloaded from: http://home.snafu.de/tilman/xenulink.html



On Sep 22, 3:03 am, Nic <nic.thi...@gmail.com> wrote:
> I've been trying to work out how to limit a crawl to only show pages
> that are either within a specific URL path or within a second path and
> linked from a page in the original path.  Any ideas?
>
> As an example, say my web content is in:www.mydomain.com.au/webcontent/
>
> The PDFs that I want to list are linked from several pages under the
> above path but they reside in an different path, it's similar to the
> following:www.mydomain.com.au/pdfs/
>
> I want to find all the content in /webcontent and then any pdfs that
> are linked from /webcontent/ to /pdfs/ and nothing else.  I've tried
> setting the "Main URL" towww.mydomain.com.au/webcontentand then
> listingwww.mydomain.com.au/pdfs/in the URL list and ticking

Nic

unread,
Sep 22, 2011, 9:36:50 PM9/22/11
to SOFTplus GSiteCrawler
Thanks webado I'm taking a look at the Xenu package. For others that
might have a solution with gsitecrawler, I should add that I'm just
trying to generate a report to compare against another report to find
unlinked files and broken links. I'm using the URL list feature
rather than generating a full sitemap.

On Sep 22, 9:26 pm, webado <web...@gmail.com> wrote:
> Well it sounds tricky.
> But the main purpose of building a sitemap is also for you to see how
> robots will be crawling your site. The sitemap does not limit what
> robots have to crawl and index, nor does it guarantee all the urls
> listed in the sitemap will be indexed in any case.
>
> If you cannot find ways to limit crawls only to certain folders for
> this program, then you will also be unable to limit Googlebot or any
> other robots from crawling just what you  want crawled.
>
> In your case and for such a structure I'd not include pdfs in the
> sitemap at all. Googlebot will find them in any case. Thus start the
> crawl with the folderwww.mydomain.com.au/webcontent/and only make a
> sitemap of that.
>
> You can help see the structure and hierarchy of the site by using Xenu
> (but which does not obey robots.txt or any other robots directives).
> One way to crawl using Xenu is to start onwww.mydomain.com.au/webcontent/
> and addwww.mydomain.com.au/pdfs/to the section "consider urls
> starting with this as part fo the same site). The Xenu will crawl justwww.mydomain.com.au/webcontent/plus any urls fromwww.mydomain.com.au/pdfs/
> .
>
> Xenu can be downloaded from:http://home.snafu.de/tilman/xenulink.html
>
> On Sep 22, 3:03 am, Nic <nic.thi...@gmail.com> wrote:
>
>
>
>
>
>
>
> > I've been trying to work out how to limit a crawl to only show pages
> > that are either within a specific URL path or within a second path and
> > linked from a page in the original path.  Any ideas?
>
> > As an example, say my web content is in:www.mydomain.com.au/webcontent/
>
> > The PDFs that I want to list are linked from several pages under the
> > above path but they reside in an different path, it's similar to the
> > following:www.mydomain.com.au/pdfs/
>
> > I want to find all the content in /webcontent and then any pdfs that
> > are linked from /webcontent/ to /pdfs/ and nothing else.  I've tried
> > setting the "Main URL" towww.mydomain.com.au/webcontentandthen
> > listingwww.mydomain.com.au/pdfs/inthe URL list and ticking

webado

unread,
Sep 22, 2011, 9:56:53 PM9/22/11
to SOFTplus GSiteCrawler
Xenu will give you a broken links report.

It can also give orphan pages.

On Sep 22, 9:36 pm, Nic <nic.thi...@gmail.com> wrote:
> Thanks webado I'm taking a look at the Xenu package.  For others that
> might have a solution with gsitecrawler, I should add that I'm just
> trying to generate a report to compare against another report to find
> unlinked files and broken links.  I'm using the URL list feature
> rather than generating a full sitemap.
>
> On Sep 22, 9:26 pm, webado <web...@gmail.com> wrote:
>
>
>
> > Well it sounds tricky.
> > But the main purpose of building a sitemap is also for you to see how
> > robots will be crawling your site. The sitemap does not limit what
> > robots have to crawl and index, nor does it guarantee all the urls
> > listed in the sitemap will be indexed in any case.
>
> > If you cannot find ways to limit crawls only to certain folders for
> > this program, then you will also be unable to limit Googlebot or any
> > other robots from crawling just what you  want crawled.
>
> > In your case and for such a structure I'd not include pdfs in the
> > sitemap at all. Googlebot will find them in any case. Thus start the
> > crawl with the folderwww.mydomain.com.au/webcontent/andonly make a
> > sitemap of that.
>
> > You can help see the structure and hierarchy of the site by using Xenu
> > (but which does not obey robots.txt or any other robots directives).
> > One way to crawl using Xenu is to start onwww.mydomain.com.au/webcontent/
> > and addwww.mydomain.com.au/pdfs/tothe section "consider urls
> > starting with this as part fo the same site). The Xenu will crawl justwww.mydomain.com.au/webcontent/plusany urls fromwww.mydomain.com.au/pdfs/
> > .
>
> > Xenu can be downloaded from:http://home.snafu.de/tilman/xenulink.html
>
> > On Sep 22, 3:03 am, Nic <nic.thi...@gmail.com> wrote:
>
> > > I've been trying to work out how to limit a crawl to only show pages
> > > that are either within a specific URL path or within a second path and
> > > linked from a page in the original path.  Any ideas?
>
> > > As an example, say my web content is in:www.mydomain.com.au/webcontent/
>
> > > The PDFs that I want to list are linked from several pages under the
> > > above path but they reside in an different path, it's similar to the
> > > following:www.mydomain.com.au/pdfs/
>
> > > I want to find all the content in /webcontent and then any pdfs that
> > > are linked from /webcontent/ to /pdfs/ and nothing else.  I've tried
> > > setting the "Main URL" towww.mydomain.com.au/webcontentandthen
> > > listingwww.mydomain.com.au/pdfs/intheURL list and ticking
> > > "include" but it doesn't seem to list any of the content under /pdfs/.
>
> > > Am I missing something? Do I have to set the "Main URL" towww.mydomain.com?
> > > If so, is there a way that I can exclude the (multiple) other paths
> > > that I'm not interested in (such aswww.mydomain.com/otherstuff)
> > > without specifically listing each one?- Hide quoted text -
>
> - Show quoted text -
Reply all
Reply to author
Forward
0 new messages