Re: A Filter Problem.

46 views
Skip to first unread message

webado

unread,
May 19, 2013, 1:21:18 AM5/19/13
to gsitec...@googlegroups.com
Unfortunately your url structure does not lend itself  easily to filtering. Like the robots.txt protocol this too is prefix based.

GSitecrawler however does allow pattern matching.
Go to File > Globale options > General > and select Pattern matching for Ban Urls ...




On Saturday, May 18, 2013 8:43:20 PM UTC-4, Ludo Rubben wrote:
Dear Christina,

I am using GSiteCrawler since a few days now and in general I am extremely happy with the program. I have 9 websites, some small, some larger.
I have a few small problems that I have not been able to solve, I hope you can help me out.
First of all, it seems that my filters do not work as I expected.
One example is:
I have put "http://www.example.com/*/page/*" in the Ban Urls page and "http://www.example.com/*/category/*"
But in the URL List, I have dozens and dozens of "http://www.example.com/environment/category/news-items-english/page/2960/" (different page number at the end).
So it means that my two different parameters are in fact ignored.
I have to select manually all urls containing "page" and "category" in the url list and unselect them from "include".
That works, and they are now not included in the sitemap, but it would be easier if they were not included in the URL List in the first place.
In your help file, you say that just putting a word is enough to block the complete url, but I'm afraid to block more than I want that way.
Just putting "category" in the ban list would also block an article titled ""http://www.example.com/environment/a-new-category-of-invertebrates-discovered/".
Did I do something wrong? Or do you have a solution?
Thanks in advance for your help.
Ludo Rubben.

Ludo Rubben

unread,
May 19, 2013, 9:50:48 AM5/19/13
to gsitec...@googlegroups.com
Oh, sorry, I didn't see that option. I will use regex, that will solve my problem.
Thanks a lot for your quick reply.


From: gsitec...@googlegroups.com [mailto:gsitec...@googlegroups.com] On Behalf Of webado
Sent: Sunday, 19 May 2013 07:21
To: gsitec...@googlegroups.com
Subject: [GSiteCrawler] Re: A Filter Problem.

--
You received this message because you are subscribed to the Google Groups "SOFTplus GSiteCrawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsitecrawler...@googlegroups.com.
To post to this group, send email to gsitec...@googlegroups.com.
Visit this group at http://groups.google.com/group/gsitecrawler?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

webado

unread,
May 19, 2013, 7:28:08 PM5/19/13
to gsitec...@googlegroups.com
Ok you will have to delete all manual urls from the URL list and let the crawler crawl all over again, and this time your filter will apply - I assume you did set it up using either wildcards or regex. Then, after a full crawl, for the fresh URL list,  set your frequencies and priorities to whatever you want if the default values are not ok.

But this is not really enough. You have to be able to exclude those urls from the website as well, and you'd do that either by robots.txt directives that target them in some way, or, better, by adding robots noindex meta tags to them (if you are able to do that given your site's software platform).

You need to do that because regardless of whatever is or isn't in the sitemap Google crawls and indexes everything it is allowed to from the site anyway.
Reply all
Reply to author
Forward
0 new messages