Using hounder

3 views
Skip to first unread message

marian...@gmail.com

unread,
Apr 30, 2009, 7:32:18 PM4/30/09
to hounder
Hi, thanks for all your support, so far things are working very good,
it is pretty easy to get started,
I am currently trying to make the crawler retrieve only the pages that
end with .ar, so for example
if I have http://www.clarin.com.ar/noticias/daily I guess I need to
make a regular expression that recognizes that the previous url should
be accepted and www.ar.com should not.
So I understand that this configuration is changed in the archive
called regex-urlfilter.txt
mime currently looks like this:
# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|
mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip wikipedia
-.*wikipedia.*
-.*wikimedia.*
-.*gnu.org.*
-.*wiktionary.*
-.*wikiquote.*

# accept anything else
+.


So i needo to change the line: +. for something else that accept only
the .ar urls, I tried to make a
regular expression for this but did not work, can you please explain
me how the line below #accept anything else would have to look like.


Then I have another doubt given that only a subset of webpages will be
crawler due to the restricting condition is it possible that the
crawler reaches and therefore downloads less webpages cause it is not
putting in queue the links that where in the now non complaiant
regular expression webpages. Please let me know how does this work.

I really appreciate all your help
Mariana

Jorge Handl

unread,
Apr 30, 2009, 8:15:51 PM4/30/09
to hou...@googlegroups.com
Mariana, the regex-urlfilter.txt file is mainly used to quickly discard unwanted links like emails and images. The hotspots.regex file, on the other hand, is optimized for directing the crawler to a subset of the web. If you ever start adding a long list of sites that you want the crawler to pay special attention to, like sites that don't end in .ar but you want crawled anyway, the hotspots.regex file is the way to go, and in your case it would start with something like "http:// | [^/]*\.ar(/.*|)" (sans quotes). And don't forget to limit the crawler to the number of pages that will fit in your hard drive, otherwise you will need to configure a distributed crawler, and that's a whole 'nother story.

Hope that helps.
Jorge

marian...@gmail.com

unread,
May 5, 2009, 5:57:22 PM5/5/09
to hounder
Thank you very much for the reminder, I think I will do that in the
future, since this is going pretty well, but you are right, i 'D
better not crash the server.
Once again txs for eveything

On Apr 30, 9:15 pm, Jorge Handl <jha...@gmail.com> wrote:
> Mariana, the regex-urlfilter.txt file is mainly used to quickly discard
> unwanted links like emails and images. The hotspots.regex file, on the other
> hand, is optimized for directing the crawler to a subset of the web. If you
> ever start adding a long list of sites that you want the crawler to pay
> special attention to, like sites that don't end in .ar but you want crawled
> anyway, the hotspots.regex file is the way to go, and in your case it would
> start with something like "http:// | [^/]*\.ar(/.*|)" (sans quotes). And
> don't forget to limit the crawler to the number of pages that will fit in
> your hard drive, otherwise you will need to configure a distributed crawler,
> and that's a whole 'nother story.
> Hope that helps.
> Jorge
>
> On Thu, Apr 30, 2009 at 8:32 PM, marianasof...@gmail.com <
>
> marianasof...@gmail.com> wrote:
>
> > Hi, thanks for all your support, so far things are working very good,
> > it is pretty easy to get started,
> > I am currently trying to make the crawler retrieve only the pages that
> > end with .ar, so for example
> > if I have  http://www.clarin.com.ar/noticias/dailyI guess I need to
> > make a regular expression that recognizes that the previous url should
> > be accepted andwww.ar.comshould not.
Reply all
Reply to author
Forward
0 new messages