Hi, thanks for all your support, so far things are working very good,
it is pretty easy to get started,
I am currently trying to make the crawler retrieve only the pages that
end with .ar, so for example
if I have
http://www.clarin.com.ar/noticias/daily I guess I need to
make a regular expression that recognizes that the previous url should
be accepted and
www.ar.com should not.
So I understand that this configuration is changed in the archive
called regex-urlfilter.txt
mime currently looks like this:
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|
mov|MOV|exe)$
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
# skip wikipedia
-.*wikipedia.*
-.*wikimedia.*
-.*gnu.org.*
-.*wiktionary.*
-.*wikiquote.*
# accept anything else
+.
So i needo to change the line: +. for something else that accept only
the .ar urls, I tried to make a
regular expression for this but did not work, can you please explain
me how the line below #accept anything else would have to look like.
Then I have another doubt given that only a subset of webpages will be
crawler due to the restricting condition is it possible that the
crawler reaches and therefore downloads less webpages cause it is not
putting in queue the links that where in the now non complaiant
regular expression webpages. Please let me know how does this work.
I really appreciate all your help
Mariana