more questions

3 views
Skip to first unread message

marian...@gmail.com

unread,
May 19, 2009, 7:05:38 PM5/19/09
to hounder
Hi me again, second list of questions:

1.Do you have a recomended list of sites from argentina to ban, in
order not to get caught in the black holes?

2.Do you have a recomended list of seed urls from argentina to start
from? Or a recomended number of seeds?

3.What is the command to see the text belonging to a webpage I
downloaded? do you have one or I need to deal
with the database?


Thank you once again
Mariana



Jorge Handl

unread,
May 20, 2009, 11:45:00 AM5/20/09
to hou...@googlegroups.com
I recommend looking for google results for queries like "lista de sitios". This will give you plenty of starting points. As for "black holes", meaning sites that work in such a way that a crawler will retrieve an infinite number of urls that all refer to the same set of pages (with session ids, for example), that highly depends both on the sites you crawl and the configuration of the crawler. For example, banning any url containing the "?" character will get you out of most loops, but will also limit the scope of the crawl. You need to analyze your page database regularly to find such loops and filter them out with the hotspots.regex and regex-urlfilter.txt files, and with the blacklist.

The pagedb does not hold the contents of each page, but you can get the text from either the index or the cache, if you configured the crawler to use a cache.

Regards,
- Jorge

marian...@gmail.com

unread,
May 22, 2009, 5:10:47 PM5/22/09
to hounder
Thank you once again, everythinhg is perfectly clear!

M

On May 20, 12:45 pm, Jorge Handl <jha...@gmail.com> wrote:
> I recommend looking for google results for queries like "lista de sitios".
> This will give you plenty of starting points. As for "black holes", meaning
> sites that work in such a way that a crawler will retrieve an infinite
> number of urls that all refer to the same set of pages (with session ids,
> for example), that highly depends both on the sites you crawl and the
> configuration of the crawler. For example, banning any url containing the
> "?" character will get you out of most loops, but will also limit the scope
> of the crawl. You need to analyze your page database regularly to find such
> loops and filter them out with the hotspots.regex and regex-urlfilter.txt
> files, and with the blacklist.
>
> The pagedb does not hold the contents of each page, but you can get the text
> from either the index or the cache, if you configured the crawler to use a
> cache.
>
> Regards,
> - Jorge
>
> On Tue, May 19, 2009 at 8:05 PM, marianasof...@gmail.com <
Reply all
Reply to author
Forward
0 new messages