On May 20, 12:45 pm, Jorge Handl <
jha...@gmail.com> wrote:
> I recommend looking for google results for queries like "lista de sitios".
> This will give you plenty of starting points. As for "black holes", meaning
> sites that work in such a way that a crawler will retrieve an infinite
> number of urls that all refer to the same set of pages (with session ids,
> for example), that highly depends both on the sites you crawl and the
> configuration of the crawler. For example, banning any url containing the
> "?" character will get you out of most loops, but will also limit the scope
> of the crawl. You need to analyze your page database regularly to find such
> loops and filter them out with the hotspots.regex and regex-urlfilter.txt
> files, and with the blacklist.
>
> The pagedb does not hold the contents of each page, but you can get the text
> from either the index or the cache, if you configured the crawler to use a
> cache.
>
> Regards,
> - Jorge
>