list of questions

2 views
Skip to first unread message

marian...@gmail.com

unread,
May 19, 2009, 7:05:11 PM5/19/09
to hounder
Hi my Flaptor friends, I am really sorry to bother you again, I swear
I woudn't if I could.

Here is a list of questions I have, by the way your crawler works by
itself pretty good. Congrats.

1.Why does in the file stats the pages is always like this:
pages=1000001. and when I check the amount of urls it downloaded I
hava a slightly higher number of pages, is it a mistake?
a problem I have ?

2.The crawler is retrieving very few pages a day since I reached a
certain number of downloaded urls, I do not understand
why.

3. Once the crawling has been working for a while, I decided that I
want to add some new url to the seeds page
can I do this without loosing all the previously crawled pages?How?


Thanks for your time
Mariana

Jorge Handl

unread,
May 20, 2009, 11:36:42 AM5/20/09
to hou...@googlegroups.com
Hi Mariana!


> 1.Why does in the file stats the pages is always like this:
> pages=1000001. and when I check the amount of urls it downloaded I
> hava a slightly higher number of pages, is it a mistake?

How are you checking the number of urls you downloaded?


> 2.The crawler is retrieving very few pages a day since I reached a
> certain number of downloaded urls, I do not understand why.

This is probably because your crawler is configured to only fetch newly discovered pages and it has already discovered most of the pages in the hotspot area.

Check the value of the "priority.percentile.to.fetch" configuration variable in the crawler.properties file. A value of zero means no page will be fetched twice. A value of 100 means all pages will be fetched in each cycle, even if they had already been fetched. A value in between will select the most promising pages for refetch, for example those that have changed more frequently, but also those that have not been refetched in a long time (or else they would never be revisited).

You can also use the "discovery.front.size" variable, which is the number of pages the crawler will fetch in each cycle outside of the hotspot area, but it only makes sense if you are using modules that can mark a page as hotspot for things other than its url, for example a Bayesian filter trained to detect pages talking about a certain subject.


> 3. Once the crawling has been working for a while, I decided that I
> want to add some new url to the seeds page  can I do this without loosing
> all the previously crawled pages?How?

Yes, you can. The crawler is constantly looking for a dir named "injectdb", and as soon as it finds one it will read and delete it, adding the pages to the current fetch cycle. Assuming you have a file named "new.urls" containing the new urls, run the following commands:

    ./db.sh create tempdb new.urls
    mv tempdb injectdb

Regards,
- Jorge

marian...@gmail.com

unread,
May 22, 2009, 5:08:52 PM5/22/09
to hounder
Excelent and quick responses, thank you very much! so far everything
is goig very very good.
> On Tue, May 19, 2009 at 8:05 PM, marianasof...@gmail.com <
Reply all
Reply to author
Forward
0 new messages