incremental crawling and/or recrawl

14 views
Skip to first unread message

dans...@gmail.com

unread,
Jan 17, 2020, 3:31:59 PM1/17/20
to DigitalPebble
Does StormCrawler run continuously and recrawl?

That is, to achieve incremental crawling do I give it the same seeds with filters that talk about when to recrawl, or do I just keep runing it?

Does it support some concept of snapshot and resume so it can be setup to run between certain hours?

DigitalPebble

unread,
Jan 20, 2020, 4:18:03 AM1/20/20
to DigitalPebble
Hi

Please use StackOverflow for questions like these, you'll get a wider audience.

Does StormCrawler run continuously and recrawl?

yes x2
 
That is, to achieve incremental crawling do I give it the same seeds with filters that talk about when to recrawl, or do I just keep runing it?

you simply let it run. You can specify how frequently a page should be revisited in the configuration 


with fetchInterval.*
 

Does it support some concept of snapshot and resume so it can be setup to run between certain hours?

If you use a proper backend (ES, SOLR, etc...) to persist the info about the URLs, i.e. you don't store it just in memory, then yes, you would be able to pause or stop it at anytime and resume where it left when restarting the topology.

Hope this helps

Julien
 

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/digitalpebble/2463d3cd-07dc-491c-9fc7-726762360134%40googlegroups.com.


--
Reply all
Reply to author
Forward
0 new messages