Distributed Crawler

29 views
Skip to first unread message

Robert King

unread,
Aug 26, 2015, 2:37:31 PM8/26/15
to DataparkSearch Engine
Hi Maxime,

I am wondering if anyone has ever distributed Datapark search over several machines to increase it's crawling ability and the number of urls it can process ?

Any insight on this topic would be appreciated.

thanks

Robert

Maxim Zakharov

unread,
Aug 26, 2015, 9:45:39 PM8/26/15
to datapar...@googlegroups.com
Hi Robert,

I assume you're already using cache mode storage which is default for dpsearch.

To distribute indexing/crawling, you need to configure and start cached daemon on the machine where searchd is running.
See http://www.dataparksearch.org/dpsearch-cachemode.en.html

On remote machines, you need to add cached= parameter into DBAddr command pointing to the machine where cached is running, see
http://www.dataparksearch.org/dpsearch-indexcmd.en.html#DBADDR_CMD




--
You received this message because you are subscribed to the Google Groups "DataparkSearch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataparksearc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Robert King

unread,
Sep 6, 2015, 3:07:25 PM9/6/15
to DataparkSearch Engine
Hi Maxime,

Thank you for the information.

If I decide to crawl on a Server basis, is it safe to have different indexer.conf files on independent machines

eg: one machine with Server command for a list of sites and another machine with a Server command for another list of sites with both machines talking to the central server and database ?

thanks

Robert

Maxim Zakharov

unread,
Sep 7, 2015, 7:10:04 PM9/7/15
to DataparkSearch Engine

Hi Robert,
Indexind with different indexer.conf with the same database may likely cause the situation when one server adds new urls onto database while another one would delete them as it has no rules in the config allowing them.
You may use
Server skip *
Command at the end of your indexer.conf files to not delete such urls but watch if you get unwanted urls getting into database as there would be no command deleting them and you would need to add
Server delete <url>
Command to delete them explicitly.

Maxim

Reply all
Reply to author
Forward
0 new messages