目前就我理解,若是想做 Index 的 incremental update
可以參考 LinkedIn 寫的 Zoie
http://javasoze.github.io/zoie/或者評估一下 SolrCloud
http://wiki.apache.org/solr/SolrCloud原本 Nutch 確實也有看到實作 merge
http://wiki.apache.org/nutch/bin/nutch%20mergedb不過操作上可能得再試試看。
/opt/crawlzilla/nutch/bin# ./nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
readdb read / dump crawl db
convdb convert crawl db from pre-0.9 format
mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
index run the indexer on parsed segments and linkdb
solrindex run the solr indexer on parsed segments and linkdb
merge merge several segment indexes dedup remove duplicates from a set of segment indexes
solrdedup remove duplicates from solr
plugin load a plugin and run one of its classes main()
server run a search server
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
其次,未來還是會遇到單一 Index 檔案過大的問題。
- Jazz