關於merge index的實作方法

Jim T. Tang

unread,

Oct 13, 2013, 11:17:22 PM10/13/13

to crawlzi...@googlegroups.com

老師您好：

最近想要在hadoop上實作incremental MapReduce把計算後的資料不斷累加，也看了幾篇文章，諸如Incoop/IncMR，但是發現我們的需求沒有那麼複雜，只需要把每天的index合併起來就好。於是，我們找到了強大的Crawlzilla。

然而Lucene的index檔都是binary，在論壇上爬文的結果，老師似乎都是提到使用luke轉SQLite，由於我DB非常不熟悉，所以想請問老師就您所知，如果我每天想針對自己的網站做incremental index，我是否就該朝luke+SQLite的方向研究呢？還是有其他方法可以參考呢？

謝謝老師的回答！

Jim

Jazz Yao-Tsung Wang

unread,

Oct 14, 2013, 12:41:37 PM10/14/13

to crawlzi...@googlegroups.com

目前就我理解，若是想做 Index 的 incremental update
可以參考 LinkedIn 寫的 Zoie
http://javasoze.github.io/zoie/

或者評估一下 SolrCloud
http://wiki.apache.org/solr/SolrCloud

原本 Nutch 確實也有看到實作 merge
http://wiki.apache.org/nutch/bin/nutch%20mergedb
不過操作上可能得再試試看。

/opt/crawlzilla/nutch/bin# ./nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl             one-step crawler for intranets
readdb            read / dump crawl db
convdb            convert crawl db from pre-0.9 format
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages
parse             parse a segment's pages
readseg           read / dump segment data
mergesegs         merge several segments, with optional filtering and slicing
updatedb          update crawl db from segments after fetching
invertlinks       create a linkdb from parsed segments
mergelinkdb       merge linkdb-s, with optional filtering
index             run the indexer on parsed segments and linkdb
solrindex         run the solr indexer on parsed segments and linkdb
merge             merge several segment indexes
dedup             remove duplicates from a set of segment indexes
solrdedup         remove duplicates from solr
plugin            load a plugin and run one of its classes main()
server            run a search server
or
CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

其次，未來還是會遇到單一 Index 檔案過大的問題。

- Jazz

Jim T. Tang

unread,

Oct 15, 2013, 11:54:20 AM10/15/13

to crawlzi...@googlegroups.com

非常感謝老師寶貴的建議！我會試著從這些方面研究的！

感恩老師！

On Monday, October 14, 2013 11:17:22 AM UTC+8, Jim T. Tang wrote:

Reply all

Reply to author

Forward