關於merge index的實作方法

54 views
Skip to first unread message

Jim T. Tang

unread,
Oct 13, 2013, 11:17:22 PM10/13/13
to crawlzi...@googlegroups.com
老師您好:

最近想要在hadoop上實作incremental MapReduce把計算後的資料不斷累加,也看了幾篇文章,諸如Incoop/IncMR,但是發現我們的需求沒有那麼複雜,只需要把每天的index合併起來就好。於是,我們找到了強大的Crawlzilla。

然而Lucene的index檔都是binary,在論壇上爬文的結果,老師似乎都是提到使用luke轉SQLite,由於我DB非常不熟悉,所以想請問老師就您所知,如果我每天想針對自己的網站做incremental index,我是否就該朝luke+SQLite的方向研究呢?還是有其他方法可以參考呢?

謝謝老師的回答!

Jim

Jazz Yao-Tsung Wang

unread,
Oct 14, 2013, 12:41:37 PM10/14/13
to crawlzi...@googlegroups.com
目前就我理解,若是想做 Index 的 incremental update
可以參考 LinkedIn 寫的 Zoie
http://javasoze.github.io/zoie/

或者評估一下 SolrCloud
http://wiki.apache.org/solr/SolrCloud

原本 Nutch 確實也有看到實作 merge
http://wiki.apache.org/nutch/bin/nutch%20mergedb
不過操作上可能得再試試看。

/opt/crawlzilla/nutch/bin# ./nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets
  readdb            read / dump crawl db
  convdb            convert crawl db from pre-0.9 format
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  index             run the indexer on parsed segments and linkdb
  solrindex         run the solr indexer on parsed segments and linkdb
  merge             merge several segment indexes
  dedup             remove duplicates from a set of segment indexes
  solrdedup         remove duplicates from solr
  plugin            load a plugin and run one of its classes main()
  server            run a search server
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.


其次,未來還是會遇到單一 Index 檔案過大的問題。

- Jazz

Jim T. Tang

unread,
Oct 15, 2013, 11:54:20 AM10/15/13
to crawlzi...@googlegroups.com
非常感謝老師寶貴的建議!我會試著從這些方面研究的!
感恩老師!


On Monday, October 14, 2013 11:17:22 AM UTC+8, Jim T. Tang wrote:
Reply all
Reply to author
Forward
0 new messages