How to : Store BuBing Crawler data in MG4J for full text search

閲覧: 109 回
最初の未読メッセージにスキップ

Samba Siva Rao Kolusu

未読、
2013/12/30 16:30:012013/12/30
To: mg...@googlegroups.com
Hi all,

I am investigating the various web crawlers & search engines and found MG4J & BuBing interesting and comparable to Nutch/Heritrix (BuBing) and Lucene/Sphinx(MG4J).

However, i could not understand a few things from the LAW website like how to programmatically index the Bubing craled (& application parsed) data into MG4J. Could you please give an example snippet of code to help me understand how to accomplish this?

Thanks and Regards,
Samba


Sebastiano Vigna

未読、
2013/12/31 5:17:092013/12/31
To: mg...@googlegroups.com
Good point. Actually, in this very moment you can't. The LAW software contains a WarcDocumentSequence class based on the old warc library that made it immediate to index a warc file. The same class, with very minor modifications, will work for a Warc file. If you're interested I can put together a patched version.

Ciao,

seba

全員に返信
投稿者に返信
転送
新着メール 0 件