How to : Store BuBing Crawler data in MG4J for full text search

109 views
Skip to first unread message

Samba Siva Rao Kolusu

unread,
Dec 30, 2013, 4:30:01 PM12/30/13
to mg...@googlegroups.com
Hi all,

I am investigating the various web crawlers & search engines and found MG4J & BuBing interesting and comparable to Nutch/Heritrix (BuBing) and Lucene/Sphinx(MG4J).

However, i could not understand a few things from the LAW website like how to programmatically index the Bubing craled (& application parsed) data into MG4J. Could you please give an example snippet of code to help me understand how to accomplish this?

Thanks and Regards,
Samba


Sebastiano Vigna

unread,
Dec 31, 2013, 5:17:09 AM12/31/13
to mg...@googlegroups.com
Good point. Actually, in this very moment you can't. The LAW software contains a WarcDocumentSequence class based on the old warc library that made it immediate to index a warc file. The same class, with very minor modifications, will work for a Warc file. If you're interested I can put together a patched version.

Ciao,

seba

Reply all
Reply to author
Forward
0 new messages