How to : Store BuBing Crawler data in MG4J for full text search
109 views
Skip to first unread message
Samba Siva Rao Kolusu
unread,
Dec 30, 2013, 4:30:01 PM12/30/13
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to mg...@googlegroups.com
Hi all,
I am investigating the various web crawlers & search engines and found MG4J & BuBing interesting and comparable to Nutch/Heritrix (BuBing) and Lucene/Sphinx(MG4J).
However, i could not understand a few things from the LAW website like how to programmatically index the Bubing craled (& application parsed) data into MG4J. Could you please give an example snippet of code to help me understand how to accomplish this?
Thanks and Regards, Samba
Sebastiano Vigna
unread,
Dec 31, 2013, 5:17:09 AM12/31/13
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to mg...@googlegroups.com
Good point. Actually, in this very moment you can't. The LAW software contains a WarcDocumentSequence class based on the old warc library that made it immediate to index a warc file. The same class, with very minor modifications, will work for a Warc file. If you're interested I can put together a patched version.