I know there have been at least a few people looking for something like this on the mailing list, so please check it out. It's a port of a Java project so I'd particularly like to hear how I can make it more Ruby like. Enjoy!
Dave Balmain
== Description
Ferret is a full port of the Java Lucene searching and indexing library. It's available as a gem so try it out! To get started quickly read the quick start at the project homepage;
{ :title => "Agile Web Development with Rails", :author => "Dave Thomas, David Heinemeier Hansson, Leon Breedt, Mike Clark, Thomas Fuchs, Andreas Schwarz", :tags => "Ruby, Rails, Web Development", :published => "2005-07-13", :content => "Yada yada yada ..."
puts "\nFind all documents on ruby:-" index.search_each("tags:Ruby") do |doc, score| puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % score end
puts "\nFind all documents on ruby published this year:-" index.search_each("tags:ruby AND published: >= 2005") do |doc, score| puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % score end
puts "\nFind all documents by the Pragmatic Programmers:-" index.search_each('author:("dave Thomas" AND "Andy hunt")') do |doc, score| puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % score end
This is a tremendous gift to the ruby on rails crowd. Lucene was probably *the* library which was most missed in the ruby world.
Cheers for such an amazing port.
On 10/20/05, David Balmain <dbalmain...@gmail.com> wrote:
> Hi Folks,
> I know there have been at least a few people looking for something like this > on the mailing list, so please check it out. It's a port of a Java project > so I'd particularly like to hear how I can make it more Ruby like. Enjoy!
Thanks for pointing that out. I missed it. I do think it's a great idea for a quiz though. It will be interesting to see what people come up with in say 50 lines as opposed to 10,000. And I'll be able to see who to hit up for some help. ;-)
> I know there have been at least a few people looking for something > like this > on the mailing list, so please check it out. It's a port of a Java > project > so I'd particularly like to hear how I can make it more Ruby like. > Enjoy!
> Dave Balmain
> == Description
> Ferret is a full port of the Java Lucene searching and indexing > library. > It's available as a gem so try it out! To get started quickly read > the quick > start at the project homepage;
> puts "\nFind all documents on ruby:-" > index.search_each("tags:Ruby") do |doc, score| > puts "Document <#{index[doc]["title"]}> found with a score of % > 0.2f" % score > end
> puts "\nFind all documents on ruby published this year:-" > index.search_each("tags:ruby AND published: >= 2005") do |doc, score| > puts "Document <#{index[doc]["title"]}> found with a score of % > 0.2f" % score > end
> puts "\nFind all documents by the Pragmatic Programmers:-" > index.search_each('author:("dave Thomas" AND "Andy hunt")') do | > doc, score| > puts "Document <#{index[doc]["title"]}> found with a score of % > 0.2f" % score > end
> I know there have been at least a few people looking for something like this > on the mailing list, so please check it out. It's a port of a Java project > so I'd particularly like to hear how I can make it more Ruby like. Enjoy!
> Dave Balmain
> == Description
> Ferret is a full port of the Java Lucene searching and indexing library. > It's available as a gem so try it out! To get started quickly read the quick > start at the project homepage;
> puts "\nFind all documents on ruby:-" > index.search_each("tags:Ruby") do |doc, score| > puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % score > end
> puts "\nFind all documents on ruby published this year:-" > index.search_each("tags:ruby AND published: >= 2005") do |doc, score| > puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % score > end
> puts "\nFind all documents by the Pragmatic Programmers:-" > index.search_each('author:("dave Thomas" AND "Andy hunt")') do |doc, score| > puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % score > end
> The link to the tutorial embedded on your intro page is incorrect > (the one in the TOC at the top right works though)
Bob, thanks for that, the link is fixed now. Unfortunately my packaging system is a bit broken. I think a few files got left out so please wait for version 0.1.1. It'll be ready in a couple of hours.
How does this compare to (Estraier/Hyper Estraier/Ruby-Odeum/SimpleSearch/other 'IR' systems with Ruby bindings?) on (ease of learning/ease of use/ease of maintenance/speed/any other noteworthy attributes)? To put it simply, which one should I choose?* :)
(Well, speed's pretty well covered on the home page, though I'm not sure how much faster Hyper Estraier is than Lucene, and not sure how much slower SimpleSearch is than Ferret. I only ask because it's the thing people in the position of questioneer are /supposed/ to do.)
Free feel to answer whatever part of that you (want/know), or just tell me to fork off... a thread.
*For those actually interested in answering that question, it'll be an intranet app that won't likely get a major amount of hits, but will likely have a major amount of data. Right now, I'm just looking to make a rough prototype in a week, but wouldn't mind picking a contender, if quickly-pickuppable.
Atm i'm using jruby to use lucene search, but this definitely sounds great!!! Do you have any plans to include snowball stemmers? As my documents are not english, i could use them ;)
I just recall ruby's stemmer4r, which wraps the snowball stemmers. I can't wait to use this ;)
> Atm i'm using jruby to use lucene search, but this definitely sounds > great!!! Do you have any plans to include snowball stemmers? As my > documents are not english, i could use them ;)
Hi Norjee, I've looked at the snowball parser and I don't think it would be too hard to do a pure ruby version of this if enough people are interested. But that is pretty low on my to do list so I hope stemmer4r will do for now. I also hope that you won't be needing unicode support as that is one of the things that is missing in Ferret. Speaking of which, anyone know of any good ruby unicode tutorials?
On 10/21/05, Devin Mullins <twif...@comcast.net> wrote:
> Question for those (soon to be) in the know:
> How does this compare to (Estraier/Hyper > Estraier/Ruby-Odeum/SimpleSearch/other 'IR' systems with Ruby > bindings?) on (ease of learning/ease of use/ease of > maintenance/speed/any other noteworthy attributes)? To put it simply, > which one should I choose?* :)
Hi Devin, I'm afraid I've only briefly looked at those other IR systems but I'll try and answer your question as best I can. I think Ferret is currently pretty easy to learn and use through the Index interface as described in my original post. I don't think ease of use should turn you off. Once I've done a bit more work on the documentation, I think it'll be a lot easier to find your way around than some of the other ones. But it'll be significantly slower than the C library backed search engines. I'm certainly not the type of person to say speed isn't important, however, I think ferret should easily handle the kind of website you are talking about.
Ferret should be a lot faster than SimpleSearch for large document sets. Having said that, there is a ruby quiz coming up for which I intend to write a quick and simple search engine that will easily outperform simple search so if people are interested, I might make that a project too.
== As for the others, the main advantages of Ferret are;
* a more powerful extendable query language. You can do boolean, phrase, range, fuzzy (for misspellings etc), wildcard, sloppy phrase (out of order phrases) and more. Check out the Query Parser in the API for more info on the query language. http://ferret.davebalmain.com/api/classes/Ferret/QueryParser.html
* a more powerful document structure. I could be wrong about this so someone please correct me if I am, but I think most of the other IR's just take a string as a document. Ferrets documents can have multiple fields. Each field can have a different analyzer (parses field into tokens). You can store binary fields like images or compress your data. In fact, you could do away with a database altogether and just use Ferret. (You can also store term vectors if you want to compare document similarities, but that's getting pretty technical)
* Ferret is pure ruby (at least it can be if you don't install the C extension) so it'll run anywhere Ruby does.
* If you are patient, Ferret will one day match or beat the speed of those other search engines. Hopefully by Christmas but it all depends how much help I can get between now and then.
== And the main disadvantages;
* Ferret is still alpha and has not been put into production yet. Hopefully that will change soon.
* Ferret is currently slower than the C backed IRs
Anyway, sorry for such a long email. It's really hard to describe all the features available. In fact, there is a whole book on Lucene by Erik Hatcher and Otis Gospodnetic which I highly recommend if you want to take full advantage of all the features in Ferret. Most of the examples should translate pretty easily into Ruby.
Please let me know if you have any more questions. Regards, Dave
On 10/22/05, Hal Fulton <hal9...@hypermetrics.com> wrote:
> I'm probably wrong, but I thought we already had > a port of this? A thing called Rucene?
Yes and also one called rubylucene. Unfortunately Erik Hatcher never had the time to get those projects off the ground. Hopefully he'll have time to help me out now that the port is finished though. ;)
Have you looked into what it would take to allow all text to be UTF-8? Would it take a complete overhaul or a somewhat-minor tweak? If a tweak, what would you charge to make that lovely update? :-)
On 10/22/05, Miles Keaton <mileskea...@gmail.com> wrote:
> Have you looked into what it would take to allow all text to be UTF-8? > Would it take a complete overhaul or a somewhat-minor tweak? > If a tweak, what would you charge to make that lovely update? :-)
I'm looking in to it now. I'll send you the bill. ;-)
On 10/22/05, David Balmain <dbalmain...@gmail.com> wrote:
> On 10/22/05, Miles Keaton <mileskea...@gmail.com> wrote:
> > Have you looked into what it would take to allow all text to be UTF-8? > > Would it take a complete overhaul or a somewhat-minor tweak? > > If a tweak, what would you charge to make that lovely update? :-)
Hi Miles, Currently the query parser struggles with UTF-8 but apart from that you should be able to use UTF-8. You'll need to write your own analyzer for whatever language you are using, as well as implementing your own sort procedure if you want to sort results by strings. Here is an example I tried. The strings are chinese so apologies if your browser can't display them. (I have no idea what they mean).
require 'rubygems' require 'ferret' include Ferret
class ChineseAnalyzer def token_stream(field, string) tokenizer = Analysis::RegExpTokenizer.new(string) class <<tokenizer def token_re() /./ end end return tokenizer end end
docs = ["道德經", "搜索所有网页", "搜索所有中文网页", "搜索简体中文网页"]
index = Index::Index.new(:analyzer => ChineseAnalyzer.new)
> require 'rubygems' > require 'ferret' > include Ferret
> class ChineseAnalyzer > def token_stream(field, string) > tokenizer = Analysis::RegExpTokenizer.new(string) > class <<tokenizer > def token_re() /./ end > end > return tokenizer > end > end
> docs = ["道德經", "搜索所有网页", "搜索所有中文网页", "搜索简体中文网页"] > index = Index::Index.new(:analyzer => ChineseAnalyzer.new) > docs.each { |doc| index << doc } > puts index[3][""] > tq = Search::TermQuery.new(Index::Term.new("", "网")) > index.search_each(tq) do |doc, score| > puts "Document #{doc} found with score #{score}" > end > index.close
On 10/23/05, Miles Keaton <mileskea...@gmail.com> wrote:
> one correction: > RegExpTokenizer should be RETokenizer > (at least in the version you've released publicly)
Doh, I meant to change that for the email. Anyway, in case you're interested, I also fixed the problem with the query parser so that it should also parse UTF-8 queries. That'll be out in the next release.
I have to say that the combination of Ruby, Eclipse, and Ferret have just blown me away today. I've gone from knowing that Ruby existed (for several years in fact), to thinking that I should check it out over the last couple of months, to reading some of the doc yesterday, to having a complete running installation of Ruby (through the One Click Windows installer), with Eclipse layered on top, and Ferret installed and operational - at least for this test example - in the space of 4 hours!
Really looking forward to the UTF-8 support too - some Google searching on Ruby and UTF-8 led me to Ferret in the first place. Looks like an absolutely great platform to do experiments in information retrieval with - which is just what I need.
Welcome aboard. With all the people who are coming to ruby because of rails, it's great to hear my project has helped to convert someone. I hope you enjoy it here.
Cheers, Dave
On 10/28/05, peter.r.bai...@gmail.com <peter.r.bai...@gmail.com> wrote:
> I have to say that the combination of Ruby, Eclipse, and Ferret have > just blown me away today. I've gone from knowing that Ruby existed (for > several years in fact), to thinking that I should check it out over the > last couple of months, to reading some of the doc yesterday, to having > a complete running installation of Ruby (through the One Click Windows > installer), with Eclipse layered on top, and Ferret installed and > operational - at least for this test example - in the space of 4 hours!
> Really looking forward to the UTF-8 support too - some Google searching > on Ruby and UTF-8 led me to Ferret in the first place. Looks like an > absolutely great platform to do experiments in information retrieval > with - which is just what I need.