[ANN] Ferret 0.1.0 (Port of Java Lucene) released

David Balmain

unread,

Oct 20, 2005, 10:36:15 PM10/20/05

to

Hi Folks,

I know there have been at least a few people looking for something like this
on the mailing list, so please check it out. It's a port of a Java project
so I'd particularly like to hear how I can make it more Ruby like. Enjoy!

Dave Balmain

== Description

Ferret is a full port of the Java Lucene searching and indexing library.
It's available as a gem so try it out! To get started quickly read the quick
start at the project homepage;

http://ferret.davebalmain.com/trac/

== Quick (Very Simple) Example

require 'ferret'

include Ferret

docs = [
{ :title => "The Pragmatic Programmer",
:author => "Dave Thomas, Andy Hunt",
:tags => "Programming, Broken Windows, Boiled Frogs",
:published => "1999-10-13",
:content => "Yada yada yada ..."
},
{ :title => "Programming Ruby",
:author => "Dave Thomas, Chad Fowler, Andy Hunt",
:tags => "Ruby",
:published => "2004-10-06",
:content => "Yada yada yada ..."
},
{ :title => "Agile Web Development with Rails",
:author => "Dave Thomas, David Heinemeier Hansson, Leon Breedt, Mike Clark,
Thomas Fuchs, Andreas Schwarz",
:tags => "Ruby, Rails, Web Development",
:published => "2005-07-13",
:content => "Yada yada yada ..."
},
{ :title => "Ruby, Developer's Guide",
:author => "Robert Feldt, Lyle Johnson, Michael Neumann",
:tags => "Ruby, Racc, GUI, FOX",
:published => "2002-10-06",
:content => "Yada yada yada ..."
},
{ :title => "Lucene In Action",
:author => "Otis Gospodnetic, Erik Hatcher",
:tags => "Lucene, Java, Search, Indexing",
:published => "2004-12-01",
:content => "Yada yada yada ..."
}
]

index = Index::Index.new()

docs.each {|doc| index << doc }

puts index.size

puts "\nFind all documents on ruby:-"
index.search_each("tags:Ruby") do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % score
end

puts "\nFind all documents on ruby published this year:-"
index.search_each("tags:ruby AND published: >= 2005") do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % score
end

puts "\nFind all documents by the Pragmatic Programmers:-"
index.search_each('author:("dave Thomas" AND "Andy hunt")') do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % score
end

Ryan Leavengood

unread,

Oct 21, 2005, 12:46:24 AM10/21/05

to

On 10/20/05, David Balmain <dbalm...@gmail.com> wrote:
>
> Ferret is a full port of the Java Lucene searching and indexing library.

Uh oh, there goes another Ruby Quiz idea[1] torpedoed by a diligent
library author ;)

I mean that with the utmost respect as this looks cool. Love the name too.

Ryan

P.S. Eh, we can do the Ruby Quiz either way :D

1. http://ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/161261

Tobias Luetke

unread,

Oct 21, 2005, 1:13:35 AM10/21/05

to

Amazing,

This is a tremendous gift to the ruby on rails crowd. Lucene was
probably *the* library which was most missed in the ruby world.

Cheers for such an amazing port.

On 10/20/05, David Balmain <dbalm...@gmail.com> wrote:

> Hi Folks,
>
> I know there have been at least a few people looking for something like this
> on the mailing list, so please check it out. It's a port of a Java project
> so I'd particularly like to hear how I can make it more Ruby like. Enjoy!
>
> Dave Balmain
>

--
Tobi
http://jadedpixel.com - modern e-commerce software
http://typo.leetsoft.com - Open source weblog engine
http://blog.leetsoft.com - Technical weblog

David Balmain

unread,

Oct 21, 2005, 1:21:08 AM10/21/05

to

On 10/21/05, Ryan Leavengood <leave...@gmail.com> wrote:

> P.S. Eh, we can do the Ruby Quiz either way :D
>
> 1. http://ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/161261

Thanks for pointing that out. I missed it. I do think it's a great idea for
a quiz though. It will be interesting to see what people come up with in say
50 lines as opposed to 10,000. And I'll be able to see who to hit up for
some help. ;-)

Dave

Sean O'Halpin

unread,

Oct 21, 2005, 3:22:15 AM10/21/05

to

On 10/21/05, David Balmain <dbalm...@gmail.com> wrote:
> Ferret is a full port of the Java Lucene searching and indexing library.

Superb! Thanks!

Sean

Bob Hutchison

unread,

Oct 21, 2005, 7:19:17 AM10/21/05

to

This is great news! I'm installing it now, and will have a go at it
whenever the gem shows up :-)

The link to the tutorial embedded on your intro page is incorrect
(the one in the TOC at the top right works though)

Cheers,
Bob

----
Bob Hutchison -- blogs at <http://www.recursive.ca/hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>

George Moschovitis

unread,

Oct 21, 2005, 7:58:58 AM10/21/05

to

Can't wait to try this!

thanks,
George.

On 10/21/05, David Balmain <dbalm...@gmail.com> wrote:

--
http://www.gmosx.com
http://www.navel.gr
http://www.nitrohq.com

David Balmain

unread,

Oct 21, 2005, 8:26:26 AM10/21/05

to

> The link to the tutorial embedded on your intro page is incorrect
> (the one in the TOC at the top right works though)

Bob, thanks for that, the link is fixed now. Unfortunately my
packaging system is a bit broken. I think a few files got left out so
please wait for version 0.1.1. It'll be ready in a couple of hours.

Regards,
Dave

James Edward Gray II

unread,

Oct 21, 2005, 8:30:50 AM10/21/05

to

On Oct 20, 2005, at 11:46 PM, Ryan Leavengood wrote:

> P.S. Eh, we can do the Ruby Quiz either way :D
>
> 1. http://ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/161261

I still think the quiz will be fun. Our goals are humble compared to
a library like this.

James Edward Gray II

Devin Mullins

unread,

Oct 21, 2005, 8:51:36 AM10/21/05

to

Question for those (soon to be) in the know:

How does this compare to (Estraier/Hyper
Estraier/Ruby-Odeum/SimpleSearch/other 'IR' systems with Ruby
bindings?) on (ease of learning/ease of use/ease of
maintenance/speed/any other noteworthy attributes)? To put it simply,
which one should I choose?* :)

(Well, speed's pretty well covered on the home page, though I'm not sure
how much faster Hyper Estraier is than Lucene, and not sure how much
slower SimpleSearch is than Ferret. I only ask because it's the thing
people in the position of questioneer are /supposed/ to do.)

Free feel to answer whatever part of that you (want/know), or just tell
me to fork off... a thread.

*For those actually interested in answering that question, it'll be an
intranet app that won't likely get a major amount of hits, but will
likely have a major amount of data. Right now, I'm just looking to make
a rough prototype in a week, but wouldn't mind picking a contender, if
quickly-pickuppable.

(Devin/twifkak)
//

Message has been deleted

Norjee

unread,

Oct 21, 2005, 9:24:14 AM10/21/05

to

Atm i'm using jruby to use lucene search, but this definitely sounds
great!!! Do you have any plans to include snowball stemmers? As my
documents are not english, i could use them ;)

I just recall ruby's stemmer4r, which wraps the snowball stemmers. I
can't
wait to use this ;)

Hal Fulton

unread,

Oct 21, 2005, 3:30:34 PM10/21/05

to

I'm probably wrong, but I thought we already had
a port of this? A thing called Rucene?

Hal
(goes off to Google)

David Balmain

unread,

Oct 21, 2005, 3:30:50 PM10/21/05

to

On 10/21/05, Norjee <Nor...@gmail.com> wrote:
>
> Atm i'm using jruby to use lucene search, but this definitely sounds
> great!!! Do you have any plans to include snowball stemmers? As my
> documents are not english, i could use them ;)
>

Hi Norjee,
I've looked at the snowball parser and I don't think it would be too hard to
do a pure ruby version of this if enough people are interested. But that is
pretty low on my to do list so I hope stemmer4r will do for now. I also hope
that you won't be needing unicode support as that is one of the things that
is missing in Ferret. Speaking of which, anyone know of any good ruby
unicode tutorials?

Dave

David Balmain

unread,

Oct 21, 2005, 4:05:34 PM10/21/05

to

On 10/21/05, Devin Mullins <twi...@comcast.net> wrote:
>
> Question for those (soon to be) in the know:
>
> How does this compare to (Estraier/Hyper
> Estraier/Ruby-Odeum/SimpleSearch/other 'IR' systems with Ruby
> bindings?) on (ease of learning/ease of use/ease of
> maintenance/speed/any other noteworthy attributes)? To put it simply,
> which one should I choose?* :)

Hi Devin,
I'm afraid I've only briefly looked at those other IR systems but I'll try
and answer your question as best I can. I think Ferret is currently pretty
easy to learn and use through the Index interface as described in my
original post. I don't think ease of use should turn you off. Once I've done
a bit more work on the documentation, I think it'll be a lot easier to find
your way around than some of the other ones. But it'll be significantly
slower than the C library backed search engines. I'm certainly not the type
of person to say speed isn't important, however, I think ferret should
easily handle the kind of website you are talking about.

Ferret should be a lot faster than SimpleSearch for large document sets.
Having said that, there is a ruby quiz coming up for which I intend to write
a quick and simple search engine that will easily outperform simple search
so if people are interested, I might make that a project too.

== As for the others, the main advantages of Ferret are;

* a more powerful extendable query language. You can do boolean, phrase,
range, fuzzy (for misspellings etc), wildcard, sloppy phrase (out of order
phrases) and more. Check out the Query Parser in the API for more info on
the query language.
http://ferret.davebalmain.com/api/classes/Ferret/QueryParser.html

* a more powerful document structure. I could be wrong about this so someone
please correct me if I am, but I think most of the other IR's just take a
string as a document. Ferrets documents can have multiple fields. Each field
can have a different analyzer (parses field into tokens). You can store
binary fields like images or compress your data. In fact, you could do away
with a database altogether and just use Ferret. (You can also store term
vectors if you want to compare document similarities, but that's getting
pretty technical)

* Ferret is pure ruby (at least it can be if you don't install the C
extension) so it'll run anywhere Ruby does.

* If you are patient, Ferret will one day match or beat the speed of those
other search engines. Hopefully by Christmas but it all depends how much
help I can get between now and then.

== And the main disadvantages;

* Ferret is still alpha and has not been put into production yet. Hopefully
that will change soon.

* Ferret is currently slower than the C backed IRs

Anyway, sorry for such a long email. It's really hard to describe all the
features available. In fact, there is a whole book on Lucene by Erik Hatcher
and Otis Gospodnetic which I highly recommend if you want to take full
advantage of all the features in Ferret. Most of the examples should
translate pretty easily into Ruby.

Please let me know if you have any more questions.
Regards,
Dave

David Balmain

unread,

Oct 21, 2005, 4:09:22 PM10/21/05

to

On 10/22/05, Hal Fulton <hal...@hypermetrics.com> wrote:

>
> I'm probably wrong, but I thought we already had
> a port of this? A thing called Rucene?

Yes and also one called rubylucene. Unfortunately Erik Hatcher never had the
time to get those projects off the ground. Hopefully he'll have time to help
me out now that the port is finished though. ;)

Miles Keaton

unread,

Oct 21, 2005, 5:14:05 PM10/21/05

to

On 10/20/05, David Balmain <dbalm...@gmail.com> wrote:

> Ferret is a full port of the Java Lucene searching and indexing library.

> http://ferret.davebalmain.com/trac/

David -

Have you looked into what it would take to allow all text to be UTF-8?
Would it take a complete overhaul or a somewhat-minor tweak?
If a tweak, what would you charge to make that lovely update? :-)

David Balmain

unread,

Oct 21, 2005, 11:27:45 PM10/21/05

to

On 10/22/05, Miles Keaton <miles...@gmail.com> wrote:

> Have you looked into what it would take to allow all text to be UTF-8?
> Would it take a complete overhaul or a somewhat-minor tweak?
> If a tweak, what would you charge to make that lovely update? :-)

I'm looking in to it now. I'll send you the bill. ;-)

David Balmain

unread,

Oct 22, 2005, 3:56:32 AM10/22/05

to

On 10/22/05, David Balmain <dbalm...@gmail.com> wrote:
>
> On 10/22/05, Miles Keaton <miles...@gmail.com> wrote:
>
> > Have you looked into what it would take to allow all text to be UTF-8?
> > Would it take a complete overhaul or a somewhat-minor tweak?
> > If a tweak, what would you charge to make that lovely update? :-)

Hi Miles,
Currently the query parser struggles with UTF-8 but apart from that you
should be able to use UTF-8. You'll need to write your own analyzer for
whatever language you are using, as well as implementing your own sort
procedure if you want to sort results by strings. Here is an example I
tried. The strings are chinese so apologies if your browser can't display
them. (I have no idea what they mean).

require 'rubygems'
require 'ferret'
include Ferret

class ChineseAnalyzer
def token_stream(field, string)
tokenizer = Analysis::RegExpTokenizer.new(string)
class <<tokenizer
def token_re() /./ end
end
return tokenizer
end
end

docs = ["道德經", "搜索所有网页", "搜索所有中文网页", "搜索简体中文网页"]

index = Index::Index.new(:analyzer => ChineseAnalyzer.new)

docs.each { |doc| index << doc }

puts index[3][""]

tq = Search::TermQuery.new(Index::Term.new("", "网"))

index.search_each(tq) do |doc, score|
puts "Document #{doc} found with score #{score}"
end

index.close

Miles Keaton

unread,

Oct 22, 2005, 6:25:55 PM10/22/05

to

one correction:
RegExpTokenizer should be RETokenizer
(at least in the version you've released publicly)

David Balmain

unread,

Oct 22, 2005, 11:29:49 PM10/22/05

to

On 10/23/05, Miles Keaton <miles...@gmail.com> wrote:
>
> one correction:
> RegExpTokenizer should be RETokenizer
> (at least in the version you've released publicly)

Doh, I meant to change that for the email. Anyway, in case you're
interested, I also fixed the problem with the query parser so that it should
also parse UTF-8 queries. That'll be out in the next release.

Dave

peter.r...@gmail.com

unread,

Oct 28, 2005, 2:51:03 AM10/28/05

to

I have to say that the combination of Ruby, Eclipse, and Ferret have
just blown me away today. I've gone from knowing that Ruby existed (for
several years in fact), to thinking that I should check it out over the
last couple of months, to reading some of the doc yesterday, to having
a complete running installation of Ruby (through the One Click Windows
installer), with Eclipse layered on top, and Ferret installed and
operational - at least for this test example - in the space of 4 hours!

Really looking forward to the UTF-8 support too - some Google searching
on Ruby and UTF-8 led me to Ferret in the first place. Looks like an
absolutely great platform to do experiments in information retrieval
with - which is just what I need.

Well done and many thanks!

Peter

David Balmain

unread,

Oct 28, 2005, 3:46:48 AM10/28/05

to

Hi Peter,

Welcome aboard. With all the people who are coming to ruby because of rails,
it's great to hear my project has helped to convert someone. I hope you
enjoy it here.

Cheers,
Dave