I've got two choices that I see... pony up the O'reilly mini-pdf and
tweak ferret settings or scrap ferret and go with Sphinx (and hope it
handles cases like this better). I'm not sure how much time the
latter would take me but, assuming that I'm going to spend somewhere
around 40 hours anyway, which route would you all recommend?
Thanks for your time,
Vince
We've used ferret on past projects... and now use sphinx. We're not
likely going back to ferret. ;-)
Robby
--
Robby Russell
Founder and Executive Director
PLANET ARGON, LLC
Design, Development, and Hosting with Ruby on Rails
http://www.planetargon.com/
http://www.robbyonrails.com/
+1 503 445 2457
+1 877 55 ARGON [toll free]
+1 815 642 4068 [fax]
Can you elaborate on why? I'm mostly just curious :)
To the parent...
the ferret PDF booklet is pretty full of good information
if you stick with ferret. I don't however remember if it discusses how to
handle words with apostrophes in it. It does talk about how to hand
plurals via the StemFilter though.
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/StemFilter.html
-philip
Ferret is unstable in production. Segfaults, corrupted indexes
galore. We've switched around 40 clients form ferret to sphinx and
solved their problems this way. I will never use ferret again after
all the problems I have seen it cause peoples production apps.
Plus sphinx can reindex many many times faster then ferret and uses
less cpu and memory as well.
Cheers-
- Ezra Zygmuntowicz
-- Founder & Software Architect
-- ez...@engineyard.com
-- EngineYard.com
A decent search option is Lucene via acts_as_solr plugin.
I never used Sphynx though. Can anyone with firsthand experience of
both Lucene and Sphynx give their opinion?
--
Alexey Verkhovsky
CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com]
RubyWorks [http://rubyworks.thoughtworks.com]
Huh. I must be lucky. Or not have that much to index (true) or users
don't complain about not finding anything (probably very true)
:-)
I'll have t ogive sphinx a go next time around... thanks ezra
>
>> Ferret is unstable in production
> Very true.
>
> A decent search option is Lucene via acts_as_solr plugin.
> I never used Sphynx though. Can anyone with firsthand experience of
> both Lucene and Sphynx give their opinion?
>
> --
> Alexey Verkhovsky
We have a bunch of clients using solr as well. In general it is more
powerful then sphinx but a lot slower to reindex and querey. Also it
uses 50 times the memory of sphinx. If you have a box or vm to put
SOLR on by itself then it is a good option as well. but if sphinx can
do everything you need from a a search indexer then it is a way better
option cost wise.
Just out of interest, were corrupted indexes seen even with only one
process writing to the index (via DRb as is recommended)? Multiple
writers are unsupported and cause these kinds of problems.
Segfaults were quite common in older version too, but it's settled down
now and I've had it rather stable in a few small production sites
(though I'm not talking Twitter-like load :).
John.
--
http://www.brightbox.co.uk - UK Ruby on Rails hosting
Yes we have tried every way possible of running ferret, by itself,
drb server etc. I really like ferrets interface and integration with
rails but unfortunately it causes nothing but problems for so many
people that I cannot recommend it with a straight face. Not meaning to
bash on the ferret devs here at all, just stating what I've seen
across hundreds of deployments.
Hi Vince,
They're different tools really. I've found the flexibility of Ferret to
be really quite awesome. I can (in Ruby):
* set boost values independently per field and per record
* write custom text tokenizers, stemmers and stop lists (and use
different ones per field even)
* highlight matches in results using the same engine that does the
searching
* manage my own indexes, merging them at will, or just merging results
from them.
* Index content generated on the fly, without having to store it in my
sql database (pull in all the associated tags for a post as you index it
for example).
* Store original data in the index (though most people use it to index
an SQL database anyway).
* other awesome stuff I can't remember right now.
Looking at the documentation for Sphinx (and it's usual usage, with
MySQL), many (if not all) of those features are missing. But Sphinx is
reportedly quicker, supports distributed searching, and appears to be
undergoing more development that Ferret is at the moment so I think it
depends on your needs.
I'd recommend you ask on the Ferret mailing list about your search
result issues though - I'm surprised you're having problems with that.
I'm sure it can be solved.
I don't have first hand experiences with sphinx, but i can confirm
that given a decent hw setup solr (with acts_as_solr) is really good
(not only in terms of performance but also of flexibility, and
functionality). We used it for miojob.it and it powers almost any
aspect of that site, which is built around faceted browsing of job
postings and has a only a few spots where caching was appropriate
without sweating under a traffic which is in the multi hundred K hits
per day (i don't have the real numbers)
Anyhow given the lower system requirements, I'd like to give a try to
sphinx to see what can it do!
cheers,
Luca Mearelli
Anyways, can anyone recommend a sphinx plugin for Rails?
There's 3 so far that I found. acts_as_sphinx, ultrasphinx, and
sphinctor. Are they all actively updated?
Thanks,
Ray
--
Posted via http://www.ruby-forum.com/.
...
How difficult would it be to change over to Sphinx?
> How difficult would it be to change over to Sphinx?
The overall process? Not hard, with the caveat Adrian mentioned (ie:
advanced Ferret features).
But keep in mind Sphinx does not allow updating fields of index
records (Ferret does) - you have to re-index to get the latest changes
into Sphinx. There are ways around this, to some extent - delta
indexes, containing just the recent changes - but it doesn't seem to
be critical to everyone.
Essentially, though:
- Choose a sphinx plugin, and install it.
- Set up the configuration and indexes, either manually, or within
your models (depending on the plugin)
- Install sphinx
- Index your data
- Switch your ferret-specific search calls to use the sphinx plugin's
search calls.
- Start the sphinx daemon (searchd)
- Confirm everything works
Or something along those lines. I'm sure the EngineYard crew have a
better idea though.
How would you do the integration into Rails 2 ?
I tried the acts_as_tsearch plugin
http://code.google.com/p/acts-as-tsearch/
and the first line of the example works, but it really
does not seem to be ready for prime time to me and at
this moment ...
Thanks for any insights,
Peter Vandenabeele
(new to rails)
>
> Ericson Smith wrote:
>> If you consider using Postgresql, then tsearch2 is awesome. Its built
>> into the latest version of Postgresql.
>
> How would you do the integration into Rails 2 ?
>
> I tried the acts_as_tsearch plugin
>
> http://code.google.com/p/acts-as-tsearch/
>
> and the first line of the example works, but it really
> does not seem to be ready for prime time to me and at
> this moment ...
I haven't used the plugin, but interfacing with tsearch2 is easy
enough so you can write your own in a day: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
We did that back in early '06 and since talking with tsearch2 is
basically normal SQL, all you have to do is to write a custom finder
method.
I have no idea how the performance compares to other engines but I
find it pretty cool that everything happens transparently inside the
database so you have one less process to monitor and keep fresh. So if
you're using PostgreSQL, it should definitely be worth a shot. It's
been around forever, so it should be void of most pediatric diseases.
Cheers,
//jarkko
--
Jarkko Laine
http://jlaine.net
http://dotherightthing.com
http://www.railsecommerce.com
http://odesign.fi
I had trouble with it using version 0.11.6. I was having intermittent
problems every time I tagged a store. When I removed my rescue I
found it was ferret (don't have the exact error on me.. sorry).
Stepping back to 0.11.3 seems to have resolved this (this is the last
version I can remember that worked for me somewhat reliably). With
0.11.6, removing my index solves it temporarily (6 or 7 tag actions)
but then it comes back.
Feel free to move this to the ferret talk list, I'll go check on it there.
-Vince
support independent business -- http://www.buyindie.net/
Do you mean the "*" feature (prefix* and *infix*) ? Where
the search term "program*" matches the database text
"program", "programmer", "programs" ...
Those work for me in version sphinx-0.9.8-svn-r1065 and
sphinx-0.9.8-svn-r1112 ... I have done quite some testing
on r0165 (still testing the newest r1112) and that seems
to work OK for me. Set the "enable_star" to 1 and set a
min_prefix_leng or a min_infix_leng.
> - No automatic updates - must rebuild entire index using cron jobs.
Indeed. But automatic rotation of indexes seems to work OK.
Indexing on my dataset takes 15 seconds (37000 records,
28 MByte) on a desktop PC.
> Again using straight SQL, not the current state of your models
At least in one (limited) test, I have just used the :include
feature of ultrasphinx and that automatically created the SQL
for the sphinx configuration file. So, if I understand well,
that did use the AR model ?
From:
http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/classes/ActiveRecord/Base.html
* Including a field from an association
Use the :include key.
Accepts an array of hashes.
:include => [{:association_name => 'category', :field => 'name', :as
=> 'category_name'}]
Each should contain an :association_name key (the association name for
the included model), a :field key (the name of the field to include),
and an optional :as key (what to name the field in the parent).
So in my Model for jobs, that is as simple as e.g.:
class Job < ActiveRecord::Base
is_indexed :fields => [
'title']
:include => [
{:association_name => 'employer', :field => 'name'}]
belongs_to :employer
...
The config file for sphinx that is calculated by ultrasphinx then has
automatically calcuated by ultrasphinx:
...
sql_query = SELECT (jobs.id * 1 + 0) AS id, 'Job' AS class, 0 AS
class_id, jobs.
title AS title, employer.name AS name, ...
...
index complete
{
source = jobs
charset_type = utf-8
charset_table = 0..9, A..Z->a..z, -, _, &, a..z,
U+410..U+42F->U+430..U+44F, ... and a lot more ...
min_word_len = 2
# min_infix_len = 4
stopwords =
enable_star = 1
path = /var/sphinx//sphinx_index_complete
docinfo = extern
morphology = none
min_prefix_len = 4
}
All of this seems to work for me (no production experience yet ...).
>
>Jeff Cc wrote:
>> - No wildcards at all (Sphinx doesn't support them)
>
>Do you mean the "*" feature (prefix* and *infix*) ? Where
>the search term "program*" matches the database text
>"program", "programmer", "programs" ...
>
>Those work for me in version sphinx-0.9.8-svn-r1065 and
>sphinx-0.9.8-svn-r1112 ... I have done quite some testing
>on r0165 (still testing the newest r1112) and that seems
>to work OK for me. Set the "enable_star" to 1 and set a
>min_prefix_leng or a min_infix_leng.
Yup, works for me too - it's just not turned on by default. The
enable_star feature has been around in at least the last 4 releases of
0.9.8. Fairly certain it's not in 0.9.7 though (the last 'production'
release).
>
>> - No automatic updates - must rebuild entire index using cron jobs.
>
>Indeed. But automatic rotation of indexes seems to work OK.
>Indexing on my dataset takes 15 seconds (37000 records,
>28 MByte) on a desktop PC.
Thinking Sphinx has delta indexes, which keep track of changes between
explicit indexes. I know Evan's working on adding something like this
to UltraSphinx as well.
The super small delta indexes means they get indexed really quickly,
straight after a model is updated.
>
>> Again using straight SQL, not the current state of your models
>
You're correct that UltraSphinx doesn't support model methods (as
opposed to standard attributes) are not accessible for index generation
- that's the same with Thinking Sphinx and perhaps all of the other
plugins as well.
Because you're dealing with MySQL directly when the data is indexed,
there's no instantiation of models (and no Ruby at all), so it's not
really an option. If the data you want isn't available somewhere in the
database, you're out of luck.
Ferret uses model methods, I believe, if that's an option available to
you.
>At least in one (limited) test, I have just used the :include
>feature of ultrasphinx and that automatically created the SQL
>for the sphinx configuration file. So, if I understand well,
>that did use the AR model ?
>
>From:
>http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/classes/ActiveRecord/Base.html
>
>* Including a field from an association
>
>Use the :include key.
>
>Accepts an array of hashes.
>
> :include => [{:association_name => 'category', :field => 'name', :as
>=> 'category_name'}]
>
>Each should contain an :association_name key (the association name for
>the included model), a :field key (the name of the field to include),
>and an optional :as key (what to name the field in the parent).
>
>So in my Model for jobs, that is as simple as e.g.:
>
>class Job < ActiveRecord::Base
>
> is_indexed :fields => [
> 'title']
> :include => [
> {:association_name => 'employer', :field => 'name'}]
>
> belongs_to :employer
>...
>
Thinking Sphinx equivalent (just to provide a comparison):
class Job < ActiveRecord::Base
define_index do |index|
index.includes.title
index.includes.employer.name
end
# ...
end
Cheers
--
Pat
I have the impression the enable_star _is_ really the feature that does
allow
search for "*@gmail.com" to find all emails @ gamil.com (if you add the
'@' sign to the char table actually ... (which is another problem, since
'@' also has a special meaning as a field indicator for field specific
search).
For the enable star the user must explicitely give a '*'. WIthout a '*'
the match is only for "exact match". I give an example at the end of
my blog: (http://www.vandenabeele.com/Ultrasphinx-performance) where
I tested with and without the enable_star feature and always without
stemming
(since I had not stemmer for the Duthch language).
0.001 sec [ext/0/rel 1409 (0,20)] [complete] c
0.001 sec [ext/0/rel 1409 (0,20)] [complete] c*
0.000 sec [ext/0/rel 35 (0,20)] [complete] co
0.000 sec [ext/0/rel 35 (0,20)] [complete] co*
0.000 sec [ext/0/rel 5 (0,20)] [complete] com
0.000 sec [ext/0/rel 5 (0,20)] [complete] com*
0.000 sec [ext/0/rel 10 (0,20)] [complete] comp
0.003 sec [ext/0/rel 5343 (0,20)] [complete] comp*
0.000 sec [ext/0/rel 0 (0,20)] [complete] compl
0.000 sec [ext/0/rel 1473 (0,20)] [complete] compl*
0.000 sec [ext/0/rel 0 (0,20)] [complete] comple
0.000 sec [ext/0/rel 1214 (0,20)] [complete] comple*
0.000 sec [ext/0/rel 0 (0,20)] [complete] complet
0.000 sec [ext/0/rel 793 (0,20)] [complete] complet*
0.000 sec [ext/0/rel 458 (0,20)] [complete] complete
0.000 sec [ext/0/rel 642 (0,20)] [complete] complete*
0.000 sec [ext/0/rel 30 (0,20)] [complete] completed
0.000 sec [ext/0/rel 30 (0,20)] [complete] completed*
0.000 sec [ext/0/rel 0 (0,20)] [complete] completel
0.000 sec [ext/0/rel 130 (0,20)] [complete] completel*
0.000 sec [ext/0/rel 10 (0,20)] [complete] completely.
What happens is that with less than 4 characters, the * has no effect,
but from 4 characters on, the * expands to all words that match the same
first 4 letters. And that is an interesting feature the major public
search engines do not offer. At this time, with the relatively small
database I expect initially for our project (< 10 MByte or so), it
should not be a problem to keep indices with start expansion after 4
letters in memory.
An issue that I still have is that a final '.' of a sentence is attached
to the index data and so not found without attaching a '.' or '*' to the
search term.
++++
I solved the '.' issue in the meanwhile with a crude solution of
removing the '.' character from the char_table list (which causes other
problems ...).
The stemming will e.g. 'companies' and 'company' to a stem of 'compani'
(both in the search term and in the database index), without the user
needing to add a special * to the search. so any combination of
'company' and 'companies' will match.
HTH,
Peter
Hi,
That is a very interesting thread. I am currently deploying a rails app
with aaf. I have many troubles and basically I cannot have it working. I
am surprised because everything was so simple in development.
I must admit that I understand nothing to the DRB server. (I am learning
this new thing.)
My app is on a shared host. I do not even understand if the drb server
can run on it...
When I run : script/ferret_server -e production start
I get : starting ferret server...
That is all
But when I stop it (script/ferret_server -e production start) I get :
ferret_server doesn't appear to be running
I guess it is not normal (can someone confirm please ?)...
Then when I do script/console production
Article.rebuild_index
I get the first time :
DRb::DRbConnError: druby://ferret.myhost.com:9010 - #<SocketError:
getaddrinfo: Name or service not known>
And if I do it a second time :
LoadError: Expected article.rb to define Article
Whatever bad I am, this is just an awful behavior for a software, sorry
to say that because I loved aaf in dev.
(I tried a chmod -R 777 index without success)
For info I am deploying with Capistrano, in case it rings a bell to
someone.
My options :
more help from my host, I am currently discussing with them
help from you about the aaf configuration
but even if I make it work, from what I have read here I should not
build my app with it...
try another search engine : sphinx
but I read from brfsa "FERRET is in my second choice only because
shared hosts won't support sphinx...." Can I have any precision on that
? Or alternatively for those of you on a shared host how do you manage
your search ?
Finally, I am listening to your suggestions about :
the web host : which ones allow a search engine such as
ferret/sphinx/other ?
how to configure aaf ?
Which plugins for the engine ? (Do not worry I will read again the whole
thread!)
Thx !
H
Adrian Madrid wrote:
> Don't even try running ferret on a shared host. I don't think you really
> have any other option but MySQL fulltext indexes in a shared hosting
> environment.
>
> AEM
You might take a look at tsearch2 on postgresql (for a shared host
solution).
IIRC, it only requires special indexes in the database, but no daemon
process (like e.g. sphinx does). This was mentioned higher up in this
thread too, by
Ericson Smith.
I did some experiments with tsearch2 and it worked OK (but then I
switched to sphinx, mainly because MySQL is more common as a Rails
back-end and because a clean and full plug-in (Ultrasphinx) was
available). In older versions of Postgresql it is a plug-in, since 8.2
(IIRC) it is built-in by default.
HTH,
Peter
About ferret on a shared host there is this solution which could be a
temporary solution.
http://boonedocks.net/mike/archives/151-Rails-acts_as_ferret-without-DRb.html
H
"Is there something that fulltext mysql indexes won't give you that you
desperately need? If MySQL won't cut it then you probably need to move
into a VPS."
Well that is a good question I was wondering about. And basically the
answer is that it was so easy to run aaf that it is a pity to go without
it to search in different models, for different fields.
By the way I do not really understand why ferret could not use the db to
write its index (performance issue?). At least the db knows how not to
corrupt a file system.
H
If you want to stay out of all this debate about which search engine to
use, avoid troubleshooting your search feature and make it zero
maintenance and still get a great speed at indexing and searching(pros
and cons), I would suggest you to go for Solr and acts_as_solr plugin.
I have compiled some points that I came across during my experience with
RoR till date.
Ferret:-
Advantages:
1. Easy to implement.
2. Indexing on ActiveRecord save - It hooks up with the life cycle of an
object.
Disadvantages:-
1. Corrupts indexes if used with Transactions in your apps because of
its after_update filter.(It updates the index before the actual save to
the database)
2. Unstable on the production server if you use some load balancing
techniques like round-robbin scheme and you have instances of mongrel on
different machines.
(Added burden to use a separate dRB server)
3. Faster at indexing but slower at searching.
Sphinx:-
Advantages:-
1. Great at speed of indexing and searching.
2. Its at the database level so just one copy of indexes unlike ferret.
Disadvantages:-
1. Difficult to integrate as compared to Ferret or Solr.
2. You have to write a lot of sql code in the configuration file for
indexing and searching data.
3. Not hooked with the ActiveRecord save or the life cycle of an object,
so you need a cron job to rebuild the index periodically.
Solr:-
Advantages:-
1. Easy to implement
2. Runs on a separate Java server(Solr server), so just one copy of
indexes.
3. Hooked up with the object life cycle, so index update with
ActiveRecord save.
4. Good speed at indexing and searching
5. No gem required, no engine installation......just get the
Acts_as_solr plugin.
6. In-built support for highlighting search keywords like you see in
Google Search and many more advanced features.
7. NONE of the disadvantages mentioned above
Disadvantages:-
1. It costs you just some extra memory but not an unbearable amount
though.(I would say that now-a-days memory is cheaper, so you can afford
it)
I personally would suggest you to go for Acts_As_Solr plugin.
You could also refer to the following links:-
http://bloggingrails.wordpress.com/2007/05/31/implementing-full-text-search-for-your-rails-application/
http://blog.aisleten.com/2007/04/14/getting-started-with-acts_as_solr/
If you decide to use Acts_as_solr on windows, this would be helpful:-
http://www.webonrails.com/2007/09/13/acts_as_solr-starting-solr-server-on-windows/
> Sphinx:-
> Advantages:-
> 1. Great at speed of indexing and searching.
> 2. Its at the database level so just one copy of indexes unlike
> ferret.
> Disadvantages:-
> 1. Difficult to integrate as compared to Ferret or Solr.
Arguable, but each to their own.
> 2. You have to write a lot of sql code in the configuration file for
> indexing and searching data.
This very much depends on the plugin you use. I'm reasonably sure this
isn't required for Ultrasphinx, and it's definitely not for Thinking
Sphinx (my own plugin, as mentioned earlier in this thread - yes, I've
got some level of bias).
> 3. Not hooked with the ActiveRecord save or the life cycle of an
> object,
> so you need a cron job to rebuild the index periodically.
Yeah, that's pretty much true. Both of the above plugins support delta
indexes, so model changes are automatically put into the live indexes,
but regular periodic reindexing is still needed.
> Solr:-
<snip>
> Disadvantages:-
> 1. It costs you just some extra memory but not an unbearable amount
> though.(I would say that now-a-days memory is cheaper, so you can
> afford
> it)
2. It's Java - which is extra overhead for some people - I certainly
don't use any other Java tools, and I've not dealt with Java since
Uni. Again, each to their own, but that may push non-Java people away
from Solr.
Cheers
You don't need to know Java to use the acts_as_solr plugin.
You just install the plugin and build the index for the first time.
Thenonwards, you just have to start the solr server by issuing a
command:
rake solr:start.
Now tell me where's Java?
That's pretty much it.
Ultrasphinx works only on Rails 2.0.
Sorry - first off, I have complete ignorance about acts_as_solr and
Solr. My Java comment was in reference to the latter though, since you
mentioned you can run it on 'a separate Java server', I assumed that's
all it runs in.
>>> 2. You have to write a lot of sql code in the configuration file for
>>> indexing and searching data.
>>
>> This very much depends on the plugin you use. I'm reasonably sure
>> this
>> isn't required for Ultrasphinx, and it's definitely not for Thinking
>> Sphinx (my own plugin, as mentioned earlier in this thread - yes,
>> I've
>> got some level of bias).
>
> Ultrasphinx works only on Rails 2.0.
Granted, this is a problem if you're shoehorning Sphinx into an
existing app - but I'm guessing most people starting new projects
would be using 2.0 (or even edge in preparation for 2.1?)
Cheers
--
Pat