My starting data sets will be in the ballpark of 20,000 records and
probably will not grow beyond about 100,000 records. I need to
support indexing records with multiple fields and allowing searches of
specific fields, sorting results by various columns, boolean
operators, etc. I'm willing to put in more time in learning curve if
the tool is flexible enough that I may not have to learn another one
for a few projects. :-)
A few options I'm considering:
* Solr. This is a Java-based search application (built with Lucene)
with an extensive web services interface allowing you to use it from
other languages. Likely the most memory intensive option, but my
initial research indicates it may also be the most full-featured and
fastest option.
* Zend's PHP port of Lucene. Apparently it is not a complete port
and performance for both indexing and searching is substantially worse
than Solr. However, it would probably be more convenient and does not
introduce a Java dependency for the project.
* Sphinx. I haven't looked into this one in depth but it appears to
have mature PHP bindings. I'm not in love with the query syntax
compared to Lucene's, though.
Thanks for any pointers,
David Brewer
david....@gmail.com
I had great experience with using solr. The commuity around it is astounding. It is nice as it keeps you out of the java lucene code but you get all the benefit and extensibilty of the engine.
One very cool feaure was the abilty to simply point solr to rss feeds and have them indexed.
Sphinx: Used by sites like http://nowpublic.com and http://www.information.dk/
. See http://www.drupal4hu.com/node/129 for a quick overview of the
PHP interface.
Solr: In use by many Drupal sites, due to awesome integration module,
and the availability of hosted Solr. I guess the thinking is that many
PHP developers may shy away from Java out of habit or lack of
experience, so Acquia has announced hosted Solr for any Drupal site. I
don't know if it's flexible enough or robust enough to work with any
site, but it may be an option worth looking into. http://acquia.com/products-services/acquia-search
-Mike
__________________
Michael Prasuhn
503.488.5433 office
971.244.2595 cell
503.661.7574 home
mi...@mikeyp.net
http://mikeyp.net
"hello world" title:"example program"~5 body:"python" -"php" -"perl"
"code"
With most of these search engines, you'll probably need to write a
custom front-end to protect your users from having to type these ugly
low-level queries.
Unfortunately, the code that transforms high-level user input into
low-level search engine queries is often some of the most cryptic code
in an app.
> I may have to create some kind
> of adapter layer that allows me to plug in different search engines
> relatively easily so that as the scene evolves I can easily switch
> later on.
I like this idea, but it'll be tricky because most of these search
engines require invasive changes to the database schema and model
classes that are hard to abstract away.
-igal
sphinx has fantastic query speed, and it's n-gram search algorithm
seems more accurate in many cases. But, the index size is gargantuan.
I chatted with the author about this, and he recommends using word
lists to prime the index, but that code is not fully realized yet, so
you would need to roll your own.
In addition, Solr uses AND and OR in place of '&' and '|' so I think
that example could also be written as:
"hello world" title:"example program"~5 body:"python" -(php OR perl) "code".
My theory (admittedly, not in any way tested) is that advanced search
engine users are already familiar with Google's approach and may try
to type in Google-style searches no matter what syntax your system
uses, so you might as well go with the flow. On the flip side, users
who are not advanced are unlikely to try to do advanced searches using
just a query string, and will probably need some kind of visual
interface no matter what the query language looks like.
> sphinx has fantastic query speed, and it's n-gram search algorithm
> seems more accurate in many cases. But, the index size is gargantuan.
I can confirm this. My Solr indexes are about 3/4ths of the size of
the data they're indexing. My Sphinx indexes are about 5x larger than
the data they're indexing.
-igal