Full-featured search engines in PHP

1 view
Skip to first unread message

David Brewer

unread,
Jun 6, 2009, 8:51:53 PM6/6/09
to pdx...@googlegroups.com
I'm about to embark on a couple of php projects (symfony based) which
will involve integrating fairly sophisticated search engine
functionality. The search will be one of the core features of the
site so performance is important. I recently completed a project
which used the Perl-based search engine "KinoSearch" and I'm hoping to
find something similarly full-featured for PHP. Does any one have any
good or bad experiences with PHP search engines to report?

My starting data sets will be in the ballpark of 20,000 records and
probably will not grow beyond about 100,000 records. I need to
support indexing records with multiple fields and allowing searches of
specific fields, sorting results by various columns, boolean
operators, etc. I'm willing to put in more time in learning curve if
the tool is flexible enough that I may not have to learn another one
for a few projects. :-)

A few options I'm considering:
* Solr. This is a Java-based search application (built with Lucene)
with an extensive web services interface allowing you to use it from
other languages. Likely the most memory intensive option, but my
initial research indicates it may also be the most full-featured and
fastest option.
* Zend's PHP port of Lucene. Apparently it is not a complete port
and performance for both indexing and searching is substantially worse
than Solr. However, it would probably be more convenient and does not
introduce a Java dependency for the project.
* Sphinx. I haven't looked into this one in depth but it appears to
have mature PHP bindings. I'm not in love with the query syntax
compared to Lucene's, though.

Thanks for any pointers,

David Brewer
david....@gmail.com

Franz Maruna

unread,
Jun 6, 2009, 9:22:27 PM6/6/09
to pdx...@googlegroups.com
I can tell ya we were not happy with lucene for built in concrete5
search for any number of reasons

Sent from my iPhone

David Brewer

unread,
Jun 6, 2009, 9:46:26 PM6/6/09
to pdx...@googlegroups.com
You were using the Zend PHP port? What did you end up with? Roll your own?

Sam Keen

unread,
Jun 6, 2009, 11:01:52 PM6/6/09
to pdx...@googlegroups.com

I had great experience with using solr.  The commuity around it is astounding. It is nice as it keeps you out of the java lucene code but you get all the benefit and extensibilty of the engine.
One very cool feaure was the abilty to simply point solr to rss feeds and have them indexed. 

David Brewer

unread,
Jun 6, 2009, 11:59:46 PM6/6/09
to pdx...@googlegroups.com
Sam, thanks for sharing your experience. Did you run Solr using the
default Jetty container or use something else like Tomcat? And, was
it a big memory/cpu hog? My Java servlet experience is pretty rusty
these days...

Sam Keen

unread,
Jun 7, 2009, 1:18:53 AM6/7/09
to pdx...@googlegroups.com
ran it with Jetty.
I was just a proof of concept type use. I think it was indexing about
1000 records. ran fine on a VPS with 256 MB ram.
I would think you would probably move to tomcat for any sort of
'production' use.
Here is an interesting article concerning Rackspace's use of Solr with Hadoop
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
--
Sam Keen
@samkeen

Igal Koshevoy

unread,
Jun 7, 2009, 2:38:25 AM6/7/09
to pdx...@googlegroups.com, David Brewer
David,

I've used Solr as the search engine for Calagator and some other
projects. It's very stable, offers excellent search features and works
well. I've only used it with the default Jetty container.

However, there are some big gotchas:
* Slow indexing: It takes a fast CPU up to a minute to index a megabyte
of database content, so you should avoid reindexing and build your index
on a fast machine and upload it to your slow, virtualized servers. In
comparison, Sphinx can index the same amount of data magnitudes faster.
* Memory use: My typical small Solr instances consume 100MB of resident
memory and 700MB of virtual memory. By tinkering with Java memory flags,
I can make them run in as little as 50MB resident and 300MB virtual, but
this hurts performance. In comparison, my typical small Sphinx instances
typically use 0.5MB resident and 80MB virtual.
* Bindings: The server process itself is fine, but the Ruby on Rails
bindings were horribly broken and required much rework. Hope the PHP
bindings are better.

I find it hard to recommend any of the search systems I've used:
* MySQL full-text search is easy but very limited.
* PostgreSQL's tsearch2 is ludicrously slow.
* Solr works great if you can live with its severe gotchas.
* Sphinx is fast and light, but has limited features, requires the most
invasive changes to your application and database, and makes it harder
to do basic things like update an existing index than it should.
* Estraier seems to be broken by design and no longer maintained.
* Xapien is new to me, but worth investigating further.

I'd be interested in hearing more from anyone that likes their full text
search system.

-igal

Andrew Embler

unread,
Jun 8, 2009, 11:10:25 AM6/8/09
to pdx...@googlegroups.com, David Brewer
Hey David!

Just wanted to chime in on what Franz said last week. We did indeed
use the Zend Framework port of Lucene, and I was NOT happy with it.
This was for concrete5. I found a few things to be the case:

1. Challenging when working with different character sets.
2. VERY slow (obviously, much slower than Java Lucene or an alternate
solution.), both when searching and indexing (although we were lazy
and the only indexing routine we ever built was a reindex of the
entire index, rather than one that updated progressively.
3. I was never all that impressed with the results, to be honest with you.

I did like the syntax, and the ease of drop-in, which was primarily
the reason I was attracted to it in the first place. This was for
concrete5, and we needed something that

1. could be run on most LAMP installations without any tuning -
something that "Just worked."
2. Something with a permissive license for inclusion w/MIT licensed software.

So ZF's Lucene port fit the bill, but technically just wasn't working
out (plus, even it resulted in a number of installation headaches for
people in certain hosts.) To that end, we ripped out Zend's lucene and
replaced it with a custom MySQL-based indexer that uses MySQL's
fulltext searching. I've always been happy with this. Our search isn't
perfect, but I've found that, if your tables are properly indexed, and
you take the time to write any advanced search logic (search between
dates, etc...) MySQL fulltext will handle boolean searching of various
full text data very well. It's only annoyances (to me) are it minimum
character length and stop words, both of which it sounds like could be
tweaked by you, since you're working on something you'll have total
control over.

Please let us know what you decide on!

best,
Andrew
--
Andrew Embler
CTO, concrete5
concrete5.org | Twitter: aembler

David Brewer

unread,
Jun 8, 2009, 1:27:27 PM6/8/09
to pdx...@googlegroups.com
Thanks for the interesting link. That's a quite a bit heavier
application then we are likely to ever consider, so it's good to hear
that Solr is up to the task. However, I'm coming to realize that we
are in a bit of an uncomfortable middle ground here... we want the
flexibility and scalability of a higher-end search solution, but we
also want something that could comfortably be set up for much smaller
projects without a huge effort. I'm somewhat uncomfortable with
adding Java to our list of requirements for our PHP-based
installations, too.

I was hoping to find a full-featured Lucene port implemented as a PHP
extension, like this only not so alpha:
http://pecl.php.net/package/clucene. But I guess that's not in the
cards just yet.

David Brewer

unread,
Jun 8, 2009, 1:35:28 PM6/8/09
to Igal Koshevoy, pdx...@googlegroups.com
Igal --

Thanks for the insight. It's great to get some feedback from someone
who has tried multiple different solutions.

Sphinx looks somewhat tempting when I look at its feature list, but I
just can't get over its ugly query syntax. Check out this example:

"hello world" @title "example program"~5 @body python -(php|perl) @* code

Not so bad for developers, but I can't imagine asking end users of the
site to dig through that!

I am coming to terms with the fact that this is apparently a "pick the
least-worst" kind of situation. :-) I may have to create some kind
of adapter layer that allows me to plug in different search engines
relatively easily so that as the scene evolves I can easily switch
later on.

On Sat, Jun 6, 2009 at 11:38 PM, Igal Koshevoy<ig...@pragmaticraft.com> wrote:

David Brewer

unread,
Jun 8, 2009, 1:43:50 PM6/8/09
to Andrew Embler, pdx...@googlegroups.com
Andrew:

Thanks for the very valuable information about your experience with
Zend Lucene. The things that attracted you to it in the first place
are exactly the reasons I was considering it, but the downsides seem
significant. When it comes to its speed... how big a data set were
you attempting to index where the speed became a problem? A lot of
people complain about Zend's speed but I don't have a clear sense of
what size of data set is big enough to start noticing the problem.

MySQL Fulltext searching is something I considered, but there are a
couple of things which have always turned me away from it:
* We sometimes have to use other databases such as MSSQL or Oracle,
so a solution which does not directly depend upon a specific database
is preferable.
* I have a sense (possibly incorrect) that it's tricker setting up
the kind of multiple-field queries I want to do. But to be honest, I
haven't looked into this in depth because the first issue is big
enough for me to rule it out.

At this time I am leaning toward creating some kind of simplified
indexing and query layer which plugs into different backends. Perhaps
Zend Lucene will be sufficient for smaller projects where I have less
server resources to work with, while Solr will be a better choice in
more substantial projects with better hardware.

I will likely be doing some more in-depth investigation and
comparative tests at some point in the next few weeks. I'll share
this with the list when I'm done. Thanks to everyone that helped
steer my direction of investigation!

David

Michael Prasuhn

unread,
Jun 8, 2009, 2:01:34 PM6/8/09
to pdx...@googlegroups.com
I'm a Drupal guy here, and Drupal is not known for it's built in
search, so I've looked at alot of alternatives:

Sphinx: Used by sites like http://nowpublic.com and http://www.information.dk/
. See http://www.drupal4hu.com/node/129 for a quick overview of the
PHP interface.

Solr: In use by many Drupal sites, due to awesome integration module,
and the availability of hosted Solr. I guess the thinking is that many
PHP developers may shy away from Java out of habit or lack of
experience, so Acquia has announced hosted Solr for any Drupal site. I
don't know if it's flexible enough or robust enough to work with any
site, but it may be an option worth looking into. http://acquia.com/products-services/acquia-search

-Mike


__________________
Michael Prasuhn
503.488.5433 office
971.244.2595 cell
503.661.7574 home
mi...@mikeyp.net
http://mikeyp.net


David Brewer

unread,
Jun 8, 2009, 2:10:48 PM6/8/09
to Igal Koshevoy, pdx...@googlegroups.com
Your mention of Xapien encouraged me to look into this option further
and it looks like a pretty strong contender to me:
* Implemented in C++ and available as packages for many Linux
distributions, included my preferred Ubuntu.
* Seems to have good PHP bindings (not to mention Ruby, Python, and
Tcl!), also available as Ubuntu packages.
* Their query syntax is reasonable
* Their performance seems reasonable.
* No need to run a separate Xapien server application, unlike Solr.

At this point, if I get the time to compare three contenders in more
detail, those contenders will likely be Xapien, Zend Lucene, and Solr.

David

On Sat, Jun 6, 2009 at 11:38 PM, Igal Koshevoy<ig...@pragmaticraft.com> wrote:

Sam Keen

unread,
Jun 8, 2009, 2:12:34 PM6/8/09
to pdx...@googlegroups.com
" I'm somewhat uncomfortable with
adding Java to our list of requirements for our PHP-based
installations, too."

to mitigate that, you can consider building a 'google appliance' with
SOLR (or something similar discussed here).
Then you can offer search as a service for all your PHP builds. You
will just need feeds from those sites to be indexed by the box you
built. With some management of those feeds, you can present just
updated/new content to keep you from re-indexing the entire set.

random thoughts,
sam
--
Sam Keen
@samkeen

David Brewer

unread,
Jun 8, 2009, 2:15:13 PM6/8/09
to pdx...@googlegroups.com
Interesting... I wasn't aware there were hosted search providers based
on Solr, but I can see that it would make a lot of sense for some
applications. Unfortunately it's not something that I can consider as
our projects typically have to be more self-contained.

David Brewer

unread,
Jun 8, 2009, 2:19:58 PM6/8/09
to pdx...@googlegroups.com
That is a good suggestion, but most of our clients want a finished
product which is more self-contained. We usually have to run on their
own server infrastructure and I think they would be uncomfortable with
the idea of sharing a single search service infrastructure with other
clients. And, I personally would be uncomfortable with the single
point of failure for multiple otherwise unrelated projects.

Igal Koshevoy

unread,
Jun 8, 2009, 2:38:33 PM6/8/09
to David Brewer, pdx...@googlegroups.com
David Brewer wrote:
> Sphinx looks somewhat tempting when I look at its feature list, but I
> just can't get over its ugly query syntax. Check out this example:
>
> "hello world" @title "example program"~5 @body python -(php|perl) @* code
>
> Not so bad for developers, but I can't imagine asking end users of the
> site to dig through that!
>
This sort of low-level query syntax is reasonable and common. Here's an
equivalent Solr query:

"hello world" title:"example program"~5 body:"python" -"php" -"perl"
"code"

With most of these search engines, you'll probably need to write a
custom front-end to protect your users from having to type these ugly
low-level queries.

Unfortunately, the code that transforms high-level user input into
low-level search engine queries is often some of the most cryptic code
in an app.

> I may have to create some kind
> of adapter layer that allows me to plug in different search engines
> relatively easily so that as the scene evolves I can easily switch
> later on.

I like this idea, but it'll be tricky because most of these search
engines require invasive changes to the database schema and model
classes that are hard to abstract away.

-igal

Chris Fortune

unread,
Jun 8, 2009, 3:38:23 PM6/8/09
to pdx...@googlegroups.com
mysql fulltext has always been satisfying, but tends to be slow on
data sets > 5M records. I managed to squeeze a lot more performance
by maximizing stopwords list and using a "divide and conquer"
strategy. 1st select * into a temporary table using a WHERE criterion
to limit data scope, 2nd search using fulltext on temporary table.
Results in < second for 10M records. Not bad bang for your buck, and
easy to maintain.

sphinx has fantastic query speed, and it's n-gram search algorithm
seems more accurate in many cases. But, the index size is gargantuan.
I chatted with the author about this, and he recommends using word
lists to prime the index, but that code is not fully realized yet, so
you would need to roll your own.

David Brewer

unread,
Jun 8, 2009, 4:16:33 PM6/8/09
to Igal Koshevoy, pdx...@googlegroups.com
I find the Solr (Lucene) example a lot easier to read and more
natural... maybe this is just because it's very close to the way
Google handles advanced queries and my expectations have been
thoroughly trained. :-)

In addition, Solr uses AND and OR in place of '&' and '|' so I think
that example could also be written as:
"hello world" title:"example program"~5 body:"python" -(php OR perl) "code".

My theory (admittedly, not in any way tested) is that advanced search
engine users are already familiar with Google's approach and may try
to type in Google-style searches no matter what syntax your system
uses, so you might as well go with the flow. On the flip side, users
who are not advanced are unlikely to try to do advanced searches using
just a query string, and will probably need some kind of visual
interface no matter what the query language looks like.

Igal Koshevoy

unread,
Jun 9, 2009, 10:47:45 AM6/9/09
to pdx...@googlegroups.com
On Mon, Jun 8, 2009 at 12:38 PM, Chris Fortune<chris....@gmail.com> wrote:
> mysql fulltext has always been satisfying, but tends to be slow on
> data sets > 5M records.   I managed to squeeze a lot more performance
> by maximizing stopwords list and using a "divide and conquer"
> strategy.  1st select * into a temporary table using a WHERE criterion
> to limit data scope, 2nd search using fulltext on temporary table.
> Results in < second for 10M records.  Not bad bang for your buck, and
> easy to maintain.
Thanks for sharing this tip.

> sphinx has fantastic query speed, and it's n-gram search algorithm
> seems more accurate in many cases.  But, the index size is gargantuan.

I can confirm this. My Solr indexes are about 3/4ths of the size of
the data they're indexing. My Sphinx indexes are about 5x larger than
the data they're indexing.

-igal

Reply all
Reply to author
Forward
0 new messages