Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

SQL server FTS vs Lucene.NET

866 views

Skip to first unread message

Remi Blanchette

unread,

Jan 26, 2010, 2:43:22 PM1/26/10

People,
we are currently investigating, prototyping and comparing the architecture
of a full text search capability in a project. Our project has many data
repository, one of which is SQL server 2008. My personal vision for FTS
support was to use whatever was available at the repository level. As such,
for our SQL server database, we would have used SQL server FTS without even
looking into it.

It is argued internally however that this selection is a poor candidate even
for the SQL server repository. Which I definitely sound weird...

Here is my developer arguments:
"This product (sql server FTS) is lacking in term of the features we need to
properly rank search results. Even if we only had only one repository, the
limitations of Microsoft SQL Server Full-Text Search would make it a poor
candidate to have proper ranking.
...
It cannot be used to rank results for different kind of repository
(databases), it cannot even be used across tables in the same database,
since the number is normalized according to the number of occurrences of the
search terms in the table.
...
In addition to the problems with the aggregation of ranks in SQL Server
search queries, the other important limitation is the lack of flexibility to
influence the relevance. It is possible to use a weight operator to give
more or less importance to keywords, but it is not possible to do so for
fields. It means it is not possible natively to give more weight to a match
in the title of a document than a match in its content. It is the same
problem for the weighting from geographical position. In both case, it would
be possible to try to tweak the results with some kind of multiplication
factor to the rank results, but it would be rather fragile, skewing the
results in non-coherent ways and probably not offer good performance.
...
Many products, both open source and paid products are available the need of
a centralized, external index. The Apache Software Foundation offers one
that seems to stand out, Solr, based on the Lucene search engine. It is a
standalone server, that can be interacted with through HTTP queries. There
is also a C# client, with NHibernate integration through NHibernate Search
available."

I offer the general public, and especially those at microsoft, the chance to
mitigate that argumentation or to add to it. Either way I don't care
personally, but I don't want us choosing the solution for the wrong reasons.
We are both a little bit biased...

Hilary Cotter

unread,

Jan 26, 2010, 9:50:37 PM1/26/10

answers inline.

"Remi Blanchette" <remi_bl...@hotmail.com> wrote in message
news:3E32C724-C451-4D21...@microsoft.com...

> People,
> we are currently investigating, prototyping and comparing the architecture
> of a full text search capability in a project. Our project has many data
> repository, one of which is SQL server 2008. My personal vision for FTS
> support was to use whatever was available at the repository level. As
> such, for our SQL server database, we would have used SQL server FTS
> without even looking into it.
>
> It is argued internally however that this selection is a poor candidate
> even for the SQL server repository. Which I definitely sound weird...
>
> Here is my developer arguments:
> "This product (sql server FTS) is lacking in term of the features we need
> to properly rank search results. Even if we only had only one repository,
> the limitations of Microsoft SQL Server Full-Text Search would make it a
> poor candidate to have proper ranking.

I have gone back and forth between recommending lucene over SQL FTS. You
really need to make your decision based on your goals and test to see what
works best for you.

Basically for tight integration with SQL Server where you need close to real
time indexing and to join relational queries against full-text queries you
can't beat SQL FTS.

Now if you use lucene you have more features and a richer querying language.
You have hit high-lighting, but you have to roll your own indexing mechanism
and build some sort of transactional features to the index. Consider your
index getting corrupt while indexing 100 docs per second. How to do roll
back to a previous version of the index. How to you index 1 billion rows
quickly should you need to reinidex. These are difficult problems to get
around in lucene, but part of the engine in SQL FTS.

> ...
> It cannot be used to rank results for different kind of repository
> (databases), it cannot even be used across tables in the same database,
> since the number is normalized according to the number of occurrences of
> the search terms in the table.

ok, I don't really understand the point here. If this is important to you
you can consolidate all the data in a single table or index an index view.
With a distributed repostory you have to worry about a synchronization
problem. For example if you index 20 document stores you have to worry that
all 20 document stores are complete otherwise you will have dead links.

> ...
> In addition to the problems with the aggregation of ranks in SQL Server
> search queries, the other important limitation is the lack of flexibility
> to influence the relevance. It is possible to use a weight operator to
> give more or less importance to keywords, but it is not possible to do so
> for fields. It means it is not possible natively to give more weight to a
> match in the title of a document than a match in its content. It is the
> same problem for the weighting from geographical position. In both case,
> it would be possible to try to tweak the results with some kind of
> multiplication factor to the rank results, but it would be rather fragile,
> skewing the results in non-coherent ways and probably not offer good
> performance.
> ...

As I said previously the querying language of lucene is more feature rich
than SQL FTS. If this is important to you this makes the decision an easy
one.

However there are problems doing date sorting and searching in lucene and
you need to code in Java - I am not sure if this is important to you. The
date issues may have been solved by now - I haven't worked in a large
lucene search deployment in over a year now.

> Many products, both open source and paid products are available the need
> of a centralized, external index.

This adds another point of failure you have to work around. This may or may
not be a problem for you.

> The Apache Software Foundation offers one that seems to stand out, Solr,
> based on the Lucene search engine. It is a standalone server, that can be
> interacted with through HTTP queries. There is also a C# client, with
> NHibernate integration through NHibernate Search available."

Solr is a wrapper around lucene. Basically you have to look at what SQL FTS
delivers and then what Lucene delivers and then build your own search engine
around this, or use what ships in Solr.

>
> I offer the general public, and especially those at microsoft, the chance
> to mitigate that argumentation or to add to it. Either way I don't care
> personally, but I don't want us choosing the solution for the wrong
> reasons. We are both a little bit biased...

You have to define your goals carefully and test to see which solution
offers the best solution. Both solutions will involve considerable coding
effort.

I should mention that the search algorithim used by SQL FTS cleans up at
trec. Search is commodotized today, most search vendors are trying to gain
market share by adding features. When you make your selection it should not
be driven by search itself, but rather the code effort to support your
solution and the features provided.

>
>
>
>

pauliom

unread,

Aug 23, 2011, 8:26:06 AM8/23/11

This is something I'm looking at too. I thought I'd add my thoughts into the pot.

First off, there is lucene.net. So it is possible to avoid Java, but is it a mature enough port? It's something that worries me.

As for FTS (Full Text Search) not running across databases then I'm looking at accepting that searching uses stale data. Once I accept that I can look at constructing more of a publish model. If I have x databases then I can create a (conceptually) single database that consolodates data into one or more views that I wish to search on. Although this creates comms problems it does provide adv. when it comes to CDN (Content Delivery Network).

I'm no nearer selecting one, as both have pro's/con's, but hopefully this can further the discussion.

> On Tuesday, January 26, 2010 2:43 PM Remi Blanchette wrote:

> People,
> we are currently investigating, prototyping and comparing the architecture
> of a full text search capability in a project. Our project has many data
> repository, one of which is SQL server 2008. My personal vision for FTS
> support was to use whatever was available at the repository level. As such,
> for our SQL server database, we would have used SQL server FTS without even
> looking into it.
>
> It is argued internally however that this selection is a poor candidate even
> for the SQL server repository. Which I definitely sound weird...
>
> Here is my developer arguments:
> "This product (sql server FTS) is lacking in term of the features we need to
> properly rank search results. Even if we only had only one repository, the
> limitations of Microsoft SQL Server Full-Text Search would make it a poor
> candidate to have proper ranking.

> ...
> In addition to the problems with the aggregation of ranks in SQL Server
> search queries, the other important limitation is the lack of flexibility to
> influence the relevance. It is possible to use a weight operator to give
> more or less importance to keywords, but it is not possible to do so for
> fields. It means it is not possible natively to give more weight to a match
> in the title of a document than a match in its content. It is the same
> problem for the weighting from geographical position. In both case, it would
> be possible to try to tweak the results with some kind of multiplication
> factor to the rank results, but it would be rather fragile, skewing the
> results in non-coherent ways and probably not offer good performance.
> ...

> Many products, both open source and paid products are available the need of

> a centralized, external index. The Apache Software Foundation offers one

> that seems to stand out, Solr, based on the Lucene search engine. It is a
> standalone server, that can be interacted with through HTTP queries. There
> is also a C# client, with NHibernate integration through NHibernate Search
> available."
>

> I offer the general public, and especially those at microsoft, the chance to

> mitigate that argumentation or to add to it. Either way I do not care
> personally, but I do not want us choosing the solution for the wrong reasons.

> We are both a little bit biased...

>> On Tuesday, January 26, 2010 9:50 PM Hilary Cotter wrote:

>> answers inline.

>>
>>
>> I have gone back and forth between recommending lucene over SQL FTS. You
>> really need to make your decision based on your goals and test to see what
>> works best for you.
>>
>> Basically for tight integration with SQL Server where you need close to real
>> time indexing and to join relational queries against full-text queries you

>> cannot beat SQL FTS.

>>
>> Now if you use lucene you have more features and a richer querying language.
>> You have hit high-lighting, but you have to roll your own indexing mechanism
>> and build some sort of transactional features to the index. Consider your
>> index getting corrupt while indexing 100 docs per second. How to do roll
>> back to a previous version of the index. How to you index 1 billion rows
>> quickly should you need to reinidex. These are difficult problems to get
>> around in lucene, but part of the engine in SQL FTS.
>>
>>

>> ok, I do not really understand the point here. If this is important to you

>> you can consolidate all the data in a single table or index an index view.
>> With a distributed repostory you have to worry about a synchronization
>> problem. For example if you index 20 document stores you have to worry that
>> all 20 document stores are complete otherwise you will have dead links.
>>
>>

>> As I said previously the querying language of lucene is more feature rich
>> than SQL FTS. If this is important to you this makes the decision an easy
>> one.
>>
>> However there are problems doing date sorting and searching in lucene and
>> you need to code in Java - I am not sure if this is important to you. The

>> date issues may have been solved by now - I have not worked in a large

>> lucene search deployment in over a year now.
>>
>>

>> This adds another point of failure you have to work around. This may or may
>> not be a problem for you.
>>
>>

>> Solr is a wrapper around lucene. Basically you have to look at what SQL FTS
>> delivers and then what Lucene delivers and then build your own search engine
>> around this, or use what ships in Solr.
>>

0 new messages