Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

SQL Server Full-Text search problem in XML

1 view
Skip to first unread message

Pjotr

unread,
Jul 18, 2003, 4:42:16 AM7/18/03
to
Hello,

I have a problem with a SQL server full-text search on a
ntext field containing XML data. The XML Data is something
like this:

<COURSES>
<COURSE>
<NAME>Functiepunt Analyse Workshop</NAME>
<INSTITUTE>B-Company</INSTITUTE>
<FINISHED>2003</FINISHED>
</COURSE>
<COURSE>
<NAME>Microsoft TechEd Europe 2002</NAME>
<INSTITUTE>Microsoft</INSTITUTE>
<FINISHED>2002</FINISHED>
</COURSE>
</COURSES>

When i search for "TechEd" this record is selected.
When i search for "Workshop" this record is not selected.
When i search for "Workshop*" this record is selected.

I know SQL Server Full-text search uses a word breaker and
i think it uses a space as a delimiter for individual
words. The text "Workshop" is concatenated in the XML with
the Tag: "</NAME>". Might this be the reason why it is not
selected when i search for "Workshop"? If so does anybody
know a work-a-round for this problem?

Thanks in advance!

Peter

John Kane

unread,
Jul 20, 2003, 11:06:13 AM7/20/03
to
Pjotr,
Could you provide the SQL Server version as well as the OS platform
information via SELECT @@version ?
Both the SQL Server version and OS platform (wordbreaker dll) information
are very important in troubleshooting SQL FTS issues.

Also, if you are using SQL Server 2000 are you storing the XML data in an
IMAGE (BLOB) datatype or in a char/varchar/TEXT datatype? SQL FTS does not
use "space as a delimiter", but depending upon your version info and how you
are storing the XML data, "space as a delimiter" can be a factor.

Note, you can also post FTS related questions to the newsgroup:
microsoft.public.sqlserver.fulltext.
Regards,
John

"Pjotr" <peter....@atosorigin.com> wrote in message
news:073501c34d08$7deb5e00$a501...@phx.gbl...

Pjotr

unread,
Jul 21, 2003, 3:01:03 AM7/21/03
to
John and others,

Sorry about not providing the version info. I can imagine
it's important for this issue.

The version is:
Microsoft SQL Server 2000 - 8.00.760 (Intel X86) Dec 17
2002 14:22:05 Copyright (c) 1988-2003 Microsoft
Corporation Enterprise Edition on Windows NT 5.0 (Build
2195: Service Pack 3)

I'm putting the data in a NTEXT datatype field. I don't
know how to retrieve the information about the
wordbreaker.dll.

Regards,
Peter

>.
>

Bill Cheng [MSFT]

unread,
Jul 21, 2003, 5:21:13 AM7/21/03
to
Hi,

I use the following table to reproduce the problem on Windows 2000 + SQL
Server 2000 SP3.
create table testftsxml (c1 int identity, descrip ntext)

insert testftsxml values(' <COURSES>


<COURSE>
<NAME>Functiepunt Analyse Workshop</NAME>
<INSTITUTE>B-Company</INSTITUTE>
<FINISHED>2003</FINISHED>
</COURSE>
<COURSE>
<NAME>Microsoft TechEd Europe 2002</NAME>
<INSTITUTE>Microsoft</INSTITUTE>
<FINISHED>2002</FINISHED>
</COURSE>
</COURSES>

')

insert testftsxml values('2 values')

After I fulltext the table, I can use the following query to select the
first record.
select * from testftsxml where contains(descrip, 'Workshop')

Could you perform the same steps on your side and check the result, so that
we can work on the same data?


Bill Cheng
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.
--------------------
| Content-Class: urn:content-classes:message
| From: "Pjotr" <peter....@atosorigin.com>
| Sender: "Pjotr" <peter....@atosorigin.com>
| References: <073501c34d08$7deb5e00$a501...@phx.gbl>
<#zVD2BtT...@TK2MSFTNGP12.phx.gbl>
| Subject: Re: SQL Server Full-Text search problem in XML
| Date: Mon, 21 Jul 2003 00:01:03 -0700
| Lines: 92
| Message-ID: <043201c34f55$d94e55f0$a501...@phx.gbl>
| MIME-Version: 1.0
| Content-Type: text/plain;
| charset="iso-8859-1"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: Microsoft CDO for Windows 2000
| X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4910.0300
| Thread-Index: AcNPVdlOMefK63Y+TXKupFFbaK20sg==
| Newsgroups: microsoft.public.sqlserver.xml
| Path: cpmsftngxa06.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.sqlserver.xml:19289
| NNTP-Posting-Host: TK2MSFTNGXA13 10.40.1.165
| X-Tomcat-NG: microsoft.public.sqlserver.xml

John Kane

unread,
Jul 21, 2003, 6:57:02 PM7/21/03
to
Bill,
Of course, you're able to repo! Note, he's NOT using an IMAGE datatype and
therefore NOT using an installed XML IFiler from either MS or from 3rd party
providers. You might also want to look at the replies in the fulltext
newsgroup too <G>

John (jkane)

"Bill Cheng [MSFT]" <bill...@online.microsoft.com> wrote in message
news:q85Uxl2T...@cpmsftngxa06.phx.gbl...

Bill Cheng [MSFT]

unread,
Jul 21, 2003, 10:10:52 PM7/21/03
to
Hi John,

I am not sure how you reproduced the problem on your side. Could you try my
steps and let me know the results?


Bill Cheng
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.
--------------------

| From: "John Kane" <jt-...@comcast.net>
| References: <073501c34d08$7deb5e00$a501...@phx.gbl>
<#zVD2BtT...@TK2MSFTNGP12.phx.gbl>
<043201c34f55$d94e55f0$a501...@phx.gbl>
<q85Uxl2T...@cpmsftngxa06.phx.gbl>


| Subject: Re: SQL Server Full-Text search problem in XML

| Date: Mon, 21 Jul 2003 15:57:02 -0700
| Lines: 167
| X-Priority: 3
| X-MSMail-Priority: Normal
| X-Newsreader: Microsoft Outlook Express 6.00.3790.0
| X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.0
| Message-ID: <OJJamt9T...@TK2MSFTNGP11.phx.gbl>
| Newsgroups: microsoft.public.sqlserver.xml
| NNTP-Posting-Host: 12-228-206-224.client.attbi.com 12.228.206.224
| Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP11.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.sqlserver.xml:19307
| X-Tomcat-NG: microsoft.public.sqlserver.xml

Pjotr

unread,
Jul 22, 2003, 6:01:52 AM7/22/03
to
Hello,

I've tried it out and followed the steps on 4 different
machines. All have the same configuration according to the
results of the: "SELECT @@Version" statement.

On two of them i find no result on 2 of them i find
results. I have used this XML
string: '<COURSES><COURSE><NAME>Functiepunt Analyse

Workshop</NAME><INSTITUTE>B-
Company</INSTITUTE><FINISHED>2003</FINISHED></COURSE><COURS

E><NAME>TechEd Europe

2002</NAME><INSTITUTE>Microsoft</INSTITUTE><FINISHED>2002</
FINISHED></COURSE></COURSES>'

Weird isn't it?

On the two machines that do not work i only find results
when i search for 'Analyse'.

Regards,
Peter

>.
>

Bill Cheng [MSFT]

unread,
Jul 22, 2003, 7:55:18 AM7/22/03
to
Hi,

It is really strange. It seems that the problem does not happen on all
computers? Am I right? Could you post the difference between
normally-working and abnormally-working system, e.g. OS?

What if you install a new SQL Server 2000 system to test?


Bill Cheng
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.
--------------------
| Content-Class: urn:content-classes:message
| From: "Pjotr" <peter....@atosorigin.com>
| Sender: "Pjotr" <peter....@atosorigin.com>
| References: <073501c34d08$7deb5e00$a501...@phx.gbl>
<#zVD2BtT...@TK2MSFTNGP12.phx.gbl>

<043201c34f55$d94e55f0$a501...@phx.gbl>
<q85Uxl2T...@cpmsftngxa06.phx.gbl>
<OJJamt9T...@TK2MSFTNGP11.phx.gbl>
<9MJO#Z$TDHA...@cpmsftngxa06.phx.gbl>


| Subject: Re: SQL Server Full-Text search problem in XML

| Date: Tue, 22 Jul 2003 03:01:52 -0700
| Lines: 269
| Message-ID: <080101c35038$4634ec50$a501...@phx.gbl>


| MIME-Version: 1.0
| Content-Type: text/plain;
| charset="iso-8859-1"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: Microsoft CDO for Windows 2000
| X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4910.0300

| Thread-Index: AcNQOEY0SbbxAOYjRmuFqfPGUK07/Q==
| Newsgroups: microsoft.public.sqlserver.xml
| Path: cpmsftngxa06.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.sqlserver.xml:19311

John Kane

unread,
Jul 22, 2003, 5:16:23 PM7/22/03
to
Bill & Peter,
This has almost nothing to do with SQL Server (either version), and more to
do with the OS platform and specifically, the OS platform supplied "word
breaker" dll and for Windows 2000 (Win2K) that dll is infosoft.dll.
Additionally, the "solution" for this issue is NOT to store XML data in a
column defined as NTEXT, such as "create table testftsxml (c1 int identity,
descrip ntext)" and instead to use an IMAGE column and an XML Ifilter. Note,
the MS IFilter can be downloaded from
http://www.microsoft.com/sharepoint/server/techinfo/reskit/xml_filter.asp


Below is Hilary Cotter's post in the fulltext newsgroup on this subject:
===============================================
The short answer is that the word breakers break on white
space at indexing time. In some circumstances they will
apply language specific rules on the words they index and
when on your search phrase.

The word breakers were designed for pure textual data.
The iFilters (which take the blob of data handed back to
them from MSDmn or CiDaemon) send only text to the word
breakers, they strip out tags like <>.

If you chose to circumvent the design intentions of SQL
FTS or Indexing Service or any of the other Microsoft
Search Products you will run into problems like this.

Store your xml docs in columns using the image data type,
and use a document type column of char(3) or char(4) to
contain the document extension - in your case xml or .xml.

Then install the Microsoft XML iFilter, keeping in mind
that to index attributes/elements which have values
greater than 32 characters you need to install OfficeXP
locally on your IS machine.

The XML iFilter will index
* values of sub-elements of the root element when
the sub elements have no child elements.
* attributes of the root element and attributes of
sub-elements of the root element
Consider a sample XML document:
<?xml version="1.0" ?>
<book title="MyBook">
This a sample book
<author>
First Last
<AGE>
20
</AGE>
</author>
<ISBN>
11111111111
</ISBN>
</book>
The attribute Title of the element Book, and ISBN would be
the only values that would be indexed and query able in
this case.

You might want to consider alternatives to MS's XML iFiltr

QuiLogic http://www.quilogic.cc/ifilter.htm XML IFilter
enables crawling of documents containing xml based data.
Our filter technology enables users to search and index
arbitrary structured xml files based on content. Likewise
VISIO or Excel files can be saved as xml files and indexed
with the help of our XML IFilter component.
Microsoft

3 Tier Technology http://www.3tt.com.au. Contact Robert
Stoneman (rsto...@3tt.com.au). 3TT&#8217;s XML iFilter has a
configuration file that allows you to fine tune the
properties you wish to index.

-- My reply to the above:
=================
Hilary, are you pre-assuming that Peter is using the Neutral wordbreaker or
just providing a very simple explanation?

Peter, when you defined the FT Index and FT Catalog on your XML data in a
NTEXT column, what "Language for Word Breaker" did you use? If you don't
know, you can use the system stored procedure sp_help_fulltext_columns to
get this info. Could you post it?

Best Regards,
John


"Bill Cheng [MSFT]" <bill...@online.microsoft.com> wrote in message

news:wgTjfgEU...@cpmsftngxa06.phx.gbl...

Pjotr

unread,
Jul 23, 2003, 6:24:17 AM7/23/03
to
John Hilary & Bill,

On the four machines i've checked the infosoft.dll. They
all are version 5.0.2195.4091.

I also checked the "Language for Word Breaker". On all
machines it is 1033 (English US).

Regards,
Peter


Pjotr

unread,
Jul 23, 2003, 11:38:38 AM7/23/03
to
Hello,

I have switched the "Language for Word Breaker" to
neutral. This solved my problem but not the mystery about
the different behaviour on the four machines.

I found some guidance in an excelent whitepaper about SQL
FTS
(http://download.microsoft.com/download/sqlsvr2000/utility/
2.7/win98mexp/en-us/Full-Text%20Search_Publish.exe):

START QUOTE:
If the data stored in a text column contains special
formatting (such as HTML or XML tags), Microsoft
recommends that you set the language of that column to
neutral.
END QUOTE

To all, thanks a lot for your help. Special thanks to John
Kane. I believe he contributed to this Microsoft
whitepaper.

Regards,
Peter

>.
>

0 new messages