Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

CONTAINS and non-wildcard asterisk

8 views
Skip to first unread message

Johnny Nielsen

unread,
Aug 1, 2004, 4:21:32โ€ฏAM8/1/04
to
To quote the MS SQL Server 2000's online book on CONTAINS:

"If the text and asterisk are not delimited by double quotation marks, as in
CONTAINS (column, 'text*'), full-text search considers the asterisk as a
character and will search for exact matches to text*."

Can anybody confirm the above to be true ?

It is my experience that MS documentation in general - and in particular
their examples - contains a lot of errors, but here they specifically
mention that is should be possible to full-text search for 'text*' not
counting the asterisk as a wildcard. When I try to full-text search for
'text*' using CONTAINS the result includes records containing 'text' without
the asterisk.


Best regards
Johnny Nielsen


Hilary Cotter

unread,
Aug 1, 2004, 8:47:11โ€ฏAM8/1/04
to
This is correct for a single search argument in your search phrase. The
behavior changes if you have multiple search arguements in your search
phrase.

With double quotation marks wrapping around your search phrase it will
wildcard, so results will match the prefix or beginnings (what comes before
the *), but will have different suffixes or endings, and it will match with
words or tokens in your table that has an * at the end. The * will be
applied to all search arguments.

so a search on "al anon*" will match with al anon, alcoholics anonymous,
allan anont, but not and mike anon*.

If you search without the double quotation marks and you have multiple
search arguments

ie

select *from authors where contains(*,'al anon*')

your query will bomb with

Server: Msg 7631, Level 15, State 1, Line 1
Syntax error occurred near 'anon*'. Expected ''''' in search condition 'al
anon*'.

If you query on a single word or token with no double quotation marks you
will get not wildcarding, but will get an exact match to the anon, and
anon*.

If you query on a single word or token and wrap it in double quotation marks
you will get wildcarding and it will match with anon and anon*.

So for a single search word or token in your search phrase you need to wrap
your search phrase in double quotes to get wildcarding.

For multiple search arguments, words or tokens in your search phrase you
need to wrap your search phrase in double quotes to get wildcarding, but all
search arguments will be wildcarded and you will get matches to tokens or
words in your table ending with the * and matching the search argument
exactly.

In FreeText the * will be ignored, but will be return hits to matches to the
search arguments which are stemmed or unstemmed, and end in an * and don't
end with the *.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Johnny Nielsen" <DONT.WRITE....@megabit.dk> wrote in message
news:unzSUC6d...@tk2msftngp13.phx.gbl...

Johnny Nielsen

unread,
Aug 1, 2004, 3:12:46โ€ฏPM8/1/04
to
"Hilary Cotter" <hil...@att.net> wrote in message
news:Oz0AhW8d...@TK2MSFTNGP12.phx.gbl...

> This is correct for a single search argument in your search phrase.

So, just to be sure I understand you correctly - if you perform a SELECT *
FROM SOMETABLE WHERE CONTAINS(*,'test*') you will only match words
containing the five characters 'test*' and not the four-character word
'test' ?

On my system CONTAINS(*,'test*') and CONTAINS(*,' "test*" ') seems to return
the same results (it treats the asterisk in 'test*' as a wildcard which is
not what I want).


> The behavior changes if you have multiple search arguements in your search
phrase.


Can the behavior of the single-quoted simple term 'test*' change by joining
it to other arguments ?

I know the meaning of double-quotes in prefix terms, but I'm only interested
in the single-quoted non-prefix cases (I'm writing a search-condition parser
which is why the specific workings are important to me).


Best regards
Johnny Nielsen

Hilary Cotter

unread,
Aug 1, 2004, 3:59:24โ€ฏPM8/1/04
to
No, I am saying Select * from tablename where contains(*,'test*') will
return hits to test and test*.

With Select * from tablename where contains(*,'"test*"') I will get hits to
test, test*, and tested as it is now wildcarding

You can't really change the behavior of the wildcarding by joining it to
other clauses, unless perhaps you do a join to a like that looks like this

like '% test %'

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Johnny Nielsen" <DONT.WRITE....@megabit.dk> wrote in message

news:OXHRNu$dEHA...@TK2MSFTNGP10.phx.gbl...

Johnny Nielsen

unread,
Aug 1, 2004, 5:48:02โ€ฏPM8/1/04
to
"Hilary Cotter" <hil...@att.net> wrote in message
news:%23I4gCIA...@TK2MSFTNGP11.phx.gbl...

> No, I am saying Select * from tablename where contains(*,'test*') will
> return hits to test and test*.

I'm sorry Hillary, but now I'm really confused. According to the MS
documentation that you seem to agree upon the second asterisk in
CONTAINS(*,'test*') isn't acting as a wildcard but is a part of the word
searched for.

Then how come it will match the word test without an appended asterisk ?

That would be similar to CONTAINS(*,'server') matching the word serve
without the appended 'r'.


> With Select * from tablename where contains(*,'"test*"') I will get hits
to
> test, test*, and tested as it is now wildcarding

I agree and I have now problem understanding the functionality of the
double-quotes - they seem to work as expected :-)

It is only the functionality of the single-quoted cases that troubles me.


> You can't really change the behavior of the wildcarding by joining it to
> other clauses, unless perhaps you do a join to a like that looks like this
>
> like '% test %'

OK, I just tried to find out what you meant by "the behavior changes if you
have multiple search arguements in your search phrase" .. you didn't mean
that the behavior of the single-quoted term then.


PS: This has nothing to do with the current subject, but do you happen to
know the valid range of full-text searchable unicode characters ? I can't
seem to find a clear definition in the MS documentation .. not even a
definition of 'punctuation' symbols.


Best regards
Johnny Nielsen

Hilary Cotter

unread,
Aug 2, 2004, 9:58:29โ€ฏAM8/2/04
to
Yes that is correct. The subtlety is that non alphanumeric characters with
some exceptions are treated as white space.

So test and test* are both indexed as test. So a search on test will return


hits to test and test*.

All text (even ascii) is indexed as its unicode equivalent. Its punctuation
marks which are treated as white space and you get this problem with.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Johnny Nielsen" <DONT.WRITE....@megabit.dk> wrote in message

news:u5Jc8EBe...@TK2MSFTNGP09.phx.gbl...

Johnny Nielsen

unread,
Aug 2, 2004, 12:39:33โ€ฏPM8/2/04
to
"Hilary Cotter" <hil...@att.net> wrote in message
news:%23itsZkJ...@TK2MSFTNGP12.phx.gbl...

> Yes that is correct. The subtlety is that non alphanumeric characters with
> some exceptions are treated as white space.

..but as I quoted from the MS online books the asterisk is *not* treated as
white space when used inside single-quotes - that's my point. To repeat the
MS statement from the online books, that I originally questioned: [If the


text and asterisk are not delimited by double quotation marks, as in
CONTAINS (column, 'text*'), full-text search considers the asterisk as a

character and will search for exact matches to text*.]


> So test and test* are both indexed as test. So a search on test will
return
> hits to test and test*.

Well, according to MS the asterisk in single-quotes is treated as a
character and therefore test and test* should be considered two different
words. That is, contains(*,'test*') should never match the word test.


> All text (even ascii) is indexed as its unicode equivalent. Its
punctuation
> marks which are treated as white space and you get this problem with.

Sadly this is rather an additional problem. Building a parser is hard job
when the precise definition of what is considered punctuation symbols and
indexable characters isn't available to the public :-(

Hilary Cotter

unread,
Aug 2, 2004, 3:49:25โ€ฏPM8/2/04
to
According to my tests with the neutral and US Englishword breaker the * is
treated as white space when you have a single search argument and you
wildcard it, and you wrap it in single quotes.

In other words a search on test will return the same hits on a search on
test*.

If you wrap the phrase with double quotes you will get wildcarding, but it
will still match with test*.

Think of it this way. While indexing when MSSearch sees non white space, non
alphanumeric characters it strips them out and indexes the alphanumeric
characters.

So test* is indexed as test and is indistinguishable from test.

When you search on test* the * is stripped off and the search is performed
on test.

Confusing I know.

HTH


--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Johnny Nielsen" <DONT.WRITE....@megabit.dk> wrote in message

news:%236Y8P9K...@tk2msftngp13.phx.gbl...

Johnny Nielsen

unread,
Aug 2, 2004, 4:44:56โ€ฏPM8/2/04
to
OK, thanks for spending your time Hilary.

Having my observations confirmed I now know how not to proceed :-)

I really should stop trusting those MS examples...


Best regards
Johnny*Nielsen


"Hilary Cotter" <hil...@att.net> wrote in message

news:O$0froMeE...@TK2MSFTNGP12.phx.gbl...

Hilary Cotter

unread,
Aug 2, 2004, 9:02:03โ€ฏPM8/2/04
to
just to clarify further you get the same result searching on test* as you
get searching on test!, or test#.

Here is another stupid SQL FTS trick

try this

select * from tablename where contains(*,'#test*$')

or

select * from tablename where contains(*,'$test*$')

or

select * from tablename where contains(*,'"test*"')

Whatever delimiter you use it still works, even if you mix the delimiter
types.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Johnny Nielsen" <DONT.WRITE....@megabit.dk> wrote in message

news:uqCdXGNe...@TK2MSFTNGP10.phx.gbl...

John Kane

unread,
Aug 2, 2004, 10:50:04โ€ฏPM8/2/04
to
Hilary,
An interesting deduction... Did you test on both Win2K and Win2003 (or
WinXP) as there are different OS-supplied wordbreakers for these OS
platforms that might result in different indexing strategies.

Johnny, what is the OS platform that your SQL Server (select @@version) is
installed on?

Regards,
John


"Hilary Cotter" <hil...@att.net> wrote in message

news:O$0froMeE...@TK2MSFTNGP12.phx.gbl...

Johnny Nielsen

unread,
Aug 3, 2004, 5:24:45โ€ฏAM8/3/04
to
I have only tested the CONTAINS clause on a Windows Server 2003 (SQL 2000
SP3a).

In the absense of proper ducumentation I have decided to use the .NET
frameworks char.IsLetterOrDigit to determine if a character is searchable.
But having found a '$' in the noise file I'm not sure if this is the right
approach...


Best regards
Johnny Nielsen

"John Kane" <jt-...@comcast.net> wrote in message
news:OHtNUSQe...@TK2MSFTNGP09.phx.gbl...

Johnny Nielsen

unread,
Aug 3, 2004, 6:18:05โ€ฏPM8/3/04
to
Actually you have to be careful choosing the delimiter.

I have just discovered that '|' and '&' in a CONTAINS clause is treated as
OR and AND operators (undocumented of cause). And if you think that was to
be expected, then guess what '~' does ... no, it isn't the NOT operator ;-)

The fact that the Books Online isn't to be taken too seriously raises some
thoughts as to how well protected all those SQL/MSSearch based applications
out there really is when it comes to SQL-injection attacks or the developers
imagination of how things are working...


Best regards
Johnny Nielsen

"Hilary Cotter" <hil...@att.net> wrote in message

news:%23RRH1VP...@TK2MSFTNGP10.phx.gbl...

Johnny Nielsen

unread,
Aug 3, 2004, 7:05:26โ€ฏPM8/3/04
to
Be VERY carefull in choosing the delimiters. The CONTAINS clause seems to
react on a lot of special function characters.

The '!' operator in CONTAINS(*,'!something') will result in a syntax error
if the word 'something' exists. I'm not sure what '!' does excatly but it
probably has some use.


Best regards
Johnny Nielsen


"Johnny Nielsen" <DONT.WRITE....@megabit.dk> wrote in message

news:%23wo4Efa...@TK2MSFTNGP11.phx.gbl...

Hilary Cotter

unread,
Aug 3, 2004, 10:39:46โ€ฏPM8/3/04
to
!, &,| are all well known operators in SQL FTS, although undocumented.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Johnny Nielsen" <DONT.WRITE....@megabit.dk> wrote in message

news:e7aqh5ae...@TK2MSFTNGP11.phx.gbl...

John Kane

unread,
Aug 3, 2004, 10:38:16โ€ฏPM8/3/04
to
Johnny,
You can always remove the $ 'dollar sign' from the noise.enu (US_English)
file and then run a full population and see if you get different results.
Note, using the .NET framework's char.IsLetterOrDigit to determine if a
character is searchable is a very good idea!

Regards,
John


"Johnny Nielsen" <DONT_WRITE_...@megabit.dk> wrote in message
news:#F121uTe...@TK2MSFTNGP10.phx.gbl...

Johnny Nielsen

unread,
Aug 4, 2004, 2:51:43โ€ฏAM8/4/04
to
Somehow 'well known' and 'undocumented' doesn't go hand in hand very well
:-)

What DOES the '!' do ? (or maybe - if the operators is that well known - you
could tell me, which operators the CONTAINS clause supports beside the few
mentioned in the Books Online ?)


Best regards
Johnny Nielsen

"Hilary Cotter" <hil...@att.net> wrote in message

news:%237dQRxc...@tk2msftngp13.phx.gbl...

Hilary Cotter

unread,
Aug 4, 2004, 12:25:47โ€ฏPM8/4/04
to
Ummm, perhaps I mispoke. &, |, ~(near), ! (NOT) are Boolean operators
available in other search products.

You can use ! in conjunction with AND or & (not that NOT, OR, and AND are
not case sensitive), but not | or OR.

check out the following links for more info

http://members.microsoft.com/partner/help/searchtips.aspx

http://www.microsoft.com/windows2000/en/advanced/help/default.asp?url=/windows2000/en/advanced/help/ismain-concepts_64.htm

discussion of ~ as a symbol for NEAR

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/acdata/ac_8_qd_15_0kdw.asp

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html

"Johnny Nielsen" <DONT_WRITE_...@megabit.dk> wrote in message

news:u9zG69ee...@TK2MSFTNGP09.phx.gbl...

Johnny Nielsen

unread,
Aug 4, 2004, 2:28:59โ€ฏPM8/4/04
to
It turned out that $ is interpreted as punctuation.

The char.IsLetterOrDigit seems to do a fair job, but it isn't bulletprof. An
example: the searchable unicode Currency Sign 'ยค' = U+00A4 (decimal 164)
evaluates to false by char.IsLetterOrDigit.


Best regards
Johnny Nielsen


"John Kane" <jt-...@comcast.net> wrote in message

news:u2%239Ywce...@TK2MSFTNGP10.phx.gbl...

0 new messages