Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Does MySQL have fuzzy matching or no?

3,318 views
Skip to first unread message

bit-n...@hotmail.com

unread,
Dec 15, 2017, 7:22:48 AM12/15/17
to
I want to know - if there is a string, say, "poop" in a field in a table, and I want to match it against the string "doop", and have the search return successful - can I? How, please?


Thanks.

Axel Schwenke

unread,
Dec 15, 2017, 8:39:53 AM12/15/17
to
On 15.12.2017 13:22, bit-n...@hotmail.com wrote:
> I want to know - if there is a string, say, "poop" in a field in a table, and I want to match it against the string "doop", and have the search return successful - can I? How, please?

MySQL 5.7 now supports pluggable parsers for fulltext search. One is the
n-gram parser that breaks field and search phrase into n-grams. A match on
such a fulltext index would then find similar texts, too. In your case
"doop" would match the trigram "oop" or the bigrams "oo" and "op" in "poop".

https://mysqlserverteam.com/innodb-full-text-n-gram-parser/

The Natural Philosopher

unread,
Dec 15, 2017, 9:46:01 AM12/15/17
to
old fashioned [LIKE '%oop'] ??


--
"When a true genius appears in the world, you may know him by this sign,
that the dunces are all in confederacy against him."

Jonathan Swift.

Axel Schwenke

unread,
Dec 15, 2017, 10:50:38 AM12/15/17
to
On 15.12.2017 15:45, The Natural Philosopher wrote:
> On 15/12/17 13:39, Axel Schwenke wrote:
>> On 15.12.2017 13:22, bit-n...@hotmail.com wrote:
>>> I want to know - if there is a string, say, "poop" in a field in a table,
>>> and I want to match it against the string "doop", and have the search
>>> return successful - can I? How, please?
>>
>> MySQL 5.7 now supports pluggable parsers for fulltext search. One is the
>> n-gram parser that breaks field and search phrase into n-grams. A match on
>> such a fulltext index would then find similar texts, too. In your case
>> "doop" would match the trigram "oop" or the bigrams "oo" and "op" in "poop".
>>
>> https://mysqlserverteam.com/innodb-full-text-n-gram-parser/
>>
> old fashioned [LIKE '%oop'] ??

It's not the same. Not even near.

Come up with the LIKE clauses for finding a trigram match for the search
phrase "naturalistic philosophy". And don't forget to calculate a rank(!)
for all the matches.

Lew Pitcher

unread,
Dec 15, 2017, 11:23:04 AM12/15/17
to
If you are looking at strings that might sound similar if pronounced, then
you could use the SOUNDEX() function to match to a pre-determined value, or
the SOUNDS LIKE operator (which is, effectively, SOUNDEX(a) = SOUNDEX(b) )
to match to something else.
SELECT * FROM table
WHERE field1 SOUNDS LIKE field2 ;

If, OTOH, you are looking at strings that might be spelled similarly, you
could use a REGEXP() function to match to variations in character sequence.
SELECT * FROM table
WHERE field1 REGEXP '[pd]oop' ;

If you are looking for substrings within freeform strings, you can MATCH
AGAINST IN NATURAL LANGUAGE MODE.
SELECT * FROM table
WHERE MATCH (field1) AGAINST ('poop' IN NATURAL LANGUAGE MODE) ;

--
Lew Pitcher
"In Skills, We Trust"
PGP public key available upon request

Lew Pitcher

unread,
Dec 15, 2017, 12:46:41 PM12/15/17
to
Lew Pitcher wrote:

> bit-n...@hotmail.com wrote:
>
>> I want to know - if there is a string, say, "poop" in a field in a table,
>> and I want to match it against the string "doop", and have the search
>> return successful - can I? How, please?
>
> If you are looking at strings that might sound similar if pronounced, then
> you could use the SOUNDEX() function to match to a pre-determined value,
> or the SOUNDS LIKE operator (which is, effectively, SOUNDEX(a) =
> SOUNDEX(b) ) to match to something else.
> SELECT * FROM table
> WHERE field1 SOUNDS LIKE field2 ;

As in
SELECT * FROM table
WHERE field1 SOUNDS LIKE 'doop';

Luuk

unread,
Dec 15, 2017, 3:20:30 PM12/15/17
to
On 15-12-17 18:46, Lew Pitcher wrote:
> Lew Pitcher wrote:
>
>> bit-n...@hotmail.com wrote:
>>
>>> I want to know - if there is a string, say, "poop" in a field in a table,
>>> and I want to match it against the string "doop", and have the search
>>> return successful - can I? How, please?
>>
>> If you are looking at strings that might sound similar if pronounced, then
>> you could use the SOUNDEX() function to match to a pre-determined value,
>> or the SOUNDS LIKE operator (which is, effectively, SOUNDEX(a) =
>> SOUNDEX(b) ) to match to something else.
>> SELECT * FROM table
>> WHERE field1 SOUNDS LIKE field2 ;
>
> As in
> SELECT * FROM table
> WHERE field1 SOUNDS LIKE 'doop';
>

What am i missing?

[root@test]> select * from test15;
+------+
| test |
+------+
| doop |
| pood |
| poop |
+------+
3 rows in set (0.00 sec)

[root@test]> select * from test15 where test sounds like 'doop';
+------+
| test |
+------+
| doop |
+------+
1 row in set (0.00 sec)
[root@test]> show create table test15;
+--------+-----------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table
|
+--------+-----------------------------------------------------------------------------------------------------------------------------+
| test15 | CREATE TABLE `test15` (
`test` varchar(20) DEFAULT NULL,
FULLTEXT KEY `t` (`test`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+--------+-----------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)


Axel Schwenke

unread,
Dec 15, 2017, 5:45:06 PM12/15/17
to
On 15.12.2017 21:20, Luuk wrote:
> On 15-12-17 18:46, Lew Pitcher wrote:

>>> If you are looking at strings that might sound similar if pronounced, then
>>> you could use the SOUNDEX() function to match to a pre-determined value

> What am i missing?
>
> [root@test]> select * from test15;
> +------+
> | test |
> +------+
> | doop |
> | pood |
> | poop |
> +------+
> 3 rows in set (0.00 sec)
>
> [root@test]> select * from test15 where test sounds like 'doop';
> +------+
> | test |
> +------+
> | doop |
> +------+

Nothing.

SOUNDEX() is pretty poor when it comes to generic fuzzy matching. It works
only for phonetically similar words (like "Meyer" vs. "Mayer"), only for
English and even then it yields ... interesting ... results from time to time.

N-gram matching on the other hand is a very nice tool to find similar texts
- especially when those texts are rather long. I had big success with it
when I consolidated my joke collection (collected from various sources
around the - then new - internet). Back those days I wrote a tool in C
(after having read an article about trigram matching in a computer mag) and
it helped excellently in weeding out (near-)duplicates.

Oracle added an n-gram parser to the FULLTEXT indexer for completely
different reasons - to support languages that don't have the concept of
separation characters in written text. But it will work nicely for fuzzy
matching in western languages, too.

bit-n...@hotmail.com

unread,
Dec 16, 2017, 9:07:11 AM12/16/17
to
What I gave was just an example - it's not actually ONLY "doop" that must be returned, obviously.
As in Luuk's table, I would like it to return ALL of "doop" and "pood" as well as "oop".

"SOUNDS LIKE" sounds like it's useless :)

I have no idea what an "n-gram" is - is it hard to set up?

Luuk

unread,
Dec 16, 2017, 9:36:30 AM12/16/17
to
Ok, so writing a function which will give a higher result on better
matches is maybe a good way to solve this.

Mymatch(match, against)
returnvalue:
The sum of
1, or every matching character in the right place,
0.4, for every charachter matching in the wrong place.


This function should ('almost') do that:
DROP FUNCTION IF EXISTS MyMatch;
delimiter //
CREATE FUNCTION MyMatch (v_match TEXT, v_against TEXT)
RETURNS real
BEGIN
DECLARE v_score REAL;
DECLARE v_index INT;

SET v_score = 0;

SET v_index = 1;
WHILE v_index <= LENGTH(v_against) DO
IF substr(v_match,v_index,1) = substr(v_against,v_index,1) THEN
SET v_score = v_score + 1.0;
ELSE
IF LOCATE(substr(v_match,v_index,1),
concat(substr(v_against,1,v_index-1),substr(v_against,v_index+1)))>0 THEN
SET v_score = v_score + 0.4;
END IF;
END IF;

SET v_index = v_index + 1;
END WHILE;

RETURN v_score;

END//
delimiter ;
SELECT Mymatch('door','poop'), 2, '';
SELECT Mymatch('poop','poop'), 4, '';
SELECT Mymatch('poop','poor'), 3, 'wrong result 3.4';
SELECT Mymatch('roop','poor'), 2.8, '';
SELECT Mymatch('rood','poor'), 2.4, '';







0 new messages