matchfuzzy scoring

69 views
Skip to first unread message

Maxim Kim

unread,
Oct 15, 2020, 2:21:32 AM10/15/20
to vim_dev
Hi, I wonder why "better" candidate has lower score than the others:


tried to search `vimrc` and `.vim/vimrc` has lower score than `.vim/vimrc_colors`.

Maxim Kim

unread,
Oct 15, 2020, 2:28:04 AM10/15/20
to vim_dev
Minimal repro:

let g:testfuzzy = ['vimrc', 'vimrc_colors']
echom g:testfuzzy->matchfuzzy('vimrc')

result:
['vimrc_colors', 'vimrc']


четверг, 15 октября 2020 г. в 09:21:32 UTC+3, Maxim Kim:

Dominique Pellé

unread,
Oct 15, 2020, 2:45:57 AM10/15/20
to vim_dev
Maxim Kim wrote:

> Minimal repro:
>
> let g:testfuzzy = ['vimrc', 'vimrc_colors']
> echom g:testfuzzy->matchfuzzy('vimrc')
>
> result:
> ['vimrc_colors', 'vimrc']

I don't really know how it's scored, but here I would
have thought that the first entry "vimrc" should have
a better score than "vimrc_colors" because "vimrc"
is a full match:
- "vimrc" matches 100%
- "vimrc_colors" matches only 42% of the
characters (5 chars out of 12).

Regards
Dominique

Christian Brabandt

unread,
Oct 15, 2020, 2:54:01 AM10/15/20
to vim_dev
The neighbor '_' adds an additional bonus for scoring. Therefore it
scores higher. Perhaps we need to add an additional bonus, if the match
ends at the string. Something like this perhaps:

diff --git a/src/search.c b/src/search.c
index 349eae035..5ccc35564 100644
--- a/src/search.c
+++ b/src/search.c
@@ -4257,6 +4257,8 @@ typedef struct
// bonus if match occurs after a separator
#define SEPARATOR_BONUS 30
// bonus if match is uppercase and prev is lower
+#define ENDMATCH_SEQUENTIAL_BONUS 35
+// bonus if match is sequential and ends the match string
#define CAMEL_BONUS 30
// bonus if the first letter is matched
#define FIRST_LETTER_BONUS 15
@@ -4318,6 +4320,10 @@ fuzzy_match_compute_score(
// Sequential
if (currIdx == (prevIdx + 1))
score += SEQUENTIAL_BONUS;
+
+ // End of match
+ if ((int)currIdx == strSz-1)
+ score += ENDMATCH_SEQUENTIAL_BONUS;
}

// Check for bonuses based on neighbor character value

I am unsure, whether this bonus should only apply, if we had a previous
match as well, or in general only for the end of match.

Best,
Christian
--
Lieber ein warmes Bad, als eine kalte Dusche.

Dominique Pellé

unread,
Oct 15, 2020, 7:25:29 AM10/15/20
to vim_dev
If I understand this solution, searching for "vimrc" will
prefer "foo_vimrc" over "vimrc_foo", which does not
look ideal.

Regards
Dominique

Christian Brabandt

unread,
Oct 15, 2020, 7:40:20 AM10/15/20
to vim_dev

On Do, 15 Okt 2020, Dominique Pellé wrote:

> If I understand this solution, searching for "vimrc" will
> prefer "foo_vimrc" over "vimrc_foo", which does not
> look ideal.

correct. In your mentioned case, I would actually argue that there is no
best case that everybody agrees on, either one do look sensible for me
(although I would actually have a slight preference for foo_vimrc).

I would just add an additional case, that prefers complete matches.

Best,
Christian
--
Die Hölle kann auch produktiv sein, der Himmel ist nur langweilig.
-- Herbert Achternbusch

Yegappan Lakshmanan

unread,
Oct 15, 2020, 11:53:15 AM10/15/20
to vim_dev
Hi,

The reason is that 'vimrc' matches in multiple places in '.vim/vimrc' and
'.vim/vimrc_colors'. The scores for the various matches is below:

For '.vim/vimrc':

    'xxxxxvimrc'   140
    'xvxxxximrc'   135
    'xvimxxxxrc'   135

For '.vim/vimrc_colors':

    xxxxxvimrxxcxxxxx 148
    xxxxxvimrcxxxxxxx 133
    xvimxxxxrcxxxxxxx 128

As you can see from the scores above, the match in '.vim/vimrc_colors'
has a higher score because of the "c" that occurs after an underscore.

The match that occurs immediately after an underscore is given an additional
bonus score.

- Yegappan

Christian Brabandt

unread,
Oct 15, 2020, 11:55:21 AM10/15/20
to vim_dev
Yeah, but shouldn't a complete match get an additional bonus?

Best,
Christian
--
Der Ignorant des Tages: Mein Auto fährt auch ohne Wald.

Yegappan Lakshmanan

unread,
Oct 15, 2020, 11:59:40 AM10/15/20
to vim_dev
Hi Dominique,

On Thu, Oct 15, 2020 at 4:25 AM Dominique Pellé <dominiq...@gmail.com> wrote:
Christian Brabandt <cbl...@256bit.org> wrote:

> On Mi, 14 Okt 2020, Maxim Kim wrote:
>
> > Hi, I wonder why "better" candidate has lower score than the others:
> >
> > https://i.imgur.com/NPhKNsZ.png
> >
> > tried to search `vimrc` and `.vim/vimrc` has lower score than `.vim/vimrc_colors`.
>

If I understand this solution, searching for "vimrc" will
prefer "foo_vimrc" over "vimrc_foo", which does not
look ideal.


In this case, both the matches have the same score. So the order
depends on how the matches are sorted.

- Yegappan 

Yegappan Lakshmanan

unread,
Oct 15, 2020, 12:14:55 PM10/15/20
to vim_dev
Hi Christian,

Yes. A complete sequential match should get an additional bonus. But the
changes you posted only checks for whether the previous character is matched
and it doesn't check for a complete match. For example, when searching
for 'abc' in 'xabcy' and 'axxxxbc', it will give preference to axxxxbc than xabcy.

Regards,
Yegappan

Christian Brabandt

unread,
Oct 15, 2020, 12:33:54 PM10/15/20
to vim_dev
Indeed. Initially I thought adding a bonus for a word boundary would be
needed, therefore my naive attempt to score additionally on the end of
the match. So how about this, which adds an additional bonus only for a
complete match.

iff --git a/src/search.c b/src/search.c
index 349eae035..d7adcc59f 100644
--- a/src/search.c
+++ b/src/search.c
@@ -4257,6 +4257,8 @@ typedef struct
// bonus if match occurs after a separator
#define SEPARATOR_BONUS 30
// bonus if match is uppercase and prev is lower
+#define ENDMATCH_SEQUENTIAL_BONUS 35
+// bonus if match is sequential and ends the match string
#define CAMEL_BONUS 30
// bonus if the first letter is matched
#define FIRST_LETTER_BONUS 15
@@ -4292,6 +4294,7 @@ fuzzy_match_compute_score(
int i;
char_u *p = str;
matchidx_T sidx = 0;
+ int full_match = 0;

// Initialize score
score = 100;
@@ -4317,7 +4320,10 @@ fuzzy_match_compute_score(

// Sequential
if (currIdx == (prevIdx + 1))
+ {
score += SEQUENTIAL_BONUS;
+ full_match++;
+ }
}

// Check for bonuses based on neighbor character value
@@ -4358,6 +4364,10 @@ fuzzy_match_compute_score(
score += FIRST_LETTER_BONUS;
}
}
+
+ if (full_match == numMatches-1)
+ score += ENDMATCH_SEQUENTIAL_BONUS;
+ // End of match
return score;
}



Mit freundlichen Grüßen
Christian
--
Jede Bedrängnis ist nur ein Engpass zu einer Weite.
-- Johannes Müller

Yegappan Lakshmanan

unread,
Oct 15, 2020, 2:52:08 PM10/15/20
to vim_dev
Hi Christian,

On Thu, Oct 15, 2020 at 9:33 AM Christian Brabandt <cbl...@256bit.org> wrote:

>
>     Yeah, but shouldn't a complete match get an additional bonus?
>
>
> Yes. A complete sequential match should get an additional bonus. But the
> changes you posted only checks for whether the previous character is matched
> and it doesn't check for a complete match. For example, when searching
> for 'abc' in 'xabcy' and 'axxxxbc', it will give preference to axxxxbc than
> xabcy.

Indeed. Initially I thought adding a bonus for a word boundary would be
needed, therefore my naive attempt to score additionally on the end of
the match. So how about this, which adds an additional bonus only for a
complete match.

The change looks good to me. Some tests need to be updated after this
change as they are failing with this change.

Note that when searching for 'vimrc' in '_vim_rc' and '.vimrc', the
_vim_rc string will have a higher score than .vimrc because of the two
underscores.

Some minor comments below.
 

iff --git a/src/search.c b/src/search.c
index 349eae035..d7adcc59f 100644
--- a/src/search.c
+++ b/src/search.c
@@ -4257,6 +4257,8 @@ typedef struct
 // bonus if match occurs after a separator
 #define SEPARATOR_BONUS 30
 // bonus if match is uppercase and prev is lower
+#define ENDMATCH_SEQUENTIAL_BONUS 35
+// bonus if match is sequential and ends the match string

Can you move the comment before the #define?
 
 #define CAMEL_BONUS 30
 // bonus if the first letter is matched
 #define FIRST_LETTER_BONUS 15
@@ -4292,6 +4294,7 @@ fuzzy_match_compute_score(
     int                i;
     char_u     *p = str;
     matchidx_T sidx = 0;
+    int                full_match = 0;

These functions are using camel case variable names. To be consistent,
can you change the name to fullMatch?

Regards,
Yegappan

Bram Moolenaar

unread,
Oct 15, 2020, 3:24:11 PM10/15/20
to vim...@googlegroups.com, Christian Brabandt
An extra score for a whole match makes a big jump. It might work better
when giving a match a bit of extra score if the previous character also
matched. Thus the more consequtive characters match the better. This
seems to make sense: matches spread out over a line of text score lower
than when some matches group together. And reach the maximum when they
are all together.

--
From "know your smileys":
:-)-O Smiling doctor with stethoscope

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Yegappan Lakshmanan

unread,
Oct 15, 2020, 10:31:18 PM10/15/20
to vim_dev, Christian Brabandt
Hi Bram,

On Thu, Oct 15, 2020 at 12:24 PM Bram Moolenaar <Br...@moolenaar.net> wrote:

Christian wrote:

> Indeed. Initially I thought adding a bonus for a word boundary would be
> needed, therefore my naive attempt to score additionally on the end of
> the match. So how about this, which adds an additional bonus only for a
> complete match.
>

An extra score for a whole match makes a big jump.  It might work better
when giving a match a bit of extra score if the previous character also
matched.  Thus the more consequtive characters match the better.  This
seems to make sense: matches spread out over a line of text score lower
than when some matches group together.  And reach the maximum when they
are all together.


The current implementation already supports this. Each sequential matching
character is given a bonus score (SEQUENTIAL_BONUS).

A matching character after an underscore or space is given a bonus
score (SEPARATOR_BONUS). Because of this a string with a matching
letter after an underscore is given precedence over a fully matched sequence
of characters.

Regards,
Yegappan

Christian Brabandt

unread,
Oct 16, 2020, 2:14:26 AM10/16/20
to vim_dev

On Do, 15 Okt 2020, Yegappan Lakshmanan wrote:

Yes currently, SEQUENTIAL_BONUS is 15, while SEPARATOR_BONUS is 30.
This makes `v_i_m_r_c` be the preferred match against e.g. vimrc.

So I think SEQUENTIAL_BONUS should probably score higher than
SEPARATOR_BONUS. I think this might be a better way to continue here.

I'll make a PR for that.

On a related note, I wonder whether SEPARATOR_BONUS should also be added
for path delimiters (`/` on unix `\` on windows).


Best,
Christian
--
Der schlimmste aller Fehler ist, sich keines solchen bewußt zu sein.
-- Thomas Carlyle

Maxim Kim

unread,
Oct 16, 2020, 2:36:23 AM10/16/20
to vim...@googlegroups.com

On 16.10.2020 9:14, Christian Brabandt wrote:
> On a related note, I wonder whether SEPARATOR_BONUS should also be added
> for path delimiters (`/` on unix `\` on windows).


I think it makes sense:

let g:tdata = ['hello/world','hellopworld', 'helloworld']
echo g:tdata->matchfuzzy('llwod')

current result:

['helloworld', 'hello/world', 'hellopworld']

better result would probably be

['hello/world', 'helloworld', 'hellopworld']


Maxim Kim

unread,
Oct 16, 2020, 3:14:51 AM10/16/20
to vim_dev


пятница, 16 октября 2020 г. в 09:36:23 UTC+3, Maxim Kim:

On 16.10.2020 9:14, Christian Brabandt wrote:
> On a related note, I wonder whether SEPARATOR_BONUS should also be added
> for path delimiters (`/` on unix `\` on windows).


I think it makes sense:



let g:tdata = ['autoload/select.vim','autoload/vital/WindowSelector']
echo g:tdata->matchfuzzy('autoselect')


Bram Moolenaar

unread,
Oct 16, 2020, 6:13:24 AM10/16/20
to vim...@googlegroups.com, Christian Brabandt

Christian wrote:

> On Do, 15 Okt 2020, Yegappan Lakshmanan wrote:
>
> > On Thu, Oct 15, 2020 at 12:24 PM Bram Moolenaar <Br...@moolenaar.net> wrote:
> >
> >
> > Christian wrote:
> >
> > > Indeed. Initially I thought adding a bonus for a word boundary would be
> > > needed, therefore my naive attempt to score additionally on the end of
> > > the match. So how about this, which adds an additional bonus only for a
> > > complete match.
> > >
> >
> > An extra score for a whole match makes a big jump.  It might work better
> > when giving a match a bit of extra score if the previous character also
> > matched.  Thus the more consequtive characters match the better.  This
> > seems to make sense: matches spread out over a line of text score lower
> > than when some matches group together.  And reach the maximum when they
> > are all together.
> >
> >
> >
> > The current implementation already supports this. Each sequential matching
> > character is given a bonus score (SEQUENTIAL_BONUS).
> >
> > A matching character after an underscore or space is given a bonus
> > score (SEPARATOR_BONUS). Because of this a string with a matching
> > letter after an underscore is given precedence over a fully matched sequence
> > of characters.
>
> Yes currently, SEQUENTIAL_BONUS is 15, while SEPARATOR_BONUS is 30.
> This makes `v_i_m_r_c` be the preferred match against e.g. vimrc.
>
> So I think SEQUENTIAL_BONUS should probably score higher than
> SEPARATOR_BONUS. I think this might be a better way to continue here.

Aha. So it might just be a matter of adjusting the weights.

> I'll make a PR for that.
>
> On a related note, I wonder whether SEPARATOR_BONUS should also be added
> for path delimiters (`/` on unix `\` on windows).

I wonder why that "after underscore" bonus was added. Isn't this
similar to a "start of word" bonus, like after a space or at the start
of the text? I haven't looked into the whole algorithm, but to me it
seems that this bonus should only be given when the start of a word in
the search string matches the start of a word in the text. E.g, when
searching for "hello world" then a match in "hello worldwize" would
match a bit better than "hello xxxworldwize".

I wonder why underscore was made so important. When matching in
keywords in programming name_with_underscores still has "name" at the
start as a more imporant boundary than the underscores. Or a slash, dot
and other separator characters.

--
Where do you want to crash today?
Reply all
Reply to author
Forward
0 new messages