List of Keywords for apropos(1) Which Should Not be Stemmed

Abhinav Upadhyay

unread,

Jul 11, 2016, 9:29:43 AM7/11/16

to

Hi,

Currently we are using the in-built Porter stemming tokenizer of
SQLite, which by default stems all the keywords while indexing. It
does this by removing suffixes like 's', 'es', 'ing', 'ed' from the
end of the words and various other similar heuristics. This is useful
for full text search because if you search for 'directories', you will
find matches for both 'directory' and 'directories'.

But the downside is that technical keywords (e.g. kms, lfs, ffs), are
also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
search for kms, you will see results for both kms and km.

The solution is to write a custom tokenizer where we check in an
ignore list to decide whether to stem a token or not. I'm looking how
best to obtain this ignore list of keywords. The discussion on
current-users [1] had two suggestions:

1. If a word is not in /usr/share/dict/words, don't stem.
2. Look for .Tn macros (and probably other similar macros) and don't stem those.

Doing (1) is simple but that file is huge and it would require
building a huge hash table to search in it for ever keyword while
parsing the man pages.
With (2), the list will not be available before makemandb(8) runs, so
it is hard to implement.

There is another option of building a list by hand and by using
/usr/data/src/usr.bin/spell/spell/{special.netbsd, special.math} as a
starting point. If you have any better alternatives, please let me
know :)

[1]: http://mail-index.netbsd.org/current-users/2016/07/08/msg029732.html

-
Abhinav

Thomas Klausner

unread,

Jul 11, 2016, 10:01:04 AM7/11/16

to

On Mon, Jul 11, 2016 at 06:59:25PM +0530, Abhinav Upadhyay wrote:
> 1. If a word is not in /usr/share/dict/words, don't stem.
> 2. Look for .Tn macros (and probably other similar macros) and don't stem those.

3. Don't stem the file names?
cd /usr/share/man && find . -name *.[0-9] | sed -e "s,.*/,," -e "s/.[0-9]$//" | sort -u |less

Thomas

Abhinav Upadhyay

unread,

Jul 11, 2016, 11:29:13 AM7/11/16

to

Thanks, that would be a good starting point too. I guess we will still
have to add few words to the list manually later, but it should be
good to begin with.

-
Abhinav

Brett Lymn

unread,

Jul 11, 2016, 8:24:00 PM7/11/16

to

On Mon, Jul 11, 2016 at 08:59:05PM +0530, Abhinav Upadhyay wrote:
>
> Thanks, that would be a good starting point too. I guess we will still
> have to add few words to the list manually later, but it should be
> good to begin with.
>

How about checking the length of the word - technical abbreviations tend
to be short (<= 4 characters predominantly). According to grep there
are 155 two letter words, 1358 three letter words and 5124 four letter
words (assuming my driving of grep is correct) in /usr/share/dict/words.
So it could be feasible to hash just the short words in the dictionary
and then stem if you find a match otherwise assume it is a technical
abbreviation and don't stem.

--
Brett Lymn

Robert Elz

unread,

Jul 11, 2016, 8:40:46 PM7/11/16

to

It would also be good to have an option to not stem at all (for a lookup, and
I assume that would mean building two databases for the search, one with
words stemmed, and one with just the words as they appear.)

I know that often when I do a search, the reason is that I remember some
phrase from a man page I read in the past, but don't remember which man
page I read it in ... when I do a search like that, I want to look for
exactly what I recall reading, so if, for example, I remember "compare
directories" or something like that, and I search for "directories" I want
to see only entries that exactly match that, and I do not want to get told
of some man page just because it has "directory" in it.

I also don't mind doing two or three searches with slightly different
keys, if I don't find what I am looking for on the first (or second)
attempt. That's usually far better than being overwhelmed with dozens
(or hundreds) of matches that are "kind of like" what I was looking for
in the expectation that one of them will be what I really wanted - if I
have to send the output of "man -k" through grep, I consider that a failure...

kre

David Young

unread,

Jul 11, 2016, 8:54:14 PM7/11/16

to

On Mon, Jul 11, 2016 at 06:59:25PM +0530, Abhinav Upadhyay wrote:

> But the downside is that technical keywords (e.g. kms, lfs, ffs), are
> also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
> search for kms, you will see results for both kms and km.

Interesting problem.

I expect the set of documents that contain a word ("directories") and
the set of documents containing its true stem ("directory") to overlap
widely. I also expect the set of documents that contain a word ("kms")
and an incorrect stem ("km") to scarcely overlap. Do the manual pages
meet these expections? If so, then maybe you can decide whether or not
to keep a stem by looking at the document-set overlap?

Dave

--
David Young //\ Trestle Technology Consulting
(217) 721-9981 Urbana, IL http://trestle.tech/

Abhinav Upadhyay

unread,

Jul 12, 2016, 2:04:25 AM7/12/16

to

Yes, but there are other keywords which are probably not
abbreviations, and longer than 3/4 letters. For example, drmkms,
usbdevs, scan_ffs etc :)

-
Abhinav

Abhinav Upadhyay

unread,

Jul 12, 2016, 2:11:29 AM7/12/16

to

On Tue, Jul 12, 2016 at 6:24 AM, David Young <dyo...@pobox.com> wrote:
> On Mon, Jul 11, 2016 at 06:59:25PM +0530, Abhinav Upadhyay wrote:
>> But the downside is that technical keywords (e.g. kms, lfs, ffs), are
>> also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
>> search for kms, you will see results for both kms and km.
>
> Interesting problem.
>
> I expect the set of documents that contain a word ("directories") and
> the set of documents containing its true stem ("directory") to overlap
> widely. I also expect the set of documents that contain a word ("kms")
> and an incorrect stem ("km") to scarcely overlap. Do the manual pages
> meet these expections? If so, then maybe you can decide whether or not
> to keep a stem by looking at the document-set overlap?

Yes, usually when the stem is incorrect, the overlap is not that much.
But the only way to figure out such cases is manually comparing the
output of apropos, unless we have a pre-built list of expected
document-set and we can compare those. :)

-
Abhinav

David Holland

unread,

Jul 12, 2016, 2:56:25 AM7/12/16

to

On Tue, Jul 12, 2016 at 11:41:21AM +0530, Abhinav Upadhyay wrote:
> >> But the downside is that technical keywords (e.g. kms, lfs, ffs), are
> >> also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
> >> search for kms, you will see results for both kms and km.
> >
> > Interesting problem.
> >
> > I expect the set of documents that contain a word ("directories") and
> > the set of documents containing its true stem ("directory") to overlap
> > widely. I also expect the set of documents that contain a word ("kms")
> > and an incorrect stem ("km") to scarcely overlap. Do the manual pages
> > meet these expections? If so, then maybe you can decide whether or not
> > to keep a stem by looking at the document-set overlap?
>
> Yes, usually when the stem is incorrect, the overlap is not that much.
> But the only way to figure out such cases is manually comparing the
> output of apropos, unless we have a pre-built list of expected
> document-set and we can compare those. :)

You could build such a list from the current set of man pages, and
refresh it once in a while, and that would probably work well enough.

I'm wondering though if there's some characteristic of the document
sets you can use to automatically reject wrong stemmings without
having to precompute. What comes to mind though is some kind of
diameter or breadth metric on the image of the document set on the
crossreference graph. Or maybe something like the average
crossreference pagerank of the document set, which if it's too high
means you aren't retrieving useful information. But I guess these
notions aren't much use because I'm sure we don't currently build the
crossreference graph.

(Also, as far as longer vs. shorter words, there's not much harm
besides performance in searching for nonsense words like "resize_ff"
as they generally won't match anything.)

--
David A. Holland
dhol...@netbsd.org

Abhinav Upadhyay

unread,

Jul 17, 2016, 12:41:41 PM7/17/16

to

On Tue, Jul 12, 2016 at 12:26 PM, David Holland
<dholla...@netbsd.org> wrote:
> On Tue, Jul 12, 2016 at 11:41:21AM +0530, Abhinav Upadhyay wrote:
> > >> But the downside is that technical keywords (e.g. kms, lfs, ffs), are
> > >> also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
> > >> search for kms, you will see results for both kms and km.
> > >
> > > Interesting problem.
> > >
> > > I expect the set of documents that contain a word ("directories") and
> > > the set of documents containing its true stem ("directory") to overlap
> > > widely. I also expect the set of documents that contain a word ("kms")
> > > and an incorrect stem ("km") to scarcely overlap. Do the manual pages
> > > meet these expections? If so, then maybe you can decide whether or not
> > > to keep a stem by looking at the document-set overlap?
> >
> > Yes, usually when the stem is incorrect, the overlap is not that much.
> > But the only way to figure out such cases is manually comparing the
> > output of apropos, unless we have a pre-built list of expected
> > document-set and we can compare those. :)
>
> You could build such a list from the current set of man pages, and
> refresh it once in a while, and that would probably work well enough.

That's one of the things I want to do. It would be nice to create a
labeled dataset, probably something like a set of queries and an
expected list of documents in the top 10 for each of them. It could
then be used as a training data for tasks such as
- evaluating performance of various ranking algorithms,
- using machine learning to learn an optimal ranking algorithm
- determining which keywords should be stemmed by comparing the
overlap of the actual and expected results

> I'm wondering though if there's some characteristic of the document
> sets you can use to automatically reject wrong stemmings without
> having to precompute. What comes to mind though is some kind of
> diameter or breadth metric on the image of the document set on the
> crossreference graph. Or maybe something like the average
> crossreference pagerank of the document set, which if it's too high
> means you aren't retrieving useful information. But I guess these
> notions aren't much use because I'm sure we don't currently build the
> crossreference graph.

We haven't tried exploring this aspect. Probably if we have a hand
labeled dataset as mentioned above, we could compare performance of
page rank as well.

-
Abhinav