Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

GSoC Project Progress Update: Apropos Replacement

1 view
Skip to first unread message

Abhinav Upadhyay

unread,
Jun 22, 2011, 4:38:59 PM6/22/11
to
Hello NetBSD

Posting update about my 3rd week's work.

I mostly worked on fixing some major issues in makemandb to get the
indexing going. Besides that I also implemented a barebones apropos(1)
which takes the user query as input and simply looks up the database.

I have started a new branch on my Github repository "search"
https://github.com/abhinav-upadhyay/apropos_replacement/branches/search
, where I am adding some experimental code for search related
features. As I get some feedback and reviews, I will commit them in
master or revert them depending on the reviews.

Currently the search branch has following two features:

1. A Stopword filter: If we are doing full text search, then we also
expect users to enter normal queries consisting of usual English
words, so we need to filter out the stopwords out of the user query in
order to get only those results which match the actual keywords in the
user query and not the stopwords.

2. A ranking function: A ranking function is very necessary, so that
Sqlite ranks and gives back the most useful results at the top. If you
try the apropos in the master branch and the one in search branch, you
will notice drastic difference in the quality of search results. But
even after this lots of effort is required to improve it.

If you would like to see the output of some sample searches:
http://pastebin.com/qhQBRNd5

I made a more detailed report on my blog where I have disucssed the
issues fixed, and also how to get the code and try it out.
http://abhinav-upadhyay.blogspot.com/2011/06/netbsd-gsoc-weekly-report-3.html

At this point of time, some community feedback will be highly useful
and valuable. For example:
- What are the most important things you would look for when
performing a search across man pages ?
- What all information should be there in the output ?
- Do you like the current results ?
- Would you like any changes in the interface of apropos(1) ?
etc.

Thanks for your interest and time :-)

Regards
Abhinav

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de

Thomas Klausner

unread,
Jun 22, 2011, 4:59:39 PM6/22/11
to
On Thu, Jun 23, 2011 at 02:08:59AM +0530, Abhinav Upadhyay wrote:
> I made a more detailed report on my blog where I have disucssed the
> issues fixed, and also how to get the code and try it out.
> http://abhinav-upadhyay.blogspot.com/2011/06/netbsd-gsoc-weekly-report-3.html

I tried building it:
# git clone git://github.com/abhinav-upadhyay/apropos_replacement.git
Cloning into apropos_replacement...
remote: Counting objects: 56, done.
remote: Compressing objects: 100% (46/46), done.
remote: Total 56 (delta 26), reused 19 (delta 5)
Receiving objects: 100% (56/56), 1.20 MiB | 338 KiB/s, done.
Resolving deltas: 100% (26/26), done.
# cd apropos_replacement
# make
rm -f .gdbinit
touch .gdbinit
# compile apropos_replacement/makemandb.o
gcc -O2 -std=gnu99 -Werror -I/usr/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS -c makemandb.c
# compile apropos_replacement/sqlite3.o
gcc -O2 -std=gnu99 -Werror -I/usr/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS -c sqlite3.c
# link apropos_replacement/makemandb
gcc -o makemandb makemandb.o sqlite3.o -L/usr/src/external/bsd/mdocml/lib/libmandoc -lmandoc -Wl,-rpath-link,/lib -L=/lib
ld: cannot find -lmandoc
*** Error code 1

Stop.
make: stopped in /path/apropos_replacement

It seems I need to compile and install libmandoc? Which version?
Thomas

Abhinav Upadhyay

unread,
Jun 22, 2011, 5:05:22 PM6/22/11
to
On Thu, Jun 23, 2011 at 2:29 AM, Thomas Klausner <w...@netbsd.org> wrote:
> On Thu, Jun 23, 2011 at 02:08:59AM +0530, Abhinav Upadhyay wrote:
>> I made a more detailed report on my blog where I have disucssed the
>> issues fixed, and also how to get the code and try it out.
>> http://abhinav-upadhyay.blogspot.com/2011/06/netbsd-gsoc-weekly-report-3.html
>
> I tried building it:
> # git clone git://github.com/abhinav-upadhyay/apropos_replacement.git
> Cloning into apropos_replacement...
> remote: Counting objects: 56, done.
> remote: Compressing objects: 100% (46/46), done.
> remote: Total 56 (delta 26), reused 19 (delta 5)
> Receiving objects: 100% (56/56), 1.20 MiB | 338 KiB/s, done.
> Resolving deltas: 100% (26/26), done.
> # cd apropos_replacement
> # make
> rm -f .gdbinit
> touch .gdbinit
> #   compile  apropos_replacement/makemandb.o
> gcc -O2  -std=gnu99 -Werror    -I/usr/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS  -c    makemandb.c
> #   compile  apropos_replacement/sqlite3.o
> gcc -O2  -std=gnu99 -Werror    -I/usr/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS  -c    sqlite3.c
> #      link  apropos_replacement/makemandb
> gcc        -o makemandb  makemandb.o sqlite3.o     -L/usr/src/external/bsd/mdocml/lib/libmandoc -lmandoc   -Wl,-rpath-link,/lib  -L=/lib
> ld: cannot find -lmandoc
> *** Error code 1
>
> Stop.
> make: stopped in /path/apropos_replacement
>
> It seems I need to compile and install libmandoc? Which version?
>  Thomas
>

Yes, forgot to mention the dependency on libmandoc :-| make will
search for libmandoc in /usr/src/external/bsd/mdocml/lib/libmandoc .
Running make && make install in /usr/src/externa/bsd/mdocml should
build it :-)

I am using the version in -current which is 1.11.1. Also with this the
-current version of man pages will be required because with the
version 5.1 man pages, libmandoc was leading to an assertion failure
for some particular man pages.

Thanks

Jukka Ruohonen

unread,
Jun 22, 2011, 6:05:31 PM6/22/11
to
On Thu, Jun 23, 2011 at 02:08:59AM +0530, Abhinav Upadhyay wrote:
> 1. A Stopword filter: If we are doing full text search, then we also
> expect users to enter normal queries consisting of usual English
> words

Indeed, free form queries seem like the way to go.

> 2. A ranking function: A ranking function is very necessary, so that
> Sqlite ranks and gives back the most useful results at the top. If you
> try the apropos in the master branch and the one in search branch, you
> will notice drastic difference in the quality of search results. But
> even after this lots of effort is required to improve it.

Besides the usual frequency, possible ranking scores (or "static weights")
could involve the earlier mentioned .Nm and .Nd. Say, if a word "string"
appears already in the title, it may be a better result than several
appearances of the word "string" in the body of the text.

> - What are the most important things you would look for when
> performing a search across man pages ?

It may be difficult to say because we have never had a reasonable search
utility for man pages ;-). But I think the examples you noted were pretty
much spot on; from "how to add user" and "package installation" to "kernel
memory" or "vnode locking".

- Do you like the current results ?

Yes, the results were very reasonable already.

- Jukka.

haad

unread,
Jun 22, 2011, 6:47:08 PM6/22/11
to
On Thu, Jun 23, 2011 at 12:05 AM, Jukka Ruohonen <jruo...@iki.fi> wrote:
> On Thu, Jun 23, 2011 at 02:08:59AM +0530, Abhinav Upadhyay wrote:
>> 1. A Stopword filter: If we are doing full text search, then we also
>> expect users to enter normal queries consisting of usual English
>> words
>
> Indeed, free form queries seem like the way to go.
>
>> 2. A ranking function: A ranking function is very necessary, so that
>> Sqlite ranks and gives back the most useful results at the top. If you
>> try the apropos in the master branch and the one in search branch, you
>> will notice drastic difference in the quality of search results. But
>> even after this lots of effort is required to improve it.
>
> Besides the usual frequency, possible ranking scores (or "static weights")
> could involve the earlier mentioned .Nm and .Nd. Say, if a word "string"
> appears already in the title, it may be a better result than several
> appearances of the word "string" in the body of the text.
>
>> - What are the most important things you would look for when
>> performing a search across man pages ?
>
> It may be difficult to say because we have never had a reasonable search
> utility for man pages ;-). But I think the examples you noted were pretty
> much spot on; from "how to add user" and "package installation" to "kernel
> memory" or "vnode locking".

Usually when I want to use search in man pages I don't know exact name
of function or global man page name.

>
> - Do you like the current results ?
>
> Yes, the results were very reasonable already.
>
> - Jukka.
>

--


Regards.

Adam

Joerg Sonnenberger

unread,
Jun 22, 2011, 6:58:46 PM6/22/11
to
On Thu, Jun 23, 2011 at 02:08:59AM +0530, Abhinav Upadhyay wrote:
> 1. A Stopword filter: If we are doing full text search, then we also
> expect users to enter normal queries consisting of usual English
> words, so we need to filter out the stopwords out of the user query in
> order to get only those results which match the actual keywords in the
> user query and not the stopwords.

Be careful here. At least .Nm should *not* get filtered. Consider
"apropos who"...

Joerg

Abhinav Upadhyay

unread,
Jun 23, 2011, 2:43:04 AM6/23/11
to
On Thu, Jun 23, 2011 at 3:35 AM, Jukka Ruohonen <jruo...@iki.fi> wrote:
> On Thu, Jun 23, 2011 at 02:08:59AM +0530, Abhinav Upadhyay wrote:

>> 2. A ranking function: A ranking function is very necessary, so that
>> Sqlite ranks and gives back the most useful results at the top. If you
>> try the apropos in the master branch and the one in search branch, you
>> will notice drastic difference in the quality of search results. But
>> even after this lots of effort is required to improve it.
>
> Besides the usual frequency, possible ranking scores (or "static weights")
> could involve the earlier mentioned .Nm and .Nd. Say, if a word "string"
> appears already in the title, it may be a better result than several
> appearances of the word "string" in the body of the text.

Yes, although the current ranking function does give a static weight
to each column.
name column --> 1.50
name_desc column --> 1.25
desc column --> .75
So after calculating the term frequency in a column we multiply it by
the static weight of the column.
Besides this, calculating the Inverse Document Frequency and using it
as well as a factor in ranking should better the results.

>> - What are the most important things you would look for when
>> performing a search across man pages ?
>
> It may be difficult to say because we have never had a reasonable search
> utility for man pages ;-). But I think the examples you noted were pretty
> much spot on; from "how to add user" and "package installation" to "kernel
> memory" or "vnode locking".

Yes, although then these were the queries which produced best results.
To me it seems the more elaborate the user is in specifying his query,
the better should be the results ( I mean still he has to mention the
right keywords). A single keyword query might lead no where. But then
we are in a very initial stage of the project.

> - Do you like the current results ?
>
> Yes, the results were very reasonable already.

Thanks for liking it and taking time to provide feedback. I appreciate it. :-)

Regards
Abhinav

Abhinav Upadhyay

unread,
Jun 23, 2011, 3:15:08 AM6/23/11
to
On Thu, Jun 23, 2011 at 4:17 AM, haad <haa...@gmail.com> wrote:

> Usually when I want to use search in man pages I don't know exact name
> of function or global man page name.

Yes, it is pretty much expected. I am going to be working on getting
the section number as well in the search results, so for example we
get to know whether the search result belongs to a system call or is
it a standard library function, or an


Thanks
Abhinav

Abhinav Upadhyay

unread,
Jun 23, 2011, 3:11:02 AM6/23/11
to
On Thu, Jun 23, 2011 at 4:28 AM, Joerg Sonnenberger
<jo...@britannica.bec.de> wrote:
> On Thu, Jun 23, 2011 at 02:08:59AM +0530, Abhinav Upadhyay wrote:
>> 1. A Stopword filter: If we are doing full text search, then we also
>> expect users to enter normal queries consisting of usual English
>> words, so we need to filter out the stopwords out of the user query in
>> order to get only those results which match the actual keywords in the
>> user query and not the stopwords.
>
> Be careful here. At least .Nm should *not* get filtered. Consider
> "apropos who"...
>
At the moment I have built a static list of stopwords, but yeah I did
not consider scenarios like "apropos who" (although 'who' is not on
the list).

I have a pretty basic (and perhaps lame) approach in mind for this:

We first eliminate very obvious stopwords from the query (like a, an,
and, are, about, also, etc.).
After this, we run a query only against the name and name_desc columns.
Then we again filter any remaining stopwords from the query and then
perform search against the desc column.
In the end we take a union of all the results and rank them.
Although it makes things somewhat complicated.

David Holland

unread,
Jun 23, 2011, 2:54:02 PM6/23/11
to
On Thu, Jun 23, 2011 at 02:35:22AM +0530, Abhinav Upadhyay wrote:
> I am using the version in -current which is 1.11.1. Also with this the
> -current version of man pages will be required because with the
> version 5.1 man pages, libmandoc was leading to an assertion failure
> for some particular man pages.

Which ones? That sounds like they should be fixed...

--
David A. Holland
dhol...@netbsd.org

Abhinav Upadhyay

unread,
Jun 23, 2011, 3:01:50 PM6/23/11
to
On Fri, Jun 24, 2011 at 12:24 AM, David Holland
<dholla...@netbsd.org> wrote:
> On Thu, Jun 23, 2011 at 02:35:22AM +0530, Abhinav Upadhyay wrote:
>  > I am using the version in -current which is 1.11.1. Also with this the
>  > -current version of man pages will be required because with the
>  > version 5.1 man pages, libmandoc was leading to an assertion failure
>  > for some particular man pages.
>
> Which ones? That sounds like they should be fixed...

Ah, I did not really make a complete list of such pages. But I have a
somewhat partial list (after failing so many times, I decided to just
update the man pages)
/usr/share/man/man1/atari/edahdi.1
/usr/share/man/man4/arc/intro.4
/usr/share/man/man4/amiga/*
/usr/share/man/man4/alpha/*
I think most of them were in /usr/share/man4/<arch>/*
Also some pages caused failure int /usr/pkg/man/man3/ or so.

Following was the error message:

$ mandoc /usr/share/man/man4/atari/floppy.4
assertion "' ' != buf[*pos]" failed: file
"/usr/src/external/bsd/mdocml/lib/libmandoc/../../dist/mdoc_argv.c",
line 282, function "mdoc_argv"
Abort trap (core dumped)

Kristaps Dzonsons

unread,
Jun 23, 2011, 3:10:53 PM6/23/11
to
On 23/06/2011 21:01, Abhinav Upadhyay wrote:
> On Fri, Jun 24, 2011 at 12:24 AM, David Holland
> <dholla...@netbsd.org> wrote:
>> On Thu, Jun 23, 2011 at 02:35:22AM +0530, Abhinav Upadhyay wrote:
>> > I am using the version in -current which is 1.11.1. Also with this the
>> > -current version of man pages will be required because with the
>> > version 5.1 man pages, libmandoc was leading to an assertion failure
>> > for some particular man pages.
>>
>> Which ones? That sounds like they should be fixed...
>
> Ah, I did not really make a complete list of such pages. But I have a
> somewhat partial list (after failing so many times, I decided to just
> update the man pages)
> /usr/share/man/man1/atari/edahdi.1
> /usr/share/man/man4/arc/intro.4
> /usr/share/man/man4/amiga/*
> /usr/share/man/man4/alpha/*
> I think most of them were in /usr/share/man4/<arch>/*
> Also some pages caused failure int /usr/pkg/man/man3/ or so.
>
> Following was the error message:
>
> $ mandoc /usr/share/man/man4/atari/floppy.4
> assertion "' ' != buf[*pos]" failed: file
> "/usr/src/external/bsd/mdocml/lib/libmandoc/../../dist/mdoc_argv.c",
> line 282, function "mdoc_argv"
> Abort trap (core dumped)

This has been fixed in the CVS version of mandoc, to which joerg@ has
access.

0 new messages