Posting update about my 3rd week's work.
I mostly worked on fixing some major issues in makemandb to get the
indexing going. Besides that I also implemented a barebones apropos(1)
which takes the user query as input and simply looks up the database.
I have started a new branch on my Github repository "search"
https://github.com/abhinav-upadhyay/apropos_replacement/branches/search
, where I am adding some experimental code for search related
features. As I get some feedback and reviews, I will commit them in
master or revert them depending on the reviews.
Currently the search branch has following two features:
1. A Stopword filter: If we are doing full text search, then we also
expect users to enter normal queries consisting of usual English
words, so we need to filter out the stopwords out of the user query in
order to get only those results which match the actual keywords in the
user query and not the stopwords.
2. A ranking function: A ranking function is very necessary, so that
Sqlite ranks and gives back the most useful results at the top. If you
try the apropos in the master branch and the one in search branch, you
will notice drastic difference in the quality of search results. But
even after this lots of effort is required to improve it.
If you would like to see the output of some sample searches:
http://pastebin.com/qhQBRNd5
I made a more detailed report on my blog where I have disucssed the
issues fixed, and also how to get the code and try it out.
http://abhinav-upadhyay.blogspot.com/2011/06/netbsd-gsoc-weekly-report-3.html
At this point of time, some community feedback will be highly useful
and valuable. For example:
- What are the most important things you would look for when
performing a search across man pages ?
- What all information should be there in the output ?
- Do you like the current results ?
- Would you like any changes in the interface of apropos(1) ?
etc.
Thanks for your interest and time :-)
Regards
Abhinav
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de
I tried building it:
# git clone git://github.com/abhinav-upadhyay/apropos_replacement.git
Cloning into apropos_replacement...
remote: Counting objects: 56, done.
remote: Compressing objects: 100% (46/46), done.
remote: Total 56 (delta 26), reused 19 (delta 5)
Receiving objects: 100% (56/56), 1.20 MiB | 338 KiB/s, done.
Resolving deltas: 100% (26/26), done.
# cd apropos_replacement
# make
rm -f .gdbinit
touch .gdbinit
# compile apropos_replacement/makemandb.o
gcc -O2 -std=gnu99 -Werror -I/usr/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS -c makemandb.c
# compile apropos_replacement/sqlite3.o
gcc -O2 -std=gnu99 -Werror -I/usr/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS -c sqlite3.c
# link apropos_replacement/makemandb
gcc -o makemandb makemandb.o sqlite3.o -L/usr/src/external/bsd/mdocml/lib/libmandoc -lmandoc -Wl,-rpath-link,/lib -L=/lib
ld: cannot find -lmandoc
*** Error code 1
Stop.
make: stopped in /path/apropos_replacement
It seems I need to compile and install libmandoc? Which version?
Thomas
Yes, forgot to mention the dependency on libmandoc :-| make will
search for libmandoc in /usr/src/external/bsd/mdocml/lib/libmandoc .
Running make && make install in /usr/src/externa/bsd/mdocml should
build it :-)
I am using the version in -current which is 1.11.1. Also with this the
-current version of man pages will be required because with the
version 5.1 man pages, libmandoc was leading to an assertion failure
for some particular man pages.
Thanks
Indeed, free form queries seem like the way to go.
> 2. A ranking function: A ranking function is very necessary, so that
> Sqlite ranks and gives back the most useful results at the top. If you
> try the apropos in the master branch and the one in search branch, you
> will notice drastic difference in the quality of search results. But
> even after this lots of effort is required to improve it.
Besides the usual frequency, possible ranking scores (or "static weights")
could involve the earlier mentioned .Nm and .Nd. Say, if a word "string"
appears already in the title, it may be a better result than several
appearances of the word "string" in the body of the text.
> - What are the most important things you would look for when
> performing a search across man pages ?
It may be difficult to say because we have never had a reasonable search
utility for man pages ;-). But I think the examples you noted were pretty
much spot on; from "how to add user" and "package installation" to "kernel
memory" or "vnode locking".
- Do you like the current results ?
Yes, the results were very reasonable already.
- Jukka.
Usually when I want to use search in man pages I don't know exact name
of function or global man page name.
>
> - Do you like the current results ?
>
> Yes, the results were very reasonable already.
>
> - Jukka.
>
--
Regards.
Adam
Be careful here. At least .Nm should *not* get filtered. Consider
"apropos who"...
Joerg
>> 2. A ranking function: A ranking function is very necessary, so that
>> Sqlite ranks and gives back the most useful results at the top. If you
>> try the apropos in the master branch and the one in search branch, you
>> will notice drastic difference in the quality of search results. But
>> even after this lots of effort is required to improve it.
>
> Besides the usual frequency, possible ranking scores (or "static weights")
> could involve the earlier mentioned .Nm and .Nd. Say, if a word "string"
> appears already in the title, it may be a better result than several
> appearances of the word "string" in the body of the text.
Yes, although the current ranking function does give a static weight
to each column.
name column --> 1.50
name_desc column --> 1.25
desc column --> .75
So after calculating the term frequency in a column we multiply it by
the static weight of the column.
Besides this, calculating the Inverse Document Frequency and using it
as well as a factor in ranking should better the results.
>> - What are the most important things you would look for when
>> performing a search across man pages ?
>
> It may be difficult to say because we have never had a reasonable search
> utility for man pages ;-). But I think the examples you noted were pretty
> much spot on; from "how to add user" and "package installation" to "kernel
> memory" or "vnode locking".
Yes, although then these were the queries which produced best results.
To me it seems the more elaborate the user is in specifying his query,
the better should be the results ( I mean still he has to mention the
right keywords). A single keyword query might lead no where. But then
we are in a very initial stage of the project.
> - Do you like the current results ?
>
> Yes, the results were very reasonable already.
Thanks for liking it and taking time to provide feedback. I appreciate it. :-)
Regards
Abhinav
> Usually when I want to use search in man pages I don't know exact name
> of function or global man page name.
Yes, it is pretty much expected. I am going to be working on getting
the section number as well in the search results, so for example we
get to know whether the search result belongs to a system call or is
it a standard library function, or an
Thanks
Abhinav
I have a pretty basic (and perhaps lame) approach in mind for this:
We first eliminate very obvious stopwords from the query (like a, an,
and, are, about, also, etc.).
After this, we run a query only against the name and name_desc columns.
Then we again filter any remaining stopwords from the query and then
perform search against the desc column.
In the end we take a union of all the results and rank them.
Although it makes things somewhat complicated.
Which ones? That sounds like they should be fixed...
--
David A. Holland
dhol...@netbsd.org
Ah, I did not really make a complete list of such pages. But I have a
somewhat partial list (after failing so many times, I decided to just
update the man pages)
/usr/share/man/man1/atari/edahdi.1
/usr/share/man/man4/arc/intro.4
/usr/share/man/man4/amiga/*
/usr/share/man/man4/alpha/*
I think most of them were in /usr/share/man4/<arch>/*
Also some pages caused failure int /usr/pkg/man/man3/ or so.
Following was the error message:
$ mandoc /usr/share/man/man4/atari/floppy.4
assertion "' ' != buf[*pos]" failed: file
"/usr/src/external/bsd/mdocml/lib/libmandoc/../../dist/mdoc_argv.c",
line 282, function "mdoc_argv"
Abort trap (core dumped)
This has been fixed in the CVS version of mandoc, to which joerg@ has
access.