xapwrap to xappy migration

19 views
Skip to first unread message

dimazest

unread,
Jul 25, 2009, 3:03:31 PM7/25/09
to xappy-discuss
Hello,

I refactor Moin search code to use xappy instead of xapwrap. The first
think I tried was querying existing database using xappy. Indexing was
done by xapwrap.

Here is the code that queries the database:
#!/usr/bin/env python

import os
import re
import sys
import xappy


_whitespace_re = re.compile('\s+')

def open_index(dbpath):
return xappy.SearchConnection(dbpath)

def main(request, argv):
dbpath = os.path.join(request.cfg.cache_dir, 'xapian/index')
search = ' '.join(argv[1:])
sconn = open_index(dbpath)
print "Searching %d documents for \"%s\"" % (
sconn.get_doccount(),
search
)
q = sconn.query_parse(search, default_op=sconn.OP_AND)
print q
results = sconn.search(q, 0, 10)
if results.estimate_is_exact:
print "Found %d results" % results.matches_estimated
else:
print "Found approximately %d results" %
results.matches_estimated
for result in results:
print result.id

if __name__ == '__main__':
from MoinMoin.web.contexts import ScriptContext
request = ScriptContext()
main(request, sys.argv)

The output is:

$ python search.py SystemInfo
2009-07-25 20:53:02,207 WARNING MoinMoin.log:139 using logging
configuration read from built-in fallback in MoinMoin.log module!
2009-07-25 20:53:02,418 INFO MoinMoin.config.multiconfig:127 using
wiki config: /Users/dimazest/Documents/gsoc/2009/moinmoin/src/1.9-
search/wikiconfig.pyc
Searching 1771 documents for "SystemInfo"
Xapian::Query((systeminfo:(pos=1) AND_MAYBE systeminfo:(pos=1)))
Found 73 results
None
None
None
None
None
None
None
None
None
None

The problem is that it finds some documents, but I cannot get IDs of
them. Any ideas how can i get IDs and other fields?

Tnanks,
--
Dima

Richard Boulton

unread,
Jul 28, 2009, 3:22:46 AM7/28/09
to xappy-discuss
On Jul 25, 8:03 pm, dimazest <dimaz...@gmail.com> wrote:
> The problem is that it finds some documents, but I cannot get IDs of
> them. Any ideas how can i get IDs and other fields?

It very much depends on how xapwrap stores its data, but I'm afraid I
don't know anything about how it does that.

I suspect that it's going to be quite hard to make a database built
with xapwrap searchable with xappy. Even if you can get the IDs and
other field prefix mappings set correctly (which would involve hacking
into the internals of xappy to directly set its prefix map - not a
very robust approach), the way in which text is indexed with xapwrap
is unlikely to be identical to xappy, which will lead to poor search
performance at best; often searches simply won't return the right
results.

Instead, I think you'd be much better off trying to build new indexes
from scratch with xappy.

--
Richard

dimazest

unread,
Jul 28, 2009, 6:57:10 AM7/28/09
to xappy-discuss
Hello,
We decided to build index with xappy from scratch. Now I need to add
terms to fields::

pdoc = connection.process(doc)
pdoc.add_term('revision', 'XREV')
pdoc.add_term('mimetype', 'T')
pdoc.add_term('title', 'S')
pdoc.add_term('fulltitle', 'XFT')
pdoc.add_term('domain', 'XDOMAIN')

But I get
...
File "/Volumes/RamDisk/moin/src/1.9-xapian-dmilajevs/MoinMoin/search/
Xapian.py", line 628, in _index_page_rev
pdoc.add_term('revision', 'XREV')
File "/Volumes/RamDisk/moin/src/1.9-xapian-dmilajevs/MoinMoin/
support/xappy/datastructures.py", line 116, in add_term
prefix = self._fieldmappings.get_prefix(field)
File "/Volumes/RamDisk/moin/src/1.9-xapian-dmilajevs/MoinMoin/
support/xappy/fieldmappings.py", line 94, in get_prefix
return self._prefixes[fieldname]
KeyError: 'revision'

Am I right that terms are added to the processed documents? Could you
suggest some documentation describing terms.

--
Dima

Richard Boulton

unread,
Jul 28, 2009, 9:49:44 AM7/28/09
to xappy-...@googlegroups.com
2009/7/28 dimazest <dima...@gmail.com>
Hello,

On Jul 28, 9:22 am, Richard Boulton <boulton...@googlemail.com> wrote:
> On Jul 25, 8:03 pm, dimazest <dimaz...@gmail.com> wrote:
>
>
> Instead, I think you'd be much better off trying to build new indexes
> from scratch with xappy.
>

We decided to build index with xappy from scratch. Now I need to add
terms to fields::

           pdoc = connection.process(doc)
           pdoc.add_term('revision', 'XREV')
           pdoc.add_term('mimetype', 'T')
           pdoc.add_term('title', 'S')
           pdoc.add_term('fulltitle', 'XFT')
           pdoc.add_term('domain', 'XDOMAIN')

Am I right that terms are added to the processed documents? Could you
suggest some documentation describing terms.

You probably don't want to work at the term level at all.  Instead, set up field actions on the database (via an IndexerConnection), create UnprocessedDocuments, and add the UnprocessedDocuments to an IndexerConnection,  The terms will be generated from the text automatically.

See docs/introduction.rst for an introduction to the concepts.

-- 
Richard

dimazest

unread,
Jul 28, 2009, 12:43:39 PM7/28/09
to xappy-discuss
Thank you for reply.

Another question, do I need care about stemming or I should just set
lang parameter for FREE_TEXT actions? Is it possible for different
documents set different languages?

On Jul 28, 3:49 pm, Richard Boulton <rich...@tartarus.org> wrote:
> 2009/7/28 dimazest <dimaz...@gmail.com>

Richard Boulton

unread,
Jul 28, 2009, 12:52:05 PM7/28/09
to xappy-...@googlegroups.com
2009/7/28 dimazest <dima...@gmail.com>


Thank you for reply.

Another question, do I need care about stemming or I should just set
lang parameter for FREE_TEXT actions? Is it possible for different
documents set different languages?

If you set the "language" parameter for free text actions, xappy will take care of stemming for you.

It's not really possible for different documents to have different languages.  You can, however, set different fields to have different languages (so one field could be text_en and english, and another could be text_fr and be in french).  However, if doing this, you'll need to decide which language to search in at query construction time, and use the appropriate field (eg, with query_parse(default_allow="text_fr"))  You can't easily mix french and english queries (for example) because the stemming algorithm used at search time needs to be the same as that applied to the field at index time.

-- 
Richard 
Reply all
Reply to author
Forward
0 new messages