is EXACT_MATCH working?

24 views
Skip to first unread message

J5

unread,
Sep 29, 2011, 6:44:10 PM9/29/11
to xappy-discuss
Hello,

We are using Xappy to create indexes for package searching in Fedora.
Right now the results are a bit skewed due to freetext searches simply
matching the number of times a term shows up. I want to fix this
using exact matching on the package name so that if an exact match is
found we return that as the top result. This does not seem to work.
If I do this and remove all of the other matching fields we always get
an empty result

iconn.add_field_action('exact_name', xappy.FieldActions.INDEX_EXACT)
iconn.add_field_action('exact_name', xappy.FieldActions.STORE_CONTENT)
doc.fields.append(xappy.Field('exact_name', 'dbus', weight=100.0))
.
.
.

then searching for 'dbus' using xapian should return that match but we
get an empty set:

query = qp.parse_query('dbus')
enquire.set_query(query)
matches = enquire.get_mset(0, 10)
count = matches.get_matches_estimated()
print count

> 0

How do we get count working? BTW we are using xappy for indexing
because it presents a nice interface but xapian is simple enough on
the query side that we decided to use that for stability.

Richard Boulton

unread,
Sep 30, 2011, 5:22:21 AM9/30/11
to xappy-...@googlegroups.com
On 29 September 2011 23:44, J5 <john.j5....@gmail.com> wrote:
> How do we get count working?  BTW we are using xappy for indexing
> because it presents a nice interface but xapian is simple enough on
> the query side that we decided to use that for stability.

I'm not sure what stability you mean, but ok; it's possible to do
this, but you'll need to understand a bit more about xapian internals.
I think you'll end up replicating chunks of xappy, so I wouldn't take
this approach, personally.

I think the problem in this case is that the INDEX_EXACT action
doesn't store an unprefixed version of the term. For an
INDEX_FREETEXT action, the text "dbus" will get indexed both as "dbus"
(for non field-specific searches) and also as something like "XAdbus"
for field specific searches. (dbus may also be stemmed, depending on
settings). For an index exact field, you'll just get soemthing like
the "XAdbus" field.

To search this using pure xapian, you'll have to look up what the
prefix to insert is by reading and unpacking the metadata key stored
by xappy which holds this configuration, and give it to the query
parser by calling qp.add_boolean_prefix().

Really, I recommend you use xappy for searches too.

--
Richard

john palmieri

unread,
Sep 30, 2011, 12:10:48 PM9/30/11
to xappy-...@googlegroups.com
On Fri, Sep 30, 2011 at 5:22 AM, Richard Boulton <ric...@tartarus.org> wrote:
> On 29 September 2011 23:44, J5 <john.j5....@gmail.com> wrote:
>> How do we get count working?  BTW we are using xappy for indexing
>> because it presents a nice interface but xapian is simple enough on
>> the query side that we decided to use that for stability.
>
> I'm not sure what stability you mean, but ok; it's possible to do
> this, but you'll need to understand a bit more about xapian internals.
>  I think you'll end up replicating chunks of xappy, so I wouldn't take
> this approach, personally.

Thanks for you reply. I hope I didn't come off as offensive but as it
is the stable version of xappy doesn't quite have everything we need
so I am using the version in svn and even then our requirements are
somewhat different which is always the issue between high level
interfaces and low level capabilities. Due to the sparse
documentation of xapian itself I do want to know the internals and how
it is doing its matching so that we can tweak it if need be. Xappy
has given us a great jumping off point in that respect but for
instance I don't need the stored document to be pickled - json would
work better for us as we are simply storing strings, lists and hashes.
This seems easy to switch out by not marking any fields as
STORE_CONTENT and setting the data in the xapian document before it is
saved to the db.

> I think the problem in this case is that the INDEX_EXACT action
> doesn't store an unprefixed version of the term.  For an
> INDEX_FREETEXT action, the text "dbus" will get indexed both as "dbus"
> (for non field-specific searches) and also as something like "XAdbus"
> for field specific searches.  (dbus may also be stemmed, depending on
> settings).  For an index exact field, you'll just get soemthing like
> the "XAdbus" field.

> To search this using pure xapian, you'll have to look up what the
> prefix to insert is by reading and unpacking the metadata key stored
> by xappy which holds this configuration, and give it to the query
> parser by calling qp.add_boolean_prefix().

Ah, that makes sense. I was looking at the debian xapian package
search and they did something like this. It didn't dawn on me how
this worked but now I understand the matching a bit better. Thanks.

> Really, I recommend you use xappy for searches too.
>
> --
> Richard
>

> --
> You received this message because you are subscribed to the Google Groups "xappy-discuss" group.
> To post to this group, send email to xappy-...@googlegroups.com.
> To unsubscribe from this group, send email to xappy-discus...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/xappy-discuss?hl=en.
>
>

J5

unread,
Sep 30, 2011, 2:11:27 PM9/30/11
to xappy-discuss
On Sep 30, 5:22 am, Richard Boulton <rich...@tartarus.org> wrote:
Hmm, ok so I got this sort of working. If I do a search for
exact_match: dbus it finds it but that isn't exactly what I am trying
to achieve here. Say I have a list of package names with descriptions
which mention dbus to a varying degree:

dbus-python - 2 hits for dbus in description
dbus - 1 hit for dbus in description
Perl-DBus - 5 hits for dbus in description

I want the search for "dbus" to weight exact matches higher than non-
exact matches so it would return:

dbus
Perl-DBus
dbus-python

right now my searches for dbus do not even show dbus in the top 10.
I've tried querying for exact_name:dbus dbus with various default ops
and it either returns nothing at all or the same results as dbus. I
noticed that EXACT_INDEX doesn't do weighting so it is most likely not
what I am looking for. I could possibly add my own prefix and weight
that and add it to the search terms but the PREFIXdbus-python may
still be ranked above PREFIXdbus (but it would fix the PREFIXPerl-DBus
showing up at the top).

I'm slowly grasping how to tweak things but it isn't all that
intuitive yet. Thanks for the help.
Reply all
Reply to author
Forward
0 new messages