Confused about case-sensitivity

1 view
Skip to first unread message

Michael Elsdörfer

unread,
Mar 17, 2008, 2:07:30 AM3/17/08
to xappy-discuss
I have two fields in a xappy index, both using INDEX_FREETEXT, one
having a "language" option applied, the other not.

The stemmed fields work as expected, case does not matter at all.

The unstemmed fields mostly seem to ignore case as well - except for
the first character:

For example, the term "gee" in that field will be matched by "GEE",
"GeE" etc., but not by "gee" or "geE". Even more strange, the case of
the term I originally indexed doesn't seem to matter either. If the
term is "buffed", I have to query for "Buffed" to find the document.

Any idea what might be wrong here? If not, any suggestions on how to
debug this?

Using SVN checkout, xapian retrieved via libs\get_xapian.py, Windows

Richard Boulton

unread,
Mar 17, 2008, 4:12:16 AM3/17/08
to xappy-...@googlegroups.com

This doesn't seem like the correct behaviour to me... The intended
behaviour is that:

- In the unstemmed case, capitalisation should be entirely irrelevant.

- In the stemmed case, a query word which is uncapitalised should
match any word with the same stem, but should give a higher weight to an
exact match for the word. A query word which has an initial capital is
assumed to represent a proper noun, and will only match an exact match
for the word.

If you're able to put together a minimal example demonstrating the
behaviour I'm seeing, that would be very helpful - I'll try and look
into this shortly, and such an example would save me a bit of time.

--
Richard

Michael Elsdörfer

unread,
Mar 17, 2008, 11:50:45 AM3/17/08
to xappy-discuss
Hi Richard,

here's an example:

http://dpaste.com/39820/

On my system, the output is:

(term, num results)
buffed 0
Buffed 1
gee 0
Gee 1
GeE 1
shock 1
evolution 0

Normally, and according to your explanation, I would expect to see
exactly one result for each query.

Also (I didn't mention this in my original post), as you can see the
fields "title" and "text" are defined exactly the same way, but appear
to behave differently. The all-lowercase query "shock" finds the
document through the "title" field, while "evolution" through the
"text" field doesn't seem to work.

Thanks for your help,

Michael

On Mar 17, 9:12 am, Richard Boulton <rich...@lemurconsulting.com>
wrote:

Richard Boulton

unread,
Mar 17, 2008, 1:03:47 PM3/17/08
to xappy-...@googlegroups.com
Michael Elsdörfer wrote:
> Hi Richard,
>
> here's an example:
>
> http://dpaste.com/39820/

Thanks - that was very helpful.

> Normally, and according to your explanation, I would expect to see
> exactly one result for each query.

Yes, that would be reasonable. I've just done a quick investigation of
what happens, and found the problem; we don't currently cope with mixed
stemming settings correctly.

If you try setting all the field actions to use the same language, or
all of them to use no language (so no stemming), it works as expected.
However, when any of the fields have a stemmer, the query parser fails
to build the search terms for those fields correctly.

I can see a "quick hack" solution, but I'm not certain it won't degrade
performance elsewhere, so I'll do a few tests to check on that. I'm
hoping to have time in the near future to do a clean-up of the way in
which the field settings are set, which will make this kind of conflict
impossible to happen, so I'm not going to spend too much effort on a
short-term solution, though.

For now, I suggest you use the same stemming strategy for all free text
fields.

Thanks very much for your feedback - it came at a good time, since I'm
currently thinking about how to do this restructuring.

> Also (I didn't mention this in my original post), as you can see the
> fields "title" and "text" are defined exactly the same way, but appear
> to behave differently. The all-lowercase query "shock" finds the
> document through the "title" field, while "evolution" through the
> "text" field doesn't seem to work.

The search for "evolution" in the text field doesn't work because you
missed the line
document.fields.append(xappy.Field('text', d.get('text', '')))
in other words, you don't actually add the contents of the text field to
the UnprocessedDocument anywhere!

(Easy mistake to make - it took me a while to spot it...)

--
Richard

Michael Elsdörfer

unread,
Mar 18, 2008, 12:38:35 PM3/18/08
to xappy-discuss
Richard,

> For now, I suggest you use the same stemming strategy for all free text
> fields.

That did the trick, and is good enough for now.

> in other words, you don't actually add the contents of the text field to
> the UnprocessedDocument anywhere!
> (Easy mistake to make - it took me a while to spot it...)

Oh - sorry about that. Too much experimenting with the search in my
actual app must have had me mixing things up.

Thanks you very much for investigating this. Xappy is great, keep up
the good work.

Regards,

Michael

On Mar 17, 6:03 pm, Richard Boulton <rich...@lemurconsulting.com>
wrote:
Reply all
Reply to author
Forward
0 new messages