How to search documents with stemmed terms

15 views
Skip to first unread message

Mika Mustalahti

unread,
Aug 20, 2013, 4:25:52 PM8/20/13
to xapi...@googlegroups.com
Hi!

Does Xapian_db support Finnish stemming out of the box when I set the language to "fi"?
And how should I query the documents so that it uses stemming?

Now I'm querying the documents like this:
Question.search "election_id:#{election_id} AND (#{terms})"

And that produces a query like this:
XapianDb search (0.214413ms) indexed_class:question AND (election_id:2 AND (title:kysymystä OR title:ei))

This query however does not find a document whose title contains a word "kysymys". So it is not using stemming in the search. The search only finds a document if it contains exact match of one of the keywords.

I don't have any enabled or disabled query flags set in my configuration file, so it is now in the defaults. Actually, when the change in configuration file takes effect? Is server restart needed or just reindex?

-Mika

Gernot

unread,
Aug 21, 2013, 11:54:08 AM8/21/13
to xapi...@googlegroups.com
Hi Mika

You'll have to try. We do not use stemming in our apps. You may want to read the xapian (not xapian_db) docs to get more information.

Greets Gernot

Mika Mustalahti

unread,
Aug 21, 2013, 6:24:01 PM8/21/13
to xapi...@googlegroups.com
Hi Gernot!

What bindings Xapian_db is using for Xapian?

Could you point me to the right documentation?

I tried to see if the stemmer works by adding a line
  p XapianDb::Config.stemmer.operator(terms)
inside the loop in method index_text so that it would print the stemmed word for each word in my document, but instead I get an error message telling that "undefined method `operator' for #<Xapian::Stem:0x9c771a4>"

But the Xapian documentation in http://xapian.org/docs/apidoc/html/classXapian_1_1Stem.html tells that it should have that method. So I quess I'm reading the wrong documentation.

-Mika

Gernot

unread,
Aug 22, 2013, 1:06:04 AM8/22/13
to xapi...@googlegroups.com
Hi Mika

xapian_db uses ruby bindings. Here's the documentation: http://xapian.org/docs/bindings/ruby/rdocs/. Unfortunately it's not very complete.
Have you ever tried stemming after renaming your config file? You were running the xapian_db default configuration before since xapian_db couldn't find your config file and the language fi was never applied.

Greez Gernot

Mika Mustalahti

unread,
Aug 22, 2013, 6:30:42 AM8/22/13
to xapi...@googlegroups.com
Hi Gernot!

Stemming works. I just had tested it with a word which had scandinavian letters (ä or ö) near the end and it didn't work then. When I try it with words that have only ascii characters then it works. Looks like this is a string encoding issue after all.

I made xapian_db to log the terms it processes and it looks like this "kysymyst\xC3\xA4" but it should be "kysymystä".

Do you have any suggestions where this kind of mixing up could happen? In my database the text is correct.

-Mika

Gernot

unread,
Aug 22, 2013, 9:37:04 AM8/22/13
to xapi...@googlegroups.com
Hi Mika

All I can tell you is: The xapian library uses utf8 encoding. We use utf8 in our apps and our postgres databases and never had any encoding problems. "German" umlauts should not be a problem at all. 

Greez Gernot

Mika Mustalahti

unread,
Aug 22, 2013, 4:09:41 PM8/22/13
to xapi...@googlegroups.com
Hi Gernot

Finally the mystery is solved. The issue is in the stemming algorithm. "kysymys" is stemmed to "kysymys" but "kysymystä" to "kysymy" and that is why my test was failing. The stemmed version should be the same for both if the algorithm would work perfectly. Finnish is a complicated language.

-Mika
Reply all
Reply to author
Forward
0 new messages