Stemming

15 views
Skip to first unread message

Jacek Śliwerski

unread,
Aug 23, 2010, 4:09:19 PM8/23/10
to Open-Tran
I have implemented stemming in Open-Tran. The following languages are
supported through Snowball stemmers from NLTK: Danish, German,
Spanish, Finnish, French, Hungarian, Italian, Dutch, Norwegian,
Portuguese, Romanian, Russian and Swedish. English is supported
through Porter stemmer.

You won't see any improvements yet. The new import process is
scheduled to execute in 6 or 7 hours and will import the phrases with
stemmed words. However, the search engine does not stem the input
yet. If the import process completes without errors, you will see
degraded responses tomorrow morning.

I've got a question: does it make sense to use Spanish stemmer for
Catalan? How about Portuguese for Galician?

Thanks!
Jacek

Leandro Regueiro

unread,
Aug 24, 2010, 4:58:43 AM8/24/10
to open...@googlegroups.com, Trasno - Lista
2010/8/23 Jacek Śliwerski <sli...@googlemail.com>:

I don't really know about this. I'll forward this message to the
galician free software localization mailing list.

mvillarino

unread,
Aug 24, 2010, 7:04:02 AM8/24/10
to prox...@trasno.net, open...@googlegroups.com
>> I've got a question: does it make sense to use Spanish stemmer for
>> Catalan?  How about Portuguese for Galician?
>
> I don't really know about this. I'll forward this message to the
> galician free software localization mailing list.

I have ever tryed to stem Galician with Portuguese stemmer, however it
should fail for verbs with enclitic particles (amabachellela ->
amava-che-lhe-la)

Regarding stemmers and CAT tools, Lokalize from trunk (heave not yet
tested kde sc 4.5) stems prior to searching into glossary, that is, it
stems the source text before searching for matches into the stemmed
version of the glossary (because not everybody uses the glossary for
terminology only). But despite Mikola considered initially the use of
snowball, it has finally been done through Hunspell's stemmer, thus
supporting a wider set of languages.
The results? well, It is so fresh that possibly most of Lokalize's
users had not yet realized about this functionality. I did some
testing on a pre-release checkout of the sources and I personaly like
the result, but...
...but please take into account that by stemming hunspell refers to
doing a "reverse spellchecking", so for each word it offers as stems
watever word in the dictionary can be derived into the word in the
text, so please expect a lot of "false matches". By the way, i find
them very usefull both to check the quality of the glossary as well as
to pray for this process to use additional information from a pos
tagger some day into the near future.

All the best,
Marce Villarino

Leandro Regueiro

unread,
Sep 8, 2010, 10:17:03 AM9/8/10
to prox...@trasno.net, open...@googlegroups.com

Now you have implemented stemming we couldn't search the translation
for some words like "filtering" that is a substantive and also a verb.
Have you considered some idea to solve this, Jacek?

Leandro Regueiro

unread,
Sep 8, 2010, 2:42:17 PM9/8/10
to prox...@trasno.net, open...@googlegroups.com
On Wed, Sep 8, 2010 at 8:15 PM, mvillarino <mvill...@gmail.com> wrote:
> 2010/9/8, Leandro Regueiro <leandro....@gmail.com>:

>>> I have ever tryed to stem Galician with Portuguese stemmer, however it
>>> should fail for verbs with enclitic particles (amabachellela ->
>>> amava-che-lhe-la)
> [typo]: I have Never ...

>
>> Now you have implemented stemming we couldn't search the translation
>> for some words like "filtering" that is a substantive and also a verb.
>> Have you considered some idea to solve this, Jacek?
>
> Sorry, I do not understand: please, can you explain in more details
> what the problem is?

Right now open-tran does searches for galician using stemming. That is
good. Or it was good until I tried to search "filtering" and it give
me all the results for "filters", "filtering", "filter", "filtered"
and I only searched for the substantive "filtering" to see the most
common galician translation (maybe "filtro", "filtrado" or
"filtraxe"). So now open-tran is unusable for doing searches like that
one, because you stemmed all the words _before_ saving them in the
database.

Now the question: Jacek, have you considered this issue before
recoding opentran for using stemming prior to dump the raw data in the
database? How can we do searches like that one (getting results only
for "filtering" and not for "filters")?

Bye,
Leandro Regueiro

mvillarino

unread,
Sep 8, 2010, 2:15:29 PM9/8/10
to prox...@trasno.net, open...@googlegroups.com
2010/9/8, Leandro Regueiro <leandro....@gmail.com>:

>> I have ever tryed to stem Galician with Portuguese stemmer, however it
>> should fail for verbs with enclitic particles (amabachellela ->
>> amava-che-lhe-la)
[typo]: I have Never ...

> Now you have implemented stemming we couldn't search the translation


> for some words like "filtering" that is a substantive and also a verb.
> Have you considered some idea to solve this, Jacek?

Sorry, I do not understand: please, can you explain in more details
what the problem is?

Jacek Śliwerski

unread,
Sep 8, 2010, 3:24:56 PM9/8/10
to open...@googlegroups.com, prox...@trasno.net
On 08.09.2010 20:42, Leandro Regueiro wrote:
>
> Right now open-tran does searches for galician using stemming. That is
> good. Or it was good until I tried to search "filtering" and it give
> me all the results for "filters", "filtering", "filter", "filtered"
> and I only searched for the substantive "filtering" to see the most
> common galician translation (maybe "filtro", "filtrado" or
> "filtraxe"). So now open-tran is unusable for doing searches like that
> one, because you stemmed all the words _before_ saving them in the
> database.

I understand the issue. I know how to fix it. For now, I suggest the
following workaround: quote the filtering. I.e.: instead of this:

http://en.gl.open-tran.eu/suggest/filtering

Try this:

http://en.gl.open-tran.eu/suggest/"filtering"

Jacek

Leandro Regueiro

unread,
Sep 9, 2010, 6:21:51 AM9/9/10
to open...@googlegroups.com, prox...@trasno.net
2010/9/8 Jacek Śliwerski <sli...@googlemail.com>:

It works. Thanks.

Reply all
Reply to author
Forward
0 new messages