Porting the Snowball stemmers to NLTK

398 views
Skip to first unread message

Peter Stahl

unread,
Jun 1, 2010, 7:24:51 AM6/1/10
to nltk-dev
Hi again everyone,

perhaps you have already read my thread about developing a
morphological analyzer for NLTK. But before I'm going to start with
that, I'd like to do another thing first. As computational morphology
is one of my favorite areas, I've dealt with stemmers, too. I'd like
to extend the collection of stemmers which are already included in the
toolkit. In particular, I'm talking about the Snowball stemmers that
have been developed by Martin Porter and Richard Boulton. They are
available on http://snowball.tartarus.org/.

The algorithms for every language are explained in detail, so porting
these to Python isn't supposed to take a long time, I think. I've
already implemented the stemmer for German and compared its results to
those provided on the website for testing purposes. So testing the
code is pretty easy.

In particular, my idea is to write a module with the new class
SnowballStemmer(). Each instance of this stemmer class should be
initialized with a string describing the language that should be used
for stemming. Something like this:

stemmer = SnowballStemmer("german")
stemmed_word = stemmer.stem("...")

What do you think about this? Please let me know. Thanks again.


Best regards,
Peter

Gregg Lind

unread,
Jun 1, 2010, 9:16:36 AM6/1/10
to nltk...@googlegroups.com
Be aware that there is already a Python port of the Porter stemmer,
though it is not completely efficient.

At the risk of derailing, it would be nice to turn that into a caching /
memoized stemmer :)

GL

Peter Stahl

unread,
Jun 1, 2010, 9:49:14 AM6/1/10
to nltk-dev
Hi Gregg,

I know that there is already an implementation of the Porter stemmer
included in NLTK. Thus I want to focus on the other languages apart
from English.

Can you please tell me what a caching stemmer is? Does it mean that a
word and its stemmed form are saved in a database for future use or
something? Thanks.





On 1 Jun., 15:16, Gregg Lind <gr...@renesys.com> wrote:
> Be aware that there is already a Python port of the Porter stemmer,
> though it is not completely efficient.
>
> At the risk of derailing, it would be nice to turn that into a caching /
> memoized stemmer :)
>
> GL
>
>
>
> Peter Stahl wrote:
> > Hi again everyone,
>
> > perhaps you have already read my thread about developing a
> > morphological analyzer for NLTK. But before I'm going to start with
> > that, I'd like to do another thing first. As computational morphology
> > is one of my favorite areas, I've dealt with stemmers, too. I'd like
> > to extend the collection of stemmers which are already included in the
> > toolkit. In particular, I'm talking about the Snowball stemmers that
> > have been developed by Martin Porter and Richard Boulton. They are
> > available onhttp://snowball.tartarus.org/.
>
> > The algorithms for every language are explained in detail, so porting
> > these to Python isn't supposed to take a long time, I think. I've
> > already implemented the stemmer for German and compared its results to
> > those provided on the website for testing purposes. So testing the
> > code is pretty easy.
>
> > In particular, my idea is to write a module with the new class
> > SnowballStemmer(). Each instance of this stemmer class should be
> > initialized with a string describing the language that should be used
> > for stemming. Something like this:
>
> > stemmer = SnowballStemmer("german")
> > stemmed_word = stemmer.stem("...")
>
> > What do you think about this? Please let me know. Thanks again.
>
> > Best regards,
> > Peter
>
>
>
>  smime.p7s
> 4KAnzeigenHerunterladen

Peter Ljunglöf

unread,
Jun 1, 2010, 1:08:50 PM6/1/10
to nltk...@googlegroups.com

1 jun 2010 kl. 15.49 skrev Peter Stahl:

> I know that there is already an implementation of the Porter stemmer
> included in NLTK. Thus I want to focus on the other languages apart
> from English.

There is also a Python wrapper for Snowball. So, the easy way would be to transform that wrapper into NLTK. The harder way would be to translate the C code to Python.

/Peter

> --
> You received this message because you are subscribed to the Google Groups "nltk-dev" group.
> To post to this group, send email to nltk...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-dev+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-dev?hl=en.
>
>

Peter Stahl

unread,
Jun 1, 2010, 2:17:16 PM6/1/10
to nltk...@googlegroups.com
Hi Peter (a nice name, isn't it ;-),

I know of the wrapper. Both ways you mention are possible, but I prefer writing the code from scratch. The time that I would need for understanding the C code (I hate C, by the way) is, for me personally, better spent in starting with Python directly. Anyway, thanks for the hint. :-)

Peter Stahl

unread,
Jun 10, 2010, 8:09:46 AM6/10/10
to nltk-dev
Hi everyone,

I just want to let you know that I've finished porting the Snowball
stemmers for German, Dutch, Swedish, Norwegian, Danish and Spanish. As
soon as I've done the same for French, Portuguese, Italian and
Romanian, I'm going to release the module. By the way, I've also
implemented a caching feature, so that a previously stemmed word can
be accessed at any time without having to repeat the stemming process
again and again. I hope you will like it. :-)

Best regards,
Peter

Steven Bird

unread,
Jun 10, 2010, 10:49:54 AM6/10/10
to nltk-dev
On 10 June 2010 05:09, Peter Stahl <pemi...@googlemail.com> wrote:
> I just want to let you know that I've finished porting the Snowball
> stemmers for German, Dutch, Swedish, Norwegian, Danish and Spanish. As
> soon as I've done the same for French, Portuguese, Italian and
> Romanian, I'm going to release the module.

Great -- do you want to contribute this to NLTK? If so, please submit
it as an attachment to a new issue in our issue tracker. Also, please
take a look at our coding guidelines on this page:

http://code.google.com/p/nltk/wiki/DevelopersGuide

> By the way, I've also
> implemented a caching feature, so that a previously stemmed word can
> be accessed at any time without having to repeat the stemming process
> again and again.

NB memoization decorators:
http://wiki.python.org/moin/PythonDecoratorLibrary

-Steven

Peter Stahl

unread,
Jun 10, 2010, 11:36:23 AM6/10/10
to nltk-dev
Hi Steven,

> Great -- do you want to contribute this to NLTK?  If so, please submit
> it as an attachment to a new issue in our issue tracker.

Of course I want to contribute this to NLTK. This is the reason why
I'm doing it. :-) And thanks for telling me HOW to contribute new
code, actually. I think, it is nowhere explicitly mentioned that new
code should be attached to a new issue in the issue tracker. Perhaps
you can make that a bit clearer on the Google Code page for beginners
like me.
Oh, I wasn't aware that there is a library with examples for things
like this. It is a funny coincidence that I nearly did it in the same
way as in this library. But I didn't make use of decorators, though.
Thanks for the hint, anyway. :-)

Peter Stahl

unread,
Jun 20, 2010, 7:24:05 PM6/20/10
to nltk-dev
Hi everyone,

finally, my work is done. :-) I've ported nearly all the Snowball
stemmers to NLTK. At the moment, the module's stemmers support 12
languages: Danish, Dutch, Finnish, French, German, Hungarian, Italian,
Norwegian, Portuguese, Romanian, Spanish and Swedish. The Russian and
the Turkish stemmers are missing. The first one is a bit too
complicated for me at the moment, and for the second one the
description of the algorithm is missing on the Snowball website.
Perhaps, I will add this functionality later. I configured the module
to be put easily into the nltk.stem package.

I would be very glad if my module was included into NLTK. Please give
it a try and let me know what you think. You can see the code here:
http://code.google.com/p/nltk/issues/detail?id=567

Thanks and best regards from Germany,
Peter

Peter Stahl

unread,
Jul 1, 2010, 9:15:17 AM7/1/10
to nltk-dev
Hi,

unexpectedly, I've still found the time to port the Russian stemmer as
well. This stemmer works both with words consisting of the Cyrillic
alphabet and also with transliterated forms consisting of the Roman
alphabet. I've also done some minor code improvements.

In order to avoid confusion, I have deleted the two modules that I
posted earlier in the code repository. Instead, please use the module
that is attached to the comment I've posted there. It should work with
Python 2.4, 2.5 and 2.6. It can still be found here:
http://code.google.com/p/nltk/issues/detail?id=567

By the way, what about more feedback? Until now, your response is
quite disappointing. If you haven't found time so far, then it's OK.
But please tell me when you make use of it and what you think about
it.

Thanks. :-)

dmtr

unread,
Jul 1, 2010, 9:50:13 PM7/1/10
to nltk-dev
> unexpectedly, I've still found the time to port the Russian stemmer as
> well. This stemmer works both with words consisting of the Cyrillic
> alphabet and also with transliterated forms consisting of the Roman
> alphabet. I've also done some minor code improvements.

Sounds interesting. I know some russian and I'll give it a try. :)
BTW - which api.py is your module trying to import?

> By the way, what about more feedback? Until now, your response is
> quite disappointing. If you haven't found time so far, then it's OK.
> But please tell me when you make use of it and what you think about
> it.

Summer break in the academia. :) Wish I was there.... Happy 4th :-)

Peter Stahl

unread,
Jul 3, 2010, 8:30:37 AM7/3/10
to nltk-dev
Hi Dmitry,

> BTW - which api.py is your module trying to import?

I have configured the module so that it can be easily put into the
nltk.stem package where the other stemmers are. It imports the api.py
that is in this package. So just put it into this package and it
should then work in the way I explained in the module's docstrings. Of
course, you can use it independently, too. In this case you have to
comment the api.py import out and the class 'SnowballStemmer' has to
inherit the class 'object' and not 'StemmerI'.

Best regards,
Peter

Peter Stahl

unread,
Jul 28, 2010, 2:53:18 PM7/28/10
to nltk-dev
Hi,

for those of you who haven't noticed it yet: My port of the Snowball
stemmers is now a part of NLTK 2.0b9. After you have installed it, you
can use the stemmer easily:

>>> import nltk
>>> stemmer = nltk.SnowballStemmer(language)
>>> stemmer.stem(word)

I got mails some time ago from people asking how to install and use
the stemmers. So, with the new beta of NLTK installed, there shouldn't
be any more problems for you.


Best regards,
Peter

Nitin Madnani

unread,
Aug 3, 2010, 6:04:59 PM8/3/10
to nltk-dev
Hi Peter,

Thank you for doing all the hard work of making Snowball stemmers for
various languages available into NLTK. However, I have a question: why
did you not include the English version as well?

I know that there is a version of the Porter Stemmer already included
in NLTK but does not always work correctly. Porter himself says on his
website that most of the unofficial implementations are not fully
correct. See: http://snowball.tartarus.org/texts/introduction.html.

One can also check that the official Snowball version of the stemmer
operates differently from the one included in NLTK:

>>> from nltk.stem import PorterStemmer

>>> s = PorterStemmer()
>>> s.stem('I am eating ice cream')
'I am eating ice cream'

However, if you use the same example on the Snowball demo page (http://
snowball.tartarus.org/demo.php) and the
results are different:

i -> i
am -> am
eating -> eat
ice -> ice
cream -> cream

It might be better to include the official Snowball implementation of
the Porter Stemmer Algorithm and have the PorterStemmer class just
point to that.

Thoughts?

Nitin

Peter Stahl

unread,
Aug 4, 2010, 6:31:59 AM8/4/10
to nltk-dev
Hi Nitin,

yes, you guessed it. I didn't include the English version because
there's already Steven Bird's implementation of the Porter Stemmer
available in NLTK. But I wasn't aware that his implementation operates
differently than intended by Martin Porter. (But I must admit that I
haven't even tested it.)

I think it would be easier and less time-consuming if Steven Bird
reviewed his own code and corrected it according to Martin Porter's
definition of the algorithm. Afterwards, maybe it could be included
into my Snowball module. Nitin, I would say, just contact Steven Bird
and ask him how he thinks about it.


Peter

Peter Ljunglöf

unread,
Aug 5, 2010, 11:26:41 AM8/5/10
to nltk...@googlegroups.com
Hi Peter and Nitin,

1. The StemmerI class has the single method .stem() which stems *a single word*, not a list of words (or a space-separated string). So Nitin calls the PorterStemmer incorrectly:

>>> from nltk.stem import PorterStemmer
>>> s = PorterStemmer()
>>> s.stem('I am eating ice cream')
'I am eating ice cream'

Instead you should do this:

>>> [s.stem(w) for w in 'I am eating ice cream'.split()]
['I', 'am', 'eat', 'ice', 'cream']

2. According to the docstring, nltk.stem.porter is a port (maintained by Vivake Gupta, not Steven Bird) of Martin Porter's original algorithm from 1980:

Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.

However, the English stemmer in the current Snowball project is much better. In fact, the old stemmer is included in Snowball, under the name "porter". If you install the Snowball python wrapper (available from the Snowball website), you get this:

>>> import Stemmer
>>> help(Stemmer.algorithms)
algorithms(...)
Get a list of the names of the available stemming algorithms.
[...]
Note that the the classic Porter stemming algorithm for English is
available by default: although this has been superceded by an improved
algorithm, the original algorithm may be of interest to information
retrieval researchers wishing to reproduce results of earlier
experiments. Most users will want to use the "english" algorithm,
instead of the "porter" algorithm.

Note that all three implementations differ, with nltk.stem.porter being in the middle:

>>> Stemmer.Stemmer('porter').stemWord('generously')
'gener'
>>> Stemmer.Stemmer('english').stemWord('generously')
'generous'
>>> nltk.PorterStemmer().stem('generously')
'gener'
>>> Stemmer.Stemmer('porter').stemWord('succeed')
'succe'
>>> Stemmer.Stemmer('english').stemWord('succeed')
'succeed'
>>> nltk.PorterStemmer().stem('succeed')
'succeed'

3. So, it would be nice to also include the latest English Snowball stemmer in nltk.stem.snowball; but of course, someone has to do it...

Also, as a side-node: since Snowball is actively maintained, it would be good if the docstring of nltk.stem.snowball said something about which Snowball version it was ported from.

best, Peter

Nitin Madnani

unread,
Aug 5, 2010, 12:45:06 PM8/5/10
to nltk...@googlegroups.com
Peter,
Thank you for catching my poorly constructed example.

However, despite the typo, I have definitely seen examples where the nltk version is inferior to the snowball version. I had switched to using the Snowball version by default for my work. However, when I saw that Peter (Stahl) had done the hardwork of bringing Snowball into NLTK, I was excited only to find that English was not included.

I agree with your comments regarding maintaining a record of the Snowball algorithm version. I also think we should point PorterStemmer in NLTK to the 'porter' algorithm under Snowball for backwards compatibility.

- Nitin

Peter Stahl

unread,
Aug 5, 2010, 1:15:36 PM8/5/10
to nltk-dev
Hey guys,

as far as I'm concerned, I can add the English Snowball stemmer (named
Porter2 on the website) to my module. I should have time for it this
weekend. I just thought it would be kind of repetitive to have two
English stemmers in NLTK. Perhaps users get confused and don't know
which stemmer to choose. But since the Porter2 algorithm seems to be
better, wouldn't it make sense to remove the older implementation out
of NLTK? Anyway, I'm going to deal with it this weekend if you don't
mind.


Peter




On 5 Aug., 18:45, Nitin Madnani <nmadn...@umiacs.umd.edu> wrote:
> Peter,
> Thank you for catching my poorly constructed example.
>
> However, despite the typo, I have definitely seen examples where the nltk version is inferior to the snowball version. I had switched to using the Snowball version by default for my work. However, when I saw that Peter (Stahl) had done the hardwork of bringing Snowball into NLTK, I was excited only to find that English was not included.
>
> I agree with your comments regarding maintaining a record of the Snowball algorithm version. I also think we should point PorterStemmer in NLTK to the 'porter' algorithm under Snowball for backwards compatibility.
>
> - Nitin
>
> >> For more options, visit this group athttp://groups.google.com/group/nltk-dev?hl=en.

Peter Ljunglöf

unread,
Aug 5, 2010, 3:49:35 PM8/5/10
to nltk...@googlegroups.com
I agree with Peter and Nitin. Here's my suggestion:

1) Add the improved EnglishStemmer to nltk.stem.snowball

2a) Move the PorterStemmer implementation to nltk.stem.snowball
2b) Remove nltk.stem.porter (or deprecate it)
2c) [if Peter has the time] Remove the changes to the original Porter algorithm (marked --DEPARTURE-- and --NEW--), and test it on the sample at http://tartarus.org/~martin/PorterStemmer/

I don't think we should remove the old implementation, because of this quote from the above site:

"The Porter stemmer should be regarded as ‘frozen’, that is, strictly defined, and not amenable to further modification. (...) The Porter stemmer is appropriate to IR research work involving stemming where the experiments need to be exactly repeatable."

/Peter


5 aug 2010 kl. 19.15 skrev Peter Stahl:

> Hey guys,
>
> as far as I'm concerned, I can add the English Snowball stemmer (named
> Porter2 on the website) to my module. I should have time for it this
> weekend. I just thought it would be kind of repetitive to have two
> English stemmers in NLTK. Perhaps users get confused and don't know
> which stemmer to choose. But since the Porter2 algorithm seems to be
> better, wouldn't it make sense to remove the older implementation out
> of NLTK? Anyway, I'm going to deal with it this weekend if you don't
> mind.

Of course we don't mind. It's also perfectly okay if you have more important things to do. After all, you don't get paid for doing this:)

Steven Bird

unread,
Aug 5, 2010, 8:16:02 PM8/5/10
to nltk...@googlegroups.com
Thanks Peter -- I think this is an excellent suggestion.

Jacob Perkins

unread,
Aug 5, 2010, 10:45:23 PM8/5/10
to nltk-dev
Just to chime in - I made an online demo of all the current NLTK
stemmers at http://text-processing.com/demo/stem/. It includes the
Snowball stemmer with all the currently supported languages. Assuming
the english Snowball implementation is included in the 2.0 release,
I'll be sure to add it to the demo.

Jacob
---
http://streamhacker.com/
http://twitter.com/japerk

On Aug 5, 5:16 pm, Steven Bird <stevenbi...@gmail.com> wrote:
> Thanks Peter -- I think this is an excellent suggestion.
>
> On 6 August 2010 05:49, Peter Ljunglöf <peter.ljung...@heatherleaf.se> wrote:
>
>
>
> > I agree with Peter and Nitin. Here's my suggestion:
>
> > 1) Add the improved EnglishStemmer to nltk.stem.snowball
>
> > 2a) Move the PorterStemmer implementation to nltk.stem.snowball
> > 2b) Remove nltk.stem.porter (or deprecate it)
> > 2c) [if Peter has the time] Remove the changes to the original Porter algorithm (marked --DEPARTURE-- and --NEW--), and test it on the sample athttp://tartarus.org/~martin/PorterStemmer/

Peter Ljunglöf

unread,
Aug 6, 2010, 4:21:39 AM8/6/10
to nltk...@googlegroups.com
Very nice!

/Peter

PS. In the example text, perhaps you meant "funnier" and not "funner"?

Peter Ljunglöf

unread,
Aug 6, 2010, 4:38:00 AM8/6/10
to nltk...@googlegroups.com
Okay,

> 6 aug 2010 kl. 02.16 skrev Steven Bird:
>
>> Thanks Peter -- I think this is an excellent suggestion.
>>
>> On 6 August 2010 05:49, Peter Ljunglöf <peter.l...@heatherleaf.se> wrote:
>>> I agree with Peter and Nitin. Here's my suggestion:
>>>
>>> 1) Add the improved EnglishStemmer to nltk.stem.snowball
>>>
>>> 2a) Move the PorterStemmer implementation to nltk.stem.snowball
>>> 2b) Remove nltk.stem.porter (or deprecate it)
>>> 2c) [if Peter has the time] Remove the changes to the original Porter algorithm (marked --DEPARTURE-- and --NEW--), and test it on the sample at http://tartarus.org/~martin/PorterStemmer/

I can do 2a and 2b today (if that doesn't interfere with what the other Peter is doing today).

Also, I found a problem with the current structure of stemmers:

>>> nltk.stem.snowball.SwedishStemmer()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __new__() takes exactly 2 arguments (1 given)
>>> nltk.stem.snowball.SwedishStemmer("russian")
<RussianStemmer>

The problem is that the inherited __init__ method needs a language argument, but we don't want to specify that when we specify the stemmer class. I see two solutions:

1. Define SnowballStemmer as a function returning a language specific stemmer. The downside is that it can't have the attribute 'languages' (or at least it's strange and non-pythonic with functions having attributes)

2. Let the class SnowballStemmer have a 'stemmer' attribute, which holds the language specific stemmer instance.

I think I vote for the second solution. I can do the necessary changes today, together with some optimizations for selecting language, if Peter thinks it's okay... Instead of the following long if-then-else:

if language == "danish":
self.stemmer = DanishStemmer()
elif language == "dutch":
...

We can do this:

stemmerclass = eval(language.capitalize() + "Stemmer")
self.stemmer = stemmerclass()

A similar thing can be done in the demo function to get rid of the long if-then-else. Peter, are all these changes okay with you?

/Peter

Peter Ljunglöf

unread,
Aug 6, 2010, 5:07:19 AM8/6/10
to nltk...@googlegroups.com
Oops, there's a small problem here: porter.py is still GPL. I don't know if I'm allowed to move PorterStemmer to snowball.py (which is Apache as the rest of NLTK). So, I import PorterStemmer into snowball.py, and leave porter.py intact.

/Peter

6 aug 2010 kl. 02.16 skrev Steven Bird:

Steven Bethard

unread,
Aug 6, 2010, 5:22:53 AM8/6/10
to nltk...@googlegroups.com
On Fri, Aug 6, 2010 at 10:38 AM, Peter Ljunglöf
<peter.l...@heatherleaf.se> wrote:
> Instead of the following long if-then-else:
>
>        if language == "danish":
>            self.stemmer = DanishStemmer()
>        elif language == "dutch":
>            ...
>
> We can do this:
>
>        stemmerclass = eval(language.capitalize() + "Stemmer")
>        self.stemmer = stemmerclass()
>
> A similar thing can be done in the demo function to get rid of the long if-then-else. Peter, are all these changes okay with you?

Don't use eval - people could then stick arbitrary code into
"language" and make NLTK execute it. Probably not a huge risk since
users of NLTK are mostly running on their own machines. Still, it
would be better to do something like:

stemmerclass = globals()[language.capitalize() + "Stemmer"]

Steve
--
Where did you get that preposterous hypothesis?
Did Steve tell you that?
        --- The Hiphopopotamus

Peter Stahl

unread,
Aug 6, 2010, 7:29:01 AM8/6/10
to nltk-dev
Hi Peter,

> I can do 2a and 2b today (if that doesn't interfere with what the other Peter is doing today).

No, it's not a problem. Just do this if you like. Could you then post
the updated Snowball module in the respective issue of the code
repository, please? (http://code.google.com/p/nltk/issues/detail?
id=567) Afterwards, I will add the English stemmer to it.

> Also, I found a problem with the current structure of stemmers:
> >>> nltk.stem.snowball.SwedishStemmer()
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> TypeError: __new__() takes exactly 2 arguments (1 given)
> >>> nltk.stem.snowball.SwedishStemmer("russian")
> <RussianStemmer>

Is it really so important to fix that problem? I've specified in the
docstrings how to correctly invoke the stemmers. If someone doesn't
want to follow these instructions, then he or she must not be
surprised to cause exceptions. In my opinion, one doesn't need to do
needless work and handle all theoretically possible but practically
unlikely problems and exceptions, if the user sticks to the given
instructions. Don't make things more complicated than they have to
be.

Peter Ljunglöf

unread,
Aug 9, 2010, 4:30:26 AM8/9/10
to nltk...@googlegroups.com
Thanks Peter for your excellent contributions. Now we have all the Snowball stemmers natively in NLTK!

Here's a comparison between the NLTK snowball stemmers and the Python wrapper from the snowball site:

* NLTK stemmers:
Words Result Seconds Words/s
danish 169600 697200 3.47 48915
dutch 169800 732800 3.10 54725
english 178100 721400 4.27 41746
finnish 134400 661100 3.61 37196
french 193500 682300 8.37 23125
german 152100 725500 2.55 59628
hungarian 135800 636400 5.69 23853
italian 172300 695300 9.03 19074
norwegian 172400 715500 2.73 63162
porter 178100 714300 4.88 36511
portuguese 187400 674300 8.82 21242
romanian 171900 682500 10.66 16124
russian 86000 371600 10.14 8484
spanish 176300 680900 8.53 20662
swedish 169200 714400 2.89 58496

* Python wrapper around C stemmers:
Words Result Seconds Words/s
danish 169600 705700 0.92 183928
dutch 169800 734000 0.89 189799
english 178100 721800 0.96 184852
finnish 134400 662400 0.79 169507
french 193500 682300 1.04 186000
german 152100 730400 0.86 176687
hungarian 135800 636600 0.84 161817
italian 172300 696400 0.91 189127
norwegian 172400 725500 0.94 183562
porter 178100 711300 0.96 184691
portuguese 187400 674300 0.97 192615
romanian 171900 685400 0.97 177472
russian 86000 371800 0.59 146672
spanish 176300 680900 0.96 184516
swedish 169200 725400 0.92 183339

I think it's not too bad: half of the stemmers are 2.5--5 times slower than optimized C. But there is room for optimizations, perfect for a student project!

I tested on the udhr corpus (Universal Declaration of Human rights), and the "Result" column is the sum of all stem lengths. As you can see the results differ between the NLTK and C stemmers. (Except for French, Portugueses and Spanish). That's something to look into, but not very important for now. Another student project perhaps?

best,
/Peter


5 aug 2010 kl. 19.15 skrev Peter Stahl:

Peter Stahl

unread,
Aug 9, 2010, 8:13:39 AM8/9/10
to nltk-dev
> Thanks Peter for your excellent contributions.

My pleasure! :-) I'm glad that I was able to contribute something
useful. NLTK is really great and I hope it will grow further.
Moreover, it was a good training for my programming skills with regard
to the start of my master studies in two months.

> I tested on the udhr corpus (Universal Declaration of Human rights), and the "Result" column is the sum of all stem lengths. As you can see the results differ between the NLTK and C stemmers. (Except for French, Portugueses and Spanish). That's something to look into, but not very important for now.

This is strange. It means that in nearly every language many words
were not stemmed correctly. But why? I definitely followed the
descriptions of the various algorithms and tested every stemmer on the
sample vocabularies. Maybe there should have been more words for
testing to increase accuracy. Anyway, feel free to configure and
optimize the stemmers to your liking.

Best regards,
Peter
Reply all
Reply to author
Forward
0 new messages