At the risk of derailing, it would be nice to turn that into a caching /
memoized stemmer :)
GL
> I know that there is already an implementation of the Porter stemmer
> included in NLTK. Thus I want to focus on the other languages apart
> from English.
There is also a Python wrapper for Snowball. So, the easy way would be to transform that wrapper into NLTK. The harder way would be to translate the C code to Python.
/Peter
> --
> You received this message because you are subscribed to the Google Groups "nltk-dev" group.
> To post to this group, send email to nltk...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-dev+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-dev?hl=en.
>
>
I know of the wrapper. Both ways you mention are possible, but I prefer writing the code from scratch. The time that I would need for understanding the C code (I hate C, by the way) is, for me personally, better spent in starting with Python directly. Anyway, thanks for the hint. :-)
Great -- do you want to contribute this to NLTK? If so, please submit
it as an attachment to a new issue in our issue tracker. Also, please
take a look at our coding guidelines on this page:
http://code.google.com/p/nltk/wiki/DevelopersGuide
> By the way, I've also
> implemented a caching feature, so that a previously stemmed word can
> be accessed at any time without having to repeat the stemming process
> again and again.
NB memoization decorators:
http://wiki.python.org/moin/PythonDecoratorLibrary
-Steven
1. The StemmerI class has the single method .stem() which stems *a single word*, not a list of words (or a space-separated string). So Nitin calls the PorterStemmer incorrectly:
>>> from nltk.stem import PorterStemmer
>>> s = PorterStemmer()
>>> s.stem('I am eating ice cream')
'I am eating ice cream'
Instead you should do this:
>>> [s.stem(w) for w in 'I am eating ice cream'.split()]
['I', 'am', 'eat', 'ice', 'cream']
2. According to the docstring, nltk.stem.porter is a port (maintained by Vivake Gupta, not Steven Bird) of Martin Porter's original algorithm from 1980:
Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.
However, the English stemmer in the current Snowball project is much better. In fact, the old stemmer is included in Snowball, under the name "porter". If you install the Snowball python wrapper (available from the Snowball website), you get this:
>>> import Stemmer
>>> help(Stemmer.algorithms)
algorithms(...)
Get a list of the names of the available stemming algorithms.
[...]
Note that the the classic Porter stemming algorithm for English is
available by default: although this has been superceded by an improved
algorithm, the original algorithm may be of interest to information
retrieval researchers wishing to reproduce results of earlier
experiments. Most users will want to use the "english" algorithm,
instead of the "porter" algorithm.
Note that all three implementations differ, with nltk.stem.porter being in the middle:
>>> Stemmer.Stemmer('porter').stemWord('generously')
'gener'
>>> Stemmer.Stemmer('english').stemWord('generously')
'generous'
>>> nltk.PorterStemmer().stem('generously')
'gener'
>>> Stemmer.Stemmer('porter').stemWord('succeed')
'succe'
>>> Stemmer.Stemmer('english').stemWord('succeed')
'succeed'
>>> nltk.PorterStemmer().stem('succeed')
'succeed'
3. So, it would be nice to also include the latest English Snowball stemmer in nltk.stem.snowball; but of course, someone has to do it...
Also, as a side-node: since Snowball is actively maintained, it would be good if the docstring of nltk.stem.snowball said something about which Snowball version it was ported from.
best, Peter
However, despite the typo, I have definitely seen examples where the nltk version is inferior to the snowball version. I had switched to using the Snowball version by default for my work. However, when I saw that Peter (Stahl) had done the hardwork of bringing Snowball into NLTK, I was excited only to find that English was not included.
I agree with your comments regarding maintaining a record of the Snowball algorithm version. I also think we should point PorterStemmer in NLTK to the 'porter' algorithm under Snowball for backwards compatibility.
- Nitin
1) Add the improved EnglishStemmer to nltk.stem.snowball
2a) Move the PorterStemmer implementation to nltk.stem.snowball
2b) Remove nltk.stem.porter (or deprecate it)
2c) [if Peter has the time] Remove the changes to the original Porter algorithm (marked --DEPARTURE-- and --NEW--), and test it on the sample at http://tartarus.org/~martin/PorterStemmer/
I don't think we should remove the old implementation, because of this quote from the above site:
"The Porter stemmer should be regarded as ‘frozen’, that is, strictly defined, and not amenable to further modification. (...) The Porter stemmer is appropriate to IR research work involving stemming where the experiments need to be exactly repeatable."
/Peter
5 aug 2010 kl. 19.15 skrev Peter Stahl:
> Hey guys,
>
> as far as I'm concerned, I can add the English Snowball stemmer (named
> Porter2 on the website) to my module. I should have time for it this
> weekend. I just thought it would be kind of repetitive to have two
> English stemmers in NLTK. Perhaps users get confused and don't know
> which stemmer to choose. But since the Porter2 algorithm seems to be
> better, wouldn't it make sense to remove the older implementation out
> of NLTK? Anyway, I'm going to deal with it this weekend if you don't
> mind.
Of course we don't mind. It's also perfectly okay if you have more important things to do. After all, you don't get paid for doing this:)
/Peter
PS. In the example text, perhaps you meant "funnier" and not "funner"?
> 6 aug 2010 kl. 02.16 skrev Steven Bird:
>
>> Thanks Peter -- I think this is an excellent suggestion.
>>
>> On 6 August 2010 05:49, Peter Ljunglöf <peter.l...@heatherleaf.se> wrote:
>>> I agree with Peter and Nitin. Here's my suggestion:
>>>
>>> 1) Add the improved EnglishStemmer to nltk.stem.snowball
>>>
>>> 2a) Move the PorterStemmer implementation to nltk.stem.snowball
>>> 2b) Remove nltk.stem.porter (or deprecate it)
>>> 2c) [if Peter has the time] Remove the changes to the original Porter algorithm (marked --DEPARTURE-- and --NEW--), and test it on the sample at http://tartarus.org/~martin/PorterStemmer/
I can do 2a and 2b today (if that doesn't interfere with what the other Peter is doing today).
Also, I found a problem with the current structure of stemmers:
>>> nltk.stem.snowball.SwedishStemmer()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __new__() takes exactly 2 arguments (1 given)
>>> nltk.stem.snowball.SwedishStemmer("russian")
<RussianStemmer>
The problem is that the inherited __init__ method needs a language argument, but we don't want to specify that when we specify the stemmer class. I see two solutions:
1. Define SnowballStemmer as a function returning a language specific stemmer. The downside is that it can't have the attribute 'languages' (or at least it's strange and non-pythonic with functions having attributes)
2. Let the class SnowballStemmer have a 'stemmer' attribute, which holds the language specific stemmer instance.
I think I vote for the second solution. I can do the necessary changes today, together with some optimizations for selecting language, if Peter thinks it's okay... Instead of the following long if-then-else:
if language == "danish":
self.stemmer = DanishStemmer()
elif language == "dutch":
...
We can do this:
stemmerclass = eval(language.capitalize() + "Stemmer")
self.stemmer = stemmerclass()
A similar thing can be done in the demo function to get rid of the long if-then-else. Peter, are all these changes okay with you?
/Peter
/Peter
6 aug 2010 kl. 02.16 skrev Steven Bird:
Don't use eval - people could then stick arbitrary code into
"language" and make NLTK execute it. Probably not a huge risk since
users of NLTK are mostly running on their own machines. Still, it
would be better to do something like:
stemmerclass = globals()[language.capitalize() + "Stemmer"]
Steve
--
Where did you get that preposterous hypothesis?
Did Steve tell you that?
--- The Hiphopopotamus
Here's a comparison between the NLTK snowball stemmers and the Python wrapper from the snowball site:
* NLTK stemmers:
Words Result Seconds Words/s
danish 169600 697200 3.47 48915
dutch 169800 732800 3.10 54725
english 178100 721400 4.27 41746
finnish 134400 661100 3.61 37196
french 193500 682300 8.37 23125
german 152100 725500 2.55 59628
hungarian 135800 636400 5.69 23853
italian 172300 695300 9.03 19074
norwegian 172400 715500 2.73 63162
porter 178100 714300 4.88 36511
portuguese 187400 674300 8.82 21242
romanian 171900 682500 10.66 16124
russian 86000 371600 10.14 8484
spanish 176300 680900 8.53 20662
swedish 169200 714400 2.89 58496
* Python wrapper around C stemmers:
Words Result Seconds Words/s
danish 169600 705700 0.92 183928
dutch 169800 734000 0.89 189799
english 178100 721800 0.96 184852
finnish 134400 662400 0.79 169507
french 193500 682300 1.04 186000
german 152100 730400 0.86 176687
hungarian 135800 636600 0.84 161817
italian 172300 696400 0.91 189127
norwegian 172400 725500 0.94 183562
porter 178100 711300 0.96 184691
portuguese 187400 674300 0.97 192615
romanian 171900 685400 0.97 177472
russian 86000 371800 0.59 146672
spanish 176300 680900 0.96 184516
swedish 169200 725400 0.92 183339
I think it's not too bad: half of the stemmers are 2.5--5 times slower than optimized C. But there is room for optimizations, perfect for a student project!
I tested on the udhr corpus (Universal Declaration of Human rights), and the "Result" column is the sum of all stem lengths. As you can see the results differ between the NLTK and C stemmers. (Except for French, Portugueses and Spanish). That's something to look into, but not very important for now. Another student project perhaps?
best,
/Peter
5 aug 2010 kl. 19.15 skrev Peter Stahl: