Lemmatization module in NLTK

483 Aufrufe
Direkt zur ersten ungelesenen Nachricht

Carlos Rodriguez

ungelesen,
19.03.2010, 12:53:4519.03.10
an nltk...@googlegroups.com
I am pushing the idea of compatibilization/integration with the TALN
people (a python-based framework currently for italian NLP processing
through Swig, and that includes the DesR parser), and I'll keep all
posted.

By the way, I don't think that there's a module for lemmatization in
NLTK, maybe because in English this is not as critical as in more
inflected languages (for example, for dimensionality reduction). I
coded something quickly to use an almost million-entry word lemma
shelved dictionary, with associated POS tags, and that uses (word,POS)
tuples returned by NLTK taggers as input. I bet it can be made more
efficient, but it is quite quick in my machine, and understandable.
Do you believe it could be a worthwhile addition to the arsenal? Maybe
other more sophisticated statistical disambiguation methods could be
incorporated into this.

Carlos Rodríguez

Here's the code

========================

# -*- coding: utf-8 -*-


class LEMMA:
"""
Class to lemmatize from (word, pos) tuples from shelved dictionary
"""
def __init__(self,dict_file,unknown=None):
"""
Initialize lemmatizer by providing dictionary file. If
unknown, leave as non lemmata not found. Else, provide as lemma the
word


"""
import re, time, sys, string, os, codecs, shelve
self.dict = dict_file
self.unk_flag = unknown
#print "loading dictionary"
self.d = shelve.open(self.dict)
#print "done"

def CreateDictionaryShelved(self,input,output):
"""Create lemma dictionary"""
import codecs, shelve
lista = [x.strip().split("\t") for x in
codecs.open(input,encoding="UTF-8").readlines()]
d = shelve.open("output","c")
n = 0
for i in lista:
n += 1
entry = i[0].encode("UTF-8")
tuples = [tuple(x.split()) for x in i[1:]]
d[entry] = tuples
print "processed ",n,"entries"
d.close()

def GetLemma(self,tuple):
"""Assign lemma based on word-POS pairs
If no data in dictionary, return word as lemma"""
w = tuple[0]
t = tuple[1]
try:
tuples = self.d[w.lower().encode('UTF-8')]
for cada in tuples:
if cada[0] == t:
return (w,t,cada[-1].decode('UTF-8'))
if self.unk_flag:
return (w,t,None)
else:
return (w,t,w)
except KeyError:
if self.unk_flag:
return (w,t,None)
else:
return (w,t,w)

def lemmatize(self,TupleList):
D1 = []
for each in TupleList:
D1.append(self.GetLemma(each))
return D1

On Fri, Mar 19, 2010 at 2:13 AM, Steven Bird <steve...@gmail.com> wrote:
> Unfortunately the NLTK application for GSoC funding was unsuccessful.
> We'll need to find some other way to encourage all these good project
> ideas forward.  NLTK really needs more support for stat NLP.
>
> -Steven
>
> --
> You received this message because you are subscribed to the Google Groups "nltk-dev" group.
> To post to this group, send email to nltk...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-dev+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-dev?hl=en.
>
>

Peter Ljunglöf

ungelesen,
19.03.2010, 16:45:1219.03.10
an nltk...@googlegroups.com
Hi Carlos,

19 mar 2010 kl. 17.53 skrev Carlos Rodriguez:

> I am pushing the idea of compatibilization/integration with the TALN
> people (a python-based framework currently for italian NLP processing
> through Swig, and that includes the DesR parser), and I'll keep all
> posted.
>
> By the way, I don't think that there's a module for lemmatization in
> NLTK, maybe because in English this is not as critical as in more
> inflected languages (for example, for dimensionality reduction). I
> coded something quickly to use an almost million-entry word lemma
> shelved dictionary, with associated POS tags, and that uses (word,POS)
> tuples returned by NLTK taggers as input. I bet it can be made more
> efficient, but it is quite quick in my machine, and understandable.
> Do you believe it could be a worthwhile addition to the arsenal? Maybe
> other more sophisticated statistical disambiguation methods could be
> incorporated into this.

this is cool - the last few days I have been working on a Swedish lemmatizer, based on the SALDO lexicon. My idea is exactly like yours - using the postag to filter out possible lemmas. I will certainly look into your code and use some of it. I didn't know about the shelve module, it looks very useful.

If we should add something to NLTK, we should think about the API. It's similar to the TaggerI interface, which has the methods tag and batch_tag. E.g., tag takes a list (of words) and returns a list (of word-tag pairs). A lemmatizer takes a list (of word-tag pairs) and returns a list (of word-tag-lemma tuples).

In principle we could reuse the tagger API, but perhaps the name will be misleading. Is there any more general term covering both tagging and lemmatization (and other similar procedures)?

best, Peter

PS. The Saldo lexicon is GPL and can be found at http://spraakbanken.gu.se/saldo/ (only in Swedish). We're planning to add it to NLTK when we have the time.

Steven Bird

ungelesen,
19.03.2010, 19:38:1119.03.10
an nltk-dev
On 19 March 2010 12:53, Carlos Rodriguez <crodr...@gmail.com> wrote:
> I am pushing the idea of compatibilization/integration with the TALN
> people (a python-based framework currently for italian NLP processing
> through Swig, and that includes the DesR parser), and I'll keep all
> posted.


Sounds good, thanks.

> By the way, I don't think that there's a module for lemmatization in

> NLTK, [...]

But note that there are stemmers, and the interface must be very similar.

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem-module.html

There's also a lemmatizer based on WordNet:

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.wordnet.WordNetLemmatizer-class.html

Can we build out from this?

-Steven Bird

Carlos Rodriguez

ungelesen,
20.03.2010, 03:48:0420.03.10
an nltk...@googlegroups.com
I'll take a look at it. BTW, is there an interfase for EuroWordNets?
It would also be a good idea/project...

Carlos

Steven Bird

ungelesen,
20.03.2010, 07:25:1520.03.10
an nltk-dev
On 20 March 2010 03:48, Carlos Rodriguez <crodr...@gmail.com> wrote:
> BTW, is there an interfase for EuroWordNets?
> It would also be a good idea/project...

Yes, that would be good to have...

-Steven

JAGANADH G

ungelesen,
20.03.2010, 05:51:2720.03.10
an nltk...@googlegroups.com
Carlos


On Sat, Mar 20, 2010 at 1:18 PM, Carlos Rodriguez <crodr...@gmail.com> wrote:
I'll take a look at it. BTW, is there an interfase for EuroWordNets?
It would also be a good idea/project...


A Python interface for Euro WordNet is available at http://ilk.uvt.nl/~marsi/software/ewnpy.html


--
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Peter Ljunglöf

ungelesen,
21.03.2010, 16:18:4221.03.10
an nltk...@googlegroups.com
Hi,

There's one problem with StemmerI, and the WordnetLemmatizer: They are both single-word only. I can imagine several lemmatization (and stemmer) algorithms that take context into account, just as POS-tagging does. So my take on a Lemmatizer interface would be something like this:

class LemmatizerI(object):
def lemmatize_word(self, word, pos=None):
# default implementation:
return self.lemmatize_sentence([(word, pos)])[0]
def lemmatize_sentence(self, words):
# default implementation:
return [self.lemmatize_word(word, pos) for (word, pos) in words]

(And simliar for StemmerI)

But what I really would like is a generic interface for all three: tagging, lemmatization and stemming. They are very similar, and one could see them all as instances of a general kind of tagging -- pos-tagging, lemma-tagging and stem-tagging. In fact, I think that TaggerI is a fine interface as it is, if we also add a tag_token method. The methods stem (from StemmerI) and lemmatize (from WordnetLemmatizer) can be kept as (deprecated) methods.

Here's a suggestion, with slightly modified (generalized) comments:

class TaggerI(object):
"""
A processing interface for assigning a tag to each token in a list.
Tags are case sensitive strings that identify some property of each
token, such as its part of speech, its sense, its lemma, its word stem,
or its compound analysis.

Some taggers require specific types for their tokens. This is
generally indicated by the use of a sub-interface to C{TaggerI}.
For example, I{featureset taggers}, which are subclassed from
L{FeaturesetTaggerI}, require that each token be a I{featureset}.

Subclasses must define:
- at least one of L{tag()}, L{tag_token()} or L{batch_tag()}
"""

def tag(self, tokens):
"""
Determine the most appropriate tag sequence for the given
token sequence, and return a corresponding list of tagged
tokens. A tagged token is encoded as a tuple C{(token, tag)}.

@rtype: C{list} of C{(token, tag)}
"""
if overridden(self.batch_tag):
return self.batch_tag([tokens])[0]
elif overridden(self.tag_token):
return [self.tag_token(token) for token in tokens]
else:
raise NotImplementedError()

def tag_token(self, token):
"""
Determine the most appropriate tag for the given token.

@rtype: C{(token, tag)}
"""
return self.tag([token])[0]

def batch_tag(self, sentences):
"""
Apply L{self.tag()} to each element of C{sentences}. I.e.:

>>> return [self.tag(sent) for sent in sentences]

@rtype: C{list} of C{list} of C{(token, tag)}
"""
return [self.tag(sent) for sent in sentences]

Allen antworten
Antwort an Autor
Weiterleiten
0 neue Nachrichten