Tool for generating inflected forms on words

793 views
Skip to first unread message

Sandra Derbring

unread,
Feb 3, 2010, 9:48:44 AM2/3/10
to nltk-...@googlegroups.com
Hi,

Basically, this is the same problem discussed here a while ago, concerning how to get inflected forms out of WordNet. In my case, I don't need it to be WordNet's tools, I am just wondering if there is any other tool in NLTK that can take a lemma and generate the most common inflected forms - either by an algorithm or by comparing with some list that stores this information?

Best regards,
Sandra

Pedro Marcal

unread,
Feb 3, 2010, 10:58:21 AM2/3/10
to nltk-...@googlegroups.com
Hi Sandra,
In practice I found it more useful to go from inflected words to the base words, the attached algorithms were designed to go both ways and can be modified to do so. The algorithms accept the first modified word in a dictionary. It can work off your own dictionary or in this case, it works off a list of words I culled from the Penn+Brown+conll corpora (roughly 3 million words of raw data). If you are interested in my list, I can load a pickled file on my website for you. But that would only work for python Window versions. ( 9M Bytes). If you do not work in Windows, I would have to generate a text file from my pickled file and load it up on my website with a short code to convert back to a pickled file.
Regards,
Pedro

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

inflectedWords.py

Samir Bilal

unread,
Feb 3, 2010, 6:20:36 PM2/3/10
to nltk-...@googlegroups.com
Hi All,

Is there a way to distinguish the determiner and the relative pronoun (which for example) in part of speech tagging. I used this code :

import nltk
nltk.pos_tag(nltk.word_tokenize("which city has a cinema which is near a train station") )

But which is tagged as determiner always.

Thanks
Samir


Charles Hartman

unread,
Feb 4, 2010, 11:13:49 AM2/4/10
to nltk-users
It may be relevant that when I tried the high-powered parsers
available from the Stanford Natural Language Group a couple of years
ago, just about the one thing that could break them was "that."
Misconstruing a relative as a demonstrative could throw the trees out
of closure. How we make that distinction so apparently effortlessly
must be an interesting question.

Daniel Ridings

unread,
Feb 4, 2010, 1:23:50 PM2/4/10
to nltk-...@googlegroups.com
I think this particular parser in nltk tags both "which" and "that" as
WDT. I checked out the rules of a Brill tagger, and it doesn't really
try either. The present tagger in nltk simplified the task by calling
all relative pronouns WDT.

In a practical NLP application, when would it really matter?

Daniel

Richard Careaga

unread,
Feb 4, 2010, 1:29:50 PM2/4/10
to nltk-...@googlegroups.com
My non-professional feeling is it probably doesn't matter in a practical application against most corpora. The pristine distinction between "that", which is used in conjunction with independent clauses, and the relative "which" that is used with dependent clauses is all but gone in contemporary common usage.

Pedro Marcal

unread,
Feb 4, 2010, 2:34:53 PM2/4/10
to nltk-...@googlegroups.com
I am trying to parse
"That police officer saluted me."
as part of an English to Chinese translation.
I am using a Church based Statistical parser with some Design of Experiment improvements that gives me about a 99.5% accuracy.
The parser gave me a 
that (PRO) police(V) officer(N) solution which is wrong.
In cases such as these, I provide an override which tests for
that (PRO) police(V) officer(N)
and substitutes
that (PRO) police(N) officer(N)
I agree that it is ugly but if I had to contend with "that(DET)" as well, it would require more over-rides. The DET tag had already been eliminated by the parser.
I guess my point is that no matter how good you think your parser is, you always have to provide for back-up in case you have a failure.
Regards,
Pedro

Daniel Ridings

unread,
Feb 4, 2010, 2:56:20 PM2/4/10
to nltk-...@googlegroups.com
Sorry Pedro, I will never believe your parser is 99.5% correct.

If you ask the whole list for a correct parse of the sentence you
give, you are not going to get a 99.5% agreement.

What if I say That (PRO) is wrong, that it is a DET.

That police officer saluted me
The police officer saluted me

same construction. "The" is not a PRO.

Your parse is only 80% correct :-)

Daniel

Stuart Robinson

unread,
Feb 4, 2010, 4:05:02 PM2/4/10
to nltk-...@googlegroups.com
The issue is what kind of coverage is provided by a parser that is 99.5%
correct. If you're only handling a very limited domain, that's believable.
But if you're handling anything reasonably comprehensive (even just one
genre of articles from one newspaper), it becomes a lot less believable.

+---------------------------------------
Stuart Robinson
Email: stuart at zapata dot org
Homepage: www.zapata.org/stuart
Twitter: twitter.com/stuartrobinson

Samir Bilal

unread,
Feb 4, 2010, 6:50:15 PM2/4/10
to nltk-...@googlegroups.com

Hi All,

Is there a temporal tagger in NLTK .  For example a tagger that produces "2009" for  "last year" so that the output can be used in a formal query.

Thanks
Samir.

Pedro Marcal

unread,
Feb 4, 2010, 9:29:24 PM2/4/10
to nltk-...@googlegroups.com
Hi Daniel,
That was the result I got in testing the conll corpora. The result is not that great, it means 1 tag wrong in 200 words. I note that with the use of the Design of Experiment procedure, I have to do an order of magnitude more computing than the standard Church approach but the efficiency of the method is such that it is the equivalent of at least two orders of magnitude of a simple combinatorial search. The combinatorial search = 2**n where n is the number of words in the sentence.
I interpreted the DET tag as an ADJ as given by Samir's example.

you wrote:-
"That police officer saluted me
The police officer saluted me

same construction. "The" is not a PRO.

Your parse is only 80% correct :-)"

This is true for grammar rule parsing, but in statistical parsing, the actual words used have a large influence on the result. In fact the parsing is statistically dependent on the words.
My parser was trained by Brown+selected Penn corpora + misc. A total of 3 million words. As far as I can tell, the coverage is pretty good.
In any case I must protest at the harshness of the grading. Even had I got the second sentence wrong which I did not. That would have lowered my score by 1/14000 (the number of words in the conll corpora).
In Chinese the article 'the' gets thrown away. And after another Design of Experiment operation on the possible Chinese words for each pos, I get the following.
*** DOE solution for meaning (max prob) *** 
 police 民警 
 officer 司令 
 salute 立 
 le 了 
 me 余 
 . 。 
As far as I can tell its a good translation-since I cannot read Chinese. But it follows the grammar rules of my book Introduction to Mandarin and a check with my Chinese word processor. The le indicates a tensed verb.

Pedro Marcal

unread,
Feb 4, 2010, 10:18:12 PM2/4/10
to nltk-...@googlegroups.com
Hi Stuart,
I think we have had this discussion before.
The parser is quite general and is defined by the corpora used to train it (actually as its basis).The whole point about a statistical parser is that it gets better the larger you throw tagged corpora into it. The problem is getting hold of more tagged corpora. While my  corpora is a mixture of Brown+Penn + Conll(now after test that gave 99.5%)+ my own additions. Circa of about 3 million words. I would like to get to 5 million words, but I watched Francis Cucera struggle to build his Corpora at Brown with limited computing but lots of student power, and I do not wish to get bogged down in the same way.
I wish someone would invent a machine learning process to develop more tagged corpora.

 The same parsing process works for other Languages. I was fortunate enough to get hold of a tagged Chinese Corpora from the U. of Lancaster. And also a Japanese Wordnet. So I have a Chinese to English and a Japanese to English translator. The code is essentially the same except for the transformation of Grammar to English Grammar.
You may look at my statistical process as an in between of the Google process (essentially translation memory) which uses an untagged corpora of 50 million + words or so. In Japanese Google scores about 10% and in Chinese translation Google gets about 35% because of the closeness of Chinese Grammar to English Grammar.please see www.lifecyclevnv.com for more details. Incidentally anyone in the NLTK group is welcome to a free copy of the Translators.With some tweaking and knowledge of my code I can get close to about 95% of the meaning in English.
The whole purpose of this exercise is to learn how to get to an unambigous parsing of the English sentence since the CJK (Chinese, Korean, Japanese) written words are in principle unique.
Regards,
Pedro

Daniel Ridings

unread,
Feb 5, 2010, 12:38:15 AM2/5/10
to nltk-...@googlegroups.com
On Fri, Feb 5, 2010 at 3:29 AM, Pedro Marcal <pedrov...@gmail.com> wrote:

> This is true for grammar rule parsing, but in statistical parsing, the
> actual words used have a large influence on the result. In fact the parsing
> is statistically dependent on the words.

In fact, language and constructions is dependent on the actual words used.

> My parser was trained by Brown+selected Penn corpora + misc. A total of 3
> million words. As far as I can tell, the coverage is pretty good.
> In any case I must protest at the harshness of the grading. Even had I got
> the second sentence wrong which I did not. That would have lowered my score
> by 1/14000 (the number of words in the conll corpora).

What Stuart and I are saying is that your parser does not get 99.5% of
the English language correct. I think what you are saying is that you
can re-tag the corpus you trained on, and produced the same result
that the training corpus had.

In the real world, once you are no longer looking at training corpus,
you will never get a 99.5% correct result. My point is that you will
not be able to get 99.5% of the population of linguists to agree on
what is correct, so you will never be able to validate your 99.5%.

The only way you can do that is to have the entire corpus of English
or any other language (which unfortunately is infinite) the entire
corpus tagged (and agreed upon) and then parse an untagged version of
the same corpus and produce the same tags as the original.
Unfortunately such a parser would be pretty pointless, since it would
not be able to tell us anything we didn't already know. If the entire
corpus of English is already tagged, we do not need to re-tag it.

You can train your parser on 50 million words, and it will not get a
99.5% correct tagging of the first page of next week's New York Times
.

Your way of calculating correctness is not one I buy. I want 99.5%
correct unseen text. So if you mean you can get new text input to your
parser, new text the same size of your training corpus, and that the
results of the new parse is 99.5% correct, then I am impressed,
impressed, but I don't bel

Daniel Ridings

unread,
Feb 5, 2010, 12:46:05 AM2/5/10
to nltk-...@googlegroups.com
I can answer this one easily, without looking.

No.

But the work you are doing sounds interesting so it will be nice to follow it and hope some of the results make it into the toolkit.

Daniel


Steven Bird

unread,
Feb 5, 2010, 2:09:01 AM2/5/10
to nltk-users
But note that there is a temporal expression tagger:
http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/timex.py

This identifies temporal expressions, a step in the direction of what you want to do (resolve temporal deixis).

-Steven

Daniel Ridings

unread,
Feb 5, 2010, 2:47:19 AM2/5/10
to nltk-...@googlegroups.com
Some good stuff in the NLTK, no doubt about it.

I'm working in an applied linguistic environment, commercially, and I've found myself in workplaces that have more and more decided to base their resource and development platform on the NLTK. Most of us come from academic life and knew of this before, but we've been pleasantly surprized at the maturity the toolkit has now.

Hope to be able to contribute back to it. I'm working mostly in the Scandinavian languages.

Daniel

Pedro Marcal

unread,
Feb 5, 2010, 4:57:07 AM2/5/10
to nltk-...@googlegroups.com
Hi Daniel,
Thank you for your email.
If the estimate of accuracy offends you, forget about it. Its more important for you to understand the nature of the statistical parser combined with the Design of Experiment methodology that has evolved. I take little credit for it because I learnt it all from the NLTK book. I can report that I have been using the method to do both context free parsing followed by semantic parsing (in another Language) for six months. The text is acquired at random. In my communication with Stuart, I realized that training is the wrong word to use for a statistical parser. Rather it is that we extract the essence of each sentence combining words and tagging. This basis is carried and applied selectively at all times depending on the words in the sentence. We need not have infinite coverage. I think 3 million words is pretty good. I think 5 million would be more than enough. The Japanese have shown that their Toyo Kanji -1780 words is sufficient for use in all their written text. I understand that their newspapers and other periodicals are confined to the use of these characters by Law. My experience of building a Chinese - English dictionary from a tagged corpora was that after a million words were used to tag the words of the CEDICT dictionary (no tags), it would take 100 thousand words to improve the coverage by ten per cent of the remaining missing words using Google's word translation capability. Finally I ended up with a coverage of about 99.4 %.
To paraphrase Galileo, "Nevertheless it works."
Regards,
Pedro 

Daniel Ridings

unread,
Feb 5, 2010, 5:07:46 AM2/5/10
to nltk-...@googlegroups.com
Thanks Pedro,

I'll dig deeper in your combination of methodologies. It is appealing
since I work with so many different languages.

Hope to hear more about your work,
Daniel

Justin Olmanson

unread,
Feb 5, 2010, 1:02:43 PM2/5/10
to nltk-...@googlegroups.com
Hi,

I am building a modest grammar based off of grammar2 found on p.301 in the 2009 NLTK book.

grammar2 = nltk.parse_cfg("""
            S -> NP VP
            NP -> Det Nom | PropN
            Nom -> Adj Nom | N
            VP -> V Adj | V NP | V S | V NP PP | V Adv | Adv V | M V | V
            PP -> P NP
            PropN -> 'NNP' | 'NNPS' | 'WP' | 'WP$'
            Det -> 'DT' | 'PDT' |'WDT'
            N -> 'NN' | 'NNS'
            Adj -> 'JJ' | 'JJR' | 'JJS'
            V -> 'VB' | 'VBD' | 'VBG' | 'VBN' | 'VBP' | 'VBZ'
            M -> 'MD'
            P -> 'IN' | 'TO' | 'RP'
            Adv -> 'RB' | 'RBS' | 'WRB'
            """)

when i load it into python (based on p.306) I get the following warnings...

>>> sr_parse  = nltk.ShiftReduceParser(grammar2)
Warning: VP -> V Adj will never be used
Warning: VP -> V NP will never be used
Warning: VP -> V S will never be used
Warning: VP -> V NP PP will never be used
Warning: VP -> V Adv will never be used

Thoughts?

Thanks,

Justin


Peter Ljunglöf

unread,
Feb 5, 2010, 5:23:08 PM2/5/10
to nltk-...@googlegroups.com
Hi,

according to the code in nltk/parse/sr.py, method _check_grammar, on line 264:

# Any production whose RHS is an extension of another production's RHS
# will never be used.

The shift-reduce parser which is implemented in NLTK has only one parse stack. This means that on ambiguous grammars (and even some deterministic grammars such as yours), it will sometimes come to a point where it has to decide whether to shift or reduce. In these cases, the implemented parser always reduces.

The problem in your particular grammar is that you have a rule VP -> V. So, the parser will never ever reduce the rule VP -> V Adj, since it will reduce the first rule before seeing the Adj.

But, don't try to change the grammar; change the parsing algorithm instead. Try a RecursiveDescentParser - it cannot handle left-recursive grammars, which your grammar isn't. Or, even better, try a ChartParser; it can handle all possible grammars.

best, Peter Ljunglöf

William Johnston

unread,
Feb 9, 2010, 3:23:39 AM2/9/10
to nltk-...@googlegroups.com

Peter:

May I get a copy of your grammar when you have completed it?

For my needs, I would like a decent immediate constituent grammar.

Are there existing grammars that can perform immediate constituents?

williamj



From: Peter Ljunglöf <peter.l...@heatherleaf.se>
To: nltk-...@googlegroups.com
Sent: Fri, February 5, 2010 5:23:08 PM
Subject: Re: [nltk-users] Tagging: difference between determiner and pronoun
> To unsubscribe from this group, send email to nltk-users+unsub...@googlegroups.com.

> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+unsub...@googlegroups.com.

Peter Ljunglöf

unread,
Feb 9, 2010, 4:22:33 AM2/9/10
to nltk-...@googlegroups.com
Hi,

I have no grammar, and I'm not working on anyone at the moment. In fact it's a very difficult task to write a grammar with a decent coverage. Read about it in section 8.6-8.8 of the NLTK book.

The first question to ask before you want to write a grammar is: do I want a grammar for parsing running text (e.g., crawling the web), or do I have a more controlled language that I want to parse (e.g., a simple dialogue system)? The second question is: what kind of parse result do I want? Different kinds of grammars/parsers (and development strategies) are good for different tasks.

best, Peter

9 feb 2010 kl. 09.23 skrev William Johnston:

>
> Peter:
>
> May I get a copy of your grammar when you have completed it?
>
> For my needs, I would like a decent immediate constituent grammar.
>
> Are there existing grammars that can perform immediate constituents?
>
> williamj
>
>

> > To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.


> > For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.

> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.


> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.

> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.

Justin Olmanson

unread,
Feb 9, 2010, 10:14:22 AM2/9/10
to nltk-...@googlegroups.com
Hi,


Thanks again Peter L. for your help, I moved on from shiftReduce. Or possibly regressed... to the RegexParser. I'm using it to group user-typed strings into phrasal groups during the writing/editing process (checking for new input every 3 seconds or so). The users are 2nd-6th graders.

The grammar is quite imperfect, but my needs are modest. I'm sharing the function I wrote for the application since WilliamJ requested something, and because possibly others may suggest improvements.  :-)

best,

-Justin


------------------------------------------------------------------------------------------------------------------

import nltk

def get_phrase_chunks(unTaggedText):
    """This function takes the passed string and returns a string with phrasal boundaries
       delineated. Return example:
       '(S\n  (CL\n    (NP my/PRP$ dad/NN)\n    (VP jumped/VBD (PP over/IN (NP the/DT fence/NN)))))'
    """
    #(to do) ensure that the string is a string, ensure that it has at least a single character in it
    #this grammar modified from NLTK 2009 book p.278  -this grammar still needs work. i need to ensure that all tags are represented or scrub/sub some tags
    #NP  one or zero PDT (all, half), one or zero DT (the, a) OR ,WDT (that, which), OR, WP$ (whose), OR, PRP$ (her, my, our, your),
    #       zero or more CD (zero, 132, 1979), OR, JJ (third, regrettable, multilingual), OR, JJS (calmest, deepest, cutest), OR, JJR (bleaker, colder, cuter)
    #       one or more PRP (hers, me, them, us), OR WP (that, what, who, whom), OR,
    #                            NN (shed, slide, wind, humor), OR, NNP (Liverpool, Edith, Hank), OR NNPS (Americas, Animals, Angels), OR, NNS (muses, undergraduates, faucets)
    # PP  one IN (among, if, like, into), OR, TO (to), OR, RP (aout, across, go, teeth)  AND one
    #VP  zero or more RB (occasionally, swiftly, periodically, technically), OR RBS (best, biggest, farthest, first), OR WRB (how, where, why),
    #       zero or more MD (can, may, will, shall), AND zero or more TO (to) AND
    #       one VB (ask, assess, break), OR, VBD (diped, swipped, strode), OR, VBG (stiring, angering, erasing), OR, VBN (chaired, reunified, dubbed), OR VBP (wrap, sue, spill), OR, VBZ (bases, marks, uses)
    #       zero or more RB (occasionally, swiftly, periodically, technically), OR RBS (best, biggest, farthest, first), OR WRB (how, where, why),
    #       zero or one RP (about, at, by, go, with)
    #       zero or one  NP OR PP OR CL
    #CL   one NP AND one VP
    grammar = r"""
        NP: {<PDT>?<DT|WDT|WP\$|PRP\$>?<CD|JJ.*>*<PRP|WP|NN.*>+}
        PP: {<IN|TO|RP><NP>}
        VP: {<RB|RBS|WRB>*<MD>*<TO>*<VB.*>+<RB|RBS|WRB>*<RP>*<NP|PP|CL>*$}
        CL: {<NP><VP>}
        """
    taggedText = nltk.pos_tag(nltk.word_tokenize(unTaggedText))
    j = nltk.RegexpParser(grammar, loop=2)
    k = str(j.parse(taggedText))
    return k

----------------------------------------------------------------------------------------------------------------------

Justin Olmanson

unread,
Feb 9, 2010, 12:51:28 PM2/9/10
to nltk-...@googlegroups.com
Hi,

The way I have written my regexParser function, it returns a string:

k = str(j.parse(taggedText))


       '(S\n  (CL\n    (NP my/PRP$ dad/NN)\n    (VP jumped/VBD (PP over/IN (NP the/DT fence/NN)))))'

I'd prefer to return a list like this (below). Is there some nltk function or option that will do this for me? 

       ['S' ['CL', ['NP', 'my/PRPS', 'dad/NN'], ['VP', 'jumped/VBD', ['PP', 'over/IN', ['NP', 'the/DT', 'fence/NN']]]]]


thanks,

Justin

Peter Ljunglöf

unread,
Feb 11, 2010, 3:07:55 AM2/11/10
to nltk-...@googlegroups.com
Hi Justin,

perhaps you should read more about the class nltk.Tree?

>>> help(nltk.Tree)
class Tree(__builtin__.list)
(...)

There you can see that a tree is in itself a list of its daughters, with a special attribute 'node' for getting the mother:

>>> t = nltk.Tree("(S (NP my/PRP dad/NN) (VP jumped/VBD (PP over/IN (NP the/DT fence/NN))))")
>>> t
Tree('S', [Tree('NP', ['my/PRP', 'dad/NN']), Tree('VP', ['jumped/VBD', Tree('PP', ['over/IN', Tree('NP', ['the/DT', 'fence/NN'])])])])
>>> t.node
'S'
>>> t[0]
Tree('NP', ['my/PRP', 'dad/NN'])
>>> t[1]
Tree('VP', ['jumped/VBD', Tree('PP', ['over/IN', Tree('NP', ['the/DT', 'fence/NN'])])])

So, all you have to do is to write a recursive function which converts a tree to a list; it has to call itself on the children, and prepends the mother node to the list of daughters. Since you won't always know if the argument really is a Tree, you have to test it by isinstance(t, nltk.Tree).

Read more in the NLTK book - I'm sure there are exercises for modifying trees and other recursive structures.

best, Peter

Justin Olmanson

unread,
Feb 11, 2010, 12:09:43 PM2/11/10
to nltk-...@googlegroups.com
Thanks Peter

I'll do as you say and continue reading. Thanks for the code examples to get me started!

:-)

-J

Justin Olmanson

unread,
Feb 11, 2010, 2:07:52 PM2/11/10
to nltk-...@googlegroups.com
Thanks again for your help Peter, you were right, there was some info about Trees on pages 279-281 in the NLTK (2009) book, that and your examples / suggestions got me there.   :-)

-------------------------------------------------------------------------

def organize_phrase_chunks(raw_phrase_chunk):
    """This function takes a chunked string and preps it for phrase-by-phrase processing
       '(S\n  (CL\n    (NP my/PRP$ dad/NN)\n    (VP jumped/VBD (PP over/IN (NP the/DT fence/NN)))))'
       into
       ['S', ['CL', ['NP', 'my/PRP$', 'dad/NN'], ['VP', 'jumped/VBD', ['PP', 'over/IN', ['NP', 'the/DT', 'fence/NN']]]]]
       or
       (S (NP The/DT big/JJ black/NN dog/NN))
       into
       ['S', ['NP', 'The/DT', 'big/JJ', 'black/NN', 'dog/NN']]
       or
       (S (NP I/PRP) am/VBP the/DT best/JJS (PP at/IN (NP basketball/NN)))
       into
       ['S', ['NP', 'I/PRP'], 'am/VBP', 'the/DT', 'best/JJS', ['PP', 'at/IN', ['NP', 'basketball/NN']]]
    """
    l = raw_phrase_chunk.replace('\n', '')  #get rid of newline \n elements
    m = nltk.Tree(l)     #convert string to a tree structure
    n = recursive_tree_2_list(m)     #convert to a list
    return n


def recursive_tree_2_list(theTree):
    """This recursive function takes a tree and returns a list
       calling itself if len > 1
    """
    ret_list = []  #initialize list
    #confirm that what is passed is a tree
    if isinstance(theTree, nltk.Tree):
        ret_list.append(theTree.node)    #add  mother / phrase type
        for element in theTree:          #iterate through the tree
            if isinstance(element, nltk.Tree):     #check if list item is also a tree
                ret_list.append(recursive_tree_2_list(element))     #send for processing
            else:
                ret_list.append(element)     #add non-tree element (string) to list
    else:
        ret_list.append(0, 'tree expected but not given', theTree)     #checkable issue
    return ret_list

-------------------------------------------------------------------

Peter Ljunglöf

unread,
Feb 12, 2010, 1:16:49 AM2/12/10
to nltk-...@googlegroups.com

11 feb 2010 kl. 20.07 skrev Justin Olmanson:

> Thanks again for your help Peter, you were right, there was some info about Trees on pages 279-281 in the NLTK (2009) book, that and your examples / suggestions got me there. :-)
>
> -------------------------------------------------------------------------
>
> def organize_phrase_chunks(raw_phrase_chunk):
> """This function takes a chunked string and preps it for phrase-by-phrase processing
> '(S\n (CL\n (NP my/PRP$ dad/NN)\n (VP jumped/VBD (PP over/IN (NP the/DT fence/NN)))))'
> into
> ['S', ['CL', ['NP', 'my/PRP$', 'dad/NN'], ['VP', 'jumped/VBD', ['PP', 'over/IN', ['NP', 'the/DT', 'fence/NN']]]]]
> or
> (S (NP The/DT big/JJ black/NN dog/NN))
> into
> ['S', ['NP', 'The/DT', 'big/JJ', 'black/NN', 'dog/NN']]
> or
> (S (NP I/PRP) am/VBP the/DT best/JJS (PP at/IN (NP basketball/NN)))
> into
> ['S', ['NP', 'I/PRP'], 'am/VBP', 'the/DT', 'best/JJS', ['PP', 'at/IN', ['NP', 'basketball/NN']]]
> """
> l = raw_phrase_chunk.replace('\n', '') #get rid of newline \n elements
> m = nltk.Tree(l) #convert string to a tree structure

Instead of first converting to a string and then back into a tree, why not keep the tree in the first place? (The chunk parser returns a tree, so you can simply use it instead of its string representation.)

/Peter

Reply all
Reply to author
Forward
0 new messages