--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
| Hi All, Is there a way to distinguish the determiner and the relative pronoun (which for example) in part of speech tagging. I used this code : import nltk nltk.pos_tag(nltk.word_tokenize("which city has a cinema which is near a train station") ) But which is tagged as determiner always. Thanks Samir |
In a practical NLP application, when would it really matter?
Daniel
If you ask the whole list for a correct parse of the sentence you
give, you are not going to get a 99.5% agreement.
What if I say That (PRO) is wrong, that it is a DET.
That police officer saluted me
The police officer saluted me
same construction. "The" is not a PRO.
Your parse is only 80% correct :-)
Daniel
+---------------------------------------
Stuart Robinson
Email: stuart at zapata dot org
Homepage: www.zapata.org/stuart
Twitter: twitter.com/stuartrobinson
Hi All, Is there a temporal tagger in NLTK . For example a tagger that produces "2009" for "last year" so that the output can be used in a formal query. Thanks Samir. |
> This is true for grammar rule parsing, but in statistical parsing, the
> actual words used have a large influence on the result. In fact the parsing
> is statistically dependent on the words.
In fact, language and constructions is dependent on the actual words used.
> My parser was trained by Brown+selected Penn corpora + misc. A total of 3
> million words. As far as I can tell, the coverage is pretty good.
> In any case I must protest at the harshness of the grading. Even had I got
> the second sentence wrong which I did not. That would have lowered my score
> by 1/14000 (the number of words in the conll corpora).
What Stuart and I are saying is that your parser does not get 99.5% of
the English language correct. I think what you are saying is that you
can re-tag the corpus you trained on, and produced the same result
that the training corpus had.
In the real world, once you are no longer looking at training corpus,
you will never get a 99.5% correct result. My point is that you will
not be able to get 99.5% of the population of linguists to agree on
what is correct, so you will never be able to validate your 99.5%.
The only way you can do that is to have the entire corpus of English
or any other language (which unfortunately is infinite) the entire
corpus tagged (and agreed upon) and then parse an untagged version of
the same corpus and produce the same tags as the original.
Unfortunately such a parser would be pretty pointless, since it would
not be able to tell us anything we didn't already know. If the entire
corpus of English is already tagged, we do not need to re-tag it.
You can train your parser on 50 million words, and it will not get a
99.5% correct tagging of the first page of next week's New York Times
.
Your way of calculating correctness is not one I buy. I want 99.5%
correct unseen text. So if you mean you can get new text input to your
parser, new text the same size of your training corpus, and that the
results of the new parse is 99.5% correct, then I am impressed,
impressed, but I don't bel
I'll dig deeper in your combination of methodologies. It is appealing
since I work with so many different languages.
Hope to hear more about your work,
Daniel
according to the code in nltk/parse/sr.py, method _check_grammar, on line 264:
# Any production whose RHS is an extension of another production's RHS
# will never be used.
The shift-reduce parser which is implemented in NLTK has only one parse stack. This means that on ambiguous grammars (and even some deterministic grammars such as yours), it will sometimes come to a point where it has to decide whether to shift or reduce. In these cases, the implemented parser always reduces.
The problem in your particular grammar is that you have a rule VP -> V. So, the parser will never ever reduce the rule VP -> V Adj, since it will reduce the first rule before seeing the Adj.
But, don't try to change the grammar; change the parsing algorithm instead. Try a RecursiveDescentParser - it cannot handle left-recursive grammars, which your grammar isn't. Or, even better, try a ChartParser; it can handle all possible grammars.
best, Peter Ljunglöf
I have no grammar, and I'm not working on anyone at the moment. In fact it's a very difficult task to write a grammar with a decent coverage. Read about it in section 8.6-8.8 of the NLTK book.
The first question to ask before you want to write a grammar is: do I want a grammar for parsing running text (e.g., crawling the web), or do I have a more controlled language that I want to parse (e.g., a simple dialogue system)? The second question is: what kind of parse result do I want? Different kinds of grammars/parsers (and development strategies) are good for different tasks.
best, Peter
9 feb 2010 kl. 09.23 skrev William Johnston:
>
> Peter:
>
> May I get a copy of your grammar when you have completed it?
>
> For my needs, I would like a decent immediate constituent grammar.
>
> Are there existing grammars that can perform immediate constituents?
>
> williamj
>
>
> > To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> > For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
perhaps you should read more about the class nltk.Tree?
>>> help(nltk.Tree)
class Tree(__builtin__.list)
(...)
There you can see that a tree is in itself a list of its daughters, with a special attribute 'node' for getting the mother:
>>> t = nltk.Tree("(S (NP my/PRP dad/NN) (VP jumped/VBD (PP over/IN (NP the/DT fence/NN))))")
>>> t
Tree('S', [Tree('NP', ['my/PRP', 'dad/NN']), Tree('VP', ['jumped/VBD', Tree('PP', ['over/IN', Tree('NP', ['the/DT', 'fence/NN'])])])])
>>> t.node
'S'
>>> t[0]
Tree('NP', ['my/PRP', 'dad/NN'])
>>> t[1]
Tree('VP', ['jumped/VBD', Tree('PP', ['over/IN', Tree('NP', ['the/DT', 'fence/NN'])])])
So, all you have to do is to write a recursive function which converts a tree to a list; it has to call itself on the children, and prepends the mother node to the list of daughters. Since you won't always know if the argument really is a Tree, you have to test it by isinstance(t, nltk.Tree).
Read more in the NLTK book - I'm sure there are exercises for modifying trees and other recursive structures.
best, Peter
> Thanks again for your help Peter, you were right, there was some info about Trees on pages 279-281 in the NLTK (2009) book, that and your examples / suggestions got me there. :-)
>
> -------------------------------------------------------------------------
>
> def organize_phrase_chunks(raw_phrase_chunk):
> """This function takes a chunked string and preps it for phrase-by-phrase processing
> '(S\n (CL\n (NP my/PRP$ dad/NN)\n (VP jumped/VBD (PP over/IN (NP the/DT fence/NN)))))'
> into
> ['S', ['CL', ['NP', 'my/PRP$', 'dad/NN'], ['VP', 'jumped/VBD', ['PP', 'over/IN', ['NP', 'the/DT', 'fence/NN']]]]]
> or
> (S (NP The/DT big/JJ black/NN dog/NN))
> into
> ['S', ['NP', 'The/DT', 'big/JJ', 'black/NN', 'dog/NN']]
> or
> (S (NP I/PRP) am/VBP the/DT best/JJS (PP at/IN (NP basketball/NN)))
> into
> ['S', ['NP', 'I/PRP'], 'am/VBP', 'the/DT', 'best/JJS', ['PP', 'at/IN', ['NP', 'basketball/NN']]]
> """
> l = raw_phrase_chunk.replace('\n', '') #get rid of newline \n elements
> m = nltk.Tree(l) #convert string to a tree structure
Instead of first converting to a string and then back into a tree, why not keep the tree in the first place? (The chunk parser returns a tree, so you can simply use it instead of its string representation.)
/Peter