do I have to always mention tagged words to define grammar ?

49 views
Skip to first unread message

saik...@gmail.com

unread,
Feb 8, 2018, 3:19:03 PM2/8/18
to nltk-users
Hi,
  I noticed that in order define a grammar to use for sentence parsing, I need to specify each tagged words as a leaf entry in the grammar cfg. Otherwise, parser shows error message that grammar didn't cover any of the words in the sentence. For example, I had to specify below grammar for a sentence " I am learning NLP". In that case, how do I use a general grammar to parse various sentences in a long text ?

grammar = nltk.CFG.fromstring("""
          S  -> NP VP 
          NP -> PRP VBP
          VP -> VBG NNP
          PRP -> 'I'
          VBP -> 'am'
          VBG -> 'learning'
          NNP -> 'NLP'
          """)

thanks
   saikat

Dimitriadis, A. (Alexis)

unread,
Feb 8, 2018, 6:02:43 PM2/8/18
to <nltk-users@googlegroups.com>
The nltk's CFG module is not meant for parsing free text, it's a teaching tool. You need to include all the words in the grammar, or hack some work-around. E.g., you can write a grammar that terminates at the POS, then POS-tag your sentence and parse the string of POS tags instead of the original words. It's a hack, but it will do the job. 

However, you'll still have a hard time parsing real text because of the large number of CFG rules it will involve, and because without the lexical content it's impossible to choose the right parse among the alternatives. If your goal is to parse real text, use a statistical parser.

Alexis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jordi Carrera

unread,
Feb 9, 2018, 10:45:20 AM2/9/18
to nltk-users
Couldn't you simply generate the leaf node rules programmatically by iterating over some dictionary?

See the following code for illustration (notice the print statements use Python 2. If you're using Python 3, replace the usages of "print x" with "print(x)"):


import nltk

#   Declare the word lists to be expanded into rules as an array
#   (a Python dictionary) with all the words grouped by their part of speech
#   (to be used then as the left-hand side of the leaf rules). Alternatively,
#   you may also load this information from separate files, e.g. a file with VBGs,
#   another with NNs, etc.
LEXICON = {
    'VBG': 'learning studying understanding discovering',
    'PRP': 'i you he she it we they',
    'NN': 'NLP DS CS CL NLG IR LDA DL',
    'VBZ': 'is',
    'VBP': 'am are'
}

#   Define a function to automatically expand the items in the array above
#   into a grammar-like string:
def generate_lexicon():
    lexicon = ''
    for pos, words in LEXICON.items():
        for word in words.split():
            lexical_rule = "          %s -> '%s'\n" % (pos, word)
            lexicon += lexical_rule
    return lexicon


if __name__ == '__main__':

    #   At runtime, declare the rules section of your grammar normally
    #   but *without* the leaf rules...
    grammar = """S -> NP VP
              NP -> PRP
              V -> V VBG
              V -> V VBG
              VP -> VBZ
              VP -> VBP
              VP -> V NN
              NP -> PRP VBP
              V -> VBZ
              V -> VBP"""

    #   ... then call our function to obtain the string containing
    #   the automatically expanded leaf node rules...
    lexicon = generate_lexicon()

    #   ... and simply combine both at the end:
    semiautomatic_grammar = '%s\n%s' % (grammar, lexicon)


    #   You can now parse normally:
    grammar = nltk.CFG.fromstring(semiautomatic_grammar)

    sentences = [
        'i am studying NLP',
        'i am discovering DS',
        'he is understanding CL',
        'they are understanding LDA',
        'it is NLG',
    ]

    parser = nltk.ChartParser(grammar)

    for sent in sentences:
        print sent
        for i, tree in enumerate(parser.parse(sent.split())):
            print i + 1, tree
        print


    # You should get the following output:

# i am studying NLP
# 1 (S (NP (PRP i)) (VP (V (V (VBP am)) (VBG studying)) (NN NLP)))
# i am discovering DS
# 1 (S (NP (PRP i)) (VP (V (V (VBP am)) (VBG discovering)) (NN DS)))
# he is understanding CL
# 1 (S (NP (PRP he)) (VP (V (V (VBZ is)) (VBG understanding)) (NN CL)))
# they are understanding LDA
# 1 (S
#   (NP (PRP they))
#   (VP (V (V (VBP are)) (VBG understanding)) (NN LDA)))
# it is NLG
# 1 (S (NP (PRP it)) (VP (V (VBZ is)) (NN NLG)))

Dimitriadis, A. (Alexis)

unread,
Feb 12, 2018, 4:22:23 PM2/12/18
to <nltk-users@googlegroups.com>
> On 9 Feb 2018, at 16:45, Jordi Carrera <jordi.carr...@gmail.com> wrote:
>
> Couldn't you simply generate the leaf node rules programmatically by iterating over some dictionary?

Have you thought about parsing arbitrary text, e.g. a random paragraph from the newspaper? This could be an alternative to removing the lexemes as I suggested, but for each parsing task the grammar must be enlarged with the lexemes from the specific text to be parsed (probably after it's POS-tagged).

Alexis

Jordi Carrera

unread,
Feb 13, 2018, 11:25:54 AM2/13/18
to nltk-users
Yes, for arbitrary input you're right. However, we don't know if that applies to the current use case, the original question does not make any stipulation to that effect.

PoS-taggers are relatively easy to come by so, under some circumstances, it may actually be feasible to run one on the input, as you suggest, and automatically derive the dictionaries for the grammar. That would ensure exhaustive or near-exhaustive recall.

Alternatively (and more simplistically), one may remove from the input sentence any tokens not in the grammar (or, ideally, replace them with some kind of entity labeling), and attempt to parse the remaining words. If using entity tags, the grammar can have rules predicated on those; if using the filtered word sequence, some minimum threshold can be set so that any parses for sentences where more than 1/3 of the tokens have been removed, are ignored or set aside for specific treatment.


 acr that sometimes it may be easier to run a PoS-tagger on the text and 
Reply all
Reply to author
Forward
0 new messages