Couldn't you simply generate the leaf node rules programmatically by iterating over some dictionary?
See the following code for illustration (notice the print statements use Python 2. If you're using Python 3, replace the usages of "print x" with "print(x)"):
import nltk
# Declare the word lists to be expanded into rules as an array
# (a Python dictionary) with all the words grouped by their part of speech
# (to be used then as the left-hand side of the leaf rules). Alternatively,
# you may also load this information from separate files, e.g. a file with VBGs,
# another with NNs, etc.
LEXICON = {
'VBG': 'learning studying understanding discovering',
'PRP': 'i you he she it we they',
'NN': 'NLP DS CS CL NLG IR LDA DL',
'VBZ': 'is',
'VBP': 'am are'
}
# Define a function to automatically expand the items in the array above
# into a grammar-like string:
def generate_lexicon():
lexicon = ''
for pos, words in LEXICON.items():
for word in words.split():
lexical_rule = " %s -> '%s'\n" % (pos, word)
lexicon += lexical_rule
return lexicon
if __name__ == '__main__':
# At runtime, declare the rules section of your grammar normally
# but *without* the leaf rules...
grammar = """S -> NP VP
NP -> PRP
V -> V VBG
V -> V VBG
VP -> VBZ
VP -> VBP
VP -> V NN
NP -> PRP VBP
V -> VBZ
V -> VBP"""
# ... then call our function to obtain the string containing
# the automatically expanded leaf node rules...
lexicon = generate_lexicon()
# ... and simply combine both at the end:
semiautomatic_grammar = '%s\n%s' % (grammar, lexicon)
# You can now parse normally:
grammar = nltk.CFG.fromstring(semiautomatic_grammar)
sentences = [
'i am studying NLP',
'i am discovering DS',
'he is understanding CL',
'they are understanding LDA',
'it is NLG',
]
parser = nltk.ChartParser(grammar)
for sent in sentences:
print sent
for i, tree in enumerate(parser.parse(sent.split())):
print i + 1, tree
print
# You should get the following output:
# i am studying NLP
# 1 (S (NP (PRP i)) (VP (V (V (VBP am)) (VBG studying)) (NN NLP)))
#
# i am discovering DS
# 1 (S (NP (PRP i)) (VP (V (V (VBP am)) (VBG discovering)) (NN DS)))
#
# he is understanding CL
# 1 (S (NP (PRP he)) (VP (V (V (VBZ is)) (VBG understanding)) (NN CL)))
#
# they are understanding LDA
# 1 (S
# (NP (PRP they))
# (VP (V (V (VBP are)) (VBG understanding)) (NN LDA)))
#
# it is NLG
# 1 (S (NP (PRP it)) (VP (V (VBZ is)) (NN NLG)))