Using regular expressions in feature grammars

Cerin

unread,

Nov 15, 2012, 12:42:17 AM11/15/12

to nltk-...@googlegroups.com

While reading about the toy English-to-SQL translator in chapter 10 (http://nltk.googlecode.com/svn/trunk/doc/book/ch10.html), I noticed that the literal numeric values are hardcoded in a production. Is there any way to specify a regexp-style numeric "pattern" in a production so it could match an arbitrary number?

David Gerő

unread,

Nov 16, 2012, 3:25:19 PM11/16/12

to nltk-...@googlegroups.com

Hi Cerin,

If you need this you can do with a tagger. You tagging your text before analysis with fcfg grammar . And in this tagger you can use regex-style numeric pattern. And after that you can use new tags in your grammar.

Best,
David

Cerin

unread,

Nov 16, 2012, 4:30:04 PM11/16/12

to nltk-...@googlegroups.com

I don't understand. The grammar won't understand a tagged number any better than an untagged number.

As I understand it, in order for the parser to work, its grammar needs to include a terminal for every single token in the input text. Otherwise, check_coverage() throws an exception. Therefore, if I don't want an my parser to choke, I need to create a terminal for every single number I plan on seeing, which isn't a practical solution. I don't see how using a Regex tagger would change this.

David Gerő

unread,

Nov 17, 2012, 8:48:07 AM11/17/12

to nltk-...@googlegroups.com

Hi,

as I understand your question you would like use regex for parsing. But in grammar you can't use regex. Only use regex tagger before parsing. And a regextagger will tagging your token well and after that you can use its in your grammar. - This solution don't use regex in grammar. Only tagging which you would like accept in grammar. You can write a grammar which handle your new tagged number.

The analysis process split to two part. Before you working with parsing you tag your text that you will handle its in parsing.

Or I am very misunderstood what would you like? - If in this case, I am sorry, but I would like to understand your question.

Best,
David

On Thursday, November 15, 2012 6:42:17 AM UTC+1, Cerin wrote:

Cerin

unread,

Nov 17, 2012, 10:52:47 AM11/17/12

to nltk-...@googlegroups.com

> You can write a grammar which handle your new tagged number.

This is what I don't understand. Can you provide an example of such a feature-based grammar? I don't believe this is possible with NLTK.

David Gerő

unread,

Nov 17, 2012, 3:58:08 PM11/17/12

to nltk-...@googlegroups.com

Hi,

I only suggest a solution for this problem. I wrote you can't use regex in grammar.

import nltk

test_sentence = ["123", "0x007b"]
regexp_tagger = nltk.RegexpTagger(
    [
    (r"^[0-9]+$", "decimal"),
    (r"^0x[0-9A-Fa-f]+$", "hexadecimal"),
    ])
tagged_text = regexp_tagger.tag(test_sentence)
only_tags = [tag for text, tag in tagged_text]

grammar = nltk.parse_cfg("""
    S -> Dec Hex
    Dec -> "decimal"
    Hex -> "hexadecimal"
    """)

parser = nltk.ChartParser(grammar)
res = parser.nbest_parse(only_tags)
print(res)

Best,
David

Cerin

unread,

Nov 17, 2012, 5:57:00 PM11/17/12

to nltk-...@googlegroups.com

Thank you, this is what I suspected you meant. However, I think you misunderstood my question. I'm using a feature-based grammar, because I need to capture the number via a node pattern. What you're doing is replacing all numbers with a common word, which defeats this purpose.

Jonathan VK

unread,

Nov 20, 2012, 4:42:06 PM11/20/12

to nltk-...@googlegroups.com

I've run into this issue while playing with NLTK for a small project. In my case I'm capturing positive integers, but it can easily be extended. It's more of a hack around the problem than a proper solution, but other solutions I found would require overriding part of the behavior of the chart parsers.

I have a grammar with a NUM non terminal which isn't on the LHS anywhere.
I load this grammar.
I export the productions.
I then tokenize the text I want to parse, filter it for integers using a regex, and for each different number, I add a production to the list of productions.
After that, I build a new grammar from the extended productions
I load this new grammar into a chart parser.

It might be sufficient to add the productions to an existing grammar rather than to build a new grammar from a list of productions, but I need to figure out where the indexing happens and whether this is a safe option. This would remove steps 3 and 5.

You can find the code here, but basically, the relevant part is the following:

grammar = data_load('file:commandParser.fcfg')
productions = grammar.productions()

RE_INT = re.compile(r'\d+$')

feature_parser = FeatStructParser()

def num_production(n):
    """ Return a production NUM -> n """
    lhs = FeatStructNonterminal('NUM')
    lhs.update(feature_parser.parse('[NUM=pl, SEM=<\V.V({num})(identity)>]'.format(num=n)))
    return Production(lhs, [n])

def parse():
    """ Parse a command and return a json string.
    If parse is successful, returns a tuple (true, [instructions]).
    If parse is not successful, returns a tuple (false, [errors]).
    """

    # preprocess
    command = request.forms.get('command').strip(' .?!')
    tokens = command.split()

    # Make a local copy of productions
    lproductions = list(productions)

    # find all integers
    ints = set(filter(RE_INT.match, command.split()))
    # Add a production for every integer
    lproductions.extend(map(num_production, ints))

    # Make a local copy of the grammar with extra productions
    lgrammar = FeatureGrammar(grammar.start(), lproductions)

    # Load grammar into a parser
    parser = FeatureEarleyChartParser(lgrammar, trace=0)

Best,

Jonathan

Chris Spencer

unread,

Nov 20, 2012, 5:05:06 PM11/20/12

to nltk-...@googlegroups.com

Wow, thanks. Yeah, that's definitely a hack, but still a clever workaround.

--

Alexis Dimitriadis

unread,

Nov 20, 2012, 6:20:17 PM11/20/12

to nltk-...@googlegroups.com

On Saturday, November 17, 2012 5:57:00 PM UTC-5, Cerin wrote:
> Thank you, this is what I suspected you meant. However, I think you
> misunderstood my question. I'm using a feature-based grammar, because
> I need to capture the number via a node pattern. What you're doing is
> replacing all numbers with a common word, which defeats this purpose.

It doesn't defeat the purpose if you reverse it afterwards. To avoid
rewriting your grammars on the fly, I would suggest the following:

1. Tokenize and normalize your sentence through a preliminary pass,
using regular expressions or even a real tagger; all numbers can be
tagged as NUM, and you could do the same for literal strings and all
open-vocabulary items (such as SQL column names).

2.Use a CFG to parse the sequence of tags.

3. Substitute the actual words into the parse tree.

The CFG formalism is defined on sequences built from a finite "alphabet"
(read: vocabulary), so to use it with open-vocabulary strings, some
adjustment is necessary, and the nltk cfg module doesn't do it for you.

Best,

Alexis

Reply all

Reply to author

Forward