Hi,
I've decided to bare some of my thoughts on NLP++ blog-style on the
comp.ai.nat-lang newsgroup, and perhaps will discuss VisualText,
TAIParse, and related topics in threads such as this one. The
newsgroup is not very active, to say the least, and yet (Natural
Language Engineering) NLE and "Text Analytics" are large and fast-
growing industries.
As VisualText is now free and unencumbered for non-commercial use, I
hope that others may use this forum to critique, ask questions, and so
on.
I'm jumping into the middle of my thoughts, taken from emails of the
past few days. If someone needs intro material, the VisualText Help
files are all online at
http://www.textanalysis.com/help/help.htm
and I can split off tutorial threads for anyone who has questions.
I made the bold statement to someone recently that NLP (or at least
NLE) is a solved problem. A lot of effort can get you to whatever
accuracy and completeness you want. When you have an accelerating
environment such as VisualText/NLP++, the work effort is multiplied.
Put into the hands of experienced practitioners, the work is
multiplied further.
The notes below were originally addressed to David de Hilster, my co-
architect on all things VisualText, as a sounding board.
Enjoy!
Amnon Meyers
CTO
Text Analysis International, Inc.
http://www.textanalysis.com
* NLP++ (R) is a registered trademark of Text Analysis International,
Inc.
====================================
In working on a current app, I realize I want a rule that matches
anything up to the longest match. For example, given the rule
S <- A B C D
I want these and only these to match:
A B C D
A B C
A B
A
The [opt] or optional match doesn't do this, but some new keyword
could implement this. Eg,
@RULES
_S <-
A
B
C [step]
D [step]
@@
"step" would mean, if you match me, you can add me to the matched
nodes and step forward (or stop here). Ok not to reach me, but you
can't pass over me. So this rule would match
A B C D
A B C
A B
Thinking about this further, would be nice to go in the reverse
direction. [trigger] could do that, but also yet another keyword
could work without requiring trigger. Eg,
_S <-
A [back]
B [back]
C [back]
D
@@
would match only these:
A B C D
B C D
C D
D
====================================
Frankly, after all these years, I'm not totally thrilled with NLP++
rules as compared with the old TexUS FSA-like rules. Eg,
modal have be verb
I'd like to match every possible occurrence of that. So something
like
_vg <-
modal [opt]
have [opt]
be [opt]
verb [opt]
@@
would do it, except it's currently disallowed, since it can match
"zero nodes". One quick enhancement to NLP++ would be to allow such
rules and require:
Every rule must match at least one parse tree node to succeed.
Ok, good. But as I said previously, "the longest match" is a pain in
the butt with [opt] as the only tool. To get the above rule to work
right, I'd need (messy, inefficient) conditions like
@CHECK
if (!N(2) && (N(3) || N(4)) )
fail();
if (!N(3) && N(4))
fail();
So, looking back to TexUS rules, we could do something like
_vg <-
modal [end]
have [end]
be [end]
verb
@@
This could match
modal have be verb
modal have be
modal have
modal
We can implement [START] [END] and [NEXT=(1 2)] to do what we had in
TexUS with optional starts, optional ends, and optional skips.
Doesn't look hard to do at all.
Note that [NEXT=( 1 3 ... )] (or call it SKIP if you like) can be used
to go BACKWARD as well as forward in the pattern, allowing loops and
cycles, pretty much a full FSA. Even more general than the TexUS rule
syntax. And addresses people's complaint about having regular
expression capability in NLP++ rules.
Would be nice to put <min,max> number of times to loop on those SKIP
arcs as well. Something like
_name <-
_firstname [next=(4)]
_letter
\. [next=( 2 <0,3> )]
_lastname
@@
could match (looping from 0 to 3 times)
John Smith
John A. Smith
John A. B. Smith
John A. B. C. Smith
but no more. An interesting question is how to structure the parse
tree after such a match! We can set up syntax for alternative ways to
group the matches:
John ((A.) (B.) (C.)) Smith
John (A. B. C.) Smith
John A. B. C. Smith
We are natural language engineers, after all. Eg
@POST
cycle(2,3,"_letterperiod"); # Group each cycle into a node.
group(2,3,"_middlenames"); # Standard NLP++.
Rule Editor: We could get a graphical rule editor to do the same types
of arcs and so on that we had in TexUS. That's kind of ambitious till
we can get you working at it full time! [Addressed to David]
Triggering = While triggering says "match me first" and that works ok
in the context of the linearized matching of current NLP++ rules, I
don't think we can jump into the middle of an FSA and "back up" to
make a pattern match work in the general case. So triggering may lose
some of its luster (hasn't had much to begin with).
_vg <-
modal [start next=(3 4)]
have [start next=(4)]
be [start]
verb [start end trigger]
@@
The above looks good, triggering can work in "well behaved" FSAs. The
above would match
modal have be verb
modal have verb
modal be verb
modal verb
have be verb
have verb
be verb
verb
as we might like.
Triggering has been a pain in NLP++, because if you have two rules:
_vg <- modal have be verb [trigger] @@
_vg <- modal @@
then the second rule matches and not the first. (When "modal" is
encountered, the first rule isn't triggered yet, but the second can
match.)
Triggering could be fixed by (an extra) pre-traversal of the current
phrase to see which triggered rules will fire. Since the idea of
triggering is efficiency, not sure that this will be a good thing.
====================================
Two more related issues.
One is that I hate the renumbering of elements that happens when one
zaps or reduces nodes. I'd like to keep the numbering as is, that
removes confusion. Eg,
L("node") = group(2,3,"_letterperiod");
Rather than renumbering the elements, just grab the new node you
created, and that's how you can address it. This way the element
numbers always address the stuff they matched, unless it's been
zapped.
For the FSA stuff, we can automatically create an array!
L("nodes") = cycle(2,3,"_letterperiod");
grabs the node built in each cycle into an array, and so it can be
addressed that way!
====================================
Just a note that I'd like full power to manipulate the parse tree from
code regions. Tough to do in the middle of a rule match, and maybe
limitations there. But certainly in the @CODE regions one should be
able to add, remove, and move around nodes. (Would be cool to switch
the ordering of a sentence split when OCRing a line skewing upward!
Then merge the lines in proper order).
====================================
[snip]
> ====================================
David's response via email was:
Would this cause the parser to start slowing down with combinatorics?
-David
And my response:
No, all these cases for mhbv have to be handled, and right now that's
done with overly many rules (and rematching the same stuff). So this
should be more efficient. Remember also our rules are hashed and
triggered off the first element, etc.
Eg, if the current node isn't _modal, _have, _be, or _verb, then the
mhbv rule isn't even tried, because nothing hashes to it.
Amnon
Amnon Meyers
CTO
Text Analysis International, Inc