FWIW, relex now has a unit test for the "stanford compatibility mode"
of about 50 sentences (and they all pass). (run this by saying "ant test"
or "ant check")
While I put this together, I discovered that the stanford parser was far
from perfect; when the two parsers disagreed, it was often the case that
stanford just did something crazy. My gut impression was that it made
errors about twice as often as relex, but I did not try to measure an
error rate.
One big problem is that different parsers seem to generate different
kinds of output, making direct comparison difficult or impossible :-(
This is why I created the "compatibility mode".
--linas
So, my suggested project for the student interested in an NLP project,
was to use some scripts to transform a large corpus into dependency
form (tweaked to agree with Stanford output format, say), and then
test RelEx vs. Stanford on this large dependency corpus...
If RelEx really does work better, that's a nice result...
If not, it tells us some places we need to improve RelEx...
- Ben
--
Ben Goertzel, PhD
CEO, Novamente LLC and Biomind LLC
CTO, Genescient Corp
Vice Chairman, Humanity+
Adjunct Professor of Cognitive Science, Xiamen University, China
Advisor, Singularity University and Singularity Institute
b...@goertzel.org
"My humanity is a constant self-overcoming" -- Friedrich Nietzsche
I'm not sure, but this "tweaking" may be non-trivial. For example,
I found that sometimes, stanford made distinctions where relex didn't,
and v.v. and so much of my work was reproducing these kinds of quirks.
I assume that other corpuses will have similar characteristics.
> If RelEx really does work better, that's a nice result...
>
> If not, it tells us some places we need to improve RelEx...
Well, I have a huge number of test sentences that are parsed
incorrectly; there's no shortage of things to fix :-)
The most urgent work that remains un-done is to do the relex
side of the new/improved conjunction handling. When adding
conjunction support, I added some 6 or so new link types to link-grammar.
Relex doesn't know about any of them, and so now it will mangle
parses with conjunctions.
One of the places where relex and stanford were "fundamentally
incompatible" was exactly in the handling of conjunctions. The
new link-grammar conjunction links fixes this, but the rules needed
to generate the correct output are still missing.
I think that fixing/finishing this might be the most important thing to
hack on in relex these days.
--linas
So for G Meera: perhaps Linas's suggestion would be a good place to
start. The task would be to modify RelEx to make use of the link
parser's new conjunction handling.... He could provide specific
guidance on this as needed...
Following this, testing against a larger corpus as I suggested would
be valuable..
And all this will help NLGen as well as RelEx...
-- Ben G
Simply finding one that is compatbile or almost-compatible, and determining
just how different it is, would be useful!
But yes, I think that fixing/finishing conjunction handling would be best.
--linas
The task then becomes to try to tweak this script to give
Stanford-parser-style output, I guess...
ben
--