I'm new to LinkGrammar, but over the past few days, I've read a number of the papers about it, including the originals, as well as playing with the API using Ruby.
While I understand it only supports parsing complete sentences, I'd like to know if there was a way to hook into the parsing engine as it evaluated the text so you could get information about what was there?
Ideally, I'd like to be able to register a callback so that when given something like:
"bears in the woods"
I'd at least like to be able to determine that "in the woods" was a prepositional phrase and "woods" was the object. Having the POS tagging would be nice, but isn't crucial. Getting access to "bears" is also on the nice to have list.
Like I said, I know it doesn't do this now, but how hard would it be to implement, or is there something out there that's better suited to this sort of problem? Ideally, I need the solution in C or C++, but could cobble together something if it wasn't. I looked at the RelEx library, but I'm not sure that it fits the bill either (and it's in Java).
Any pointers or info would be greatly appreciated.
Cheers,
ast
--
Andrew S. Townley <a...@atownley.org>
http://atownley.org
On 20 February 2011 08:47, Andrew S. Townley <a...@atownley.org> wrote:
> Hi Everyone,
>
> I'm new to LinkGrammar, but over the past few days, I've read a number of the papers about it, including the originals, as well as playing with the API using Ruby.
>
> While I understand it only supports parsing complete sentences, I'd like to know if there was a way to hook into the parsing engine as it evaluated the text so you could get information about what was there?
>
> Ideally, I'd like to be able to register a callback so that when given something like:
>
> "bears in the woods"
>
> I'd at least like to be able to determine that "in the woods" was a prepositional phrase and "woods" was the object.
I think you want to enable "constituent printing". There are three
styles of constituent printing, the third one uses parens:
linkparser> !con=3
constituents set to 3
linkparser> there are bears in the woods
Found 2 linkages (2 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS=0 FAT=0 AND=0 LEN=8)
+-----MVp----+---Jp---+
+--SFp-+--Opt-+ | +-Dmc-+
| | | | | |
there.r are.v bears.n in the woods.n
(S (NP there) (VP are (NP bears) (PP in (NP the woods))))
In the constituent tree, PP identifies the prep phrase, but I think
that, overall, the Jp link is "more accurate": J really does say that
the thing on the right is the prep object.
Besides the J link, there are several other link types that
indicate prep constructions. The code in link-grammar, that
generates the (PP in (NP the woods)) style markup, that code
uses the output of the core link-parse, and applies a set of
rules to generate (PP in (NP the woods)). Relex does something
similar: it uses a different set of rules, and transforms the core
link-parse into dependency-grammar style output.
(FWIW, relex rules are in a set of files that are parsed & applied
by java code; the rules themselves have nothing to do with java.
You could write ruby code that read these same files, and applied
these rules. More generally, you could steal the entire concept:
there are a set of rules that are applied, to transform a graph of
one shape into a graph of a different shape.)
> Having the POS tagging would be nice, but isn't crucial.
Again, relex has a set of rules to determine POS tags. So,
for starters, the suffix ".n" on "bears.n" above says, more or less
that "bears" is a noun. These tags are very rough (as POS
tags usually are).
In a deeper sense, the linkage disjuncts themselves can
be thought of (should be thought of) as "very fine-grained
pos tags", so for example:
linkparser> !disj
Display of disjunct used turned on.
linkparser> there are bears in the woods
Found 2 linkages (2 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS=0 FAT=0 AND=0 LEN=8)
+-----MVp----+---Jp---+
+--SFp-+--Opt-+ | +-Dmc-+
| | | | | |
there.r are.v bears.n in the woods.n
there.r 0.0 Wd- SFp+
are.v 0.0 SFp- Opt+ MVp+
bears.n 0.0 Opt-
in 0.0 MVp- Jp+
the 0.0 Dmc+
woods.n 0.0 Jp- Dmc-
(S (NP there) (VP are (NP bears) (PP in (NP the woods))))
Here, "bears Opt-" says that bears is that part of speech which
can be an object. Similarly, "are SFp- Opt+ MVp+" should be
interpreted as "'are' is that part of speech which takes a subject,
an object, and has a modifier". So, for example, these
"fine-grained" parts of speech allow you to distinguish
between transitive and intransitive verbs, and to find
di-transitive verbs, etc. which most POS taggers will not
tell you about. It is not hard to write a filter (in ruby, or
in c/c++) that says "Oh, if the disjunct contains S- O+ then
the verb is transitive", (or more, generally, if there's an S- in
the disjunct, then its a verb.) (and, in fact, this is typical of
the rules that relex applies to find POS tags. Although, for
starters, it first notes that the suffix ".v" in "are.v" already
strongly hints that its probably a verb; other rules, based on
actual disjunct usage, correct this if some subtlety requires it).
> Like I said, I know it doesn't do this now, but how hard would it be to implement, or is there something out there that's better suited to this sort of problem?
I'm assuming that the above answered your question. If it didn't
then I don't understand your question. :-)
--linas
Thanks for your replies. Josh is correct, parsing incomplete, perhaps not even sentences, is actually what I need to do. The most important things for me to know are any valid prepositional sentence, so that's why I'm keen on having a callback option for when it recognizes those linked words. I'm not exactly super excited about trying to write a prepositional phrase parser on my own when something like LG already exists and obviously has to do part of what I want.
To Josh's point about trying to make complete sentences. I did try that, but a few times my added verb might conflict with a verb in the input, making non-sensical sentences that wouldn't parse anyway. For the case of the first four words, that's actually ok as I have a fallback option. I'm just trying to extract as much context and meaning as possible from arbitrary strings of words. I can't require that people use complete sentences, or even things that can be cajoled into complete sentences.
After playing with your suggestion of John said "..." with quoting the original, it just might be enough of what I need to start with. Thanks! :)
Still, in reading the original papers, it seems like there should theoretically be a possibility to supply some kind of callback on partial or matched linkages. I can see obvious things here for my application, but even in trying to disambiguate usage or help parse otherwise "confusing" texts as mentioned in some of the papers on biomedical research and software documentation. Has anyone ever looked into this before?
Josh's quoted phrase idea seems much more robust than my original approach of trying to guess what might make essentially opaque input possibly into a complete sentence just so I could extract prepositional phrases. The quoted phrase trick even seems to work with the following input which is an unexpected bonus:
>> links = parser.parse 'he said "brown bears in the woods"'
=> #<LinkParser::Sentence:0x404cc5c6 "LEFT-WALL he said brown bears in the woods RIGHT-WALL"/2 linkages/0 nulls>
>> puts links.diagram
+---------MVp---------+
+-------Op------+ +---Jp---+
+--Ss-+ +---A---+ | +-Dmc-+
| | | | | | |
he said.v-d brown.a bears.n in the woods.n
>> puts links.constituent_tree_string
(S (NP he)
(VP said
(NP brown bears)
(PP in
(NP the woods))))
=> nil
You guys may have just breathed some life into the glowing embers of what I was trying to do. Thanks for your help!
Any thoughts/recommendations on how to have access to the constituent tree AND the parts of speech information in the same pass? Seems like the constituent tree is the easiest to identity the sentence parts, but you lose the POS unless you use the links. Maybe this is a stupid question as I'm pretty new to the whole NLP thing, but why isn't there a grand, unified data structure that has all of the information at once?
Cheers,
ast
> --
> You received this message because you are subscribed to the Google Groups "link-grammar" group.
> To post to this group, send email to link-g...@googlegroups.com.
> To unsubscribe from this group, send email to link-grammar...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/link-grammar?hl=en.
> The quote method is a trick that utilizes what's already in the system. As such, while it might handle a large set of incomplete phrases, there is no guarantee that the parses will be correct, or even partially correct. I haven't tested it thoroughly - caveat emptor.
Absolutely understood! :)
> If it does work out well, though, then that should be good news, since it doesn't require any changes to link-grammar.
That's certainly a bonus given my current schedule (and the rustiness of my C programming chops).
Thanks again!
On 20 February 2011 12:07, Josh Rowe <jro...@gmail.com> wrote:
> He's asking if you can parse partial, incomplete sentences - literally,
> "bears in the woods", not "there are bears in the woods."
No, currently, link-grammar is not designed to parse incomplete
or ungrammatical sentences.
> One strategy might be to encapsulate the phrase in a statement:
> John said "bears in the woods."
> This parses to : [NP bears NP] [PP in [NP the woods NP] PP]
No, because link-grammar currently ignores/mishandles quotations.
Your example is actually a good illustration about why quoted
phrases are hard to handle correctly.
> Another method might be to prepend text that forms a complete sentence with
> the phrase, like Linas did with his example.
As you point out, this is fraught with potential difficulties.
This is a good time to step back, and discuss "what are you trying
to do?"
--linas
On 20 Feb 2011, at 8:54 PM, Linas Vepstas wrote:
>
> As you point out, this is fraught with potential difficulties.
>
> This is a good time to step back, and discuss "what are you trying
> to do?"
>
As I said in the original message, I need to be able to identify *potential* prepositional phrases in free-form, unstructured, ugly, fragmented (and sometimes demented) user input. Getting modifiers (adjectives) and noun phrases would be a bonus, but the prepositional phrases would go a long way on their own.
Currently, I'm doing a very crude, english only subset, but I would like (without having to craft such a beast myself), to leverage the link grammar rules already working very well for complete sentences to *attempt* to extract any meaningful phrases from this ugly input.
To be clear, it doesn't have to parse, and, actually, it doesn't have to be semantically sensical. I just need to be able to do a "best guess" pass of NLP on arbitrary input. That's really why I thought adding some kind of eventing/callback mechanism during the link checking phase might work.
I don't have it to hand at the moment, but I read a paper sometime last week which discussed attempting to use conversational context to complete meaningful statements in free-form text where there are no grammatical constraints defined but where people often use meaningful sentence fragments or phrases (did I just write that?) ;)
My issue with the current approach of whole sentence parsing is that it's effectively binary. That's not good enough for what I want to do, and a number of the papers I've seen where LG was applied to a specific domain had to go through sometimes complex pre-processing steps to ensure that the LG processing wasn't confused (the examples I saw were biotech/medical and software documentation processing). I was trying to avoid this if at all possible.
I think the LG approach is incredibly powerful and it performs well enough for interactive use, but it seems like there's a reasonable case for supporting incomplete processing of natural language text. Especially with the eventing/callback model, the application doing the processing can potentially provide the additional context required based on what's missing. Unfortunately, there don't seem to be any ways to do this, so the processing just fails and you (as a client of the parser) don't have this opportunity.
As I said earlier, maybe I'm barking up the wrong tree here as I'm not terribly familiar with NLP yet, but, given the google results I saw over the last few days, being able to (partially?) parse incomplete sentences is something people would like to do.
I have more testing to do, but Josh's suggestion (while ugly, I admit), might just get me enough functionality to do what I need (extract the prepositional phrases). Whether the input is a grammatically correct sentence is not only irrelevant, but it's also pretty unlikely.
Very open to other alternatives and approaches.
Thanks again for your reply.
> Thanks for your replies. Josh is correct, parsing incomplete, perhaps not even sentences, is actually what I need to do. The most important things for me to know are any valid prepositional sentence, so that's why I'm keen on having a callback option for when it recognizes those linked words. I'm not exactly super excited about trying to write a prepositional phrase parser on my own when something like LG already exists and obviously has to do part of what I want.
It would be useful if you described what you are really trying to do,
instead of proposing a design and asking how to implement it.
(I don't see what callbacks have to do with anything)
Anyway, if you want to parse prep phrases, and know that your
input will always be just prep phrases, then you can always
add a special "word" to the dictionary, for example:
myspecialword: J+;
Then, whenever the parser got the following as input:
"myspecialword bears in the woods"
it would parse and give you what you want.
> I'm just trying to extract as much context and meaning as possible from arbitrary strings of words. I can't require that people use complete sentences, or even things that can be cajoled into complete sentences.
is there any structure? Are these IRC chat room dialogs?
You can always add new rules to the dictionary, that loosen
parsing significantly. The biggest problem with this tends to
lead to a combinatorial explosion of alternatives. Its a problem
I idly think about but haven;t tried to tackle.
> help parse otherwise "confusing" texts as mentioned in some of the papers on biomedical research and software documentation. Has anyone ever looked into this before?
There's been a fair amount of attention for biomed parsing
w/ link-grammar, where "biomed" means "technical & academic
journal articles". Current versions have long ago folded in
Peter Szolovits' biomed dictionary from MIT, and all of BioLG
has been folded in; the current version was doing much better
than the original BioLG on its own test corpus.
But most academic papers are written in more-or-less
grammatical, proper English using full sentences, and not
phrase fragments, so I don't see where this connects up .. ?
> Josh's quoted phrase idea seems much more robust than my original approach of trying to guess what might make essentially opaque input possibly into a complete sentence just so I could extract prepositional phrases. The quoted phrase trick even seems to work with the following input which is an unexpected bonus:
100% illusion. LG discards quotes, and does no special
processing of quoted phrases whatsoever.
> Any thoughts/recommendations on how to have access to the constituent tree AND the parts of speech information in the same pass? Seems like the constituent tree is the easiest to identity the sentence parts, but you lose the POS unless you use the links.
? Huh?
Maybe the ruby interfaces are doing something funny, dunno; the info
is available in the C language interfaces.
--linas
Its purely an illusion. LG does nothing with quotes, it discards them.
--linas
> My issue with the current approach of whole sentence parsing is that it's effectively binary. That's not good enough for what I want to do, and a number of the papers I've seen where LG was applied to a specific domain had to go through sometimes complex pre-processing steps to ensure that the LG processing wasn't confused (the examples I saw were biotech/medical and software documentation processing).
Can you provide references to these papers? I haven't seen any
papers that described this. Now, I think the guys from BioLG
attempted to do this, but I think that approach was fundamentally
flawed, and just didn't work. Anyway, the current parser is
considerably more accurate than BioLG ever was.
> I think the LG approach is incredibly powerful and it performs well enough for interactive use, but it seems like there's a reasonable case for supporting incomplete processing of natural language text.
Thanks. A lot of people would like to be able to parse & understand
fragmentary text, but doing so is not an easy task.
> Especially with the eventing/callback model, the application doing the processing can potentially provide the additional context required based on what's missing.
Can you elaborate? A "callback" has nothing to do with parsing,
I have no clue what you are trying to say here.
--linas
If you just want to recognize and parse prepositional phrases, I
believe there are some chunk parsers out there that will do what you
want better than the link parser *currently* does
If you're looking for a tool that will parse incomplete sentences, in
its current form, the link parser ain't it...
The link parser could be modified to parse incomplete sentences, but
this would require some original computational linguistics
development. For instance, the SAT link parsing algorithm (which is
part of the link parser release, that can be enabled via the
configuration options), could "straightforwardly" be modified to
handle incomplete sentence. But the SAT link parsing algorithm itself
is still experimental (though functional), so this becomes a
fascinating, nontrivial research project...
-- Ben Goertzel
> --
> You received this message because you are subscribed to the Google Groups "link-grammar" group.
> To post to this group, send email to link-g...@googlegroups.com.
> To unsubscribe from this group, send email to link-grammar...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/link-grammar?hl=en.
>
>
--
Ben Goertzel, PhD
CEO, Novamente LLC and Biomind LLC
CTO, Genescient Corp
Chairman, Humanity+
Adjunct Professor of Cognitive Science, Xiamen University, China
Advisor, Singularity University and Singularity Institute
b...@goertzel.org
"My humanity is a constant self-overcoming" -- Friedrich Nietzsche
On 20 Feb 2011, at 10:41 PM, Linas Vepstas wrote:
> On 20 February 2011 16:25, Andrew S. Townley <a...@atownley.org> wrote:
>
>
>> My issue with the current approach of whole sentence parsing is that it's effectively binary. That's not good enough for what I want to do, and a number of the papers I've seen where LG was applied to a specific domain had to go through sometimes complex pre-processing steps to ensure that the LG processing wasn't confused (the examples I saw were biotech/medical and software documentation processing).
>
> Can you provide references to these papers? I haven't seen any
> papers that described this. Now, I think the guys from BioLG
> attempted to do this, but I think that approach was fundamentally
> flawed, and just didn't work. Anyway, the current parser is
> considerably more accurate than BioLG ever was.
Admittedly, all I could find easily through google/siteceer/etc were kinda old. I let my ACM subscription lapse, so I couldn't do a more detailed search there.
This was one: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.4635
This was another one: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.94.1506
Some discussion of the issues in here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.2626
Those were the ones I kept. I sifted through a bunch of papers last week, so I don't remember if there were others or not.
I wasn't really trying to solve the same problem, but these were the only ones google led me to that dealt with sentence fragment parsing with LG.
>> I think the LG approach is incredibly powerful and it performs well enough for interactive use, but it seems like there's a reasonable case for supporting incomplete processing of natural language text.
>
> Thanks. A lot of people would like to be able to parse & understand
> fragmentary text, but doing so is not an easy task.
I can imagine. That's why I'm trying to see how close I can get "faking it" with what's already there. :)
>
>> Especially with the eventing/callback model, the application doing the processing can potentially provide the additional context required based on what's missing.
>
> Can you elaborate? A "callback" has nothing to do with parsing,
> I have no clue what you are trying to say here.
You're right, generally it doesn't. I've written parsers for structured languages using handwritten top-down, recursive-descent parsers as well as using parser generator tools such as lex/yacc and derivatives several times. It used to be a particular area of interest.
However, one of the things which occurred to me was that each parser was generally tightly coupled to the particular application, and everyone was writing their own parsers because they had to. In today's OSS world, this is kinda silly.
Since I had this particular epiphany, I've tried to write reusable, modular parser frameworks that provided clear abstractions between the grammar processing (lexing and parsing) and what the application actually needs to do to meet its goals. Taking some cues from the SAX XML parser, I've implemented any parser I've written since then (where I could) so that you provided a set of event-driven callbacks for the production matching "business logic" so that you wouldn't need to wait for the generation of the complete parse tree. It also allows you to provide some in-context "help" to the parser when resolving conflicts and ambiguities.
Want to write a totally different application that processes the same grammar (documentation generation vs. compilation is one example), then fine, you don't need to write a separate parser. You just plug in to the productions you care about and job done.
Given the way I understand the LG parser works, once it has created links between certain words, there's a reasonable chance that these links are correct subtrees of the overall parse tree. If you only care about these correct subtress, it would be nice to have access to them during the parse phase (via a callback or event-driven architecture) rather than having them be discarded when the complete parsing phase determines that not enough information is present to recognize a grammatically correct complete sentence.
I wasn't trying to ruffle any feathers with my question, and, based on your background, you've much more parser/compiler generation experience than me. I have found the above approach to be extremely useful in practice though.
My goal is to extract as many "best guess" collections of grammatically correct (as far as LG is concerned) phrases from an unstructured, bag of words type of input independent of whether they form complete sentences. Any phrases that are grammatically correct, e.g. "in the woods" can be used by the application to give additional semantics and context to the user input request and potentially even other words within the input text itself. Within the application, I have a good bit of context in which the text is being processed, so not everything needs to come from the input string itself. If no additional context is provided, e.g. they really just have typed a bag of words, then that triggers another type of text analysis to try and figure out what to do with the input.
Does this make sense?
To your other point about the Ruby library, I'd suspected this, but I hadn't gone through the source yet. From the original API docs, it did seem to indicate that it should be possible to go from one to the other. I'll investigate a bit further and contact the maintainer separately.
Also take on board the quotes issue being coincidental. Even without the quotes, the approach Josh suggested has probably a 60-70% of giving me enough of a complete sentence to extract some additional information that's useful.
Thanks for your comments.
Cheers,
On 20 Feb 2011, at 11:09 PM, Ben Goertzel wrote:
> Hi,
>
> If you just want to recognize and parse prepositional phrases, I
> believe there are some chunk parsers out there that will do what you
> want better than the link parser *currently* does
Do you have any examples/links? Worth checking out.
> If you're looking for a tool that will parse incomplete sentences, in
> its current form, the link parser ain't it...
I'd pretty-much determined that from my own experimentation, but before I set it aside completely, I thought I'd try and ask the experts! :)
> The link parser could be modified to parse incomplete sentences, but
> this would require some original computational linguistics
> development. For instance, the SAT link parsing algorithm (which is
> part of the link parser release, that can be enabled via the
> configuration options), could "straightforwardly" be modified to
> handle incomplete sentence. But the SAT link parsing algorithm itself
> is still experimental (though functional), so this becomes a
> fascinating, nontrivial research project...
Fair enough. Somehow the words "notrivial" and "research project" are the anthesis to my deadline of less than two weeks (and this isn't all I'm trying to do). ;)
At this stage, I think this is something for a rev 2 of what I'm doing. Even with my hard-coded, minimal matching regex-based approach, what I could find made a big difference, so I had to see if it was something obvious that I was missing.
Cheers,
ast
--
There are a variety of generic parsers; as you point out, besides yacc,
there are all sorts of others e.g. LALR parsers seem to have been
created in most programming languages. As to applicability to
LG, realize that LG's dictionary is not written in BNF form, i.e. does
not consist of "production rules"; its not a Chmosky-style
"generative grammar" that is quite popular. In principle, the LG
dictionary can be converted to BNF; in practice, I suspect this
would lead to some intractable combinatoric explosion (but
I'm not really sure, that's only my gut intuition). As Ben points out,
it seems more amenable to Boolean SAT. I also think that
backward-forward-style algorithms, e.g. viterbi codes, or other
markov-chain-inspired methods, are closer to what link-grammar
is, than the usual top-down BNF-style production-rule grammars
of classic comp sci.
> Given the way I understand the LG parser works, once it has created links between certain words, there's a reasonable chance that these links are correct subtrees of the overall parse tree.
Yes. For more complex sentences, there can be a combinatoric
explosion of these, so a callback might get called thousands if
not millions of times.
The algorithm currently implemented in the code is correct, and
fast, but strikes me as kind of goofy and not "natural"; I've long
wanted to re-do it, but its not something that can be whipped out
in a few days.
Hmm. Now that you've made me think about this, yes, this is yet
another reason to explore a viterbi-style algo. (and/or boolean SAT)
> I wasn't trying to ruffle any feathers
Sorry, sometimes I get into cranky moods.
--linas
> You may want to look at the chunk parsers integrated into OpenNLP...
Know of any in C or C++? I've seen people do crazy things with gcj to leverage java libraries, but last time I touched Java was pre JSE5. Seem to be getting back to my C/C++ roots lately (but with bindings to Ruby).
I did some searching and found the University of IL one and one in Madrid, but they were either in Java or unavailable to the general public. Another potential issue is licensing entanglements, but OpenNLP doesn't have that issue for sure.
Thanks for the pointer.
Cheers,
ast
On 21 Feb 2011, at 4:48 AM, Linas Vepstas wrote:
> On 20 February 2011 17:24, Andrew S. Townley <a...@atownley.org> wrote:
>> Hi Linas,
>>
>> However, one of the things which occurred to me was that each parser was generally tightly coupled to the particular application, and everyone was writing their own parsers because they had to. In today's OSS world, this is kinda silly.
>
> There are a variety of generic parsers; as you point out, besides yacc,
> there are all sorts of others e.g. LALR parsers seem to have been
> created in most programming languages. As to applicability to
> LG, realize that LG's dictionary is not written in BNF form, i.e. does
> not consist of "production rules"; its not a Chmosky-style
> "generative grammar" that is quite popular. In principle, the LG
> dictionary can be converted to BNF; in practice, I suspect this
> would lead to some intractable combinatoric explosion (but
> I'm not really sure, that's only my gut intuition). As Ben points out,
> it seems more amenable to Boolean SAT. I also think that
> backward-forward-style algorithms, e.g. viterbi codes, or other
> markov-chain-inspired methods, are closer to what link-grammar
> is, than the usual top-down BNF-style production-rule grammars
> of classic comp sci.
I guess I'm not making myself very clear here. I'm not talking about compiler construction kits or generic parsers, I'm talking about off-the-shelf parsers with plug-in APIs for particular languages (and we're digressing, but I want you to understand where I'm coming from. If this still doesn't do it, I give up, I promise!).
For example, if I wanted to parse Java, SQL, C or C++, what I'm talking about would be able to take a functional, tested parser for those languages (expressed in whatever tools and notation made the most sense) and simply implement callbacks for things like:
void method_definition_start(...)
void method_definition_end(...)
void method_call(...)
void function_call(...)
void variable_definition(...)
void variable_reference(...)
... etc
That's why I mentioned the SAX API for XML parsing in an earlier email. Working at the ANTLR/BNF/LEX/YACC level is lower than I meant. I'm talking about specific language applications here based on pre-built parsers for particular languages/domains.
I understand that LG isn't expressed in terms of BNF or the like (I got that from the original papers). However, I can easily see the potential for something like NLP processing here where you could provide an event-driven API something like:
void noun_phrase_matched(...)
void prepositional_phrase_matched(..)
...
This would allow you to hook into the semantics of the parsing rather than the mechanics of particular parsing algorithms, recognition and the like. Whether they're traditional "productions" or something different, they are semantically relevant events in terms of processing the input into meaningful structures. From my perspective, they're still interesting parsing events within the scope of the NLP domain, and, as such, I thought it'd be handy to have a domain-specific event API.
Hopefully this helps at least understand the kind of thing I was suggesting and why it might be useful.
>
>> Given the way I understand the LG parser works, once it has created links between certain words, there's a reasonable chance that these links are correct subtrees of the overall parse tree.
>
> Yes. For more complex sentences, there can be a combinatoric
> explosion of these, so a callback might get called thousands if
> not millions of times.
Understood. I wasn't suggesting that it be the only parsing interface provided, but that when the default all-or-nothing parse API wasn't good enough, you at least had hooks (a la the Interceptor pattern) where you could respond to certain events that might be relevant to your application.
From an implementation perspective, I'm not sure providing such a mechanism would add much overhead since you could just check a structure for a non-NULL function pointer. Of course, I've never peeked under the hood of LG, so I'm just guessing.
The real advantage I see is being able to allow better support for the edge/boundary cases without having to implement your own LG-ish parser algorithms from scratch just to tweak the way certain instances were handled.
Anyway, I guess it's a pretty moot point. :)
> The algorithm currently implemented in the code is correct, and
> fast, but strikes me as kind of goofy and not "natural"; I've long
> wanted to re-do it, but its not something that can be whipped out
> in a few days.
>
> Hmm. Now that you've made me think about this, yes, this is yet
> another reason to explore a viterbi-style algo. (and/or boolean SAT)
Well, at least my question wasn't totally wasted. :)
>> I wasn't trying to ruffle any feathers
>
> Sorry, sometimes I get into cranky moods.
No worries! :)
Take care,
The Ruby LinkParser::Linkage#constituent_tree method returns an Array of
Structs that are based on link-grammar's 'struct CNode_s' (with a few
concessions to Ruby idiom):
From the example in the API documentation:
sent = dict.parse( "He is a big dog." )
link = sent.linkages.first
ctree = link.constituent_tree
# => [#<struct Struct::LinkParserLinkageCTree label="S",
children=[#<struct Struct::LinkParserLinkageCTree label="NP">, ...],
start=0, end=5>]
I assume by 'part of speech' you mean 'label', and I don't see anything
else it could be in the constituent tree C API
(http://www.abiword.org/projects/link-grammar/api/index.html#cons), but
if I'm missing something please do let me know and I'll fix it.
--
Michael Granger <ruby...@gmail.com>
http://deveiate.org/
Several reasons:
1) linguists are not programmers.
2) academics seem to think that Java is a great language.
Its certainly taught a lot in school.
3) If they are good programmers, then they are probably
coding in python.
So 1+2+3 == no C/C++
--linas
Regards
Simon
--linas
--
Err, what I said above is wrong. The constituent tree interface
in link-grammar stinks. It should provide at least the word index, etc.
I'll try to fix this someday ...
Other parts of the API could also stand for more cleanup & uniformity.
> The Ruby LinkParser::Linkage#constituent_tree method returns an Array of
> Structs that are based on link-grammar's 'struct CNode_s'
FWIW, CNode_s is not really a part of the public API; should use
linkage_constituent_node_get_*() to get the actual values.
--linas
Ah, I misspoke (mistyped?) slightly; the members of the Ruby struct are
based on the members of 'struct CNode_s', but I actually use the
linkage_constituent_node_get_*() functions to extract the members when
populating it.
In the unlikely case you (or anyone) are curious:
http://deveiate.org/projects/Ruby-LinkParser/browser/ext/linkage.c#L782
As far as cleanup and uniformity are concerned, I'm still just grateful
for how nice it is to use now versus the original CMU code.
> As far as cleanup and uniformity are concerned, I'm still just grateful
> for how nice it is to use now versus the original CMU code.
Ohh, but I do like compliments! however, I'm confused ... the API should
be very nearly unchanged since then. So what's nicer now (aside from
the fact that you can code in ruby instead of C?)
--linas
p.s. would you be interested in shipping the ruby bindings along with the main
distro? I don't know how ruby people like to get their code, I'm thinking that
shipping w/ main distro makes it easier for users & developers?
The API has remained pretty consistent, but (if I recall correctly) the
original was very difficult to bind against because of inconsistent
memory-management functions, no shared library option, no autoconf, poor
locale/charset handling, etc.
Also, the constituent-tree struct functions aren't in the original. :)
> p.s. would you be interested in shipping the ruby bindings along with the main
> distro? I don't know how ruby people like to get their code, I'm thinking that
> shipping w/ main distro makes it easier for users & developers?
If you have the link-grammar library installed, the Ruby code is
distributed using the standard packaging system called 'Rubygems'.
Installing it is (usually) just a matter of running:
# gem install linkparser
That said, I wouldn't be opposed to shipping the Ruby bindings with the
library if it's something people want.