Hi guys,
The thread on MediaWiki extensions morphed into something more
interesting, so I decided to change the Subject line ;)
I thought I'd say a few words about why I think having a Lojban
comprehension/generation system for OpenCog is a really good idea
(which is why I recruited Alyn to help with this).
Crudely, we can decompose the NLP problem into two aspects:
A)
More "pure linguistics" aspects: syntax parsing, part of speech
tagging, word segmentation, word sense disambiguation, anaphor
resolution etc. ; surface realization in language generation (i.e.
figuring out how to turn a specific set of relatively unambiguous
semantic relations into a sentence)
B)
More "linguistic cognition" type aspects: interpreting the meaning of
a sentence in context, figuring out when two differently expressed
linguistic utterances mean the same thing; figuring out how to take a
medium-sized network of thoughts and dice it into smaller chunks
suitable for surface realization... figuring out what doesn't need to
be explicitly said to a given person because they already know it...
...
In principle, there is no rigid distinction between these two aspects.
In practice, modern computational linguistics AND old-style
generative grammar, both deal almost exclusively with type A issues,
"pure linguistics."
There are theoretical papers written about type B issues, but not that
much concrete work. This is not because people feel linguistic
cognition is unimportant. Rather, it seems to be because people feel
like they need to get the type A questions solved first ... and while
getting the type A issues "80% solved" is fairly straightforward given
a host of available methodologies, getting them "95% solved" seems to
involve an endless stream of niggling little issues....
For example, parsing English or other natural languages is
straightforward using a variety of approaches -- if you stick to
simple sentences. But when you get into conjunctions, comparatives,
reflexives and other complex constructs, you find that current methods
all make mistakes. Fixing the mistakes seems tractable, but then
turns into a long-winded complex task.... And since there are many
such "apparently tractable, but long-winded" tasks involved in dealing
with the type A "pure linguistics" issues, nobody moved on to the
type B tasks.
You can see where I'm going with this. Dealing with Lojban instead of
traditional natural languages, the type A issues are either
nonexistent or minimal. Syntax parsing is purely deterministic and
mathematically straightforward in every case (unless the user made an
error), just like computer programming languages. Semantic ambiguity
can be reduced to a rather low level via appropriate usage patterns,
so that a person chatting with an AI using Lojban could
straightforwardly avoid problematic ambiguity in nearly any case where
the AI seemed to get confused. Complex constructs like conjunctions,
comparatives, reflexives and anaphors are dealt with using mechanisms
derived from mathematical logic, so they can be translated into
OpenCog Atom constructs immediately and straightforwardly.
So, if Lojban is used in place of a natural language, then the type A
issues almost disappear. We are left only with the Type B, cognitive
linguistics, issues. Of course, these issues are not easy. In some
senses, they are harder than the type A issues. But they have been
very little explored, because the computational and theoretical
linguistics community has spent basically its entire history grappling
with the mess of type A issues.
My intuition (based on a lot of thought) is that PLN (OpenCog's
probabilistic reasoning component) can be very helpful with cognitive
linguistics issues. I would like to explore this hypothesis in
detail, but it's hard to get to that point when doing English language
processes, because mess caused by not-quite-good-enough handling of
the type A issues always seems to get in the way...
It may be that Ruiting's new approach to the type A issues will get
around this problem, via producing sufficiently clean Atom structures
from English sentences. But I suspect that even if her approach works
great, there will still be a lot more frustrating issues dealing with
English-generated Atom-structures than with Lojban-generated Atom
structures.
...
Now, about the practicalities of Alyn's code and its integration with OpenCog.
The fact that two different dialects of Scheme are involved is mildly
awkward, but not really a big problem (as I guess everyone has already
recognized...)
What I am hoping for Alyn to do is to write Scheme code carrying out
the two transformations:
Lojban sentence ==> Scheme tree of relationships (labeled via Lojban words)
[for language comprehension]
Scheme tree of relationships (labeled via Lojban words) ==> Lojban sentence
[for language generation]
[plus some tweaks to the above to deal with anaphora, which Alyn and I
haven't discussed yet]
Once this works well, then I will work with Ruiting and/or Jade to
create Guile code carrying out the transformations
Scheme tree of OpenCog Atoms ==> Scheme tree of relationships (labeled
via Lojban words)
[for language comprehension]
Scheme tree of relationships (labeled via Lojban words) ==> Scheme
tree of OpenCog Atoms
[for language generation]
(These will be similar to the rules Ruiting has recently coded in
Java, for translating sets of RelEx relationships into OpenCog Atoms.
She prototyped some of these rules in Scheme before shifting to Java
due to her faster coding speed in the latter.)
Given these rules, we will have a fairly complete "Type A language
pipeline" for OpenCog, using Lojban.
This will open up a bunch of interesting possibilities, including:
-- a Lojban dialogue system
-- experiments having the system ground Lojban words in its
virtual-world experiences
-- conversing with the system in parallel in English and Lojban. As
Lojban is much less ambiguous, this may give the system data it can
use for disambiguating English. This may be useful for teaching the
system the meaning of complex, confusing English constructs (with
commas, comparatives, quantifiers, etc.), for which statistical
methods of disambiguation aren't very useful at present.
-- automated translation between English and Lojban
-- later on: automated translation between English and Chinese using
OpenCog, using Lojban as an intermediate language
Note, the latter 3 points, involving using English and Lojban
together, would be easier if some additional preliminary work besides
Alyn's current Scheme project were completed. Namely: we would want
the English-Lojban dictionary to get ported into OpenCog. The best
way to do this isn't yet 100% clear to me, though I've thought about
it a lot. But we can cross that bridge when we come to it.
I realize this is sort of an odd, edgy direction -- and I'm certainly
not suggesting to abandon OpenCog work on the good old fashioned
natural languages. But I think that if any of you studies Lojban for
a couple dozen hours, enough to learn to read and write simple
sentences, then you will start to see why I think this is a really
cool direction...
-- Ben
On Thu, Sep 13, 2012 at 2:42 PM, Linas Vepstas <
linasv...@gmail.com> wrote:
> Hi,
>
> On 13 September 2012 12:56, .
alyn.post. <
alyn...@lodockikumazvati.org>
> wrote:
>>
>> On Thu, Sep 13, 2012 at 12:16:31PM -0500, Linas Vepstas wrote:
>> > I don't know Guile well enough to know what libraries are
>> > available,
>> >
>> > its more or less standard R6RS plus all the srfis plus various
>> > historical
>> > baggage like ice-9 which you would avoid in new code. *The only real
>> > difference is in modules, since guile had a module system that is
>> > much
>> > older and more feature-full than what was adopted in r6rs, which
>> > causes
>> > many headaches.
>> >
>>
>> Probably the biggest pain would be that I use the sandbox egg to
>> provide safe evaluation of Scheme code:
>>
>>
http://wiki.call-cc.org/eggref/4/sandbox
>>
>> I could strip this out and do a bare eval so long as my input wasn't
>> coming from another user.
>
>
> Why would you need accept scheme input from another user? google doesn't
> know of a sandbox for guile. Can't be hard ...
>
>>
>> After that I need matchable:
>>
>>
http://wiki.call-cc.org/eggref/4/matchable
>>
>> Which is portable, but I'm not sure has been packaged for Guile.
>
>
>
http://www.gnu.org/software/guile/manual/html_node/Pattern-Matching.html
>
> "The (ice-9 match) module provides a pattern matcher, written by Alex Shinn,
> and compatible with Andrew K. Wright's pattern matcher found in many Scheme
> implementations."
>
>> Finally, I need a hash table or some kind of tree for memoization, which
>> I assume Gulie has.
>
>
> Yeah, it has hashes and various other data structures.
>
>>
>> > I suppose I could too, having gone through this for the English
>> > language.
>> > *I could tell you what I learned, but this is the wrong thread.
>> > *Anyway,
>> > the short answer is that the difficulty of understanding english has
>> > nothing to do with syntactic or semantic ambiguity, or with any
>> > difficulty
>> > in parsing. *It is at quite another level, and lojban would promptly
>> > stumble there as well.
>>
>> I agree with you, though I don't have the depth of understanding you
>> do with OpenCog, and therefor can't speak articulately on the topic.
>> Lojban happens to be in my skillset, doing this work is also getting
>> some other goals accomplished, and I'd honestly rather try and fail
>> but produce a demonstrated, documented outcome than not move forward
>> with this work.
>
>
> Well, the issue is not really an opencog issue, but a much deeper, general
> question: roughly: "great, you have a syntax tree, now what?"
>
> The obvious first answer is "lets do question answering". This ranges from
> deceptively simple to fiendishly complex: e.g
>
> John threw the ball.
> Who threw the ball?
>
> You don't even have to parse anything to figure that out: just match up the
> words.
>
> John threw the green ball.
> Who threw the ball?
>
> Slightly trickier: your parser needs to recognize that "green" is a noun
> modifier, and your pattern matcher needs to know that, in this case,
> ignoring the modifier when building the answer is acceptable.
>
> From there, it only gets harder. Many sentences have different surface
> structures, but, the act parsing will normalize them into rather similar or
> even identical "semantic relation" trees. The real problems set in when
> when the known facts, and the question have very very different "semantic"
> structures. This requires having rules that can modify one structure into
> another, and then determine if the modification can answer the question.
>
> I started hand-coding rules to perform these transformations. After a
> half-dozen or so, I realized that this was nuts: I would have to hand-code
> thousands of pattern transformations. The result would be fragile and
> buggy. I contemplated some ways to use machine learning to automatically
> extract these, but never got a chance to try them out.
>
> --linas
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To post to this group, send email to
ope...@googlegroups.com.
> To unsubscribe from this group, send email to
>
opencog+u...@googlegroups.com.
> For more options, visit this group at
>
http://groups.google.com/group/opencog?hl=en.
--
Ben Goertzel, PhD
http://goertzel.org
"My humanity is a constant self-overcoming" -- Friedrich Nietzsche