Lojban + OpenCog

Ben Goertzel

unread,

Sep 13, 2012, 3:42:38 PM9/13/12

to ope...@googlegroups.com

Hi guys,

The thread on MediaWiki extensions morphed into something more
interesting, so I decided to change the Subject line ;)

I thought I'd say a few words about why I think having a Lojban
comprehension/generation system for OpenCog is a really good idea
(which is why I recruited Alyn to help with this).

Crudely, we can decompose the NLP problem into two aspects:

A)
More "pure linguistics" aspects: syntax parsing, part of speech
tagging, word segmentation, word sense disambiguation, anaphor
resolution etc. ; surface realization in language generation (i.e.
figuring out how to turn a specific set of relatively unambiguous
semantic relations into a sentence)

B)
More "linguistic cognition" type aspects: interpreting the meaning of
a sentence in context, figuring out when two differently expressed
linguistic utterances mean the same thing; figuring out how to take a
medium-sized network of thoughts and dice it into smaller chunks
suitable for surface realization... figuring out what doesn't need to
be explicitly said to a given person because they already know it...

...

In principle, there is no rigid distinction between these two aspects.
In practice, modern computational linguistics AND old-style
generative grammar, both deal almost exclusively with type A issues,
"pure linguistics."

There are theoretical papers written about type B issues, but not that
much concrete work. This is not because people feel linguistic
cognition is unimportant. Rather, it seems to be because people feel
like they need to get the type A questions solved first ... and while
getting the type A issues "80% solved" is fairly straightforward given
a host of available methodologies, getting them "95% solved" seems to
involve an endless stream of niggling little issues....

For example, parsing English or other natural languages is
straightforward using a variety of approaches -- if you stick to
simple sentences. But when you get into conjunctions, comparatives,
reflexives and other complex constructs, you find that current methods
all make mistakes. Fixing the mistakes seems tractable, but then
turns into a long-winded complex task.... And since there are many
such "apparently tractable, but long-winded" tasks involved in dealing
with the type A "pure linguistics" issues, nobody moved on to the
type B tasks.

You can see where I'm going with this. Dealing with Lojban instead of
traditional natural languages, the type A issues are either
nonexistent or minimal. Syntax parsing is purely deterministic and
mathematically straightforward in every case (unless the user made an
error), just like computer programming languages. Semantic ambiguity
can be reduced to a rather low level via appropriate usage patterns,
so that a person chatting with an AI using Lojban could
straightforwardly avoid problematic ambiguity in nearly any case where
the AI seemed to get confused. Complex constructs like conjunctions,
comparatives, reflexives and anaphors are dealt with using mechanisms
derived from mathematical logic, so they can be translated into
OpenCog Atom constructs immediately and straightforwardly.

So, if Lojban is used in place of a natural language, then the type A
issues almost disappear. We are left only with the Type B, cognitive
linguistics, issues. Of course, these issues are not easy. In some
senses, they are harder than the type A issues. But they have been
very little explored, because the computational and theoretical
linguistics community has spent basically its entire history grappling
with the mess of type A issues.

My intuition (based on a lot of thought) is that PLN (OpenCog's
probabilistic reasoning component) can be very helpful with cognitive
linguistics issues. I would like to explore this hypothesis in
detail, but it's hard to get to that point when doing English language
processes, because mess caused by not-quite-good-enough handling of
the type A issues always seems to get in the way...

It may be that Ruiting's new approach to the type A issues will get
around this problem, via producing sufficiently clean Atom structures
from English sentences. But I suspect that even if her approach works
great, there will still be a lot more frustrating issues dealing with
English-generated Atom-structures than with Lojban-generated Atom
structures.

...

Now, about the practicalities of Alyn's code and its integration with OpenCog.

The fact that two different dialects of Scheme are involved is mildly
awkward, but not really a big problem (as I guess everyone has already
recognized...)

What I am hoping for Alyn to do is to write Scheme code carrying out
the two transformations:

Lojban sentence ==> Scheme tree of relationships (labeled via Lojban words)
[for language comprehension]

Scheme tree of relationships (labeled via Lojban words) ==> Lojban sentence
[for language generation]

[plus some tweaks to the above to deal with anaphora, which Alyn and I
haven't discussed yet]

Once this works well, then I will work with Ruiting and/or Jade to
create Guile code carrying out the transformations

Scheme tree of OpenCog Atoms ==> Scheme tree of relationships (labeled
via Lojban words)
[for language comprehension]

Scheme tree of relationships (labeled via Lojban words) ==> Scheme
tree of OpenCog Atoms
[for language generation]

(These will be similar to the rules Ruiting has recently coded in
Java, for translating sets of RelEx relationships into OpenCog Atoms.
She prototyped some of these rules in Scheme before shifting to Java
due to her faster coding speed in the latter.)

Given these rules, we will have a fairly complete "Type A language
pipeline" for OpenCog, using Lojban.

This will open up a bunch of interesting possibilities, including:

-- a Lojban dialogue system

-- experiments having the system ground Lojban words in its
virtual-world experiences

-- conversing with the system in parallel in English and Lojban. As
Lojban is much less ambiguous, this may give the system data it can
use for disambiguating English. This may be useful for teaching the
system the meaning of complex, confusing English constructs (with
commas, comparatives, quantifiers, etc.), for which statistical
methods of disambiguation aren't very useful at present.

-- automated translation between English and Lojban

-- later on: automated translation between English and Chinese using
OpenCog, using Lojban as an intermediate language

Note, the latter 3 points, involving using English and Lojban
together, would be easier if some additional preliminary work besides
Alyn's current Scheme project were completed. Namely: we would want
the English-Lojban dictionary to get ported into OpenCog. The best
way to do this isn't yet 100% clear to me, though I've thought about
it a lot. But we can cross that bridge when we come to it.

I realize this is sort of an odd, edgy direction -- and I'm certainly
not suggesting to abandon OpenCog work on the good old fashioned
natural languages. But I think that if any of you studies Lojban for
a couple dozen hours, enough to learn to read and write simple
sentences, then you will start to see why I think this is a really
cool direction...

-- Ben

On Thu, Sep 13, 2012 at 2:42 PM, Linas Vepstas <linasv...@gmail.com> wrote:
> Hi,
>
> On 13 September 2012 12:56, .alyn.post. <alyn...@lodockikumazvati.org>
> wrote:
>>
>> On Thu, Sep 13, 2012 at 12:16:31PM -0500, Linas Vepstas wrote:
>> > I don't know Guile well enough to know what libraries are
>> > available,
>> >
>> > its more or less standard R6RS plus all the srfis plus various
>> > historical
>> > baggage like ice-9 which you would avoid in new code. *The only real
>> > difference is in modules, since guile had a module system that is
>> > much
>> > older and more feature-full than what was adopted in r6rs, which
>> > causes
>> > many headaches.
>> >
>>
>> Probably the biggest pain would be that I use the sandbox egg to
>> provide safe evaluation of Scheme code:
>>
>> http://wiki.call-cc.org/eggref/4/sandbox
>>
>> I could strip this out and do a bare eval so long as my input wasn't
>> coming from another user.
>
>
> Why would you need accept scheme input from another user? google doesn't
> know of a sandbox for guile. Can't be hard ...
>
>>
>> After that I need matchable:
>>
>> http://wiki.call-cc.org/eggref/4/matchable
>>
>> Which is portable, but I'm not sure has been packaged for Guile.
>
>
> http://www.gnu.org/software/guile/manual/html_node/Pattern-Matching.html
>
> "The (ice-9 match) module provides a pattern matcher, written by Alex Shinn,
> and compatible with Andrew K. Wright's pattern matcher found in many Scheme
> implementations."
>
>> Finally, I need a hash table or some kind of tree for memoization, which
>> I assume Gulie has.
>
>
> Yeah, it has hashes and various other data structures.
>
>>
>> > I suppose I could too, having gone through this for the English
>> > language.
>> > *I could tell you what I learned, but this is the wrong thread.
>> > *Anyway,
>> > the short answer is that the difficulty of understanding english has
>> > nothing to do with syntactic or semantic ambiguity, or with any
>> > difficulty
>> > in parsing. *It is at quite another level, and lojban would promptly
>> > stumble there as well.
>>
>> I agree with you, though I don't have the depth of understanding you
>> do with OpenCog, and therefor can't speak articulately on the topic.
>> Lojban happens to be in my skillset, doing this work is also getting
>> some other goals accomplished, and I'd honestly rather try and fail
>> but produce a demonstrated, documented outcome than not move forward
>> with this work.
>
>
> Well, the issue is not really an opencog issue, but a much deeper, general
> question: roughly: "great, you have a syntax tree, now what?"
>
> The obvious first answer is "lets do question answering". This ranges from
> deceptively simple to fiendishly complex: e.g
>
> John threw the ball.
> Who threw the ball?
>
> You don't even have to parse anything to figure that out: just match up the
> words.
>
> John threw the green ball.
> Who threw the ball?
>
> Slightly trickier: your parser needs to recognize that "green" is a noun
> modifier, and your pattern matcher needs to know that, in this case,
> ignoring the modifier when building the answer is acceptable.
>
> From there, it only gets harder. Many sentences have different surface
> structures, but, the act parsing will normalize them into rather similar or
> even identical "semantic relation" trees. The real problems set in when
> when the known facts, and the question have very very different "semantic"
> structures. This requires having rules that can modify one structure into
> another, and then determine if the modification can answer the question.
>
> I started hand-coding rules to perform these transformations. After a
> half-dozen or so, I realized that this was nuts: I would have to hand-code
> thousands of pattern transformations. The result would be fragile and
> buggy. I contemplated some ways to use machine learning to automatically
> extract these, but never got a chance to try them out.
>
> --linas
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To post to this group, send email to ope...@googlegroups.com.
> To unsubscribe from this group, send email to
> opencog+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/opencog?hl=en.

--
Ben Goertzel, PhD
http://goertzel.org

"My humanity is a constant self-overcoming" -- Friedrich Nietzsche

Ben Goertzel

unread,

Sep 13, 2012, 3:52:57 PM9/13/12

to ope...@googlegroups.com

Ah, one more thing...

Part of my thinking here is that

-- Type A NLP problems would be easier to solve if the Type B problems
had already been solved

-- Type B NLP problems would be easier to solve if the Type A problems
had already been solved

-- Solving both at once is a pain

-- The solution to type B NLP problems is, mostly likely, mainly
language-independent

Soo.... it might be possible to proceed as follows

-- Solve the type B problems for Lojban (easier because the Type A
problems are much easier for Lojban)

-- Use the solution for type B problems, to help with other languages
like English

-- In this way, solve the type A problems for English more easily,
because one can do it in the context of already having a workable
solution for the Type B problems

...

Yes, yes, it's devilishly clever, I know ;p ;) ...

ben

Linas Vepstas

unread,

Sep 13, 2012, 6:37:40 PM9/13/12

to ope...@googlegroups.com

On 13 September 2012 14:52, Ben Goertzel <b...@goertzel.org> wrote:

Ah, one more thing...

Part of my thinking here is that

-- Type A NLP problems would be easier to solve if the Type B problems
had already been solved

Yes.

I certainly lost interest in solving any more type-A problems until far more dramatic advances are made on the type-B side.

-- Type B NLP problems would be easier to solve if the Type A problems
had already been solved

Maybe. Sure. My impression was that the type B problems would be ... err ... how should I say this: very difficult and very interesting.

-- The solution to type B NLP problems is, mostly likely, mainly
language-independent

The first trick is to find a language-independent representation of semantic meaning. This, in itself is not hard at all, I don't think. At least, not the first draft. I'll argue that I already had one, for the mini-QA-chat-system I'd built (although this would hardly be obvious from a quick look at it). The trick is to be able to recognize when two different graphs have the same meaning. (what I was trying to get to w/question-answering).

But of course, once one tries to do something non-trivial with the "language independent representation", one will promptly find that whatever one picked, it is inadequate, so it then requires modification. This is where I got stuck: realizing that I had to change something that was fairly "deeply" wired into the system. Making this change would be costly, and there was no guarantee that there wouldn't be other deep, costly changes ahead. And this is when I started pondering the automated learning techniques.

Soo.... it might be possible to proceed as follows

-- Solve the type B problems for Lojban (easier because the Type A
problems are much easier for Lojban)

-- Use the solution for type B problems, to help with other languages
like English

-- In this way, solve the type A problems for English more easily,
because one can do it in the context of already having a workable
solution for the Type B problems

Sure, sounds plausible.

--linas

Russell Wallace

unread,

Sep 13, 2012, 7:14:04 PM9/13/12

to ope...@googlegroups.com

Could you not achieve similar results by asking the user to stick to
an easy subset of English?

Ben Goertzel

unread,

Sep 13, 2012, 7:31:03 PM9/13/12

to ope...@googlegroups.com

The "restricted English" path has been gone down many times, and runs
into a lot of difficulties. Not to say it's impossible, but it's
harder than it seems...

A big problem is: if you try to restrict English in a way that doesn't
allow complex uses of commas, conjunctions, comparisons, quantifiers
and all the other things that make linguistics hard ... what you wind
up with is an English dialect consisting of a series of very short
sentences, connected by rampant use of anaphora. But then all the
noun anaphors and such become quite difficult to figure out -- you
have a language sort of like Piraha (which has no recursive phrase
structure, and relies wholly on pragmatics-guided cross-sentential
anaphors to allow the build-up of recursive semantics).... So,
restricting English a lot, pushes the user into an English usage
pattern that poses extreme interpretational complexities for the AI.
In the language of my prior post, it simplifies type A problems, at
the cost of making the type B problems significantly harder. Whereas,
Lojban seems to even more drastically simplify the type A problems,
and also make the type B problems easier (due to pushing semantics in
a somewhat mathematical direction)

-- Ben

Russell Wallace

unread,

Sep 13, 2012, 8:05:12 PM9/13/12

to ope...@googlegroups.com

On Fri, Sep 14, 2012 at 12:31 AM, Ben Goertzel <b...@goertzel.org> wrote:
> A big problem is: if you try to restrict English in a way that doesn't
> allow complex uses of commas, conjunctions, comparisons, quantifiers
> and all the other things that make linguistics hard ... what you wind
> up with is an English dialect consisting of a series of very short
> sentences, connected by rampant use of anaphora. But then all the
> noun anaphors and such become quite difficult to figure out

Er, not really? Commas and conjunctions and suchlike aren't hard per
se. I mean I'm no linguist, but I remember the old infocom text
adventure games, they handled that kind of thing just fine, you could
stack up very complex compound phrases and use 'it' no problem, you
just had to stick to talking about concrete things instead of idioms
and abstractions. And that's stuff that used to run on a Commodore 64!

The latest iteration of that technology, Inform 7, does better yet,
allowing you to not just play but write text adventures in restricted
English - it's designed to make the art accessible to writers rather
than programmers, and seems to be working.

Now if you say Lojban is nonetheless a technically superior solution
I'll take your word for it, but restricted English is so vastly
preferable from a business viewpoint, it seems to me it would be worth
trying very hard indeed to make it work.

Mark Nuzzolilo II

unread,

Sep 13, 2012, 8:12:38 PM9/13/12

to ope...@googlegroups.com

I thought that the Lojban will only be used internally by OpenCog and
it will be able to process full English by converting it to Lojban and
back (automatically). I would think that would be preferable
business-wise. I'd be more impressed by something that can understand
full English.

Ben Goertzel

unread,

Sep 13, 2012, 8:20:32 PM9/13/12

to ope...@googlegroups.com

Hi,

> Er, not really? Commas and conjunctions and suchlike aren't hard per
> se.

Hmmm... then why don't any of the existing English parsers consistently handle
them correctly?

>I mean I'm no linguist, but I remember the old infocom text
> adventure games, they handled that kind of thing just fine, you could
> stack up very complex compound phrases and use 'it' no problem, you
> just had to stick to talking about concrete things instead of idioms
> and abstractions. And that's stuff that used to run on a Commodore 64!

Yes, if you are restricted to a sufficiently narrow domain, NLP becomes
much easier. Indeed, that's been known since the early 1970s.

> The latest iteration of that technology, Inform 7, does better yet,
> allowing you to not just play but write text adventures in restricted
> English - it's designed to make the art accessible to writers rather
> than programmers, and seems to be working.
>
> Now if you say Lojban is nonetheless a technically superior solution
> I'll take your word for it,

I would say that making a restricted English that works beyond a narrow
domain ("works" means: allows expression of general content, without
frequently requiring one to resort to contorted phrasings that are hard for
the AI to figure out semantically even though they're syntactically clear),
is an open research problem.

I can see a path to this kind of restricted English. For instance, you could
replace words with Wordnet synsets (and something similar for prepositions
and other words not in Wordnet). E.g. you'd have sentences like

The dog_1 peed in_5 the tub_2.

where the subscripts referred to a common ontology (e.g. Wordnet). Then,
for syntax, you could have an interface reject any sentence such that a given
parser, e.g. the link parser, doesn't give a single parse with much higher
rank than all the others (this would require some tuning of parse ranking)..

This would be an interesting research project. You'd still IMO have
difficulties
with various uses of comparatives, quantifiers and such.... But it would likely
be workable.

On the other hand, Lojban is a mature technology that works already, and has
been refined via decades of practice via a small but enthusiastic community
of contributors/communicators.

>but restricted English is so vastly
> preferable from a business viewpoint, it seems to me it would be worth
> trying very hard indeed to make it work.

Nobody is suggesting making a commercial product using Lojban, of course.

I'm suggesting it as

-- a possible intermediate step, to help build a system capable of dialoguing in
natural language

-- a possible way for AGI programmers/teachers to interact with AGIs

Anyway, nobody can please all the people all the time ;)

And I'm definitely not suggesting to abandon work on English-language NLP
in OpenCog. Our HK project will continue to pursue English comprehension/
generation in OpenCog, as we're aiming at making technology that a wide
variety of people can communicate with....

Fortunately OpenCog can support multiple research directions in parallel ;) ...
and it may just happen that Lojban work helps accelerate the path to giving
OpenCog robust English comprehension...

-- Ben

Ben Goertzel

unread,

Sep 13, 2012, 8:28:22 PM9/13/12

to ope...@googlegroups.com

Mark,

> I thought that the Lojban will only be used internally by OpenCog and
> it will be able to process full English by converting it to Lojban and
> back (automatically).

No, my suggestion is initially for some humans to use Lojban to
communicate with OpenCog
in parallel with using English. As a way of communicating more directly
with the system, and of helping it learn English syntax/semantics mapping.

Once the system has learned enough from this parallel communication, it
should be able to learn to convert btw English and Lojban automatically.

> I would think that would be preferable
> business-wise.

While a complete human-level AGI will obviously be very valuable from a business
perspective (to *itself*, or if we want to be slave-owners, to those
of us who own it ;p),
it's not obvious that intermediate steps on the way to human-level AGI
need to be
profitable. There are certainly many cases where the best path toward
a highly profitable
business entity involves intermediate steps that are not profitable at all.

Anyway my motivation in introducing Lojban here is not financial
profit but the pursuit
of human-level and superior AGI ...

>I'd be more impressed by something that can understand
> full English.

Yes, everyone will be more impressed by conversations in languages
they can undesrtand ;)

Anyway, if we could make an OpenCog capable of fluent, sensible conversation
in Lojban, we'd be 90% of the way to one capable of fluent, sensible
conversation
in English...

ben g

Russell Wallace

unread,

Sep 13, 2012, 8:58:52 PM9/13/12

to ope...@googlegroups.com

On Fri, Sep 14, 2012 at 1:20 AM, Ben Goertzel <b...@goertzel.org> wrote:
>> Er, not really? Commas and conjunctions and suchlike aren't hard per
>> se.
>
> Hmmm... then why don't any of the existing English parsers consistently handle
> them correctly?

I don't know, the only English parsers I've actually worked with are
the ones for text adventure games, which do :)

That having been said - could it be that existing English parsers do
handle them correctly for the kind of straightforward concrete
sentences text adventures handle at all, but get confused when idioms
and suchlike start coming into the picture?

> I can see a path to this kind of restricted English. For instance, you could
> replace words with Wordnet synsets (and something similar for prepositions
> and other words not in Wordnet). E.g. you'd have sentences like
>
> The dog_1 peed in_5 the tub_2.
>
> where the subscripts referred to a common ontology (e.g. Wordnet). Then,
> for syntax, you could have an interface reject any sentence such that a given
> parser, e.g. the link parser, doesn't give a single parse with much higher
> rank than all the others (this would require some tuning of parse ranking)..

I suspect you could do better than that. Seems to me you ought to be
able to make a parser that accepts "the dog peed in the tub," and
spits out a paraphrase so the user can verify the sentence was
understood correctly, _or_ if there is any term about which it is
unsure, says "which tub do you mean, 1) the bathtub or 2)..." etc.
(again, text adventures do some of this already with much cruder
technology.)

There will still be some narrowness in the domain, to be sure, but
there is also some narrowness in the domain of concepts the system
will be able to understand regardless of syntax issues anyway. My
guess is an approach like this would work well enough that the parser
wouldn't stand in the way of working on the semantics.

> Nobody is suggesting making a commercial product using Lojban, of course.
>
> I'm suggesting it as
>
> -- a possible intermediate step, to help build a system capable of dialoguing in
> natural language
>
> -- a possible way for AGI programmers/teachers to interact with AGIs
>
> Anyway, nobody can please all the people all the time ;)

True :)

> And I'm definitely not suggesting to abandon work on English-language NLP
> in OpenCog. Our HK project will continue to pursue English comprehension/
> generation in OpenCog, as we're aiming at making technology that a wide
> variety of people can communicate with....
>
> Fortunately OpenCog can support multiple research directions in parallel ;) ...
> and it may just happen that Lojban work helps accelerate the path to giving
> OpenCog robust English comprehension...

fair enough!

Ben Goertzel

unread,

Sep 13, 2012, 9:33:02 PM9/13/12

to ope...@googlegroups.com

Hi,

> I suspect you could do better than that. Seems to me you ought to be
> able to make a parser that accepts "the dog peed in the tub," and
> spits out a paraphrase so the user can verify the sentence was
> understood correctly, _or_ if there is any term about which it is
> unsure, says "which tub do you mean, 1) the bathtub or 2)..." etc.
> (again, text adventures do some of this already with much cruder
> technology.)

Yes, that simple example is easy to handle. The problem is that in any
non-trivial text or conversation, one comes across many examples
that are not nearly so easy to handle.

> There will still be some narrowness in the domain, to be sure, but
> there is also some narrowness in the domain of concepts the system
> will be able to understand regardless of syntax issues anyway. My
> guess is an approach like this would work well enough that the parser
> wouldn't stand in the way of working on the semantics.

Well, we tried to build a system like that in 2004, for a gov't customer;
that's how/why the RelEx semantic processing system (still in OpenCog)
got built. The domain there was not narrow enough for this to work
unproblematically.

Also consider that Siri and Apple have thrown a fair amount of $$
and expertise at this problem recently, with fairly mediocre results.

This is one of those things that sounds pretty easy till you try it and
get into the details...

-- Ben

Reply all

Reply to author

Forward