Extract Alternative Sentences

Roman Treutlein

unread,

Nov 17, 2016, 4:33:45 AM11/17/16

to opencog, linasv...@gmail.com

Hey Linas,

I have the follow definition in for a Lojban Predicate:

bacru: x₁ utters verbally/says/phonates/speaks [vocally makes sound] x₂.

ignoring the things in brackets I get:

x₁ utters verbally/says/phonates/speaks x₂.

Now I would like to extract individual sentences from this:

x1 utters verbally x2

x1 says x2

x1 phonates x2

x1 speaks x2

Do you know of any tool that allows me to do this easily?

best Regards

/Roman

Ben Goertzel

unread,

Nov 17, 2016, 8:33:45 AM11/17/16

to opencog, Linas Vepstas

perl ;p

> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> Visit this group at https://groups.google.com/group/opencog.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/6b9fa1a6-995c-44ff-9d0a-9f2c2cdfaeb4%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Ben Goertzel, PhD
http://goertzel.org

“I tell my students, when you go to these meetings, see what direction
everyone is headed, so you can go in the opposite direction. Don’t
polish the brass on the bandwagon.” – V. S. Ramachandran

Roman Treutlein

unread,

Nov 17, 2016, 11:42:36 AM11/17/16

to opencog, linasv...@gmail.com

Not sure if that's a serious suggestion but since not all alternatives consist of exactly 1 word. A simple parser won't suffice.

I might use a simple parser and then fix the results by hand but if possible I would prefer not to do that.

Linas Vepstas

unread,

Nov 17, 2016, 5:48:59 PM11/17/16

to Roman Treutlein, opencog

Not sure I understand the question.

You can use WordNet to look up synonymous words/phrases. it has perl, python, java and other APIs.

NLTK is in python only but its a huge swiss-army knife of tools.

Perhaps you want to create a lojban parser? There are probably off-the-shelf solutions for this, but before you go there, we should talk more.

Finally, there is Apertium and other open-source NL translators. Set it up right, and you get a lojban to/from English converter.

So one way to translate from lojban to English would be to create a mapping between lojban words and wordnet-senses. Then for each wordnet-sense, you can look up all the English words that embody that sense, as well as a numeric measure of just how good a fit each one is.

But that gives you only word-by-word translation. I presume that Aprertium has a more sophisticated framework for handling phrase-by-phrase translation chunks.

--linas

Ben Goertzel

unread,

Nov 17, 2016, 5:55:59 PM11/17/16

to opencog, Roman Treutlein

Hi Linas,

Roman already has built a parser (in Haskell) that maps Lojban
sentences into Atomese structures

I think I know how to use this to make a replacement for RelEx2Logic,
by using a parallel English-Lojban corpus, and then using the pattern
miner to find patterns from the set of pairs of the form

(link parser output for English sentence S,
Lojban->Atomese parser output for Lojban cognate S' of English sentence S)

What Roman is now doing is assembling the parallel English-Lojban
corpus, from various existing fragmentary parallel English-Lojban
corpora...

One of these corpora is a Lojban dictionary with definitions that look like

> x1 utters verbally/says/phonates/speaks x2.

for each Lojban word.... So he is simply facing the small programming
task of translating these definitions into sets of English sentences,
i.e. in the above example

> Now I would like to extract individual sentences from this:
>
> x1 utters verbally x2
> x1 says x2
> x1 phonates x2
> x1 speaks x2

This could be done using a fairly simple script but he is wondering if
there's some elegant parsing framework that could be used to deal with
the forward-slashes between phrases... I guess ...

ben

Ben Goertzel

unread,

Nov 17, 2016, 5:57:50 PM11/17/16

to opencog, Roman Treutlein

On Fri, Nov 18, 2016 at 7:55 AM, Ben Goertzel <b...@goertzel.org> wrote:
> I think I know how to use this to make a replacement for RelEx2Logic,
> by using a parallel English-Lojban corpus, and then using the pattern
> miner to find patterns from the set of pairs of the form
>
> (link parser output for English sentence S,
> Lojban->Atomese parser output for Lojban cognate S' of English sentence S)

I am going to write a couple pages explaining the above in detail,
sometime in the next week, for use by Roman and Rui Ting...

Note that if we changed to a different link grammar dictionary (e.g.
one that was learned by unsupervised learning, hint hint) then we
would just need to re-run this pattern-mining process and we'd get a
new R2L-like-system-of-transformations transforming the output from
the new link grammar dictionary into nice PLN-friendly logical
links...

Linas Vepstas

unread,

Nov 17, 2016, 6:00:57 PM11/17/16

to opencog, Roman Treutlein

On Thu, Nov 17, 2016 at 4:57 PM, Ben Goertzel <b...@goertzel.org> wrote:

Note that if we changed to a different link grammar dictionary (e.g.
one that was learned by unsupervised learning, hint hint)

Yeah, I've recently started laying a plan to restart that.

--linas

Linas Vepstas

unread,

Nov 17, 2016, 6:09:30 PM11/17/16

to opencog, Roman Treutlein

On Thu, Nov 17, 2016 at 4:55 PM, Ben Goertzel <b...@goertzel.org> wrote:

> x1 utters verbally/says/phonates/speaks x2.

for each Lojban word.... So he is simply facing the small programming
task of translating these definitions into sets of English sentences,
i.e. in the above example

> Now I would like to extract individual sentences from this:
>
> x1 utters verbally x2
> x1 says x2
> x1 phonates x2
> x1 speaks x2

This could be done using a fairly simple script but he is wondering if
there's some elegant parsing framework that could be used to deal with
the forward-slashes between phrases... I guess ...

Well, if it's really just that, then its not "fairly simple", its "almost trivial". This kind of string mangling is something that perl excels at. You can do it in about 7 lines of code. Here:

#! /usr/bin/env perl

while(<>) { # angle brackets mean "read from standard in"

($x1, $v, $x2) = split; # split on whitespace, into three parts

@verblist = split /\/, $v; #split the middle bit, based on slashes

foreach $verb (@verblist) { #loop over the list

print "$x1 $verb $x2";

}

Untested, might have bugs.

--linas

Amen Belayneh

unread,

Nov 17, 2016, 6:37:35 PM11/17/16

to opencog

From a discussion I had with Roman, I thought he wanted a lexical function. But, I might have misunderstood.

Roman Treutlein

unread,

Nov 18, 2016, 3:21:54 AM11/18/16

to opencog

Maybe I should have given more examples. Because while your script might work for this example it won't work for this:

x₁ comes/goes to destination x₂ from origin x₃ via route x₄ using means/vehicle x₅.

This would at least have the advantage of the alternatives consisting of only 1 word, but as in the first example and the following this is not always the case.

x₁ is a quantity of/is made of chalk from source x₂ in form x₃.

But I probably will have to use a simple solution like the one you provided and then just fix everything it gets wrong by hand.

Ben Goertzel

unread,

Nov 18, 2016, 8:34:13 AM11/18/16

to opencog

Hmm... yeah for cases like

On Fri, Nov 18, 2016 at 5:21 PM, Roman Treutlein <lordm...@gmail.com> wrote:
> x1 is a quantity of/is made of chalk from source x2 in form x3.

either you either need to get pretty fancy or fix them by hand...

I mean, in this case "is a quantity of" happens to start with the same
word as "is made from" so you could use a heuristic based on that; but
what if you had

x1 is a quantity of/ gets made from chalk from source x2 in form x3.

Then you'd need to use grammar to tell that the phrase boundary was
around "is a quantity of" rather than around, say, "a quantity of" ...
you'd need to try various options and parse them...

A fun idea would be to replace \ with "or", so that e.g.

x1 is a quantity of/is made of chalk from source x2 in form x3.

would become

x1 is a quantity of or is made of chalk from source x2 in form x3.

But one suspects the link parser might choke on many of these perverse
sentences, and improving the link dictionary in this way would be a
lot more work than just writing a perl (or whatever) script handling
an ugly list of special cases...

Linas Vepstas

unread,

Nov 18, 2016, 12:53:11 PM11/18/16

to opencog

This is still very easy: modify the script to split on the x's, instead of splitting on whitespace, and only then split on the slashes.

I assume the x's really are the letter x. If the x's are just some strings of random words, then the problem is not really solvable deterministically by any algorithm.

--linas

--

You received this message because you are subscribed to the Google Groups "opencog" group.

To unsubscribe from this group and stop receiving emails from it, send an email to opencog+unsubscribe@googlegroups.com.

To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/8e577a5c-e978-402a-8e90-161e522656a4%40googlegroups.com.

Roman Treutlein

unread,

Nov 19, 2016, 1:32:41 AM11/19/16

to opencog, linasv...@gmail.com

Linas if i understand your code correctly it would do this.

x1 is a quantity of/is made of chalk from source x2 in form x3.

=> split on X

["x1", "is a quantity of/is made of chalk from source" , "x2" , "in form" , "x3"]

=> split on / and distribute

["x1", "is a quantity of" , "x2" , "in form" , "x3"]

["x1", "is made of chalk from source" , "x2" , "in form" , "x3"]

=> join again

x1 is a quantity of x2 in form x3 // WRONG should be "x1 is a quantity of chalk from source x2 in form x3"

x1 is mad of chalk from source x2 in form x3

So it's not that easy. Of course, it seems I am going to use something like this anyway and then fix the errors by hand.

/roman

On Friday, November 18, 2016 at 6:53:11 PM UTC+1, linas wrote:

This is still very easy: modify the script to split on the x's, instead of splitting on whitespace, and only then split on the slashes.

I assume the x's really are the letter x. If the x's are just some strings of random words, then the problem is not really solvable deterministically by any algorithm.

--linas

On Fri, Nov 18, 2016 at 2:21 AM, Roman Treutlein <lordm...@gmail.com> wrote:

Maybe I should have given more examples. Because while your script might work for this example it won't work for this:

x₁ comes/goes to destination x₂ from origin x₃ via route x₄ using means/vehicle x₅.

This would at least have the advantage of the alternatives consisting of only 1 word, but as in the first example and the following this is not always the case.

x₁ is a quantity of/is made of chalk from source x₂ in form x₃.

But I probably will have to use a simple solution like the one you provided and then just fix everything it gets wrong by hand.

--
You received this message because you are subscribed to the Google Groups "opencog" group.

To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.

To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.

Linas Vepstas

unread,

Nov 19, 2016, 4:39:33 AM11/19/16

to Roman Treutlein, opencog

wHi Roman,

if the text has more than 2 or 3 constructions of the form

"is a quantity of/is made of chalk from source"

then it is easier to "fix this by hand" not in the final output, but in the middle of the script: look for all patterns that start with "is" and end in "of", and peel off everything after the "of" into it's own token, before you distribute.

What makes perl both good and bad was that it was designed (optimized) for creating this kind of single-use, throw-away code: to solve some task that you only had to do once or twice, so that hackery and sloppy quick code is acceptable, adequate, and especially easy to write. This is the reason for some of the cryptic short-hand: its simply less typing, less stuff you have to type in. No one else will ever maintain or read or use the code, so undocumented cryptic is OK.

--linas

Reply all

Reply to author

Forward