Re: LG 5.5.1

Linas Vepstas

unread,

Nov 15, 2018, 1:09:51 AM11/15/18

to akol...@aigents.com, link-grammar, Amir P, lang-...@googlegroups.com, Andres Suarez

On Wed, Nov 14, 2018 at 11:23 PM Anton Kolonin @ Aigents <akol...@aigents.com> wrote:

Hi Linas,

I would expect that those links are not returned because "Mom" has
nothing to do with "telescope" in the first sentence

The Mp connects "Mom" to "with" (not to telescope). Why does it connect? Because, no matter which way that you read the sentence (either Dad has the telescope, or Mom has the telescope) in either case, the telescope is instrumentally associated with "Mom" (either she is carrying it, or she is being seen with it.) So there is a deep relationship between Mom and the telescope - the instrumentality relationship.

My hope is that the next step of the language learning project is that such instrumentalities can be fished out of the soup of words. By "instrumentality", I mean, in a casual sense, the concept of a "lexical function" (LF) and related ideas taken from Meaning-Text Theory.

https://en.wikipedia.org/wiki/Lexical_function
http://www.coli.uni-saarland.de/courses/syntactic-theory-09/literature/MTT-Handbook2003.pdf

The way that I envision this happening is that patterns such as

                   +-----MVp-------+----Js--- (instrument)
(subject)---Ss*s-+--Os--+--Mp----+
                   |      |        |
              (action) (object) (prep)

can be spotted in text, as "frequently occurring patterns", or, more accurately "patterns with high MI" and that, furthermore, they can be detected as synonymous with other similar structures (Dekang Lin, Poon and Domingos) such as "Mom was seen by Dad with the telescope" and "Dad used the telescope to see Mom" and "Mom had the telescope when Dad saw her" -- the goal is to recognize all these as synonymous sentences, and then recognizing the ambiguity in the first example.

The Mp link helps make the linkage "well-connected", "firm" or "strong" -- graphs with loops are more "rigid" or "stronger" than graphs without loops.

I think this is possible. I think its a very worthwhile goal. If you want to understand meaning, you have to go beyond vector space formalisms, and extract synonymous phrases, lexical functions, all of that.

and "chalk" has
nothing to do with "on" in the second sentence, to my understanding.

But its much like the first sentence: where is the chalk going? Its going "on" to something. The M link indicates where it is going. The lower-case "p" just indicates that the noun (chalk) is connecting to a preposition (on). Its another instrumentality relationship.

Perhaps if you study the examples of the Ma, Mv, Mg, Mr connectors, documented there, you will see why Mp is the analogous case for prepositions. Again, read through the documentation for the M link: https://www.abisource.com/projects/link-grammar/dict/section-M.html

We have had discussion on this involving Andres, so we had to create
manually edited "gold standard" parses (which do not have "extra links"):
http://88.99.210.144/data/andres_parses/poc-english_ex-parses-gold.txt

It is not an "extra link". It is a vitally important link. it captures a large part of the meaning, the semantics of the sentence. If you edit it out, you damage th parse.

It might also be useful for you to take a look at how the Stanford parser handles these sentences. I'm pretty sure they handle it the same way that link grammar does; this is a more-or-less standardized relationship that pretty much all dependency parsers are going to generate.

compared to LG "silver standard" parses (with those extra links):
http://langlearn.singularitynet.io/data/parses/English/POC-English/poc_english-LG-silver.txt.ull

I recall we had discussion the future version of LG will have this fixed

There is nothing to fix; nothing is broken. The discussion was not with me.

--linas

and I hoped we can get the "gold" (manual) and "sliver" (LG) standards
merged. Sorry if I misunderstood on that.

On the new LG version an performance on long sentences - cool, we would
love to have it, because we have major performance problems with LG
parsing Gutenberg Children - you can get the dictionary here:
http://langlearn.singularitynet.io/data/clustering_2018/Gutenberg-Children-Books-1000-disjuncts-2018-10-29/Gutenberg-Children-Books-Caps-50-clusters-1000-disjuncts-2018-10-29_/Gutenberg-Children-Books-Caps_LG-English_cALEd_no-LW_no-RW_no-gen/

and try to parse the corpus that was used to create this dictionary:
http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/

Right now, it takes technical infinity to do the parse for the above, we
have never got the parse results results with current version of LG.

Cheers,
-Anton

15.11.2018 11:21, Linas Vepstas:
>
> On Wed, Nov 14, 2018 at 3:25 AM Anton Kolonin @ Aigents
> <akol...@aigents.com <mailto:akol...@aigents.com>> wrote:
>
> Hi Amir and Linas,
>
> We have finally upgraded to LG 5.5.1 and see that some sentences in our
> reference corpus are not parsed right:
> http://langlearn.singularitynet.io/data/poc-english/poc_english.txt
>
>
> For one example:
>
> link-parser
> link-grammar: Info: Dictionary found at
> /home/akolonin/miniconda3/envs/ull-lg55/share/link-grammar/en/4.0.dict
> link-grammar: Info: Dictionary version 5.5.1, locale en_US.UTF-8
> link-grammar: Info: Library version link-grammar-5.5.1. Enter "!help"
> for help.
> linkparser> Dad saw Mom with a telescope.
> Found 18 linkages (18 had no P.P. violations)
> Linkage 1, cost vector = (UNUSED=0 DIS=-0.61 LEN=9)
>
> +---------------------Xp---------------------+
> +----->WV----->+-----MVp----+----Js---+ |
> +-->Wd--+-Ss*s-+--Os--+--Mp-+ +-Ds**c+ |
> | | | | | | | |
> LEFT-WALL dad.m saw.v-d Mom.l with a telescope.n .
>
> Press RETURN for the next linkage.
>
>
> linkparser> Mom writes with chalk on the board.
> Found 32 linkages (32 had no P.P. violations)
> Linkage 1, cost vector = (UNUSED=0 DIS=-0.61 LEN=11)
>
> +-------------------------Xp-------------------------+
> +------>WV----->+---------MVp--------+----Ju---+ |
> +-->Wd--+--Ss*s-+--MVp-+--Ju--+--Mp--+ +--Dmu-+ |
> | | | | | | | | |
> LEFT-WALL Mom.l writes.v with chalk.n-u on the board.n-u .
>
> Press RETURN for the next linkage.
> linkparser>
>
>
>
> In the sentences above, links
>
> Mom.l--Mp-with
>
> and
>
> chalk.n-u--Mp--on
>
> Seems unexpected.
>
>
> Unexpected, perhaps, but correct, a far as I can tell, and documented in
> great detail:
>
> https://www.abisource.com/projects/link-grammar/dict/section-M.html
>
> As I recall from some earlier discussions, issues like those should be
> not exiting in the latest LG version.
>
> If we misunderstand something or we should rather create an issues on
> this matter?
>
>
> I don't see the issue. It appears to be 100% correct. What were you
> expecting to happen, instead?
>
> --linas
>
> p.s. it would be more convenient if you used the link-grammar mailing list.
>
> --
> cassette tapes - analog TV - film cameras - you
>
> --
> You received this message because you are subscribed to the Google
> Groups "lang-learn" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to lang-learn+...@googlegroups.com
> <mailto:lang-learn+...@googlegroups.com>.
> To post to this group, send email to lang-...@googlegroups.com
> <mailto:lang-...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lang-learn/CAHrUA36%2BQMxMjwRz493siJVtzjBGDOHF2p3PsYGXkrHiSa_cMA%40mail.gmail.com
> <https://groups.google.com/d/msgid/lang-learn/CAHrUA36%2BQMxMjwRz493siJVtzjBGDOHF2p3PsYGXkrHiSa_cMA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
-Anton Kolonin
skype: akolonin
cell: +79139250058
akol...@aigents.com
https://aigents.com
https://www.youtube.com/aigents
https://www.facebook.com/aigents
https://plus.google.com/+Aigents
https://medium.com/@aigents
https://steemit.com/@aigents
https://golos.blog/@aigents
https://vk.com/aigents

--

cassette tapes - analog TV - film cameras - you

Anton Kolonin @ Gmail

unread,

Nov 15, 2018, 1:15:27 AM11/15/18

to link-g...@googlegroups.com, Linas Vepstas, Amir P, lang-...@googlegroups.com, Andres Suarez

Hi Linas,

Stanford Parser does not think like you describe - for both of the
sentences - it can be very easily tried:

http://corenlp.run/

Cheers,
-Anton

15.11.2018 13:09, Linas Vepstas пишет:

> <mailto:lang-learn%2Bunsu...@googlegroups.com>
> > <mailto:lang-learn+...@googlegroups.com
> <mailto:lang-learn%2Bunsu...@googlegroups.com>>.

> > To post to this group, send email to lang-...@googlegroups.com
> <mailto:lang-...@googlegroups.com>

> > <mailto:lang-...@googlegroups.com

> <mailto:lang-...@googlegroups.com>>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/lang-learn/CAHrUA36%2BQMxMjwRz493siJVtzjBGDOHF2p3PsYGXkrHiSa_cMA%40mail.gmail.com
>
> >
> <https://groups.google.com/d/msgid/lang-learn/CAHrUA36%2BQMxMjwRz493siJVtzjBGDOHF2p3PsYGXkrHiSa_cMA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> -Anton Kolonin
> skype: akolonin
> cell: +79139250058

> akol...@aigents.com <mailto:akol...@aigents.com>

> https://aigents.com
> https://www.youtube.com/aigents
> https://www.facebook.com/aigents
> https://plus.google.com/+Aigents
> https://medium.com/@aigents
> https://steemit.com/@aigents
> https://golos.blog/@aigents
> https://vk.com/aigents
>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>

> --
> You received this message because you are subscribed to the Google

> Groups "link-grammar" group.

> To unsubscribe from this group and stop receiving emails from it, send

> an email to link-grammar...@googlegroups.com
> <mailto:link-grammar...@googlegroups.com>.
> To post to this group, send email to link-g...@googlegroups.com
> <mailto:link-g...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/link-grammar.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/link-grammar/CAHrUA35kSTaZE-7wqTqeTGBENOpb84jnd6t2wecQ%3Dh_P9%2BT3tg%40mail.gmail.com
> <https://groups.google.com/d/msgid/link-grammar/CAHrUA35kSTaZE-7wqTqeTGBENOpb84jnd6t2wecQ%3Dh_P9%2BT3tg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Linas Vepstas

unread,

Nov 15, 2018, 1:21:54 AM11/15/18

to Anton Kolonin, link-grammar, Amir P, lang-...@googlegroups.com, Andres Suarez

It doesn't generate anything at all .. the website is broken or down.

Please double check -- the relationship is obviously correct, and I am at a loss as to why you are arguing about this. If you have some explanation that actually makes sense, that can be stated in factual terms, and is anchored in a reality that is sufficiently plausible that one could say "gosh, that might possibly be true", then please make that statement. Its pointless to just badger me for no particular reason, other than that you feel like badgering me.

--linas

Linas Vepstas

unread,

Nov 15, 2018, 9:23:51 AM11/15/18

to Anton Kolonin, link-grammar, Amir P, lang-...@googlegroups.com, Andres Suarez

On Thu, Nov 15, 2018 at 1:41 AM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

Hi Linas, the site is up and running - attaching the screenshots.

I see your reasoning, actually.

Thank you.

But since you have suggested to refer to Stanford Parser, I just have checked that it does the other way.

OK, so here's how that works. The Stanford parser is designed to only produce planar trees. Since link grammar is producing parses with loops (the MV-J-M loop in one case, and an MV-O-M loop in the other), of course, the two will disagree. To get a reasonable comparison, you'd have to find some other sentence, of a similar style, to find the Stanford version of the M dependency. Some of the other sentences from the M-link documentation page will produce those dependencies.

I feel I should make a few statements about loops. Link Grammar has a number of them. The most basic one is this:

    +------>WV----->+
    +-->Wd--+--Ss*s-+
    |       |       |
LEFT-WALL Mom.l writes.v

Here, there are three dependencies:
-- from the root (left-wall) to the head-noun (Wd link)
-- from the root to the head-verb (WV link)

-- between the subject and the verb.

The Stanford parser, because it generates a tree, has two choices: omit the connection to the head-noun, or omit the connection to the head verb, (or omit the SV dependency, but that would be crazy). Their default has historically been to usually connect to the head-verb, and rarely to the head-noun. Thus, they have a tree, but at the expense of missing an "obvious" dependency.

You can trick Stanford into connecting to the head-noun by giving it sentences without a verb. Exclamations: "Holy Cow, what a mess!"

Simple Replies: "Yes!"

Prepositional replies: "Just past the big one, on the left."

Questions: "Why not?"

Ellipses: "... like wedding cakes."

Link Grammar has other loops, the second-most important being the relative clauses, using the R and B links:

           +----B----+
           +-R-+-RS--+
           |   |     |
      The dog who chased me was black

Of the three links, the Stanford parser will drop one, I'm not sure which.

Next - an important theoretical-philosophical question -- Why cycles? The answer is that cycles add rigidity and force correctness. There are a number of ambiguous parses of sentences, and its hard, using only tree rules, to obtain the correct parse. By forcing the use of more links, by requiring that **all** of the desired relationships be present, instead of only some of them, you can be more confident that the parse is correct. Otherwise, you are in a situation where some desired dependency is missing, and you don't know why ... is it because the parse is incorrect, or is it because that dependency is actually missing?

For language-learning, the MST parser always produces ... trees. I've long planned an extension to it that would produce loops, but only if all of the links in the loop were above some threshold, say MI=4 (and none of the links cross). Been too busy to do that. The easy part is modifying the parser; the hard part is exploring what happens downstream.

--linas

Anton Kolonin @ Gmail

unread,

Nov 16, 2018, 7:09:06 AM11/16/18

to linasv...@gmail.com, link-grammar, Amir P, lang-...@googlegroups.com, Andres Suarez

Linas, thank you!

15.11.2018 21:23, Linas Vepstas:

The dog who chased me was black

Here you go:

Linas Vepstas

unread,

Nov 16, 2018, 4:02:43 PM11/16/18

to Anton Kolonin, link-grammar, Amir P, lang-...@googlegroups.com, Andres Suarez

OK, so the analysis for this case is the same as before.
"The dog who chased me was black" contains the relative
clause "who chased me". The root/left-wall of this relative
clause is "dog". The head-verb of the relative clause is

"chased", the head-noun is "who". Finally, there is a
subject-object relation between "who" and "chased".

If you draw all three arrows, you get a loop. So, to get a tree,

drop one of them: by convention, omit the root-to-the-head-noun
arrow. That is what the "basic dependencies" graph does.

The Enhanced++ dependencies graph does show the
root-to-the-head-noun arrow, it's labeled "ref". Its also
bizarrely mis-drawn: the "nsubj" arrow should have been

exactly the same as the "basic dependencies" graph, but

its mis-drawn, for some reason.

By comparing the graphs, you can try to make a table of
LG link types to Stanford relations. That is what relex did.

Superficially it works great, but eventually drowns in details

and subtle distinctions. On giant example is that LG never

generates copula ("cop" in Stanford). Now, one *could* write

an LG dictionary that had a COP link-type in it. But no one has

done this? Why? Historical accident, I suppose.

What is interesting about MI values is that MI gives you an

"objective" way of knowing which links are the strongest. Which

hopefully will be the same as LG, but might not be. For example,

this parse makes sense, and seems right:

     +-------->WV------->+
    +---->Wd-----+      |
    |      +Ds**c+-Ss*s-+---Pa--+
    |      |     |      |       |
LEFT-WALL the dog.n was.v-d black.a

but there is another possibility, that kind-of makes sense

Here, adjcomp is "adjectival compliment" and "cop" was

copula. Some dependency grammars draw this graph.

Some call it "predicative adjectival modifier". Lets quibble.

Note that I did not draw an arrow from subject to verb. I could,
I suppose. Note that it is now IMPOSSIBLE to draw an arrow

from root/left-wall to the verb, because it would require a
link-crossing, it would have to cross over the adjcomp arrow.

Thus, if you want to draw an arrow from root to head-verb, and also
get a planar graph, you are not allowed to draw the adjcomp/predadj

arrow. That helps explain what LG does.

It also helps make clear that the no-links-crossing constraint is imperfect.

It seems reasonable, but clearly, there is a violation in the above rather

trivial sentence!

-- Linas

Linas Vepstas

unread,

Nov 16, 2018, 5:05:38 PM11/16/18

to Anton Kolonin, link-grammar, Amir P, lang-...@googlegroups.com, Andres Suarez, opencog

I hit "send" too soon, without finishing the thought:

On Fri, Nov 16, 2018 at 3:02 PM Linas Vepstas <linasv...@gmail.com> wrote:

For example, this parse makes sense, and seems right:

     +-------->WV------->+
    +---->Wd-----+      |
    |      +Ds**c+-Ss*s-+---Pa--+
    |      |     |      |       |
LEFT-WALL the dog.n was.v-d black.a

but there is another possibility, that kind-of makes sense (and perhaps language learning will find):

    +---->Wd---->+
    |            +-->adjcomp--->+
    |      +Ds**c+      +<-cop<-+
    |      |     |      |       |
LEFT-WALL the dog.n   was    black

Here, adjcomp is "adjectival compliment" and "cop" was copula. Some dependency grammars draw this graph. Some call it "predicative adjectival modifier". Lets quibble. Note that I did not draw an arrow from subject to verb. I could, I suppose. Note that it is now IMPOSSIBLE to draw an arrow from root/left-wall to the verb, because it would require a
link-crossing, it would have to cross over the adjcomp arrow.

Thus, if you want to draw an arrow from root to head-verb, and also get a planar graph, you are not allowed to draw the adjcomp/predadj arrow. That helps explain what LG does.

It also helps make clear that the no-links-crossing constraint is imperfect. It seems reasonable, but clearly, there is a violation in the above rather
trivial sentence!

OK, to finish this thought. Let us speculate what an MST parse of this sentence might be like. It depends on the MI values for the word-pairs MI(dog,was) MI(was,black) and MI(dog,black) I don't know what these are, but clearly they will be different for a corpus of kids-lit, than a corpus of math texts.

Next question: what happens when words are sorted into categories? What is MI(dog, some color)? What is MI(some animal, some color)? What is MI(physical object, some color)?

I don't have a good story here, except to say that copulas and predicative adjectives prsent maybe the simplest-possible example of a difficulty of moving from surface syntax (SSynt, what LG does) to deep syntax (DSynt, what MMT does). Yet, this move is a critical one.

I'm currently thinking of it as a graph-write rule, that converts the SSynt graph into a PLN graph

EvaluationLink

PredicateNode "has color"

ListLink

Concept "dog"

Concept "black"

Or, perhaps as Nil might like to write:

LambdaLink

VariableList

Variable $PHY

Variable $COL

AndLink
EvaluationLink

PredicateNode "has color"

ListLink

Variable $PHY

Variable $COL

InheritanceLink
Variable $PHY

Concept "physical object"

InheritanceLink
Variable $COL

Concept "color"

Of course, even the above representation is wrong, in several ways, but nit-picking it at this stage is counter-productive.

The question is: given a learned grammar, with statistics, how to we get to the DSynt or the opencog variant? Well, the now-quite-old Dekang Lin DIRT paper, and the newer-but-still-old Poon&Domingos unsupervised learning paper show the way.

Onward ho!

Linas

Hudson, Richard

unread,

Nov 17, 2018, 6:09:24 AM11/17/18

to link-g...@googlegroups.com, Linas Vepstas, Anton Kolonin, Amir P, lang-...@googlegroups.com, Andres Suarez, opencog

Hello Linas. If you leave it to the learning mechanism, aren't you inevitably going to get crossed links? To take an even simpler example, "It was raining", your learning mechanism should work out three predictions:

that "was" needs a subject (i.e. a preceding noun or pronoun).
that any form of the verb RAIN needs the pronoun "it" as its subject (as in "It rained").
that "was" needs (or at least accepts) an ing-form verb after it.

When you put these expectations together, you find a dependency triangle, with subject links from both verbs to "it" and dependency from "was" to "raining". Since both of the "it" links are the same ('subject'), there's no reason for assigning them to different levels of structure (deep vs surface), so you get a topological tangle.

Dick

--
You received this message because you are subscribed to the Google Groups "link-grammar" group.

To unsubscribe from this group and stop receiving emails from it, send an email to link-grammar...@googlegroups.com.
To post to this group, send email to link-g...@googlegroups.com.

Visit this group at https://groups.google.com/group/link-grammar.

To view this discussion on the web visit https://groups.google.com/d/msgid/link-grammar/CAHrUA36aRbObkgMmOGvxO2eGr0RV6pcwrkVBUR-yua_LOYNFSg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

-- 
Richard Hudson (dickhudson.com)

Virus-free. www.avg.com

Linas Vepstas

unread,

Nov 17, 2018, 9:19:26 PM11/17/18

to Richard Hudson, link-grammar, Anton Kolonin, Amir P, lang-...@googlegroups.com, Andres Suarez, opencog

Hi Dick,

Well, yes, but, "it depends". What you describe is found, more-or-less. There are some (relatively simple) mechanical processes that generate this. Since they are mechanical, they are meant to be taken as non-judgmental, non-subjective lab instruments for examining syntax collected from nature, in the wild. Like 17th century telescopes, they are blurry, and allow subjective interpretation. You see something, but its not always clear what you see. Different ones give different views. There is a fairly broad selection, each of which gives different details, even as they agree on the overall structure. The good news is that they agree on the overall structure, and that this overall structure agrees with classical symbolic linguistics, at a general level; the game is now to get to the next level of detail. The details are currently too blurry to say "ah ha, this linguist was exactly right, and that one was exactly wrong". It seems likely that everyone was a little-bit right, and a little-bit wrong. So it goes.

Let me give a concrete example, the MST example. This is a one-page recap of Deniz Yuret's PhD thesis, circa 1998. I hope this is not too off-track.

Here, one starts with some reasonably large corpus, say wikipedia, or project gutenberg. (you eventually discover that wikipedia is very very deficient in action verbs, like run, jump, cry, sing, sail, think. But that's for much much later. It does, however, affect the statistics very deeply.)

One then counts the co-occurrence of word-pairs. How often is word w seen to the left of word v, in a window of size 6 or 8 or so? (Window size mostly doesn't matter much). Call this count N(w,v). This is a "real" quantity, based on "facts", its a measurement of "reality". Corpus-dependent, but based on language captured in-the-wild.

Next: compute a magical quantity: the "point-wise mutual information", MI or PMI. I can explain/motivate why it's correct, or "best", just not here, not now. There are other possibilities, too, but the other ones are less coherent, they don't quite make sense. The MI is a simple, explicit formula:

MI(w,v) = log_2 N(w,v) N(*,*) / N(w,*) N(*,v)

where N(w,*) = sum-over-all-v N(w,v) and N(*,*) = sum-total of all word-pairs that were counted. There is a very long history rooted in mathematics and physics and information theory that explains what MI is, and why it is a "good thing", suitable for this task. (That is, MI has nothing do do with language: it works for chemistry, too, and astronomy, etc. It's generic.)

For linguistics, MI is nice because ... when two words co-occur, it has a large value, and when they don't, it has a small (or negative) value. Typical range for MI is from minus 20 to plus 40 or so (depending on corpus size). Examples:

MI(Northern, Ireland) = +25

MI(the, and) = -10

Yuret's Ansatz: we can, we should use MI to tell us which links in a dependency parse are the correct links. The highest-MI links are correct, in some certain objective sense, and the lowest-MI ones are garbage, nonsense.

The algorithm: MST "Maximum Spanning Tree". Take a sentence. Draw an edge that connects every possible word to every other, i.e. a clique, a big tangle, and then remove all links with the lowest MI until a tree is left. (alternately, start with no edges at all, and add the highest-MI edge, then the second highest, etc. until you have a tree, and no unconnected words). Then declare this to be the "correct parse", brush the dust off your overalls, and call it a day. Here's what happens when you do this, and some critiques, and how to do better:

-- Yuret does this, and finds 85% accuracy or thereabouts, vs. a hand-annotated corpus. (Which I think needs to be acknowledged as a huge success! Viz: linguists are not hallucinating; the structure is "actually there", in "true reality".)

-- Prepositions cause problems for MST.

-- During the search for the tree, you can (arbitrarily) choose to reject crossing links. Or not.

-- During the search for the tree, you can arbitrarily choose to connect all words (this might not make sense for interjections, coughs, sneezes, non-verbal hand-motions, etc.)

-- During the search for the tree, you can explicitly exclude loops (but perhaps loops are desirable, so...)

-- The above did not describe a link from "root" to head-word. (there's a way to fix this).

-- The links are unlabeled: the algo does not tell you if they are subj, obj, etc.

The last criticism is perhaps the deepest, most significant. I claim I know exactly how to get past it. Also, I claim I know how to get past the 85% accuracy. I will not explain in this email, though.

The moral of the story:

-- One can objectively measure the existence of dependencies.

-- One has a lot of alternatives to explore (tree or loops allowed? cross or no-cross allowed? Use MI or use something else? (others have explored "something else", were less successful, but more famous. Standard story of fame and prestige in academia))

-- The MST or MST-like approaches are a way-point, not the final end-point. A step on the path.

Oh, I should mention: some of the neural-net stuff, like word2vec, GloVe, can be kind-of understandable as sort-of MST-like things, if you look at them the right way. There's a lot to be said, but it does offer a bridge between the "here" of symbolic linguistics, and the there of the deep-learning crowd, a unification of the two.

So, my ruminations about "shallow" and "deep" are more along these lines: Lets accept what MST does (or some variant of it, according to taste and evidence), and call this "shallow", so that "shallow" is a way-marker on the map, from here to there. So, shallow is giving us some-kind of dependency parse, mostly-ish accurate, with deficiencies, but its "unarguable" because it is based on measured statistics. Variations of the algorithm give somewhat different results, but they are all in the same ballpark.

So what's the "deep structure"? Well, its the structure we want to actually have. Say, your life's work. Or perhaps Melcuk's MTT. Or maybe predicate-argument structure. Or Sowa's concept nets. Or some mashup of these. I don't particularly care: all I know is that it's the general direction for the next way-point on the journey.

How do we get there? Well, there has to be some relatively simple collection of formulas and algorithms that are mechanical in their action. The quality of these mechanisms will be judged on how closely they line up with the more sophisticated theories of syntax+semantics. My laboratory bench has a bunch of these mechanisms laying about. I cannot assemble them and evaluate them fast enough. I am totally certain that they will work: preliminary evidence is very good, and besides, most or all of them are already based on tricks and techniques that many others have described, and have found to be useful and successful.

To get back to your example: it's not so simple, because it includes morphology, which I did not talk about, above. How can one find out that "rain", "rains" "rained" and "raining" are somehow the same word, sharing a stem, but with different suffixes? Well, there is a way to do this, but its another, different mechanism to be bolted on. How can one discover that "it was raining" and "it rained" are vaguely synonymous? They don't even have the same word-count. Well, that is yet another mechanism, that goes elsewhere, attaching a different way. There's no particular graph to rule-them-all. There's a morpho-graph that draws an edge between "rain" and "ing". There's a semantic graph that treats "wasraining" as a single unit. There's a third graph that attaches "it" to it's referent. Except, for this example, "it" refers is a pleonastic-it to an implicit, non-specified imaginary place-time, rather than to some explicit word in a previous sentence. The three graphs are related, but have different functions, they illustrate different relationships.

-- Linas

Hudson, Richard

unread,

Nov 19, 2018, 4:33:13 AM11/19/18

to linasv...@gmail.com, link-grammar, Anton Kolonin, Amir P, lang-...@googlegroups.com, Andres Suarez, opencog

Wow! Thanks for the teach-in, Linas. Very interesting, and you make it all as clear as it could be, I guess.

I think of your projects as an experiment whose goal is to find how far it's possible to get in learning a language using nothing but written records - a bit like deciphering a dead language simply by spotting patterns in the available corpus of texts. To some extent, your success will reflect the quality and 'depth' of the raw data, which (in the case of English texts) already reflect quite a sophisticated linguistic analysis thanks to the word spaces, the punctuation and the spelling (which distinguishes some homophones). I'm not sure what it will tell us about human language, but it will presumably tell us a lot about the limits of AI. Would you agree?

Anyway, I'm very impressed by what you and your colleagues in this field have achieved already.

Best wishes, Dick

-- 
Richard Hudson (dickhudson.com)

Linas Vepstas

unread,

Nov 26, 2018, 2:26:49 AM11/26/18

to Anton Kolonin, Alexei Glushchenko, Amir P, lang-...@googlegroups.com, link-grammar

A combinatorial explosion suggests that the grammar specified in the dict files is far too loose; it needs to be stricter. So, for example, the sentence you supplied parses in a few milliseconds, with the normal English dict file, with no combinatorial explosion (it does find 120 linkages).

There are several immediate work-arounds you can try:

1) shorten the parse timeout from 30 seconds to 5 or 2 seconds (use the !timeout variable, or the -timeout option) You will still get a combinatorial explosion, it's just that it will give up after 5 seconds, instead of 30. After hitting a timeout, it goes into panic-parsing; which eats another 15 seconds; perhaps disable panic parsing.

2) lower the max number of linkages from 1000 to maybe 30 (use the !limit variable of the -limit flag) This will have very little effect on runtime, but might help in other ways.

3) try !use-sat -- you will still get a combinatorial explosion; you just might get it more quickly.

So these are obvious work-arounds you can try right now. In the long run, however, you need to create a dictionary that is far more strict in what it accepts as valid grammar.

But of course: this is the whole point of the language-learning project: a loose grammar that accepts anything is not much of a grammar. A strict grammar that understands nothing is useless. There is a very narrow middle window: a grammar that describes language as actually written (spoken) and no less, and no more.

--linas

On Mon, Nov 26, 2018 at 12:53 AM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:

Hi Amir,

We have found that the same corpus may be parsed with the same LG
version 5.5.1 either in 19/53 minutes or stay hanging "forever",
depending on nature of machine generated dict file (obtained with our
unsupervised learning pipeline).

We have identified that hanging "forever" is happening due to
"combinatorial explosions" that you have explained in earlier issue:
https://github.com/opencog/link-grammar/issues/798

Now, we are implementing the workaround that you have suggested to skip
the sentences causing the "combinatorial explosions", however we would
like to ask you if it is really the case.

For example, using the dictionary
http://langlearn.singularitynet.io/data/aglushchenko_parses/test-lg-5.1.1/dict.tar.gz
while parsing the file
http://langlearn.singularitynet.io/data/aglushchenko_parses/test-lg-5.1.1/corpus.tar.gz
there is "combinatorial explosion" on the sentence:
"Now not far from the music master's house there dwelt a lady who
possessed a most lovely little pussy cat called Koma."
There are much more combinatorial explosions in the same file.

Do you think this is expected?

If so, do you think the best solution is to skip the sentence of
introduce costs in the LG dictionary?

Note, using other machine generated dictionaries, the whole following
batch of files including the file mentioned above is processed decently
fast (19 or 53 minutes, see details below):
http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/

Thanks,
-Anton

23.11.2018 11:41, Anton Kolonin:
>
>
> 22.11.2018 20:50, Alexey Glushchenko:
>> There are two more in Gutenberg-Children-Books-500-disjuncts-2018-10-31:
>>
>> Gutenberg-Children-Books-Caps-20-clusters-2018-10-31
>> Gutenberg-Children-Books-Caps_LG-ANY-all-parses-agm-opt_cALEd_no-LW_no-RW_no-gen
>> - complete (11hrs 28 min 49 sec)
>
>
> Maximum disjunct length    16
> dict_20C_2018-10-31_0006.4.0.dict    2018-10-31 15:47    198K
> Average sentence parse:             59.63%
> Recall:         31.05%
> Precision:     37.99%
> F1:         34.17%
>
>
>> Gutenberg-Children-Books-Caps_LG-English_cALEd_no-LW_no-RW_no-gen -
>> complete (12 min 48 sec)
>
> Maximum disjunct length    10
> dict_20C_2018-10-31_0006.4.0.dict    2018-10-31 10:32    182K
> Average sentence parse:             55.29%
> Recall:         29.55%
> Precision:     34.44%
> F1:         31.81%
>
>
>>     Gutenberg-Children-Books-Caps-10-clusters-1000-disjuncts-2018-10-29_
>>
>>
>> Gutenberg-Children-Books-Caps_LG-ANY-all-parses-agm-opt_cALEd_no-LW_no-RW_no-gen
>>
>>     - in progress since Nov 20 02:10 (server time)
>>
>
> Maximum disjunct length    16
> dict_10C_2018-10-29_0006.4.0.dict    2018-10-29 18:51    229K
>
>
>>     Gutenberg-Children-Books-Caps_LG-English_cALEd_no-LW_no-RW_no-gen -
>>     in progress since Nov 20 02:10 (server time)
>>
>
> Maximum disjunct length    10
> dict_10C_2018-10-29_0006.4.0.dict    2018-10-29 16:15    231K
>
>
>>     Gutenberg-Children-Books-Caps-20-clusters-1000-disjuncts-2018-10-29_
>>
>>
>> Gutenberg-Children-Books-Caps_LG-ANY-all-parses-agm-opt_cALEd_no-LW_no-RW_no-gen
>>
>>     - in progress since Nov 20 02:10 (server time)
>>
>
> Maximum disjunct length    16
> dict_20C_2018-10-31_0006.4.0.dict    2018-10-31 12:00    265K
>
>
>>     Gutenberg-Children-Books-Caps_LG-English_cALEd_no-LW_no-RW_no-gen -
>>     complete (53 min 55 sec)
>>
>
> Maximum disjunct length    10
> dict_20C_2018-10-29_0006.4.0.dict    2018-10-29 18:21    283K
> Average sentence parse:             58.19%
> Recall:         32.07%
> Precision:     35.23%
> F1:         33.58%
>
>>
>>     Gutenberg-Children-Books-Caps-50-clusters-1000-disjuncts-2018-10-29_
>>      Gutenberg-Children-Books-Caps_LG-English_cALEd_no-LW_no-RW_no-gen
>>     - complete (19 min 40 sec)
>>
>
> Maximum disjunct length    10
> dict_50C_2018-10-29_0006.4.0.dict    2018-10-29 18:29    517K
> Average sentence parse:             44.35%
> Recall:         21.89%
> Precision:     32.01%
> F1:         26.00%

--
-Anton Kolonin
skype: akolonin
cell: +79139250058
akol...@aigents.com
https://aigents.com
https://www.youtube.com/aigents
https://www.facebook.com/aigents
https://plus.google.com/+Aigents
https://medium.com/@aigents
https://steemit.com/@aigents
https://golos.blog/@aigents
https://vk.com/aigents

--
You received this message because you are subscribed to the Google Groups "lang-learn" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lang-learn+...@googlegroups.com.
To post to this group, send email to lang-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/f07cfe6b-1459-c163-0788-ee4ed61fdb04%40gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Linas Vepstas

unread,

Nov 26, 2018, 3:03:10 AM11/26/18

to Anton Kolonin, Alexei Glushchenko, Amir P, lang-...@googlegroups.com, link-grammar

Oh, I should add: hangs/inifinite time should be impossible. If you can isolate the sentence, and the dictionary leading to an infinite hang, that can very clearly be fixed.

The maximum time should be number of sentences times 90 seconds. The normal timeout is 30 seconds, the panic timeout is 60 seconds; the worst-case is that you hit one, and then the other.

--linas

Linas Vepstas

unread,

Nov 26, 2018, 3:18:33 AM11/26/18

to Alexei Glushchenko, Amir P, lang-...@googlegroups.com, link-grammar, Anton Kolonin

On Mon, Nov 26, 2018 at 1:54 AM Alexey Glushchenko <ale...@mail.ru> wrote:

!use-sat does not help much. In one case it smashes the stack:

Huh. OK, that's new, and interesting. And not very fixable. The current implementation uses minisat, which ... it would seem that it is using the stack for recursion, and is recursing so deeply that it is blowing up. We've tried other SAT solvers, they work, but I don't know if they would also use the stack or not. If you feel like killing time, you can try using minisat2. I forget which other ones were tried.

while in another case it can not parse with null links:

Ah, right. This is a known limitation; we've never gotten around to trying to fix it, since it seemed low-urgency. A "null link" is a sentence for which there is a linkage, but only if the parser ignores one of the words. It's easy for the regular parser to ignore a word; but the SAT solver would have to try ignoring each one, one-at-a-time, and that would be painful. Sentences can have 2,3 or more null links.

--linas

Alexey Glushchenko

unread,

Nov 26, 2018, 3:22:22 AM11/26/18

to linasv...@gmail.com, Amir P, lang-...@googlegroups.com, link-grammar, Anton Kolonin

!use-sat does not help much. In one case it smashes the stack:

```

$ echo "Now not far from the music master's house there dwelt a lady who possessed a most lovely little pussy cat called Koma." | link-parser dict_10C_2018-10-29_0006/ -timeout=1 -postscript=1 -graphics=0 -verbosity=1 -use-sat=1
timeout set to 1
postscript set to 1
graphics set to 0
verbosity set to 1
use-sat set to 1
link-grammar: Info: Dictionary found at ./dict_10C_2018-10-29_0006/4.0.dict
link-grammar: Info: Dictionary version 0.0.6, locale en_US.UTF-8

link-grammar: Info: Library version link-grammar-5.5.1. Enter "!help" for help.

*** stack smashing detected ***: link-parser terminated
Aborted

```

while in another case it can not parse with null links:

```

$ echo "Hello world!" | link-parser dict_10C_2018-10-29_0006/ -timeout=1 -postscript=1 -graphics=0 -verbosity=1 -use-sat=1
timeout set to 1
postscript set to 1
graphics set to 0
verbosity set to 1
use-sat set to 1
link-grammar: Info: Dictionary found at ./dict_10C_2018-10-29_0006/4.0.dict
link-grammar: Info: Dictionary version 0.0.6, locale en_US.UTF-8

link-grammar: Info: Library version link-grammar-5.5.1. Enter "!help" for help.

No complete linkages found.
link-grammar: Info: use-sat: Cannot parse with null links (yet).
Set the "null" option to 0 to turn off parsing with null links.
link-grammar: Info: Freeing dictionary dict_10C_2018-10-29_0006/4.0.dict
link-grammar: Info: Freeing dictionary dict_10C_2018-10-29_0006/4.0.affix
Bye.

```

Понедельник, 26 ноября 2018, 16:26 +09:00 от Linas Vepstas <linasv...@gmail.com>:

To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/CAHrUA34T5kTLb1%3DS1KMQ7-t-eet8QQWo4bU6vzzMB%3DPwg9Voeg%40mail.gmail.com.