Hi Linas,
>I'd call it "interesting", but maybe not "golden"
These are randomly selected sentences from "Gutenberg Children" corpus:
http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
"Gutenberg Children silver standard" is LG-English parses:
"Gutenberg Children gold standard" is subset of "silver standard" with semi-random selection of sentences skipping direct speech and doing manual verification of the links.
So as long as we are training on "Gutenberg Children" corpus, having the test on the same "Gutenberg Children" seems reasonable, right?
But thanks, we may have put mire effort in removal of ancient constructions and words even if these are present in the corpus.
>Anyway -- you only indicate pair-wise word-links. Is the omission of disjuncts intentional?
If you have all links in the sentence, you can construct all of the disjuncts with o ambiguity, correct?
>Also -- no hint of any word-classes or part-of-speech tagging? This is surely important to evaluate as well, or is this to be done in some other way? i.e. to evaluate if "Pivi" was correctly clustered with other given names? Or that lama/llama was clustered with other four-legged animals?
We don't have that in MST-Parsing, right? We need this corpus to assess the quality of the MST-Parsing so we don't need part-of-speech information for that.
The clustering is able to do that anyway - see the graphs in the end of the last year report:
>Also -- I can't tell -- is it free of loops, or are loops allowed? Allowing loops tends to provide stronger, more accurate parses. Loops act as constraints.
The loops and crossing links are not allowed in the MST-Parser now. If we allow them in the test corpus, how could it make assessment of MST-Parses better?
Note, that we ARE working we MST-Parses now - accordingly to Ben's directions.
We have your MST-Parser-less idea on the map but we are NOT trying it now:
https://github.com/singnet/language-learning/issues/170
We may try it after we explore the account for costs
https://github.com/singnet/language-learning/issues/183
Thanks,
-Anton
24.03.2019 9:24, Linas Vepstas пишет:
Also, BTW, link-grammar cannot parse "I just stood there, my hand on the knob, trembling like a leaf." correctly. It is one of a class of sentences it does not know about. Which is maybe OK, because ideally, the learned grammar will be able to do this. But today, LG cannot.
--linas
On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <linasv...@gmail.com> wrote:
Anton,
It's certainly an unusual corpus, and it might give you rather low scores. I'd call it "interesting", but maybe not "golden". Although I suppose it depends on your training corpus. Here are some problems that pop out:
First sentence --"the old beast was whinnying on his shoulder" -- the word "whinnying" is a fairly rare English verb -- you could read half-a-million wikipedia articles, and not see it once. You could read lots of 19th-century or early-20th century cowboy/adventure novels, (like what you'd find on Project Gutenberg) and maybe see it some fair amount. Even then -- to "whinny on a shoulder" seems bizarre.. I guess he's hugging the horse? How often does that happen, in any cowboy novel? "to whinny on something" is an extremely rare construction. It will work only if you've correctly categorized "whinny" as a verb that can take a preposition. Are your clustering algos that good, yet, to correctly cluster rare words into appropriate verb categories?
Second sentence .. "Jims" is a very uncommon name. Frankly, I've never heard of it as a name before. Your training data is going to be extremely slim on this. And lack of training data means poor statistics, which means low scores. Unless -- again, your clustering code is good enough to place "Jims" in a "proper name" cluster...
"the lama snuffed blandly" -- "snuffed" is a very uncommon, almost archaic verb. These days, everyone spells llama with two ll's not one. Unless your talking about Buddhist monks, its a typo.
"you understand?" is .. awkward. Common in speech, uncommon in writing. Unlikely that you'll have enough training data for this.
"Willard" is an uncommon name. Does your training corp[us have a sufficient number of mentions of Willard? Do you have clustering working well enough to stick "Willard" into a cluster with other names?
"it is so with Sammy Jay" is clearly archaic English.
"he hasn't any relations here" is clearly archaic, an olde-fashioned construction.
"Pivi said not one word" - again, a clearly old-fashioned construction. Does the training set contain enough examples of "Pivi" to recognize it as a name? Are names clustering correctly?
Any sentence with an inversion is going to sound old-fashioned. All of the sentences in that corpus sound old-fashioned. Which maybe is OK if you are training on 19th century Gutenberg texts .. but its certainly not modern English. Even when I was a child, and I read those old crumbly-yellow paper adventure books, part of the fun was that no one actually talked that way -- not at school, not at home, not on TV. It was clearly from a different time and place -- an adventure.
Anyway -- you only indicate pair-wise word-links. Is the omission of disjuncts intentional? Also -- no hint of any word-classes or part-of-speech tagging? This is surely important to evaluate as well, or is this to be done in some other way? i.e. to evaluate if "Pivi" was correctly clustered with other given names? Or that lama/llama was clustered with other four-legged animals?
Also -- I can't tell -- is it free of loops, or are loops allowed? Allowing loops tends to provide stronger, more accurate parses. Loops act as constraints.
-- Linas
On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail <akol...@gmail.com> wrote:
--Hi Linas, Andes and whoever understands LG and English well enough both.
Attached are first 100 sentences for GC "gold standard" - manually checked based on LG parses.
We are expecting more to come in the next two weeks.
To enable that, please have cursory review of the corpus and let us know if there are corrections still needed so your corrections will be used as a reference to fix the rest and keep going further.
Thank you,
-Anton
You received this message because you are subscribed to the Google Groups "lang-learn" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lang-learn+...@googlegroups.com.
To post to this group, send email to lang-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
cassette tapes - analog TV - film cameras - you
--
cassette tapes - analog TV - film cameras - you
-- -Anton Kolonin skype: akolonin cell: +79139250058 akol...@aigents.com https://aigents.com https://www.youtube.com/aigents https://www.facebook.com/aigents https://medium.com/@aigents https://steemit.com/@aigents https://golos.blog/@aigents https://vk.com/aigents
Ben, Linas,
>But we know that MST parsing is shit. Stop wasting time on MST or trying to "improve" it.
I think that sounds like kind of support for the concept of "dumb explosive parsing" being advocated for 1+ year ago:
I also agree we other Linas'es reasoning in this thread. I would consider giving it a try starting next month if we don't have a breakthrough with DNN-MI-milking-based-MST-Parsing by that time.
> can be done generically, and not just on language
I think everyone in bio-informatics dreams of extracting secrets of "dark side of the genome" with something like that ;-)
Cheers,
-Anton
28.03.2019 1:24, Linas Vepstas пишет:
To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
-- -Anton Kolonin skype: akolonin cell: +79139250058
Linas Vepstas wrote:>... knowledge extraction can be done generically, and not just on language.If link grammar would be Turing complete, this might be possible right away.
But somehow, I suspect... Isn't this why OpenCog has "unified rule engine" (URE) instead of link grammar at its core,
and with URE things get much more complicated. I'm sorry, but that is still a Gordian knot to me, considering all of my modest knowledge.
On the other hand, if someone really smart would provide automatic grammar extraction by means of unrestricted grammar, I believe that would be it.
Hi Linas, I like this thread more and more :-)
>But somehow, I suspect... Isn't this why OpenCog has "unified rule engine" (URE) instead of link grammar at its core,
Linas, the "extraction of phrasemes" goal approaching has been discussed exactly in terms of MST->GL->URL on the last fall in Hong Kong discussion: https://docs.google.com/document/d/13YyqtGud0GAbVaFcc94kAd2LhGf7jTr5XDYgiuC294c/edit
That is:
1) Do MST-parsing to get word links proto-disjuncts
2) Do Grammar Learning to cluster and conclude word categories and rules with disjuncts
3) Do URE-kind-of-thing to build the rules into "phrasemes" or "sections" or "patterns".
However, your current discourse and our current results just show that "no one is be able to do reasonable MST-parsing" so the above is just waste of time, correct?
At the time we speak, Ben, Alexely, Sergey and Asuares are trying to use DNN/BERT magic to do the trick 1.
To my mind, that may get possible only if the DNN/BERT magic do the trick having the steps 2 and 3 done under the hood. If this is done, in such case, we don't need to do 2 and 3 after we have the DNN/BERT-based model, because we can simply "milk-out" the grammar rules out of DNN/BERT micelium for that. And we don't need the ULL as well by the way, because we just need DNN/BERT and rows of different sorts of milk machines around it.
So, instead of solving the problem of constructing the pipeline for learning grammar from raw text we need to solve the problem of milking the grammar out of DNN/BERT model trained on these texts, right?
However, either way, we need to understand algorithmic machinery of how the links assemble in disjuncts and disjuncts assemble into sections, through the universe-scale combinatorial explosion.
And I agree that clustering and categorizing word and links (and then disjuncts and sections, right) is part of the process - explicitly in ULL pipeline or implicitly deep in DNN/BERT darkness.
Cheers,
-Anton
01.04.2019 9:17, Linas Vepstas:
Hi,
On 4/1/19 5:17 AM, Linas Vepstas wrote:
MOSES, URE and OpenPsi ...
Linas, what is a good starting point to understand what you're trying
to accomplish?
Here?
https://github.com/opencog/atomspace/blob/master/opencog/sheaf/README.md
Hi Linas,
Are you saying that "while ULL team has found strong linear correlation between A) quality (F1) on input parses and B) quality (F1) of the output parses based on the grammar learned from the input parses, this phenomenon is due to the fact that they test on the entire input corpus so this phenomena should go away once they test on gold standard corpus consisting only of sentences with high-frequency words"?
Best regards,
-Anton
02.04.2019 5:38, Linas Vepstas пишет:
To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Hi Linas,
Are you saying that "while ULL team has found strong linear correlation between A) quality (F1) on input parses and B) quality (F1) of the output parses based on the grammar learned from the input parses, this phenomenon is due to the fact that they test on the entire input corpus so this phenomena should go away once they test on gold standard corpus consisting only of sentences with high-frequency words"?
If so, I hope we will have this premise verified instrumentally.
Best regards,
-Anton
02.04.2019 5:38, Linas Vepstas пишет:
OK, There's clearly a lot ow work happening in linguistics these days, that I have fallen behind on reading.
To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
>I don't understand what that "something"
Hi Linas, last year paper is here
http://langlearn.singularitynet.io/data/docs/
this year paper draft is attached.
Cheers,
-Anton
03.04.2019 0:09, Linas Vepstas пишет:
To view this discussion on the web visit https://groups.google.com/d/msgid/lang-learn/CAHrUA36nTsYOOcsJf3t%2BxnSQYF2FYNK4yj-bEXgNgOtXL3NrVw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.