Bootstrapp seed for a grammar in natural language

32 views
Skip to first unread message

Amirouche Boubekki

unread,
Feb 11, 2020, 10:16:40 AM2/11/20
to opencog
I am wondering whether there is existing material about how to
bootstrap a LG-like dictionary using a seed of natural language
elements: grammar, words, punctuation....

The idea is to use such a seed to teach the program more about the
full grammar using almost natural conversations.

Linas Vepstas

unread,
Feb 11, 2020, 12:45:44 PM2/11/20
to opencog, link-grammar
Salut Amirouche,

What you describe was/is the goal of the language-learning project. It is stalled, because there is no easy way to evaluate if it is making forward progress, and is learning a good grammar, or learning junk.

The proposed solution to this is to create "random" grammars, and thus compare what the system learned to the precise, exactly-known grammar.  The only problem here is that generating a corpus of sentences drawn from a given grammar is surprisingly hard.  (i.e. is not an afternoon project, or even a one-week project).

I would love to work on this, but well, good old capitalistic considerations are currently blocking my efforts.

--linas

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAL7_Mo-Lb_c_1i8uhMWCnXkeOtBPkJtqqwosBp0V2398TUhcfA%40mail.gmail.com.


--
cassette tapes - analog TV - film cameras - you

Adrian Borucki

unread,
Feb 12, 2020, 6:18:42 AM2/12/20
to opencog


On Tuesday, 11 February 2020 18:45:44 UTC+1, linas wrote:
Salut Amirouche,

What you describe was/is the goal of the language-learning project. It is stalled, because there is no easy way to evaluate if it is making forward progress, and is learning a good grammar, or learning junk.

The proposed solution to this is to create "random" grammars, and thus compare what the system learned to the precise, exactly-known grammar.  The only problem here is that generating a corpus of sentences drawn from a given grammar is surprisingly hard.  (i.e. is not an afternoon project, or even a one-week project).

Is using English language as the target too limited (overfitting)? There are datasets for grammatical tasks like Part Of Speech tagging, the quality of the grammar could be judged by testing performance on those tasks.


I would love to work on this, but well, good old capitalistic considerations are currently blocking my efforts.

--linas

On Tue, Feb 11, 2020 at 9:16 AM Amirouche Boubekki <amirouch...@gmail.com> wrote:
I am wondering whether there is existing material about how to
bootstrap a LG-like dictionary using a seed of natural language
elements: grammar, words, punctuation....

The idea is to use such a seed to teach the program more about the
full grammar using almost natural conversations.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ope...@googlegroups.com.

Ben Goertzel

unread,
Feb 12, 2020, 6:34:06 AM2/12/20
to opencog, link-grammar
Linas, we haven't been discussing this on public lists much, but
Andres Suarez and I are actually making some interesting progress on
the language learning project lately ... we are both here in HK so
most of the discussion is F2F ... we are using transformer-NN language
models as "sentence probability oracles" (they can estimate the
probability of a sentence according to a language model) and using
these oracles to estimate the probabilities of various sentences
proposed by the symbolic grammar-rule and POS learning algorithms,....
will have more to say on this in the coming months, but it's looking
like a pretty cool example of neural-symbolic methodology.
> To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35ed%3DPb4SRJUiqJTGkVPUTxWXO75PK5deyPxC4Gz2yCKA%40mail.gmail.com.



--
Ben Goertzel, PhD
http://goertzel.org

“The only people for me are the mad ones, the ones who are mad to
live, mad to talk, mad to be saved, desirous of everything at the same
time, the ones who never yawn or say a commonplace thing, but burn,
burn, burn like fabulous yellow roman candles exploding like spiders
across the stars.” -- Jack Kerouac

Linas Vepstas

unread,
Feb 12, 2020, 1:57:41 PM2/12/20
to opencog, link-grammar
On Wed, Feb 12, 2020 at 5:18 AM Adrian Borucki <gent...@gmail.com> wrote:


On Tuesday, 11 February 2020 18:45:44 UTC+1, linas wrote:
Salut Amirouche,

What you describe was/is the goal of the language-learning project. It is stalled, because there is no easy way to evaluate if it is making forward progress, and is learning a good grammar, or learning junk.

The proposed solution to this is to create "random" grammars, and thus compare what the system learned to the precise, exactly-known grammar.  The only problem here is that generating a corpus of sentences drawn from a given grammar is surprisingly hard.  (i.e. is not an afternoon project, or even a one-week project).

Is using English language as the target too limited (overfitting)?
No, not at all.

There are datasets for grammatical tasks like Part Of Speech tagging, the quality of the grammar could be judged by testing performance on those tasks.

That's what we thought too, initially, and it turns out that doesn't work.  Sometimes the problems are minor: the datasets contain errors, which is annoying.  More difficult are issues surrounding small grammars and small corpora: e.g. "child-directed speech" (CDS) -- one can naively think that CDS is just like adult-English grammar, but with a smaller vocabulary, but there's a hint that mathematically, that is just not true. It becomes particularly obvious when the training corpus is so small that you can run the calculations by pencil-n-paper: the probabilities really come out quite different. 

One soon trips over other problems too: Some at the front end: capitalization, punctuation, anything to do with tokenization (imagine irregular French verbs!) But also problems near the back end - problems with multiple meanings, synonyms, etc. - one sees many interesting things that are both correct and wrong at the same time: e.g. turns out that "houses" and "cities" are quite similar: both can be entered, both provide shelter, both can be a home... and yet a house is not a city.  I saw many unexpected but interesting groupings like this... reminiscent of corpus linguistics results.

Oh, and corpora problems: wikipedia lacks "action verbs": run jump kick punch, since wikipedia is describing things (X is a Y which Z when W). It also has vast amounts of foreign words, product names, geographical names, obscure terminology: words which appear only once or twice, ever.

Project Gutenberg  is mostly 19th, early-20th century English which is very unlike modern English. We can read it and understand it, but there are many quite strange and unusual sentences in there, which no one would ever say, today. Jane Austen's Pride & Prejudice is a good example.

It became clear that existing corpora and datasets are almost useless for evaluating quality.  I want to be able to control for size of vocabulary, size of grammar, density and distribution of different parts-of-speech, and, most importantly, the distribution of different meanings ("I saw" - "to see" vs "to cut") ... and then evaluate the algos as each of these parameters are changed. 

The idea is that the learning system is like a transducer: grammar->corpora->grammar, so its like a microphone, or a high-fidelity stereo system: we want to evaluate how well it works. Of course, you can "listen to it" and "see if it sounds good", but really, I'd rather measure and be precise, and for that, we really need inputs that can be controlled, and see how the system responds, as we "turn the knob".

-- Linas



I would love to work on this, but well, good old capitalistic considerations are currently blocking my efforts.

--linas

On Tue, Feb 11, 2020 at 9:16 AM Amirouche Boubekki <amirouch...@gmail.com> wrote:
I am wondering whether there is existing material about how to
bootstrap a LG-like dictionary using a seed of natural language
elements: grammar, words, punctuation....

The idea is to use such a seed to teach the program more about the
full grammar using almost natural conversations.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ope...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAL7_Mo-Lb_c_1i8uhMWCnXkeOtBPkJtqqwosBp0V2398TUhcfA%40mail.gmail.com.


--
cassette tapes - analog TV - film cameras - you

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/25a07c59-3e71-4a0d-9229-a0784d9ee5e9%40googlegroups.com.

Amirouche Boubekki

unread,
Feb 19, 2020, 9:59:18 AM2/19/20
to link-grammar, opencog
Certainly the learning language project will work, I am hopeful.

Here is a small description I found of what I have in mind:
http://learnthesewordsfirst.com/about/what-is-a-multi-layer-dictionary.html
> You received this message because you are subscribed to the Google Groups "link-grammar" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to link-grammar...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/link-grammar/CAHrUA35ed%3DPb4SRJUiqJTGkVPUTxWXO75PK5deyPxC4Gz2yCKA%40mail.gmail.com.



--
Amirouche ~ https://hyper.dev
Reply all
Reply to author
Forward
0 new messages