Re: [opencog-dev] Bootstrapp seed for a grammar in natural language

12 views
Skip to first unread message

Linas Vepstas

unread,
Feb 11, 2020, 12:45:44 PM2/11/20
to opencog, link-grammar
Salut Amirouche,

What you describe was/is the goal of the language-learning project. It is stalled, because there is no easy way to evaluate if it is making forward progress, and is learning a good grammar, or learning junk.

The proposed solution to this is to create "random" grammars, and thus compare what the system learned to the precise, exactly-known grammar.  The only problem here is that generating a corpus of sentences drawn from a given grammar is surprisingly hard.  (i.e. is not an afternoon project, or even a one-week project).

I would love to work on this, but well, good old capitalistic considerations are currently blocking my efforts.

--linas

On Tue, Feb 11, 2020 at 9:16 AM Amirouche Boubekki <amirouche...@gmail.com> wrote:
I am wondering whether there is existing material about how to
bootstrap a LG-like dictionary using a seed of natural language
elements: grammar, words, punctuation....

The idea is to use such a seed to teach the program more about the
full grammar using almost natural conversations.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAL7_Mo-Lb_c_1i8uhMWCnXkeOtBPkJtqqwosBp0V2398TUhcfA%40mail.gmail.com.


--
cassette tapes - analog TV - film cameras - you

Linas Vepstas

unread,
Feb 12, 2020, 1:57:41 PM2/12/20
to opencog, link-grammar


On Wed, Feb 12, 2020 at 5:18 AM Adrian Borucki <gent...@gmail.com> wrote:


On Tuesday, 11 February 2020 18:45:44 UTC+1, linas wrote:
Salut Amirouche,

What you describe was/is the goal of the language-learning project. It is stalled, because there is no easy way to evaluate if it is making forward progress, and is learning a good grammar, or learning junk.

The proposed solution to this is to create "random" grammars, and thus compare what the system learned to the precise, exactly-known grammar.  The only problem here is that generating a corpus of sentences drawn from a given grammar is surprisingly hard.  (i.e. is not an afternoon project, or even a one-week project).

Is using English language as the target too limited (overfitting)?
No, not at all.

There are datasets for grammatical tasks like Part Of Speech tagging, the quality of the grammar could be judged by testing performance on those tasks.

That's what we thought too, initially, and it turns out that doesn't work.  Sometimes the problems are minor: the datasets contain errors, which is annoying.  More difficult are issues surrounding small grammars and small corpora: e.g. "child-directed speech" (CDS) -- one can naively think that CDS is just like adult-English grammar, but with a smaller vocabulary, but there's a hint that mathematically, that is just not true. It becomes particularly obvious when the training corpus is so small that you can run the calculations by pencil-n-paper: the probabilities really come out quite different. 

One soon trips over other problems too: Some at the front end: capitalization, punctuation, anything to do with tokenization (imagine irregular French verbs!) But also problems near the back end - problems with multiple meanings, synonyms, etc. - one sees many interesting things that are both correct and wrong at the same time: e.g. turns out that "houses" and "cities" are quite similar: both can be entered, both provide shelter, both can be a home... and yet a house is not a city.  I saw many unexpected but interesting groupings like this... reminiscent of corpus linguistics results.

Oh, and corpora problems: wikipedia lacks "action verbs": run jump kick punch, since wikipedia is describing things (X is a Y which Z when W). It also has vast amounts of foreign words, product names, geographical names, obscure terminology: words which appear only once or twice, ever.

Project Gutenberg  is mostly 19th, early-20th century English which is very unlike modern English. We can read it and understand it, but there are many quite strange and unusual sentences in there, which no one would ever say, today. Jane Austen's Pride & Prejudice is a good example.

It became clear that existing corpora and datasets are almost useless for evaluating quality.  I want to be able to control for size of vocabulary, size of grammar, density and distribution of different parts-of-speech, and, most importantly, the distribution of different meanings ("I saw" - "to see" vs "to cut") ... and then evaluate the algos as each of these parameters are changed. 

The idea is that the learning system is like a transducer: grammar->corpora->grammar, so its like a microphone, or a high-fidelity stereo system: we want to evaluate how well it works. Of course, you can "listen to it" and "see if it sounds good", but really, I'd rather measure and be precise, and for that, we really need inputs that can be controlled, and see how the system responds, as we "turn the knob".

-- Linas



I would love to work on this, but well, good old capitalistic considerations are currently blocking my efforts.

--linas

On Tue, Feb 11, 2020 at 9:16 AM Amirouche Boubekki <amirouch...@gmail.com> wrote:
I am wondering whether there is existing material about how to
bootstrap a LG-like dictionary using a seed of natural language
elements: grammar, words, punctuation....

The idea is to use such a seed to teach the program more about the
full grammar using almost natural conversations.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ope...@googlegroups.com.


--
cassette tapes - analog TV - film cameras - you

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.

Amirouche Boubekki

unread,
Feb 19, 2020, 9:59:17 AM2/19/20
to link-grammar, opencog
Certainly the learning language project will work, I am hopeful.

Here is a small description I found of what I have in mind:
http://learnthesewordsfirst.com/about/what-is-a-multi-layer-dictionary.html
> You received this message because you are subscribed to the Google Groups "link-grammar" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to link-grammar...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/link-grammar/CAHrUA35ed%3DPb4SRJUiqJTGkVPUTxWXO75PK5deyPxC4Gz2yCKA%40mail.gmail.com.



--
Amirouche ~ https://hyper.dev
Reply all
Reply to author
Forward
0 new messages