Re: Chinese language learning pipeline

31 views

Skip to first unread message

Linas Vepstas

unread,

Jul 13, 2017, 4:07:34 AM7/13/17

to Ruiting Lian, Ben Goertzel, Curtis M. Faith, Andres Suarez, link-grammar, Shujing Ke, Nil Geisweiller, opencog

OK,I guess I still didn't explain enough. Here's a longer explanation.

So, for the case of word boundaries, you have sequences of characters, hanzi, that are next to each other. For circumpositions, they do NOT have to be next to each other: there might be many words in between.

This is why I was trying to talk about sheaves: a sheaf tells you how to handle the connectivity of circumpositions, where there are "holes" in between the blocks. So again: imagine a single disjunct, all by itself. This contains only local connections. But two of them together still look like a section of a sheaf. In diagrams:

Here is a single disjunct:

         +Cs+
         | |
         if ?

Think of it as being anchored by the single germ/gerbe "if" from one sentence:

    +------->WV------->+--MVs-+---CV->+
    +--Wd--+-Sp*i+--I--+Osm+ +Cs+-Sp-+--O-+
    |      |     |     |   | | |    |    |
LEFT-WALL I.p will.v do.v it if you say.v so

Here is one section of a sheaf whose germ/gerbe is the word "if", but now has been extended multiple steps farther out, to a greater distance:

   +--MVs-+---CV->+
   |      +Cs+-Sp-+--O--+
   |      | |    |     |
   ?     if ? say     ?

Note that this section occurs in the sentence above, but it might also occur in other sentences. This is a very complex section. If you observe it many many times, if you observe it more than "average" (I think Ben and Shujing call it "surprising".) then you say "oh, a ha!! this is an idiom or an institutional phrase or a set phrase! or a lexical chunk!" Because there is a blank spot between "if... say''', this would be a circumposition.

Segmenting text into morphemes, or, in the case of Chinese, groupimg multiple hanzi into a single word, works **exactly the same way**, except, for the case of Chinese, you do not allow circumpositions if you just want to find single words. But otherwise it is the same algorithm.

I was originally hoping that Shujing's pattern miner could do this, but I don't think it is ready, and I don't have the time to figure it out.

The reason you *stil* want to read the papers on word segmentation and morpho-syntax is that all of those ideas and algorithms in those papers still apply, except that they now apply to sections of sheaves, instead of adjacent words, or adjacent letters, or n-grams. The point is that the sections of the sheves tell you exactly how to work with complex structure, instead of having to work with n-grams.

--linas

On Thu, Jul 13, 2017 at 2:20 AM, Linas Vepstas <linasv...@gmail.com> wrote:

On Thu, Jul 13, 2017 at 1:53 AM, Ruiting Lian <lianli...@gmail.com> wrote:

Ah, it's not that I don't like them. I just didn't see how they can be 100% correct. Maybe I didn't explain well. Let me try again:

4 .学   <--> 5 . 社 cost= 1.2513192358165952

So if you think this is a perfect word, which means, every link with MI higher than 1.25.... should be considered as one word.

No, that is not correct. That's not how it works.

Then the followings should be also:

8 . 與   <--> 9 . 實 cost= 1.5154639602692228

Yes, I guess o.

23 . 鹼   <--> 24 . （ cost= 2.0228678213402027

Yes I guess so

2 . 把   <--> 3 . 蕾 cost= 3.301384076256177

Yes I guess so.

4 . ，   <--> 5 . 此 cost= 1.9045829576234112

Yes I guess so.

but none of them is supposed to be a word, and these links are not good (no syntax relation, no semantic relation).

Sure, but you have exactly *one* example of each. Its not "statistics" if you have only one example. We would have to collect hundreds of these. And I bet that there won't be hundreds, that you won't be able to find that many.

Doing the MI between characters in morden Chinese is more like doing the MI between letters in English words, you will get some strong patterns that are frequently used in words, but that's very far away from understanding the syntax relations and semantic relations.

Its also just plain not how it works. No one does it this way, and I am certainly not proposing that we should do this. That would be .. dumb. Its just not how it works.

There are a number of good papers on how to do word segmentation correctly, and how to do morphosyntax correctly. I don't have them here at the tip of my fingers I would have to search for them.   I'll try to send them tomorrow.

--linas

把蕾當作戀人看待，記憶力比一般人優異。

6 . 戀   <--> 17 . 人 cost= 3.346336621332618
6 . 戀   <--> 7 . 人 cost= 3.346336621332618

The second link can indicate "戀(love) 人(people)" (lover/sweetheart) is one word, but the same MI implies to the first link doesn't make sense anymore, as they don't have strong relation in the sentence.

So 6-7 is one word, and 6-17 clearly cannot be one word, since the symbols are not next to each-other. So what's the problem? We have one word that is 100% correct, and a hint that maybe other things are wrong.

So it sounds like the first parse is perfect, and the second parse might have a problem ... and so? What's the actual problem?

What I meant is, there shouldn't be link between 6-17, as they don't have syntax relation from the grammar point of view.
Ahh!

Yes, it was obvious in the earlier email that there might be something wrong here.

If you want to consider the semantic relation in this case, the link should between 7-17.

OK, but so far that would still be linking to the same word, so its linking to the wrong morphome, but in the right word...
Recall that so far, these are jus spanning tree parses, not disjunct parses, so they are not going to provide word segmentation, and they are just a partial view of the syntax. Getting the word segmentation, and going over to disjuncts will presumably get better results. So far, it seems like we are on the correct path word segmentation, and I assume the syntactic links might be OK.

So since everything seems to be more or less correct, why don't you like it?

--linas

I think if you look at how Russian is done, maybe these parses will be more clear. For example, standard link-grammar generates

    +------------------------------------------------Xp------------------------------------------------+
    +---------------------Wd---------------------+                                                     |
    |       +-----------------EIw----------------+                                                     |
    |       +-------Jp-------+                   |                         +----------Mg---------+     |
    |       |      +--LLAAQ--+        +---LLGJV--+-----SIm3----+-----Mg----+           +--LLACG--+     |
    |       |      |         |        |          |             |           |           |         |     |
LEFT-WALL в.jp коридор.= =е.ndmsp послыш.= =ался.vsndpms грохот.ndmsi сапог.ndmpg стражник.= =ов.nlmpg .

The LL links are the links between morphemes.

The link between Chinese characters in one word is really not anything like morphemes.

--linas

On Tue, Jul 11, 2017 at 11:39 PM, Linas Vepstas <linasv...@gmail.com> wrote:
At a minimum, what the word segmentation for these sentences should have been? And what the linkages should have been, at least approximately?

--linas

On Tue, Jul 11, 2017 at 11:35 PM, Linas Vepstas <linasv...@gmail.com> wrote:
ah, it would be great if I got a translation. I know that's tedious, but I don't know of an easier way. Is it junkier than the English?

--linas

On Tue, Jul 11, 2017 at 10:48 PM, Ben Goertzel <b...@goertzel.org> wrote:
The MST parses you sent contain quite a lot of gibberish, according to
what Ruiting observed to me yesterday... (she's not at the computer
right now)

On Wed, Jul 12, 2017 at 11:39 AM, Linas Vepstas <linasv...@gmail.com> wrote:
> Yes, that's fine.
>
> Did you look at the MST parses I sent you? I would much rather use those, if
> they aren't terrible, because this would save a lot of time and effort.
>
> --linas
>
> On Tue, Jul 11, 2017 at 8:26 AM, Ruiting Lian <lianli...@gmail.com>
> wrote:
>>
>> Hi Linas,
>>
>> Can you check if the attached format is OK with you? Or you want a cleaner
>> format?
>>
>> The encoding is GBK.
>>
>>
>> The current format is:
>> ===============
>> Sentence #2 (14 tokens):
>> 陈清扬当时二十六岁，就在我插队的地方当医生。
>> [Text=陈清扬 CharacterOffsetBegin=50 CharacterOffsetEnd=53]
>> [Text=当时 CharacterOffsetBegin=53 CharacterOffsetEnd=55]
>> [Text=二十六 CharacterOffsetBegin=55 CharacterOffsetEnd=58]
>> [Text=岁 CharacterOffsetBegin=58 CharacterOffsetEnd=59]
>> [Text=， CharacterOffsetBegin=59 CharacterOffsetEnd=60]
>> [Text=就 CharacterOffsetBegin=60 CharacterOffsetEnd=61]
>> [Text=在 CharacterOffsetBegin=61 CharacterOffsetEnd=62]
>> [Text=我 CharacterOffsetBegin=62 CharacterOffsetEnd=63]
>> [Text=插队 CharacterOffsetBegin=63 CharacterOffsetEnd=65]
>> [Text=的 CharacterOffsetBegin=65 CharacterOffsetEnd=66]
>> [Text=地方 CharacterOffsetBegin=66 CharacterOffsetEnd=68]
>> [Text=当 CharacterOffsetBegin=68 CharacterOffsetEnd=69]
>> [Text=医生 CharacterOffsetBegin=69 CharacterOffsetEnd=71]
>> [Text=。 CharacterOffsetBegin=71 CharacterOffsetEnd=72]
>> =========================
>>
>> The simple format can be like English sentences using space to separate
>> the words:
>>
>> ===============
>> Sentence #2 (14 tokens):
>> 陈清扬当时二十六岁，就在我插队的地方当医生。
>>
>>
>> ----
>> Ruiting Lian
>>
>>
>> On Mon, Jul 10, 2017 at 12:13 PM, Linas Vepstas <linasv...@gmail.com>
>> wrote:
>>>
>>> Here's what I have for non-segmented data:
>>>
>>> The current Mandarin dataset has enough English in it to parse even
>>> simple English, not well, but not terribly badly, either. Basically, the
>>> wikipedia articles contain scattered English, and this is treated just like
>>> everything else.
>>>
>>> The cost is the MI of that pair. The integers are the word-ordinals.
>>>
>>> (print-mst "this is a test")
>>> 2 . this <--> 3 . is cost= 12.596325047561628
>>> 3 . is <--> 4 . a cost= 10.684454066752686
>>> 1 . ###LEFT-WALL### <--> 3 . is cost= 0.5236859955283215
>>> 1 . ###LEFT-WALL### <--> 5 . test cost= -0.8277888080021967
>>>
>>> (print-mst "it is surprsing that this works")
>>> 5 . that <--> 7 . works cost= 12.824735708193995
>>> 5 . that <--> 6 . this cost= 11.23043475606367
>>> 3 . is <--> 5 . that cost= 8.50983391898189
>>> 2 . it <--> 3 . is cost= 12.720811706832116
>>> 1 . ###LEFT-WALL### <--> 7 . works cost= 0.7545942265555112
>>>
>>>
>>> Some random sentences:
>>>
>>> 也常在工業上與實驗室中，用於有機合成中的強鹼（超強鹼）。
>>>
>>> 長時間都跟男人接觸，不擅長對待女孩子。
>>>
>>> 把蕾當作戀人看待，記憶力比一般人優異。
>>>
>>> 道德学社创始人。
>>>
>>> 然而，此混战亦可坚持三十六日。
>>>
>>>
>>> These are segmented so that its two hanzi per pair. There is NO word
>>> segmentation!! I am hoping that word segmentation will happen
>>> "automatically". See below.
>>>
>>> I do not yet have software to draw the graphs. You will have to do this
>>> by hand, for now.
>>>
>>> The parses:
>>>
>>> (print-mst "也常在工業上與實驗室中，用於有機合成中的強鹼（超強鹼）。")
>>> 9 . 實 <--> 10 . 驗 cost= 8.67839880455297
>>> 10 . 驗 <--> 11 . 室 cost= 7.808708535957656
>>> 5 . 工 <--> 11 . 室 cost= 3.628109329166861
>>> 5 . 工 <--> 6 . 業 cost= 5.267964316861255
>>> 8 . 與 <--> 9 . 實 cost= 1.5154639602692228
>>> 11 . 室 <--> 18 . 合 cost= 1.3112275389754586
>>> 18 . 合 <--> 19 . 成 cost= 2.8487390406796376
>>> 2 . 也 <--> 19 . 成 cost= 1.3284690162639947
>>> 2 . 也 <--> 3 . 常 cost= 2.670074279865112
>>> 2 . 也 <--> 4 . 在 cost= 1.5999478270616105
>>> 19 . 成 <--> 27 . 鹼 cost= 1.0648751602654372
>>> 23 . 鹼 <--> 27 . 鹼 cost= 7.397179300260998
>>> 22 . 強 <--> 23 . 鹼 cost= 6.6278644991783935
>>> 26 . 強 <--> 27 . 鹼 cost= 6.6278644991783935
>>> 25 . 超 <--> 26 . 強 cost= 4.233373422153647
>>> 21 . 的 <--> 23 . 鹼 cost= 2.189776253921316
>>> 23 . 鹼 <--> 24 . （ cost= 2.0228678213402027
>>> 20 . 中 <--> 21 . 的 cost= 0.7882106238577222
>>> 11 . 室 <--> 13 . ， cost= 0.7204578405457767
>>> 13 . ， <--> 15 . 於 cost= 1.3787498338128241
>>> 14 . 用 <--> 15 . 於 cost= 2.3097280192674408
>>> 13 . ， <--> 16 . 有 cost= 0.8940307468617856
>>> 16 . 有 <--> 17 . 機 cost= 1.5798315464927484
>>> 12 . 中 <--> 13 . ， cost= 0.702836797299172
>>> 19 . 成 <--> 29 . 。 cost= 0.5752920624938298
>>> 28 . ） <--> 29 . 。 cost= 1.1524305408117694
>>> 6 . 業 <--> 7 . 上 cost= -0.02469597819435876
>>> 1 . ###LEFT-WALL### <--> 2 . 也 cost= -0.5802694975440783
>>>
>>>
>>> scheme@(guile-user)> (print-mst "長時間都跟男人接觸，不擅長對待女
>>> 孩子。")
>>> 9 . 接 <--> 10 . 觸 cost= 8.386123586043663
>>> 9 . 接 <--> 16 . 待 cost= 5.500641386527644
>>> 15 . 對 <--> 16 . 待 cost= 4.527961561831734
>>> 4 . 間 <--> 9 . 接 cost= 3.011954338244294
>>> 3 . 時 <--> 4 . 間 cost= 5.53719025855275
>>> 2 . 長 <--> 4 . 間 cost= 1.4431523444018381
>>> 2 . 長 <--> 19 . 子 cost= 1.6192113159084691
>>> 18 . 孩 <--> 19 . 子 cost= 6.867211688834326
>>> 17 . 女 <--> 18 . 孩 cost= 7.309507215753307
>>> 13 . 擅 <--> 16 . 待 cost= 1.4396781889010661
>>> 13 . 擅 <--> 14 . 長 cost= 7.846183142809686
>>> 12 . 不 <--> 13 . 擅 cost= 4.148490791614584
>>> 11 . ， <--> 13 . 擅 cost= 2.360114646965709
>>> 4 . 間 <--> 5 . 都 cost= 0.9693867850594948
>>> 5 . 都 <--> 6 . 跟 cost= 2.3990108227984237
>>> 6 . 跟 <--> 7 . 男 cost= 1.3528584177181244
>>> 7 . 男 <--> 8 . 人 cost= 2.705451519003308
>>> 2 . 長 <--> 20 . 。 cost= 0.9405658895019222
>>> 1 . ###LEFT-WALL### <--> 2 . 長 cost= -0.11680557768830013
>>>
>>>
>>> scheme@(guile-user)> (print-mst "把蕾當作戀人看待，記憶力比一般人優異
>>> 。")
>>> 11 . 記 <--> 12 . 憶 cost= 9.984319885371686
>>> 6 . 戀 <--> 12 . 憶 cost= 4.479455917846746
>>> 6 . 戀 <--> 17 . 人 cost= 3.346336621332618
>>> 6 . 戀 <--> 7 . 人 cost= 3.346336621332618
>>> 6 . 戀 <--> 19 . 異 cost= 3.1326057184954585
>>> 18 . 優 <--> 19 . 異 cost= 7.429036103043035
>>> 12 . 憶 <--> 13 . 力 cost= 2.9092434467610424
>>> 7 . 人 <--> 8 . 看 cost= 1.5817340934392554
>>> 8 . 看 <--> 9 . 待 cost= 6.40562791854984
>>> 6 . 戀 <--> 16 . 般 cost= 1.532540145438297
>>> 15 . 一 <--> 16 . 般 cost= 6.011468981031394
>>> 14 . 比 <--> 16 . 般 cost= 2.133858423252395
>>> 19 . 異 <--> 20 . 。 cost= 1.062610265554028
>>> 5 . 作 <--> 20 . 。 cost= 0.9473091730162793
>>> 4 . 當 <--> 5 . 作 cost= 1.8844566876556712
>>> 2 . 把 <--> 4 . 當 cost= 1.601964785223224
>>> 2 . 把 <--> 3 . 蕾 cost= 3.301384076256177
>>> 1 . ###LEFT-WALL### <--> 4 . 當 cost= 1.4277603296921786
>>> 8 . 看 <--> 10 . ， cost= 0.7854131431579994
>>>
>>>
>>> scheme@(guile-user)> (print-mst "道德学社创始人。")
>>> 6 . 创 <--> 7 . 始 cost= 5.70773284687867
>>> 4 . 学 <--> 6 . 创 cost= 3.061665951119931
>>> 6 . 创 <--> 8 . 人 cost= 2.1724657173546227
>>> 4 . 学 <--> 5 . 社 cost= 1.2513192358165952
>>> 8 . 人 <--> 9 . 。 cost= 0.9996485775966537
>>> 2 . 道 <--> 9 . 。 cost= 0.566680489393292
>>> 2 . 道 <--> 3 . 德 cost= 2.4802112908263467
>>> 1 . ###LEFT-WALL### <--> 2 . 道 cost= -0.4271902589965446
>>>
>>>
>>> scheme@(guile-user)> (print-mst "然而，此混战亦可坚持三十六日。")
>>> 10 . 坚 <--> 11 . 持 cost= 8.090136242433331
>>> 2 . 然 <--> 10 . 坚 cost= 3.986785521916751
>>> 2 . 然 <--> 3 . 而 cost= 5.260819401185174
>>> 1 . ###LEFT-WALL### <--> 2 . 然 cost= 1.960750908016161
>>> 2 . 然 <--> 5 . 此 cost= 1.6401055467853318
>>> 4 . ， <--> 5 . 此 cost= 1.9045829576234112
>>> 5 . 此 <--> 8 . 亦 cost= 1.6941756040836786
>>> 8 . 亦 <--> 9 . 可 cost= 3.9818559895630496
>>> 5 . 此 <--> 7 . 战 cost= 0.7150479036642032
>>> 6 . 混 <--> 7 . 战 cost= 3.4641015339868524
>>> 11 . 持 <--> 16 . 。 cost= 0.667356443539326
>>> 11 . 持 <--> 12 . 三 cost= -0.082652373159668
>>> 12 . 三 <--> 13 . 十 cost= 4.910027341075681
>>> 13 . 十 <--> 14 . 六 cost= 6.198613466629329
>>> 13 . 十 <--> 15 . 日 cost= 1.2911435674834095
>>> scheme@(guile-user)>
>>>
>>>
>>> On Thu, Jul 6, 2017 at 7:57 PM, Ben Goertzel <b...@goertzel.org> wrote:
>>>>
>>>> >> The primary issue I'm still very much struggling with is that I am
>>>> >> not happy
>>>> >> with the classification of words into word-groups. I've run multiple
>>>> >> experiments, all of what are a tad underwhelming. They're not bad,
>>>> >> they're
>>>> >> just not yet very good, either.
>>>> >
>>>> > Andres Suarez (an intern here), together with me and Curtis, is
>>>> > experimenting with a modification of Adagram (itself an extension of
>>>> > word2vec to handle disambiguation) for this... We'll share any
>>>> > meaningful results we get...
>>>>
>>>> but... yeah... it's a hard problem and surely the right place to be
>>>> stuck...
>>>>
>>>>
>>>>
>>>> --
>>>> Ben Goertzel, PhD
>>>> http://goertzel.org
>>>>
>>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>>>> boundary, I am the peak." -- Alexander Scriabin
>>>
>>>
>>
>

--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

Reply all

Reply to author

Forward

0 new messages