On Thu, Jul 13, 2017 at 1:53 AM, Ruiting Lian <lianli...@gmail.com> wrote:Ah, it's not that I don't like them. I just didn't see how they can be 100% correct. Maybe I didn't explain well. Let me try again:
4 .学 <--> 5 . 社 cost= 1.2513192358165952So if you think this is a perfect word, which means, every link with MI higher than 1.25.... should be considered as one word.No, that is not correct. That's not how it works.Then the followings should be also:
8 . 與 <--> 9 . 實 cost= 1.5154639602692228Yes, I guess o.
23 . 鹼 <--> 24 . ( cost= 2.0228678213402027Yes I guess so
2 . 把 <--> 3 . 蕾 cost= 3.301384076256177Yes I guess so.
4 . , <--> 5 . 此 cost= 1.9045829576234112Yes I guess so.
but none of them is supposed to be a word, and these links are not good (no syntax relation, no semantic relation).Sure, but you have exactly *one* example of each. Its not "statistics" if you have only one example. We would have to collect hundreds of these. And I bet that there won't be hundreds, that you won't be able to find that many.
Doing the MI between characters in morden Chinese is more like doing the MI between letters in English words, you will get some strong patterns that are frequently used in words, but that's very far away from understanding the syntax relations and semantic relations.Its also just plain not how it works. No one does it this way, and I am certainly not proposing that we should do this. That would be .. dumb. Its just not how it works.There are a number of good papers on how to do word segmentation correctly, and how to do morphosyntax correctly. I don't have them here at the tip of my fingers I would have to search for them. I'll try to send them tomorrow.--linas
把 蕾 當 作 戀 人 看 待 , 記 憶 力 比 一 般 人 優 異 。
6 . 戀 <--> 17 . 人 cost= 3.346336621332618
6 . 戀 <--> 7 . 人 cost= 3.346336621332618The second link can indicate "戀(love) 人(people)" (lover/sweetheart) is one word, but the same MI implies to the first link doesn't make sense anymore, as they don't have strong relation in the sentence.So 6-7 is one word, and 6-17 clearly cannot be one word, since the symbols are not next to each-other. So what's the problem? We have one word that is 100% correct, and a hint that maybe other things are wrong.So it sounds like the first parse is perfect, and the second parse might have a problem ... and so? What's the actual problem?What I meant is, there shouldn't be link between 6-17, as they don't have syntax relation from the grammar point of view.Ahh!Yes, it was obvious in the earlier email that there might be something wrong here.If you want to consider the semantic relation in this case, the link should between 7-17.OK, but so far that would still be linking to the same word, so its linking to the wrong morphome, but in the right word...Recall that so far, these are jus spanning tree parses, not disjunct parses, so they are not going to provide word segmentation, and they are just a partial view of the syntax. Getting the word segmentation, and going over to disjuncts will presumably get better results. So far, it seems like we are on the correct path word segmentation, and I assume the syntactic links might be OK.So since everything seems to be more or less correct, why don't you like it?
--linasI think if you look at how Russian is done, maybe these parses will be more clear. For example, standard link-grammar generates
+------------------------------------------------Xp------------------------------------------------+
+---------------------Wd---------------------+ |
| +-----------------EIw----------------+ |
| +-------Jp-------+ | +----------Mg---------+ |
| | +--LLAAQ--+ +---LLGJV--+-----SIm3----+-----Mg----+ +--LLACG--+ |
| | | | | | | | | | |
LEFT-WALL в.jp коридор.= =е.ndmsp послыш.= =ался.vsndpms грохот.ndmsi сапог.ndmpg стражник.= =ов.nlmpg .The LL links are the links between morphemes.The link between Chinese characters in one word is really not anything like morphemes.
--linas
On Tue, Jul 11, 2017 at 11:39 PM, Linas Vepstas <linasv...@gmail.com> wrote:At a minimum, what the word segmentation for these sentences should have been? And what the linkages should have been, at least approximately?--linasOn Tue, Jul 11, 2017 at 11:35 PM, Linas Vepstas <linasv...@gmail.com> wrote:ah, it would be great if I got a translation. I know that's tedious, but I don't know of an easier way. Is it junkier than the English?--linasOn Tue, Jul 11, 2017 at 10:48 PM, Ben Goertzel <b...@goertzel.org> wrote:The MST parses you sent contain quite a lot of gibberish, according to
what Ruiting observed to me yesterday... (she's not at the computer
right now)
On Wed, Jul 12, 2017 at 11:39 AM, Linas Vepstas <linasv...@gmail.com> wrote:
> Yes, that's fine.
>
> Did you look at the MST parses I sent you? I would much rather use those, if
> they aren't terrible, because this would save a lot of time and effort.
>
> --linas
>
> On Tue, Jul 11, 2017 at 8:26 AM, Ruiting Lian <lianli...@gmail.com>
> wrote:
>>
>> Hi Linas,
>>
>> Can you check if the attached format is OK with you? Or you want a cleaner
>> format?
>>
>> The encoding is GBK.
>>
>>
>> The current format is:
>> ===============
>> Sentence #2 (14 tokens):
>> 陈清扬当时二十六岁,就在我插队的地方当医生。
>> [Text=陈清扬 CharacterOffsetBegin=50 CharacterOffsetEnd=53]
>> [Text=当时 CharacterOffsetBegin=53 CharacterOffsetEnd=55]
>> [Text=二十六 CharacterOffsetBegin=55 CharacterOffsetEnd=58]
>> [Text=岁 CharacterOffsetBegin=58 CharacterOffsetEnd=59]
>> [Text=, CharacterOffsetBegin=59 CharacterOffsetEnd=60]
>> [Text=就 CharacterOffsetBegin=60 CharacterOffsetEnd=61]
>> [Text=在 CharacterOffsetBegin=61 CharacterOffsetEnd=62]
>> [Text=我 CharacterOffsetBegin=62 CharacterOffsetEnd=63]
>> [Text=插队 CharacterOffsetBegin=63 CharacterOffsetEnd=65]
>> [Text=的 CharacterOffsetBegin=65 CharacterOffsetEnd=66]
>> [Text=地方 CharacterOffsetBegin=66 CharacterOffsetEnd=68]
>> [Text=当 CharacterOffsetBegin=68 CharacterOffsetEnd=69]
>> [Text=医生 CharacterOffsetBegin=69 CharacterOffsetEnd=71]
>> [Text=。 CharacterOffsetBegin=71 CharacterOffsetEnd=72]
>> =========================
>>
>> The simple format can be like English sentences using space to separate
>> the words:
>>
>> ===============
>> Sentence #2 (14 tokens):
>> 陈清扬 当时 二十六 岁 , 就 在 我 插队 的 地方 当 医生 。
>>
>>
>> ----
>> Ruiting Lian
>>
>>
>> On Mon, Jul 10, 2017 at 12:13 PM, Linas Vepstas <linasv...@gmail.com>
>> wrote:
>>>
>>> Here's what I have for non-segmented data:
>>>
>>> The current Mandarin dataset has enough English in it to parse even
>>> simple English, not well, but not terribly badly, either. Basically, the
>>> wikipedia articles contain scattered English, and this is treated just like
>>> everything else.
>>>
>>> The cost is the MI of that pair. The integers are the word-ordinals.
>>>
>>> (print-mst "this is a test")
>>> 2 . this <--> 3 . is cost= 12.596325047561628
>>> 3 . is <--> 4 . a cost= 10.684454066752686
>>> 1 . ###LEFT-WALL### <--> 3 . is cost= 0.5236859955283215
>>> 1 . ###LEFT-WALL### <--> 5 . test cost= -0.8277888080021967
>>>
>>> (print-mst "it is surprsing that this works")
>>> 5 . that <--> 7 . works cost= 12.824735708193995
>>> 5 . that <--> 6 . this cost= 11.23043475606367
>>> 3 . is <--> 5 . that cost= 8.50983391898189
>>> 2 . it <--> 3 . is cost= 12.720811706832116
>>> 1 . ###LEFT-WALL### <--> 7 . works cost= 0.7545942265555112
>>>
>>>
>>> Some random sentences:
>>>
>>> 也 常 在 工 業 上 與 實 驗 室 中 , 用 於 有 機 合 成 中 的 強 鹼 ( 超 強 鹼 ) 。
>>>
>>> 長 時 間 都 跟 男 人 接 觸 , 不 擅 長 對 待 女 孩 子 。
>>>
>>> 把 蕾 當 作 戀 人 看 待 , 記 憶 力 比 一 般 人 優 異 。
>>>
>>> 道 德 学 社 创 始 人 。
>>>
>>> 然 而 , 此 混 战 亦 可 坚 持 三 十 六 日 。
>>>
>>>
>>> These are segmented so that its two hanzi per pair. There is NO word
>>> segmentation!! I am hoping that word segmentation will happen
>>> "automatically". See below.
>>>
>>> I do not yet have software to draw the graphs. You will have to do this
>>> by hand, for now.
>>>
>>> The parses:
>>>
>>> (print-mst "也 常 在 工 業 上 與 實 驗 室 中 , 用 於 有 機 合 成 中 的 強 鹼 ( 超 強 鹼 ) 。")
>>> 9 . 實 <--> 10 . 驗 cost= 8.67839880455297
>>> 10 . 驗 <--> 11 . 室 cost= 7.808708535957656
>>> 5 . 工 <--> 11 . 室 cost= 3.628109329166861
>>> 5 . 工 <--> 6 . 業 cost= 5.267964316861255
>>> 8 . 與 <--> 9 . 實 cost= 1.5154639602692228
>>> 11 . 室 <--> 18 . 合 cost= 1.3112275389754586
>>> 18 . 合 <--> 19 . 成 cost= 2.8487390406796376
>>> 2 . 也 <--> 19 . 成 cost= 1.3284690162639947
>>> 2 . 也 <--> 3 . 常 cost= 2.670074279865112
>>> 2 . 也 <--> 4 . 在 cost= 1.5999478270616105
>>> 19 . 成 <--> 27 . 鹼 cost= 1.0648751602654372
>>> 23 . 鹼 <--> 27 . 鹼 cost= 7.397179300260998
>>> 22 . 強 <--> 23 . 鹼 cost= 6.6278644991783935
>>> 26 . 強 <--> 27 . 鹼 cost= 6.6278644991783935
>>> 25 . 超 <--> 26 . 強 cost= 4.233373422153647
>>> 21 . 的 <--> 23 . 鹼 cost= 2.189776253921316
>>> 23 . 鹼 <--> 24 . ( cost= 2.0228678213402027
>>> 20 . 中 <--> 21 . 的 cost= 0.7882106238577222
>>> 11 . 室 <--> 13 . , cost= 0.7204578405457767
>>> 13 . , <--> 15 . 於 cost= 1.3787498338128241
>>> 14 . 用 <--> 15 . 於 cost= 2.3097280192674408
>>> 13 . , <--> 16 . 有 cost= 0.8940307468617856
>>> 16 . 有 <--> 17 . 機 cost= 1.5798315464927484
>>> 12 . 中 <--> 13 . , cost= 0.702836797299172
>>> 19 . 成 <--> 29 . 。 cost= 0.5752920624938298
>>> 28 . ) <--> 29 . 。 cost= 1.1524305408117694
>>> 6 . 業 <--> 7 . 上 cost= -0.02469597819435876
>>> 1 . ###LEFT-WALL### <--> 2 . 也 cost= -0.5802694975440783
>>>
>>>
>>> scheme@(guile-user)> (print-mst "長 時 間 都 跟 男 人 接 觸 , 不 擅 長 對 待 女
>>> 孩 子 。")
>>> 9 . 接 <--> 10 . 觸 cost= 8.386123586043663
>>> 9 . 接 <--> 16 . 待 cost= 5.500641386527644
>>> 15 . 對 <--> 16 . 待 cost= 4.527961561831734
>>> 4 . 間 <--> 9 . 接 cost= 3.011954338244294
>>> 3 . 時 <--> 4 . 間 cost= 5.53719025855275
>>> 2 . 長 <--> 4 . 間 cost= 1.4431523444018381
>>> 2 . 長 <--> 19 . 子 cost= 1.6192113159084691
>>> 18 . 孩 <--> 19 . 子 cost= 6.867211688834326
>>> 17 . 女 <--> 18 . 孩 cost= 7.309507215753307
>>> 13 . 擅 <--> 16 . 待 cost= 1.4396781889010661
>>> 13 . 擅 <--> 14 . 長 cost= 7.846183142809686
>>> 12 . 不 <--> 13 . 擅 cost= 4.148490791614584
>>> 11 . , <--> 13 . 擅 cost= 2.360114646965709
>>> 4 . 間 <--> 5 . 都 cost= 0.9693867850594948
>>> 5 . 都 <--> 6 . 跟 cost= 2.3990108227984237
>>> 6 . 跟 <--> 7 . 男 cost= 1.3528584177181244
>>> 7 . 男 <--> 8 . 人 cost= 2.705451519003308
>>> 2 . 長 <--> 20 . 。 cost= 0.9405658895019222
>>> 1 . ###LEFT-WALL### <--> 2 . 長 cost= -0.11680557768830013
>>>
>>>
>>> scheme@(guile-user)> (print-mst "把 蕾 當 作 戀 人 看 待 , 記 憶 力 比 一 般 人 優 異
>>> 。")
>>> 11 . 記 <--> 12 . 憶 cost= 9.984319885371686
>>> 6 . 戀 <--> 12 . 憶 cost= 4.479455917846746
>>> 6 . 戀 <--> 17 . 人 cost= 3.346336621332618
>>> 6 . 戀 <--> 7 . 人 cost= 3.346336621332618
>>> 6 . 戀 <--> 19 . 異 cost= 3.1326057184954585
>>> 18 . 優 <--> 19 . 異 cost= 7.429036103043035
>>> 12 . 憶 <--> 13 . 力 cost= 2.9092434467610424
>>> 7 . 人 <--> 8 . 看 cost= 1.5817340934392554
>>> 8 . 看 <--> 9 . 待 cost= 6.40562791854984
>>> 6 . 戀 <--> 16 . 般 cost= 1.532540145438297
>>> 15 . 一 <--> 16 . 般 cost= 6.011468981031394
>>> 14 . 比 <--> 16 . 般 cost= 2.133858423252395
>>> 19 . 異 <--> 20 . 。 cost= 1.062610265554028
>>> 5 . 作 <--> 20 . 。 cost= 0.9473091730162793
>>> 4 . 當 <--> 5 . 作 cost= 1.8844566876556712
>>> 2 . 把 <--> 4 . 當 cost= 1.601964785223224
>>> 2 . 把 <--> 3 . 蕾 cost= 3.301384076256177
>>> 1 . ###LEFT-WALL### <--> 4 . 當 cost= 1.4277603296921786
>>> 8 . 看 <--> 10 . , cost= 0.7854131431579994
>>>
>>>
>>> scheme@(guile-user)> (print-mst "道 德 学 社 创 始 人 。")
>>> 6 . 创 <--> 7 . 始 cost= 5.70773284687867
>>> 4 . 学 <--> 6 . 创 cost= 3.061665951119931
>>> 6 . 创 <--> 8 . 人 cost= 2.1724657173546227
>>> 4 . 学 <--> 5 . 社 cost= 1.2513192358165952
>>> 8 . 人 <--> 9 . 。 cost= 0.9996485775966537
>>> 2 . 道 <--> 9 . 。 cost= 0.566680489393292
>>> 2 . 道 <--> 3 . 德 cost= 2.4802112908263467
>>> 1 . ###LEFT-WALL### <--> 2 . 道 cost= -0.4271902589965446
>>>
>>>
>>> scheme@(guile-user)> (print-mst "然 而 , 此 混 战 亦 可 坚 持 三 十 六 日 。")
>>> 10 . 坚 <--> 11 . 持 cost= 8.090136242433331
>>> 2 . 然 <--> 10 . 坚 cost= 3.986785521916751
>>> 2 . 然 <--> 3 . 而 cost= 5.260819401185174
>>> 1 . ###LEFT-WALL### <--> 2 . 然 cost= 1.960750908016161
>>> 2 . 然 <--> 5 . 此 cost= 1.6401055467853318
>>> 4 . , <--> 5 . 此 cost= 1.9045829576234112
>>> 5 . 此 <--> 8 . 亦 cost= 1.6941756040836786
>>> 8 . 亦 <--> 9 . 可 cost= 3.9818559895630496
>>> 5 . 此 <--> 7 . 战 cost= 0.7150479036642032
>>> 6 . 混 <--> 7 . 战 cost= 3.4641015339868524
>>> 11 . 持 <--> 16 . 。 cost= 0.667356443539326
>>> 11 . 持 <--> 12 . 三 cost= -0.082652373159668
>>> 12 . 三 <--> 13 . 十 cost= 4.910027341075681
>>> 13 . 十 <--> 14 . 六 cost= 6.198613466629329
>>> 13 . 十 <--> 15 . 日 cost= 1.2911435674834095
>>> scheme@(guile-user)>
>>>
>>>
>>> On Thu, Jul 6, 2017 at 7:57 PM, Ben Goertzel <b...@goertzel.org> wrote:
>>>>
>>>> >> The primary issue I'm still very much struggling with is that I am
>>>> >> not happy
>>>> >> with the classification of words into word-groups. I've run multiple
>>>> >> experiments, all of what are a tad underwhelming. They're not bad,
>>>> >> they're
>>>> >> just not yet very good, either.
>>>> >
>>>> > Andres Suarez (an intern here), together with me and Curtis, is
>>>> > experimenting with a modification of Adagram (itself an extension of
>>>> > word2vec to handle disambiguation) for this... We'll share any
>>>> > meaningful results we get...
>>>>
>>>> but... yeah... it's a hard problem and surely the right place to be
>>>> stuck...
>>>>
>>>>
>>>>
>>>> --
>>>> Ben Goertzel, PhD
>>>> http://goertzel.org
>>>>
>>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>>>> boundary, I am the peak." -- Alexander Scriabin
>>>
>>>
>>
>
--
Ben Goertzel, PhD
http://goertzel.org
"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin