calculating Mandarin MLU in words

Ying

unread,

Dec 22, 2014, 11:37:47 PM12/22/14

to chib...@googlegroups.com

Hello,

I want to calculate Mandarin-English bilinguals' Mandarin MLU in words (instead in morphemes to make cross-linguistic comparison). I segmented the transcript by words and used the following command (transcript attached):

mor +t*CHI MEV001_retell.cha +1
post +t*CHI MEV001_retell.cha +1
mlu +s“[+ G]” +s"[+ U]" +t*CHI MEV001_retell.cha > retell_001_mlu.cha

But I noticed that the number of morphemes shown in the retell_001_mlu.cha was different from the total number of items (tokens) obtained from the following command:

freq +t*CHI +s”[+ G]” +s"[+ U]" +t%mor –t* MEV001_retell.cha > retell_001_TTR.cha

Does it mean CLAN distinguishes between words and morphemes in Mandarin? I thought the space between words is the unique way to allow CLAN to recognize words (same as morphemes) in Mandarin.

Which number should I use if I want to calculate MLU in words in Mandarin?

Thank you very much and wish you happy holidays!

Sincerely,
Ying Lu

MEV001_retell.cha

Leonid Spektor

unread,

Dec 23, 2014, 11:39:51 AM12/23/14

to chib...@googlegroups.com

Ying Lu,

The number of morphemes is different from number of tokens, because of two instances of word: "pro|ta1-PL=he" in your sample data file. Those two words are counted as four morphemes by MLU, because of presence of '-' character, but are counted as two tokens by FREQ. If you want MLU to count only words instead of morphemes, then you need to add "-b" option to your MLU command:

mlu +s“[+ G]” +s"[+ U]" +t*CHI -b MEV001_retell.cha > retell_001_mlu.cha

Leonid.

--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
To post to this group, send email to chib...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/0e01d974-5b46-4ef8-8c5a-1a290cb040c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<MEV001_retell.cha>

Ying Lu

unread,

Dec 23, 2014, 12:19:28 PM12/23/14

to ChiBolts

Thank you very much Leonid! I still have some questions:

1. I also wonder if I can use a similar command with "-b" to calculate English MLU in words (instead of morphemes). For example, may I use the command to calculate MLU in words for the transcript attached?

mor +t*CHI MEV001_E_retell.cha +1
post +t*CHI MEV001_E_retell.cha +1
mlu +t*CHI +s“[+ G]” +s"[+ U]" +k -b MEV001_E_retell.cha > 001_E_retell_mlu.cha

2. Moreover, I want to distinguish between regular verbs (e.g., work - worked) and irregular verbs (e.g., find - found) when calculating number of different words (NDW).

I understand +s@r-*,o-% will find all stems and erase all other markers. But I want to treat the irregular verbs (e.g., found) as a separate lexical from the stem (e.g., find), but count the regular verbs (e.g., worked) and the stem (e.g., work) as one entry.

But with the following two commands, I got results treating the irregular and regular verbs as the same (either reserving the suffix in regular verbs or keeping only the stem):

freq +t*CHI +s"[+ G]" +s"[+ U]" +k MEV001_E_retell.cha > 001_E_retell_NDW_different.cha
freq +t*CHI +s"[+ G]" +s"[+ U]" +t%mor -t* +s"@r-*,o-%" +k MEV001_E_retell.cha > 001_E_retell_NDW_same.cha

3. If I am going to use "+d3" to output type/token information in Excel format, is it possible to also output MLU, total number of utterances, and even other information (e.g., code calculation) in the same spreadsheet?

Thanks a lot!!!

Sincerely,
Ying

To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/4AD1237A-70FA-4E09-A1F0-24EF3EAA4659%40andrew.cmu.edu.

MEV001_E_retell.cha

Leonid Spektor

unread,

Dec 23, 2014, 12:57:41 PM12/23/14

to chib...@googlegroups.com

1. Yes, you can use "-b" option with any language including English.

2. Try using +s"@r-*,&+*,o%" option. In English, at least, the '&' suffix is used with irregular words and including it in the search string will give you two forms of words "find" and "found". For example, the result will be "v|find" for word "find" and "v|find&past" for word "found". Your command line will be:

freq +t*CHI +s"[+ G]" +s"[+ U]" +t%mor -t* +s"@r-*,&+*,o%" +k MEV001_E_retell.cha

3. The is no way to get both TTR and MLU results into the same spreadsheet while maintaining the same search flexibility that you are using, for example the use of +s"[+ G]" +s"[+ U]" options. But MLU command has its own "+d" option that will create Excel output and then you can use Excel application to join those FREQ and MLU outputs together.

Leonid.

To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/CACApr0GrKTaDKjsEcySTNZ9vg9GDBftyCf3yKkp7t03u-5KDKA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

<MEV001_E_retell.cha>

Ying Lu

unread,

Dec 23, 2014, 1:54:07 PM12/23/14

to ChiBolts

Very helpful and informative! Thank you so much, Leonid!

Merry Christmas!

Best wishes!

Ying

To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/FA2EA53C-E17E-461D-9F01-3DF4EB5E6EB1%40andrew.cmu.edu.

Reply all

Reply to author

Forward