Freq in a bilingual conversation

32 views
Skip to first unread message

Kevin Donnelly

unread,
Jun 23, 2011, 8:05:07 AM6/23/11
to chib...@googlegroups.com
Hi

I'm trying to run basic freq commands on a bilingual conversation marked up
with the current CLAN default (ie with precodes). What I'm trying to do is to
get figures for total number of words in each language. This would be:
eng: words marked @s:eng, and unmarked words where the precode is [- eng];
spa: unmarked words, and words marked @s:spa where the precode is [- eng];
indeterminate: words marked @s:eng&spa.

The command:
clan/unix/bin/freq -s"@s:eng" clan/chats/myfile.cha
gets the ones marked @s:eng, but also includes the ones marked @s:eng&spa.
Using:
clan/unix/bin/freq +s"@s:eng&spa" clan/chats/myfile.cha
produces no results. I assume & has to be escaped, but \& doesn't work.
Using
clan/unix/bin/freq +s"@s:eng" +s"[- eng]" clan/chats/myfile.cha
(to try and get all the English words, including the ones with precodes) also
produces no results.

I'd be grateful if someone could tell me the magic switches here. I suppose
in more general terms the question is, how far can standard regular
expressions be used in the CLAN command line - is there a special syntax, or
are they not really expected to be used there?

Thanks.

--
Pob hwyl / Best wishes

Kevin Donnelly
kevindonnelly.org.uk

Leonid Spektor

unread,
Jun 23, 2011, 11:01:23 AM6/23/11
to chib...@googlegroups.com
Kevin,

The +s"@s:eng&spa" option needs a star character to match the actual word. So, the right command is +s"*@s:eng&spa".

A better command would be "freq +l myfile.cha +s@s&eng" for English words and command
"freq +l myfile.cha +s@s&spa" for Spanish words.
For more information about the +s@s option type "freq +s@s" in commands window. The +l option assigns explicit language tag to every word, thus making the use of +s"[- eng]" option unnecessary.

Leonid.

> --
> You received this message because you are subscribed to the Google Groups "chibolts" group.
> To post to this group, send email to chib...@googlegroups.com.
> To unsubscribe from this group, send email to chibolts+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/chibolts?hl=en.
>
>

Kevin Donnelly

unread,
Jun 24, 2011, 4:42:33 AM6/24/11
to chib...@googlegroups.com
Hi Leonid

::::On Thursday 23 June 2011 Leonid Spektor said::::


> The +s"@s:eng&spa" option needs a star character to match the actual
word.
> So, the right command is +s"*@s:eng&spa".

Great - this works fine.

> A better command would be "freq +l myfile.cha +s@s&eng" for English words
> and command "freq +l myfile.cha +s@s&spa" for Spanish words.
> For more information about the +s@s option type "freq +s@s" in commands
> window. The +l option assigns explicit language tag to every word, thus
> making the use of +s"[- eng]" option unnecessary.

This would indeed be useful, but unfortunately it doesn't work here - I get:
=====
clan/unix/bin/freq +l clan/chats/myfile.cha +s@s&spa
[1] 17700

+s@s Followed by search pattern
r word
& stem language marker
+ suffix language marker
$ part-of-speech marker
o all other elements not specified by user
followed by - or + and/or the following
* find any match
% erase any match
word -find "word"

For example:
+s"@r-*,&-it"
find all words with Italian stems
+s"@r*,&it,$n"
find all words with Italian stems and part of speech tag "n"
+s"@r-*,&-en,o-%"
find all words with English stems and erase all other markers
+s"@r*,&it,+en"
find all words with Italian stems and English suffix
No command 'spa' found, did you mean:
<snip>
=====

I've tried various permutations of +s@s&spa, but no luck. :-(

Leonid Spektor

unread,
Jun 24, 2011, 11:27:35 AM6/24/11
to chib...@googlegroups.com
Kevin,

You must be using unix system. In this case the second command needs to have +s option surrounded with quotes. So, the command is:

clan/unix/bin/freq +l +s"@s&eng" clan/chats/myfile.cha

Leonid.

Kevin Donnelly

unread,
Jun 24, 2011, 11:42:41 AM6/24/11
to chib...@googlegroups.com
Hi Leonid

::::On Friday 24 June 2011 Leonid Spektor said::::


> You must be using unix system.

Of course. :-)

> In this case the second command needs to
> have +s option surrounded with quotes. So, the command is:
> clan/unix/bin/freq +l +s"@s&eng" clan/chats/myfile.cha

No, I'd already tried:
clan/unix/bin/freq +l clan/chats/myfile.cha +s"@s&eng"
and it gives me a printout of most of the lines in the .cha file. Same with
your variant above.

A Cristia

unread,
Jul 8, 2015, 9:57:34 PM7/8/15
to chib...@googlegroups.com, ke...@dotmon.com
Dear all,

This is an old thread, but relevant to my own questions, as I'm also trying to get some freq counts in bilingual conversations.

The CLAN manual (p. 95) states:

freq +l +s"<- spa>" *.cha 


However, when I run it on my file, I don't get anything (counts of zero). Note that 

freq +l *.cha 


does yield the expected results (frequencies by language by speaker). I thought there may be something wrong with my file, so I downloaded the CUHK corpus, because the "angela.cha" file has some '[- zho]' and tried

freq +l +s"<- zho>" *.cha 

freq +l +s"[- zho]" *.cha 


Still zero counts. I also tried online:

freq +l +s"[- zho]" angela.cha

Wed Jul 8 21:55:43 2015 freq (21-Apr-2015) is conducting analyses on: ALL speaker tiers ****************************************
From file "angela.cha"
Speaker: *AUN: ------------------------------ 0 Total number of different item types used 0 Total number of items (tokens) Speaker: *CHI: ------------------------------ 0 Total number of different item types used 0 Total number of items (tokens) Speaker: *UNC: ------------------------------ 0 Total number of different item types used 0 Total number of items (tokens) Speaker: *MOT: ------------------------------ 0 Total number of different item types used 0 Total number of items (tokens)

 And the same issues with the [ version.

What am I doing wrong?

Thank you in advance,

Alex Cristia


Leonid Spektor

unread,
Jul 9, 2015, 12:16:29 AM7/9/15
to chib...@googlegroups.com
Alex,

Unfortunately the manual has combined two mutually exclusive options into one. The correct commands for example file angela.cha from CUHK corpus are:

freq +l +s*@s:zho angela.cha

freq +s"<- zho>" angela.cha


Combination of options "+l +s*@s:zho" will locate all words spoken in zho language. The option +s"<- zho> will only locate utterances that have either all or most words spoken in zho language.

Leonid.



--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.

To post to this group, send email to chib...@googlegroups.com.

Leonid Spektor

unread,
Jul 9, 2015, 12:50:06 PM7/9/15
to chib...@googlegroups.com
Alex,

I want to add one more command line which will count zho words located only on utterances that have all or most words spoken in zho language:

freq +s"[- zho]" angela.cha


Leonid.

A Cristia

unread,
Jul 10, 2015, 5:21:43 PM7/10/15
to chib...@googlegroups.com
Leonid,

Thank you so much for the response - I see now how the syntax changes between the two commands you first mentioned.

Surprisingly, though, I'm still getting zero counts with both:
freq +s"[- zho]" angela.cha
freq +s"<- zho>" angela.cha

freq +s"[- zho]" angela.cha

Fri Jul 10 12:08:27 2015 freq (21-Apr-2015) is conducting analyses on: ALL speaker tiers ****************************************
From file "angela.cha"
Speaker: *AUN: ------------------------------ 0 Total number of different item types used 0 Total number of items (tokens) Speaker: *CHI: ------------------------------ 0 Total number of different item types used 0 Total number of items (tokens) Speaker: *UNC: ------------------------------ 0 Total number of different item types used 0 Total number of items (tokens) Speaker: *MOT: ------------------------------ 0 Total number of different item types used 0 Total number of items (tokens)

Yet on lines 31, 174, 303, etc, there are some [- zho].

Thank you once more,

-alex

Brian MacWhinney

unread,
Jul 10, 2015, 6:29:05 PM7/10/15
to ChiBolts
Dear Alex,

Are you using the angela.cha file in the /Biling/CUHK segment of CHILDES?  If so, there is only a minimal
amount of Mandarin (zho) in that file, but the one utterance that is relevant comes up fine with this command:

freq +s"[- zho]” angela.cha

> freq +s"[- zho]" angela.cha
freq +s"[- zho]" angela.cha
Fri Jul 10 18:26:51 2015
freq (08-May-2015) is conducting analyses on:
  ALL speaker tiers
****************************************
From file <angela.cha>
Speaker: *AUN:
------------------------------
    0  Total number of different item types used
    0  Total number of items (tokens)

Speaker: *CHI:
  1 cookie@s
  1 gwaai3sau2
  1 ji4zoeng6
  1 tung4waa6
------------------------------
    4  Total number of different item types used
    4  Total number of items (tokens)
1.000  Type/Token ratio

Perhaps you need a new version of CLAN?

—Brian MacWhinney

Leonid Spektor

unread,
Jul 10, 2015, 9:42:25 PM7/10/15
to chib...@googlegroups.com
Alex,

You are right about CLAN on the childes web site not working correctly. When we updated to new OS and therefor new web server, something went wrong and options with spaces in them, such as +s"[- zho]", stopped working. I have fixed the web CLAN script to work correctly again. The CLAN application however was always working correctly and we always encourage people to use CLAN application, that you can download from URL http://childes.talkbank.org/clan/, instead of CLAN on the web, because CLAN on the web is very limited in it's functionality. Primarily, you can not create a file with any web CLAN command.

Leonid.



A Cristia

unread,
Jul 11, 2015, 8:22:15 AM7/11/15
to chib...@googlegroups.com, ke...@dotmon.com
Wonderful, thank you for all your explanations! 
I was using the web version in case I had messed up anything in my local CLAN - I'll trust the local more than the web one henceforth.

Reply all
Reply to author
Forward
0 new messages