extracting utterances from specified tier ID

22 views
Skip to first unread message

Mingyu Yuan

unread,
Jul 7, 2025, 3:58:04 PMJul 7
to chibolts
Hi everyone, 

I have a question about extracting participants' utterances using CLAN commands and was wondering if I'm thinking along the right lines. I'd appreciate it if you could take a look. Thanks!

I'm working with DementiaBank, specifically the ADReSS dataset, a subset of the Pitt corpus. I used the following command to extract the 'flow' tier of participants' utterances: `flo +cr +tPAR*`. Here, I have the asterisk * placed after the PAR identifier. But I noticed that in the CLAN manual, the asterisk typically precedes it, as in `t*PAR`. 

I got the following output after running `t*PAR`

flo (13-Apr-2023) is conducting analyses on:
  ONLY speaker main tiers matching: *PAR;


And here's the output after running `tPAR*`

flo (13-Apr-2023) is conducting analyses on:
  ONLY speaker main tiers matching: *PAR*;


It looks like the asterisk is used to search for tier ID patterns. Since all my files contain only INV and PAR tiers, I assume tier matching would only affect the selection of the PAR tier. I also used a Python function to verify that the utterances extracted by these two commands were identical (attached below, in case it's helpful). 

Both commands appear to work, but I don't fully understand why. Please let me know your thoughts. Thank you very much!

Best,
Mingyu

def check_clan_command(id, file_old, file_new):
    # Read the .cex file created by the old command (i.e. with tPAR*)
    with open(PATH_TO_OLD_FILE, 'r') as file_old_cmd:
        file_o = file_old_cmd.read().splitlines()
    # Read the .cex file created by the new command (i.e. with t*PAR)
    with open(
PATH_TO_NEW_FILE, 'r') as file_new_cmd:
        file_n = file_new_cmd.read().splitlines()
    print(id, file_o == file_n)

Leonid Spektor

unread,
Jul 7, 2025, 4:40:13 PMJul 7
to chib...@googlegroups.com
Hi Mingyu,

The +/-t options convention is following:

+/-t%mor - include or exclude all %mor utterances.

The * (star) right after "+/-t" is a literal star. All speaker tiers start with a star character, i.e. *PAR: text.

+/-tPAR - is a short cut, if the star is missing after the "t", then it is assumed you want speaker tiers.

+/-t*PAR - no short cuts, just explicit way to specify specific speaker tier.

+/-tPAR* - star at the end means wild character that matches anything there is.
Some corpuses have speakers *PAR-one: and  *PAR-two: and so on.

+/-t*PAR* - the same a above, just more explicit.

Hope this helps,

Leonid.

--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/chibolts/d2459e0d-41c6-4707-9e00-e75f5e755c47n%40googlegroups.com.

Leonid Spektor

unread,
Jul 7, 2025, 4:46:29 PMJul 7
to chib...@googlegroups.com
One more thing about +/-t options. The name(s)/code(s) are not case sensitive.

For more short cut "+tpar" is the same as "+t*PAR:"


Leonid.
Message has been deleted

Mingyu Yuan

unread,
Jul 8, 2025, 3:59:42 PMJul 8
to chibolts
Hi Leonid, 

This is helpful! Thank you for the clarification. It looks like tPAR and t*PAR are what I intended to use. As for tPAR*, the wild character at the end matches anything that might follow PAR, as I understand it. Does it also match 'nothing', i.e. the tier name is exactly PAR? Thank you!

Best,
Mingyu

Leonid Spektor

unread,
Jul 8, 2025, 4:36:23 PMJul 8
to chib...@googlegroups.com
Hi Mingyu,

The * (star) is a wildcard in CLAN. It matches zero (nothing) or more any characters. There is also "_" character that matches just one any character, Combining those two like so _* would mean match one or more any characters. Those wildcard characters are most useful with +/-s option. If you want to specify literal * (star) character, then use \* combination.

Right after the +/-t option is the only place in CLAN where * (star) is always literal star character not a wildcard.


Leonid.

On Jul 7, 2025, at 17:41, 'Mingyu Yuan' via chibolts <chib...@googlegroups.com> wrote:

Hi Leonid, 

This is helpful! Thank you for the clarification. It looks like tPAR and t*PAR are what I intended to use. As for tPAR*, the wildcard character at the end matches anything that might follow PAR, as I understand it. Does it also match 'nothing', i.e. the tier name is exactly PAR? Thank you!

Best,
Mingyu
On Monday, July 7, 2025 at 1:46:29 PM UTC-7 Leonid Spektor wrote:
One more thing about +/-t options. The name(s)/code(s) are not case sensitive.

For more short cut "+tpar" is the same as "+t*PAR:"


Leonid.
On Jul 7, 2025, at 16:40, Leonid Spektor <spe...@andrew.cmu.edu> wrote:

Hi Mingyu,

The +/-t options convention is following:

+/-t%mor - include or exclude all %mor utterances.

The * (star) right after "+/-t" is a literal star. All speaker tiers start with a star character, i.e. *PAR: text.
+/-tPAR - is a short cut, if the star is missing after the "t", then it is assumed you want speaker tiers.

+/-t*PAR - no short cuts, just explicit way to specify specific speaker tier.

+/-tPAR* - star at the end means wildcard character that matches anything there is.

Hope this helps,

Leonid.
Reply all
Reply to author
Forward
0 new messages