Hi everyone,
I have a question about extracting participants' utterances using CLAN commands and was wondering if I'm thinking along the right lines. I'd appreciate it if you could take a look. Thanks!
I'm working with DementiaBank, specifically the ADReSS dataset, a subset of the Pitt corpus. I used the following command to extract the 'flow' tier of participants' utterances: `flo +cr +tPAR*`. Here, I have the asterisk * placed after the PAR identifier. But I noticed that in the CLAN manual, the asterisk typically precedes it, as in `t*PAR`.
I got the following output after running `t*PAR`
flo (13-Apr-2023) is conducting analyses on:
ONLY speaker main tiers matching: *PAR;
And here's the output after running `tPAR*`
flo (13-Apr-2023) is conducting analyses on:
ONLY speaker main tiers matching: *PAR*;
It looks like the asterisk is used to search for tier ID patterns. Since all my files contain only INV and PAR tiers, I assume tier matching would only affect the selection of the PAR tier. I also used a Python function to verify that the utterances extracted by these two commands were identical (attached below, in case it's helpful).
Both commands appear to work, but I don't fully understand why. Please let me know your thoughts. Thank you very much!
Best,
Mingyu
def check_clan_command(id, file_old, file_new):
# Read the .cex file created by the old command (i.e. with tPAR*)
with open(PATH_TO_OLD_FILE, 'r') as file_old_cmd:
file_o = file_old_cmd.read().splitlines()
# Read the .cex file created by the new command (i.e. with t*PAR)
with open(PATH_TO_NEW_FILE, 'r') as file_new_cmd:
file_n = file_new_cmd.read().splitlines()
print(id, file_o == file_n)