Large scale combining CHILDES files

32 views
Skip to first unread message

Amanda Owen Van Horne

unread,
Mar 21, 2017, 1:10:46 PM3/21/17
to chib...@googlegroups.com
Hi, 

I'm try to combine all available English/non clinical CHILDES files based on the target child's age.  I've organized my files (by hand) into folders binned by month based on the child's age reported in the header information and now I would like to strip CHILDES codes from the speaker tier and output all of those files into a temp file, then I will use this temp file to create a single file of only adult/only child speakers.  The trouble I am running into is as the number of files I am working with gets larger, CLAN seems to skip files. When I run for 0-12 months I get the (expected) 192 files following FLO.  When I run for 0-15 months I get 340 files in the TEMP folder, when I should be getting 372.  This dropping of files continues and becomes more problematic as we move to broader and broader age ranges.  It's hard to track down individual files that might be contributing because so many files are involved.  Can anyone provide any guidance? 

Amanda 

working directory: CHILDES by Age Folder
output directory: TEMP

FLO *.cha -t% +d +r1 +re +ffin
  • FLO -- command to strip codes from main tier
  • *.cha -- apply to all files in working directory
  • -t% - get rid of non-speaker related tiers like mor and spa
  • +d - output in chat format
  • +r1 - if something is in () remove () and keep content (e.g., (be)cause = because)
  • +re works recursively through subfolders
  • +ffin - output to a file with the code .fin before .cex

output (TEMP) will fill with *.fin.cex files (one per original file) 

then change your working directory to the temp file. reset your output directory to someplace memorable.

KWAL *.cex -t*CHI +d +r1 +x>0w +u +f

  • KWAL - keyword analysis with no keyword specified outputs all content
  • *.cex  - all files in working directory
  • -t*CHI - only adult speakers
  • +d  - in chat format
  • +r1 - - if something is in () remove () and keep content (e.g., (be)cause = because)
  • +x>0w - only lines with 1 or more words; no empty utterances or utterances that only have info on other tiers
  • +u - combine all output into one file
  • +f - print to file (not to the screen) 

Final output from these two processes will end with *.fin.kwal.cex (a single combined file) 



Amanda J. Owen Van Horne, PhD CCC-SLP
Associate Professor
University of Iowa

Brian MacWhinney

unread,
Mar 21, 2017, 1:23:12 PM3/21/17
to chib...@googlegroups.com, Leonid Spektor

Dear Amanda,

 

  From what you write, the problem occurs during your use of FLO.  For us (Leonid or me) to replicate the problem, we would need the complete collection of 340 files for this 0-15 months period.  It could be that some particular file is causing the problem, but it could also be the case that you are running up against a machine limitation or a CLAN limitation.  In any case, we would need to receive the collection that triggers the problem, along with the command you are using to replicate the problem.  You could send this to me or, better, Leonid (spe...@andrew.cmu.edu) as a zipped email attachment, preserving the folder structure you are using.  Before sending to us,  please make sure that this problem is replicable on your side.  You might also want to test on a second computer.  Also please make sure you are using a current version of CLAN.

 

--Brian

--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
To post to this group, send email to chib...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/CA%2BUfwo47syFFvAc9T-F9m%3DxNhRt8FxmOPBEK9okjaP3iBG%2BTdQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Leonid Spektor

unread,
Mar 21, 2017, 2:48:02 PM3/21/17
to chib...@googlegroups.com

Amanda,

    I want you to try two thing.

1. Please set working directory: CHILDES by Age Folder to 0-15 months and run command "dir -r *.cha" at the end of the output in "CLAN Output" window you will see how many files CLAN has found. If the number is 340 files, then for some reason, maybe bad file extension or bad directory name or file protection, CLAN can't see other files as .cha files. In this case run command "dir -r -n *.cha" and you will see files that CLAN doesn't recognize as .cha files.

2. If "dir -r *.cha" command finds 372 files, then the problem might be with FLO command or "+re" function. Please get data from our server at URL "http://childes.talkbank.org/data/Eng-NA/Braunwald.zip". Unzip it and in CLAN set working directory to unzipped Braunwald directory. Set output to TEMP directory that is empty and run command "FLO *.cha -t% +d +r1 +re +ffin". On my Mac and Windows 10 PC I get 900 .fin.cex files in TEMP directory. If you get the same number, then something is wrong with files in your 0-15 months set. If you get a different number, then make sure you have the latest CLAN. Maybe even reboot your computer and try the same above command again.

If you still get less than 900 files in TEMP directory, then please email to me directly the full output of CLAN Output window after you run "FLO *.cha -t% +d +r1 +re +ffin" command, tell me if you are using Mac or PC.

If you get 900 files in TEMP, but you still can't figure out why in step 1 you get 340 files, then zip and email your 0-15 months directory to me and I will see if I can figure out what is wrong.

Leonid.

Amanda Owen Van Horne

unread,
Mar 21, 2017, 3:35:03 PM3/21/17
to chibolts
Hi Leonid, 
  Thank you - the commands in step 1 identified a set of corrupted files. I'm very grateful. It will take me some time to restore those files to the right directories, but I suspect that will solve the problem. 

Amanda 
Reply all
Reply to author
Forward
0 new messages