CLAN: Text Extraction

121 views
Skip to first unread message

Snigdha Khanna

unread,
Feb 6, 2024, 3:08:36 PM2/6/24
to chibolts
Hello!

I am trying to extract "clean" text from annotated transcripts that I have. Is there any way to use CLAN to export a txt file format, or a simpler method to remove annotations from the transcripts, so that I can parse it using NLP?

Any help is appreciated!

Thanks,
Snigdha

Brian Macwhinney

unread,
Feb 6, 2024, 4:10:32 PM2/6/24
to ChiBolts
CLAN’s FLO program does most of this. Alternatively, you could grab all the <w> tags from the XML version of the database.

What kind of NLP do you want to use? You could apply Universal Dependencies directly.

— Brian MacWhinney
Teresa Heinz Professor of Cognitive Psychology,
Language Technologies and Modern Languages, CMU
> --
> You received this message because you are subscribed to the Google Groups "chibolts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/237e8996-63ba-4476-859f-4b1e6841ab3an%40googlegroups.com.

Snigdha Khanna

unread,
Feb 6, 2024, 4:14:57 PM2/6/24
to chibolts
I want to remove all annotations like the gestures and errors. Hence, I would like to use the txt format of just the transcribed text without annotations.

Any idea how to do that?

Leonid Spektor

unread,
Feb 6, 2024, 4:39:03 PM2/6/24
to chib...@googlegroups.com
Command flo +ca +t* *.cha should work.


Leonid.

Giulia Sanguedolce

unread,
Feb 6, 2024, 5:31:16 PM2/6/24
to chib...@googlegroups.com
Hello Snigdha :) I used a python code to extract the text lines from the .cha files..Let me know if this can help you and I’ll send you the piece of code for it !

Regards, 
Giulia
________________
Giulia Sanguedolce 
PhD student - AI & Machine Learning for Healthcare https://ai4health.io 
Department of Electrical & Electronic Engineering | Department of Brain Science
Imperial College London
e-mail: gs2...@ic.ac.uk | sangue...@gmail.com

Xiaowei Zhao

unread,
Aug 13, 2024, 9:13:06 AM8/13/24
to chib...@googlegroups.com
Hello,

First of all, Sorry to pick up this conversation for so long ago! 

I am also trying to use the "flo" command to extract "clean" text from .cha files, and it works very well except one small thing -- it seems to automatically add line wraps to break long lines exceeding a certain length to several lines. 

For example, for a file (060002c.cha) in the MacWhinney database, I run 
flo +cr +t* 060002c.cha

and for a long line in the original .cha file 
"
*MAR: no (.) it's not Mr Munsters (.) it's only the Munsters (.) what if the monsters won't be on anymore and xxx will be with other movie (.) what if it's at with the other program . 
"

I got three lines  
"
no it's not Mr Munsters it's only the Munsters what if the monsters won't
be on anymore and will be with other movie what if it's at with the other
program.
"    
I am just wondering if there is any command/option/switch within Clan to avoid this and still keep them on the same line? I tried  "LONGTIER", but it did not work. 

Many thanks!

Sincerely,
Xiaowei

Xiaowei Zhao, Ph.D.

Professor of Psychology


Emmanuel College

400 The Fenway | Boston | MA 02115

www.emmanuel.edu


Brian Macwhinney

unread,
Aug 13, 2024, 9:37:46 AM8/13/24
to ChiBolts, Xiaowei Zhao, Leonid Spektor
Yes, I see the problem. The longest lines get wrapped and a tab is added. You could replace carriage return and tab \r\t with nothing. Better yet, Leonid may be able to fix this problem.

— Brian MacWhinney
Teresa Heinz Professor of Cognitive Psychology,
Language Technologies and Modern Languages, CMU



> To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/CANVosvX1Q%2BjGDL0WxZKTr2CjtAZeUAPn7%2Bz6gb6X061c%3Du_4-A%40mail.gmail.com.

Leonid Spektor

unread,
Aug 13, 2024, 2:50:39 PM8/13/24
to Brian Macwhinney, ChiBolts, Xiaowei Zhao
I have changed FLO to not wrap long lines and updated everything on the web.


Leonid.
Reply all
Reply to author
Forward
0 new messages