Excluding stories and songs from corpus

21 views
Skip to first unread message

sit...@g.harvard.edu

unread,
Apr 8, 2019, 5:00:42 PM4/8/19
to chibolts
Hi all,

I am doing a corpus study using the Providence corpus right now. For the purposes of this study, I am interested in analyzing only the utterances that are produced by the speakers during their natural conversational exchanges, but the corpus also includes many stretches of talk that consist of the stories that parents read to the children, or songs and nursery rhymes they sing, etc. Is there a practical way to weed out these parts from the corpus or do I have to face the gargantuan task of eliminating them manually?

Thanks in advance for your help!

Simge Topaloglu

Brian MacWhinney

unread,
Apr 8, 2019, 5:11:19 PM4/8/19
to ChiBolts
Dear Simge,

Yes, you would have to do this semi-manually.  You could rely on the system of GEM markers, as described in the manual.  If you decide to do this, it would be possibly a good idea to actually add your gem markers to the version in the database.

— Brian MacWhinney
Teresa Heinz Professor of Cognitive Psychology 
Language Technologies, and Modern Languages
ma...@andrew.cmu.edu





--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
To post to this group, send email to chib...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/6ad9aaca-c4e1-4d5c-b398-b95f490374fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sit...@g.harvard.edu

unread,
Apr 9, 2019, 8:38:46 PM4/9/19
to chibolts
Hi Prof. MacWhinney,

Thanks for your reply! Well, I guess it will take me a while to do this.

I have another question regarding the same study. Right now, I am using the code kwal +sX -w10 +w5 -t*CHI, where X is meant to be a placeholder for the words that I am interested in searching in the input. Ideally, however, I would prefer selecting a stretch of talk like this only if the target utterance that contains the word X does not constitute a repetition of the immediately preceding line (e.g., the parent only uses X because another speaker said X in the immediately preceding line). My question is pretty much the same as above: is there a practical way to exclude repetitive utterances of this sort?

Thank you so much!

Simge

Brian MacWhinney

unread,
Apr 9, 2019, 9:25:28 PM4/9/19
to ChiBolts
Dear Simge,
    I'm not sure that I fully understand your criteria for excluding utterances with repeated words.  For example, what if a common word like "the" or "of" is used in both utterances?  Do you then really want to exclude the second one?  There is a program called CHIP that carefully analyzes th overlap between sentences in terms of repeated words, but it might not do exactly what you want.  
    I am curious why you think it is important to conduct these different types of exclusions.  What exactly are you looking for?  What hypothesis might you be testing?

-- Brian MacWhinney

--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
To post to this group, send email to chib...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages