question about batchalign2

Janet Bang

unread,

May 22, 2024, 6:28:36 PMMay 22

to ChiBolts, Teresita Garduno, krittin.s...@sjsu.edu

Hello,

I am currently working with transcripts that are multilingual (e.g., English/Spanish, English/Korean). They are around 70 - 100 utterances or so of parent-reported first words/phrases for children between 12 - 26 months, so they are around 1 - 3 words per utterance, but occasionally longer. We have asked parents to report what their child said across multiple days, in whichever language they used.

We would like to extract lemmas and consider unilemmas (e.g., Mommy, Mamá - Spanish, 어마 - Korean) both across children who speak different languages and within a child who might use multiple languages. To facilitate this I was wondering if batchalign would work with multilingual transcripts?

Thank you!

Janet

Brian Macwhinney

unread,

May 22, 2024, 8:39:52 PMMay 22

to ChiBolts, Janet Bang, Teresita Garduno, krittin.s...@sjsu.edu

Dear Janet,
Not yet, I am afraid. As my colleague Houjun Liu puts it “code-switching multilingual ASR is still an active and unstable area of research”.

— Brian MacWhinney

> --
> You received this message because you are subscribed to the Google Groups "chibolts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/CAC5V4hg5uf152so6ALFk9RyNR_aX6uFb0jEsbH7yYKs2utHD2A%40mail.gmail.com.

Janet Bang

unread,

May 23, 2024, 1:54:39 AMMay 23

to Brian Macwhinney, ChiBolts, Teresita Garduno, krittin.s...@sjsu.edu

Hi Brian,

Thanks for the quick response. We are still in the world of transcripts but I definitely look forward to the day when we can have multilingual ASR!

Since Houjun mentioned that intra utterance code switching wasn’t yet available, would you recommend that we first run batch align and then do the code switched utterances by hand? We don’t have many for now and were still working out some processes, but thinking of what we could build up moving forward.

Janet

On Wednesday 22 May 2024, Brian Macwhinney <ma...@andrew.cmu.edu> wrote:

Dear Janet,
Not yet, I am afraid. As my colleague Houjun Liu puts it “code-switching multilingual ASR is still an active and unstable area of research”.

— Brian MacWhinney

> On May 22, 2024, at 6:28 PM, Janet Bang <janet...@gmail.com> wrote:
>
> Hello,
>
> I am currently working with transcripts that are multilingual (e.g., English/Spanish, English/Korean). They are around 70 - 100 utterances or so of parent-reported first words/phrases for children between 12 - 26 months, so they are around 1 - 3 words per utterance, but occasionally longer. We have asked parents to report what their child said across multiple days, in whichever language they used.
>
> We would like to extract lemmas and consider unilemmas (e.g., Mommy, Mamá - Spanish, 어마 - Korean) both across children who speak different languages and within a child who might use multiple languages. To facilitate this I was wondering if batchalign would work with multilingual transcripts?
>
> Thank you!
> Janet
>
> --
> You received this message because you are subscribed to the Google Groups "chibolts" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe@googlegroups.com.

Brian Macwhinney

unread,

May 23, 2024, 10:17:14 AMMay 23

to Janet Bang, ChiBolts, Teresita Garduno, krittin.s...@sjsu.edu

Janet,

It seems that I didn’t understand your question. If you are talking about tagging using the UD taggers used by Batchalign, then I believe this is possible, but Houjun will need to confirm. However, you would have to mark each utterance in a CHAT file with the language tag if it were not the primary language of the file. Please take a look at section 16.1 of the CHAT manual about that type of coding.

—Brian

> > To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.

Janet Bang

unread,

May 23, 2024, 1:23:14 PMMay 23

to Brian Macwhinney, Houjun Liu, ChiBolts, Teresita Garduno, krittin.s...@sjsu.edu

Hi Brian,

Yes! What we've done follows the 16.1 conventions using precedes for whole utterances and @s for the single words for intra-utterance switching (e.g., @s:yue, @s:spa for Cantonese or Spanish, respectively). Would Batchalign be able to recognize these codes?