AI4LAM Speech-to-Text WG August meeting: Language identification and ASR for low-resource languages (non-standard time slot)

27 views

Skip to first unread message

Owen King

unread,

Aug 13, 2025, 10:57:23 AMAug 13

to AI4LAM group

You're invited to join the AI4LAM Speech-to-Text Working Group for our August call. It is on August 26 or August 27 (depending on your time zone).

Topic: Language identification and ASR for low-resource languages

We'll have two speakers, followed by discussion. In our discussion, we'd like to talk about how everyone in this group is handling transcription of audio where several languages are spoken.

Speaker 1: Yangyang Chen (Brandeis University)

Abstract: Many voices in the American Archive of Public Broadcasting speak low resource languages that state-of-the-art ASR systems do not transcribe or even recognize. In a new collaboration between GBH Archives and the Lab for Linguistics and Computation of Brandeis University, with funding from the Mellon Foundation, we are developing a lightweight system for spoken language identification for low resource languages, especially Samoan and Yup’ik. Our approach begins with using language identification functionality built-in to Whisper, tested against a synthetic benchmark based on FLEURS dataset. I will also describe plans to collect new data for experimenting and possibly tuning other ASR models to perform language identification for these languages, with the goal of eventually releasing a software pipeline for language identification.

Speaker 2: Saliha Muradoğlu (Australian National University)

Abstract: Half of the world’s languages are predicted to become extinct in the next century, and many are still mostly undocumented. Language documentation faces major bottlenecks in transcription and annotation. This talk focuses on annotation, specifically morphological inflection, and the challenge of large data requirements in deep learning approaches. I will explore two questions: (1) How much data is needed to capture a language? Using the Papuan language Nen as a case study, we propose model-based paradigm generation as a supplementary way to measure completeness, where accuracy is analogous to the coverage of the paradigm. (2) Can model-intrinsic metrics guide data collection? We explore active learning to minimise annotation costs by prioritising difficult cases. Data selection based on model low-confidence/high-entropy improves model performance more rapidly than random selection. Our experiments show that these metrics are robust to language typology, with the same behaviour observed across 30 languages. We also present a 10-cycle iteration with the Austronesian language Natügu.

Note the unusual meeting time:

Sydney (Australia – New South Wales) - August 27 at 07:00 AEST UTC+10 hours

Madrid (Spain – Madrid) - August 26 at 23:00 CEST UTC+2 hours

London (United Kingdom - England) - August 26 at 22:00 BST UTC+1 hour

Boston (USA – Massachusetts) - August 26 at 17:00 EDT UTC-4 hours

San Francisco (USA – California) - August 26 at 14:00 PDT UTC-7 hours

UTC - August 26 at 21:00

ICS invitation attached.

This should be an interesting and useful session for us. We hope to see you there!

All the best,

Owen

(on behalf of the Speech-to-Text WG organizers)

Owen King (he/him)

Metadata Operations Specialist

E: owen...@wgbh.org

One Guest Street, Boston, MA 02135