Webinar: Text Language Identification with CommonCrawl and Mozilla Data Collective

34 views

Skip to first unread message

Santiago M

unread,

May 20, 2026, 9:09:49 AMMay 20

to ai4lam

Dear AI4LAM community,

Happy to share this webinar which I feel might be of interest to many.

Mozilla Data Collective is partnering with Common Crawl Foundation for a hands-on webinar on Text Language Identification for under-represented languages, featuring two new open benchmarks on the Mozilla Data Collective platform: CommonLID and CommonVoiceLID.

As you well know, most language identification models work well for English. For many of the world’s other languages, they still fall short. This gap matters because it shapes what data enters AI systems, what tools work for which communities, and whose languages are treated as first-class citizens online.

In this session, Laurie Burchell and Pedro Ortiz Suarez from Common Crawl Foundation and Kostis Saitas Zarkias and Robert Pugh from Mozilla Data Collective will compare frontier LLMs with standard out-of-the-box tools, train local-first low-resource models from scratch, and show how to extend the pipeline to a language you care about.

Join us as we work toward technology that is more inclusive, multilingual, and multicultural!