Be Part of the Mozilla Data Collective: Supporting Arabic and Underrepresented Languages

26 views
Skip to first unread message

Joseph Attieh

unread,
Oct 13, 2025, 4:36:48 AMOct 13
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
Dear members of the SIGARAB community,

My name is Joseph, and I’m a PhD student at the University of Helsinki as well as a regional researcher with the Mozilla Data Collective (MDC).

MDC is a new, community-driven initiative from the Mozilla Foundation. You might know Mozilla from Common Voice, the world’s largest open speech dataset project, which has supported dozens of languages. Building on that success, MDC expands the mission to a broader range of language resources (text, audio, and more), while addressing some of the key challenges in our field:
  • Bridging the data gap: High-quality datasets for non-English languages, especially Arabic and its dialects, remain scarce, limiting research and innovation.
  • Respecting ownership and rights: Unlike many web-scraped datasets, MDC ensures that all resources are shared directly by their owners, with transparent licensing and use controls.
  • Empowering researchers: MDC is built to support the global research community, providing reliable, well-documented datasets.
  • Ensuring visibility and credit: Every contribution is properly attributed, giving contributors recognition and exposure for their work. Since MDC is operating on a global scale, it can help researchers (who do not work with Arabic for instance) find relevant datasets easier.
This initiative covers communities across the globe. My colleagues and I work with researchers supporting a wide range of minority and underrepresented languages to ensure everyone has a place in the open data ecosystem. I am helping out with Arabic since it it is one of my native languages.

Before the public launch of the MDC platform, I had already reached out to many individuals and organizations across the Arabic-speaking world who expressed strong interest in contributing their datasets and collaborating on future projects. I’m reaching out here to continue that conversation and explore further collaborations. The first public version of the MDC platform has just launched (here), and we’re eager to involve the Arabic research and data community in contributing to the platform, so that Arabic and its dialects (as well as other low-resource minority languages in the region) are well-represented on MDC.

If you or your organisation have Arabic datasets (or other language data) that could be shared, we would love to include them on the platform. Contributors receive full credit and visibility for their work. Please feel free to reach out if you’re interested or would like to learn more about the initiative. This is my email address josepha...@gmail.com as well as the email address of the MDC team mozilladat...@mozillafoundation.org .

Best regards,

Joseph Attieh
PhD Student, University of Helsinki
Regional Researcher (Arabic), Mozilla Data Collective

P.S.: I have posted the same text in Arabic, but it was not formatted properly. Apologies for the SPAM. 
Reply all
Reply to author
Forward
0 new messages