HamBam: The Hamedan-Bamberg Corpus of Contemporary Spoken Persian

Skip to first unread message

Behnam Esfahbod

Apr 18, 2022, 6:34:12 PM4/18/22
to Persian Computing
Hi P-C'ers,

There's been new development in the area of Persian-language corpora, which I believe would interest some of you:

The HamBam corpus, a corpus of annotated recordings of contemporary spoken Persian, is being developed jointly by Bu-Ali Sina University (Hamedan) and University of Bamberg (Germany). Their work is released in audio and time-aligned text, under the  Creative Commons Attribution 4.0 International licence (CC BY 4.0).

"The texts gathered in this corpus are predominantly monological in nature, and represent colloquial spoken Persian, as the neutral lingua franca used throughout Iran by educated Iranians. The speakers are of both genders, various ages, different educational levels and occupations. The recordings include radio interviews on a variety of topics, as well as less formal oral history recounted in a domestic setting among family members."

The corpus has published 34 recordings (as of today), and wav/mp3/eaf/xml/tsv files can be downloaded from its homepage.

Example preview pages of the recordings:
The project is made public in 2022, and so far there's nothing published on their roadmap or external contributions. If anyone has more information, or likes to contact the project leads, it would be great to share the findings here.


Behnam Esfahbod | بهنام اسفهبد | Q4880939 Q2064908 | behnam.es

Reply all
Reply to author
0 new messages