More free data for French

41 views
Skip to first unread message

Alexander Gorodetski

unread,
Nov 28, 2022, 8:51:02 AM11/28/22
to kaldi-help
Hello guys,

I am using following free data to train French model (~1200 hours).

African
LibriSpeech
TedX
CommonVoice

Can someone recommend more free data? I am especially interested in narrow band data.

Is it possible to download Netflix movies with subtitles for French only. Can someone recommend the relevant tool for such downloading process?

Thanks,
AlexG.

entn-at

unread,
Nov 28, 2022, 10:03:08 AM11/28/22
to kaldi-help
Here are some datasets (disclaimer: I haven't used them myself, so I can't tell how useful they are):

VoxPopuli (https://github.com/facebookresearch/voxpopuli) apparently contains 211 hours of transcribed French speech (according to the paper).
MPF (Multicultural Paris French, ~19 hours): https://www.ortolang.fr/market/corpora/mpf/v3
CFPP2000 (Corpus de Français Parlé Parisien, ~16 hours): https://www.ortolang.fr/market/corpora/cfpp2000
TCOF (Traitement de Corpus Oraux en Français, apparently ~146 hours, mixed adult and child): https://www.ortolang.fr/market/corpora/tcof/v2.1
CLAPI (Corpus de LAngues Parlées en Interaction): https://www.ortolang.fr/market/corpora/clapi
Europarl-ST (https://www.mllp.upv.es/europarl-st/) also contains French speech, but might overlap with VoxPopuli.
Reply all
Reply to author
Forward
0 new messages