aurora4 database

1,208 views
Skip to first unread message

jiang....@gmail.com

unread,
Aug 27, 2015, 11:07:28 AM8/27/15
to kaldi-help
I was following the link (http://aurora.hsnr.de/aurora-4.html) to download aurora4 tool (fant.tar.gz). But the file was broken when using gunzip. Where else can I download the aurora4 data preparing package?

Thanks!

Daniel Povey

unread,
Aug 27, 2015, 1:27:10 PM8/27/15
to kaldi-help
I can't find anything in the Kaldi example scripts that refers to or attempts to download that tool.  How does this question relate to Kaldi?
Dan


On Thu, Aug 27, 2015 at 8:07 AM, <jiang....@gmail.com> wrote:
I was following the link (http://aurora.hsnr.de/aurora-4.html) to download aurora4 tool (fant.tar.gz). But the file was broken when using gunzip. Where else can I download the aurora4 data preparing package?

Thanks!

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

J Jiang

unread,
Aug 27, 2015, 2:21:54 PM8/27/15
to kaldi...@googlegroups.com
Thanks Daniel!

I was reading the README in kaldi Aurora4 recipe and saw "for detailed information, please refer to: http://aurora.hsnr.de/aurora-4.html." I thought that was a pointer on where to get the data.

Actually, my question is whether your group offers advice/pointer on how to get the Aurora4 data. (We have purchased the LDC wsj0 and wsj1 datasets already.)

--
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/3qgA17EZHnw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

Jan Trmal

unread,
Aug 27, 2015, 2:33:26 PM8/27/15
to kaldi-help
I checked and we have the database already "prepared"  (here at CLSP) -- I'm not sure if it was distributed as such, or you need some software to generate it from WSJ corpora. It does not, however, mention the licence.
So I think it would be better to contact the authors of the database or the people on the papers.
y.

Jan Trmal

unread,
Aug 27, 2015, 2:34:46 PM8/27/15
to kaldi-help
BTW, I was able to download the fant.tar.gz without any problem.
y.

Jan Trmal

unread,
Aug 27, 2015, 2:41:05 PM8/27/15
to kaldi-help
I see now that Aurora is available by ELRA (european variant of LDC) and it's even for free.  You will still need to sign some licence agreement, so we cannot provide it to you -- you will have to go through ELRA directly.
y.

J Jiang

unread,
Aug 27, 2015, 2:49:18 PM8/27/15
to kaldi...@googlegroups.com
Thanks for the pointer! (fant.tar.gz can be downloaded; but it cannot be unzipped correctly.)

J Jiang

unread,
Aug 27, 2015, 3:10:28 PM8/27/15
to kaldi...@googlegroups.com
No further question on this. Thanks again!

Daniel Povey

unread,
Aug 27, 2015, 3:44:28 PM8/27/15
to kaldi-help, Chao Weng
We definitely need to have better instructions on how to download the aurora corpus, or how to obtain it.
Cc'ing Chao Weng who seems to have created that recipe.  Chao, do you know how one is supposed to get the aurora4 corpus, and do you know anything about the license?
Dan

Chao Weng

unread,
Aug 27, 2015, 4:40:30 PM8/27/15
to Daniel Povey, kaldi-help
We use the contact info on the website to get the data. But it's long time ago, not sure if they are still maintaining. In terms of license, WSJ licenses should suffice to get the Aurora4 data.

-Chao

Daniel Povey

unread,
Aug 27, 2015, 4:40:49 PM8/27/15
to Chao Weng, kaldi-help
Chao, how about if I change the README to read as follows, will this be accurate?

About aurora4
    The aurora4 database contains a) clean wsj0 data (Wall Street Journal)
                                  b) artificially added noise with clean wsj0 data
    for detailed information, please refer to: http://aurora.hsnr.de/aurora-4.html.
    To obtain the data you should use the contact info on the website above.
    If you already have the WSJ license from LDC, you should not need any
    additional licenses (but they may want to check that you have a license for
    WSJ).

About the Wall Street Journal corpus:
    This is a corpus of read
    sentences from the Wall Street Journal, recorded under clean conditions.
    The vocabulary is quite large.   About 80 hours of training data.
    Available from the LDC as either: [ catalog numbers LDC93S6A (WSJ0) and LDC94S13A (WSJ1) ]
    or: [ catalog numbers LDC93S6B (WSJ0) and LDC94S13B (WSJ1) ]
    The latter option is cheaper and includes only the Sennheiser
    microphone data (which is all we use in the example scripts)


Jan Trmal

unread,
Aug 27, 2015, 4:44:29 PM8/27/15
to kaldi-help, Chao Weng
Looking at this
http://catalog.elra.info/index.php?cPath=37_40
corpora Aurora 4a and 4b,  it seems that ELRA distributes the corpus -- although I don't know how (or if) this relates to the fact LDC distributes WSJ
y.

Chao Weng

unread,
Aug 27, 2015, 4:44:35 PM8/27/15
to Daniel Povey, kaldi-help
LGTM.

BTW, I would recommend use CHIME instead of Aurora4 since CHIME is the similar dataset but well maintained. But people have their own reason to use Aurora anyway.

-Chao

Daniel Povey

unread,
Aug 27, 2015, 4:51:25 PM8/27/15
to Chao Weng, kaldi-help
Which CHIME do you have in mind?  There are 3 in the Kaldi scripts.
Dan

Chao Weng

unread,
Aug 27, 2015, 4:54:38 PM8/27/15
to Daniel Povey, kaldi-help
CHIME2, which is also noisy and reverberant version of WSJ, and same number of training/dev/eval utterances.

CHIME3 is also very good one for far-field , but AFAIK, the dev/eval set contains only 4 speakers.

-Chao
Reply all
Reply to author
Forward
0 new messages