aurora4 database

jiang....@gmail.com

unread,

Aug 27, 2015, 11:07:28 AM8/27/15

to kaldi-help

I was following the link (http://aurora.hsnr.de/aurora-4.html) to download aurora4 tool (fant.tar.gz). But the file was broken when using gunzip. Where else can I download the aurora4 data preparing package?

Thanks!

Daniel Povey

unread,

Aug 27, 2015, 1:27:10 PM8/27/15

to kaldi-help

I can't find anything in the Kaldi example scripts that refers to or attempts to download that tool. How does this question relate to Kaldi?
Dan

On Thu, Aug 27, 2015 at 8:07 AM, <jiang....@gmail.com> wrote:

I was following the link (http://aurora.hsnr.de/aurora-4.html) to download aurora4 tool (fant.tar.gz). But the file was broken when using gunzip. Where else can I download the aurora4 data preparing package?

Thanks!

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

J Jiang

unread,

Aug 27, 2015, 2:21:54 PM8/27/15

to kaldi...@googlegroups.com

Thanks Daniel!

I was reading the README in kaldi Aurora4 recipe and saw "for detailed information, please refer to: http://aurora.hsnr.de/aurora-4.html." I thought that was a pointer on where to get the data.

Actually, my question is whether your group offers advice/pointer on how to get the Aurora4 data. (We have purchased the LDC wsj0 and wsj1 datasets already.)

--
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/3qgA17EZHnw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

Jan Trmal

unread,

Aug 27, 2015, 2:33:26 PM8/27/15

to kaldi-help

I checked and we have the database already "prepared" (here at CLSP) -- I'm not sure if it was distributed as such, or you need some software to generate it from WSJ corpora. It does not, however, mention the licence.
So I think it would be better to contact the authors of the database or the people on the papers.

y.

Jan Trmal

unread,

Aug 27, 2015, 2:34:46 PM8/27/15

to kaldi-help

BTW, I was able to download the fant.tar.gz without any problem.

y.

Jan Trmal

unread,

Aug 27, 2015, 2:41:05 PM8/27/15

to kaldi-help

I see now that Aurora is available by ELRA (european variant of LDC) and it's even for free. You will still need to sign some licence agreement, so we cannot provide it to you -- you will have to go through ELRA directly.

y.

J Jiang

unread,

Aug 27, 2015, 2:49:18 PM8/27/15

to kaldi...@googlegroups.com

Thanks for the pointer! (fant.tar.gz can be downloaded; but it cannot be unzipped correctly.)

J Jiang

unread,

Aug 27, 2015, 3:10:28 PM8/27/15

to kaldi...@googlegroups.com

No further question on this. Thanks again!

Daniel Povey

unread,

Aug 27, 2015, 3:44:28 PM8/27/15

to kaldi-help, Chao Weng

We definitely need to have better instructions on how to download the aurora corpus, or how to obtain it.

Cc'ing Chao Weng who seems to have created that recipe. Chao, do you know how one is supposed to get the aurora4 corpus, and do you know anything about the license?

Dan

Chao Weng

unread,

Aug 27, 2015, 4:40:30 PM8/27/15

to Daniel Povey, kaldi-help

We use the contact info on the website to get the data. But it's long time ago, not sure if they are still maintaining. In terms of license, WSJ licenses should suffice to get the Aurora4 data.

-Chao

Daniel Povey

unread,

Aug 27, 2015, 4:40:49 PM8/27/15

to Chao Weng, kaldi-help

Chao, how about if I change the README to read as follows, will this be accurate?

About aurora4

The aurora4 database contains a) clean wsj0 data (Wall Street Journal)

b) artificially added noise with clean wsj0 data

for detailed information, please refer to: http://aurora.hsnr.de/aurora-4.html.

To obtain the data you should use the contact info on the website above.

If you already have the WSJ license from LDC, you should not need any

additional licenses (but they may want to check that you have a license for

WSJ).

About the Wall Street Journal corpus:

This is a corpus of read

sentences from the Wall Street Journal, recorded under clean conditions.

The vocabulary is quite large. About 80 hours of training data.

Available from the LDC as either: [ catalog numbers LDC93S6A (WSJ0) and LDC94S13A (WSJ1) ]

or: [ catalog numbers LDC93S6B (WSJ0) and LDC94S13B (WSJ1) ]

The latter option is cheaper and includes only the Sennheiser

microphone data (which is all we use in the example scripts)

Jan Trmal

unread,

Aug 27, 2015, 4:44:29 PM8/27/15

to kaldi-help, Chao Weng

Looking at this
http://catalog.elra.info/index.php?cPath=37_40

corpora Aurora 4a and 4b, it seems that ELRA distributes the corpus -- although I don't know how (or if) this relates to the fact LDC distributes WSJ

y.

Chao Weng

unread,

Aug 27, 2015, 4:44:35 PM8/27/15

to Daniel Povey, kaldi-help

LGTM.

BTW, I would recommend use CHIME instead of Aurora4 since CHIME is the similar dataset but well maintained. But people have their own reason to use Aurora anyway.

-Chao

Daniel Povey

unread,

Aug 27, 2015, 4:51:25 PM8/27/15

to Chao Weng, kaldi-help

Which CHIME do you have in mind? There are 3 in the Kaldi scripts.
Dan

Chao Weng

unread,

Aug 27, 2015, 4:54:38 PM8/27/15

to Daniel Povey, kaldi-help

CHIME2, which is also noisy and reverberant version of WSJ, and same number of training/dev/eval utterances.

CHIME3 is also very good one for far-field , but AFAIK, the dev/eval set contains only 4 speakers.

-Chao

Reply all

Reply to author

Forward