I trained a 16kHz ASR model, and also want to test performance on some commonly used test sets, like eval2000 and rt03, of which the sampling rate is 8k.
So here is my `conf/mfcc_hires.conf`:
```
--use-energy=false
--num-mel-bins=40
--num-ceps=40
--low-freq=20
--high-freq=-400
--allow-upsample=true
```
and a line in `wav.scp`:
```
en_4156-A sph2pipe -f wav -p -c 1 /path/to/LDC/LDC2002S09/hub5e_00/english/en_4156.sph |
```
In this way, resampling is implemented by kaldi while extracting mfcc. WER is 28.17.
But kaldi's resampling is not straightforward to use elsewhere. So I modified the `wav.scp` to:
```
en_4156-A sph2pipe -f wav -p -c 1 /path/to/LDC/LDC2002S09/hub5e_00/english/en_4156.sph | sox - -t wav -r 16000 - |
```
or explicitly
```
en_4156-A sph2pipe -f wav -p -c 1 /path/to/LDC/LDC2002S09/hub5e_00/english/en_4156.sph | sox -t wav -e signed-integer -r 8k -b 16 -c 1 - -t wav -e signed-integer -r 16k -b 16 -c 1 -G - |
```
In this way, resampling is implemented by sox, and kaldi would do extracting directly. I believe there is no difference between these 2 ways.
However, this time the WER is 35.13, much worse than kaldi resampling.
These results are from eval2000, and it's quite simple to reproduce with a 16k model, I believe.
Any insight? Much appreciate it if anyone gives some help.
Thanks in advance.
Best,
Guanbo