Converting G.fst to G.carpa

409 views
Skip to first unread message

Sage Khan

unread,
Jul 26, 2022, 12:36:44 AM7/26/22
to kaldi-help
Hello

Im trying to complete model files to run on vosk api to make a STT system.

I came around this

One of the requirement is G.carpa along with G.fst.

I have the G.fst file. Is there anyway to convert this fst file into G.carpa?

Daniel Povey

unread,
Jul 26, 2022, 4:21:56 AM7/26/22
to kaldi-help
No, you have to start from the original arpa.gz file that was used to generate the G.fst.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/4ee2f03b-6d9b-40fc-bf33-151f02870f70n%40googlegroups.com.

Daniel Povey

unread,
Jul 26, 2022, 4:22:16 AM7/26/22
to kaldi-help
... but usually the .carpa rescoring phase would be optional anyway.

Sage Khan

unread,
Jul 26, 2022, 7:49:00 AM7/26/22
to kaldi-help
Hi Dan

I simply had the text file (audio file name and corresponding transcript) words.txt, lexicon.txt, nonsilence and silence phones txt to start with.

Prepare lang bash file was used  to generate oov, phones.txt etc. There is no arpa.gz file formed out of that. It did produce some stuff like G.fst and L.fst etc

How do I produce lm.arpa.gz file?

Jan Yenda Trmal

unread,
Jul 26, 2022, 11:05:33 AM7/26/22
to kaldi-help
you need to create a language model from the text you have

Did you go over these materials?

I don't mean it in a bad way but your questions look like you are jumping from topic to topic without perhaps resolving one topic before another. Might confuse you more than help. Not sure of course, you do you.
y.

Sage Khan

unread,
Jul 26, 2022, 11:55:21 AM7/26/22
to kaldi-help
I started off by simply following kaldi for dummies. Since data was small, much of issues didn't occur.

Then I went on to train hindi ASR as done by Kunal Dhawan

Then I did same by Ohm Vikrant

Then I moved on to Panjabi ASR

Then I did urdu ASR (with arabic style transcript) 

All those were with data that was already prepared. LM was already available. Lexicon.txt etc were already set properly;.

Now I collected and compiled my own data and made my own files. Everything from SCRATCH. Now I find a whole new set of errors.

Previous working were surface level. I do understand the whole pipeline now but in my learning process, its the first time I have dived deep. 

Ive also gone through some tutorials from medium, assembly AI, kunal, ohmvikrant, Eleanor etc

Regards

Jan Yenda Trmal

unread,
Jul 26, 2022, 12:06:09 PM7/26/22
to kaldi-help
yeah, understood. I'm currently not sure what would be a good source for LMs tutorial
Perhaps look at the iban recipe? It should be quite straightforward (done for a tutorial) and contains LM section
y.

Jan Yenda Trmal

unread,
Jul 26, 2022, 12:09:50 PM7/26/22
to kaldi-help
btw, in the tutorial for dummies this is the LM part, I just checked


echo
echo "===== LANGUAGE MODEL CREATION ====="
echo "===== MAKING lm.arpa ====="
echo
loc=`which ngram-count`;
if [ -z $loc ]; then
if uname -a | grep 64 >/dev/null; then
sdir=$KALDI_ROOT/tools/srilm/bin/i686-m64
else
sdir=$KALDI_ROOT/tools/srilm/bin/i686
fi
if [ -f $sdir/ngram-count ]; then
echo "Using SRILM language modelling tool from $sdir"
export PATH=$PATH:$sdir
else
echo "SRILM toolkit is probably not installed.
Instructions: tools/install_srilm.sh"
exit 1
fi
fi
local=data/local
mkdir $local/tmp
ngram-count -order $lm_order -write-vocab $local/tmp/vocab-full.txt -wbdiscount -text $local/corpus.txt -lm $local/tmp/lm.arpa
echo
echo "===== MAKING G.fst ====="
echo
lang=data/lang
arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/words.txt $local/tmp/lm.arpa $lang/G.fst

Sage Khan

unread,
Jul 26, 2022, 12:21:04 PM7/26/22
to kaldi-help
Hello

Thank you so much for your help. I have the above script available (probably in a refined form in one of the recipes) 

I have G.fst file available.

However. Voskapi LM files require G.carpa as well.

I am unable to get the script to convert G.fst into G.carpa


"Depending on your needs you might pick some result files from the compilation folder. Remember, that if you changed the graph you also need to change the rescoring/rnnlm part, otherwise they will go out of sync and accuracy will be low.

For large model pick the following parts:

  • exp/chain/tdnn/graph
  • data/lang_test_rescore/G.fst and data/lang_test_rescore/G.carpa into rescore folder
  • exp/rnnlm_out into rnnlm folder, you can delete some unnecessary files from rnnlm too.

If you don’t want to use RNNLM, delete rnnlm folder from the model.

If you don’t want to use rescoring, delete the rescore folder from the model, that will save you some runtime memory, but accuracy will be lower.

For small model, just pick the required files from exp/chain/tdnn/lgraph."


Much Respect and Regards

KHAN

Sage Khan

unread,
Jul 26, 2022, 12:22:40 PM7/26/22
to kaldi-help
Ive gone through and understood LM with kenLM script and LM with SRILM script. 

Just trying to understand how to get these file data/lang_test_rescore/G.fst and data/lang_test_rescore/G.carpa ... G.fst is with me... now G.carpa is the issue. HOw do we generate that?

Jan Yenda Trmal

unread,
Jul 26, 2022, 12:33:06 PM7/26/22
to kaldi-help
G.carpa will get generated by 
utils/build_const_arpa_lm.sh <your-lm> data/lang_test_rescore data/lang_test_rescore

but Dan was originally correct, albeit neither of us is familiar with Vosk, you shouldn't need that file
y.

Sage Khan

unread,
Jul 26, 2022, 2:54:33 PM7/26/22
to kaldi-help
Roger That. I'll work further on it.

Thank you so much Dan and Yenda :)

Sage Khan

unread,
Jul 26, 2022, 2:55:16 PM7/26/22
to kaldi-help
before running utils/build_const_arpa_lm.sh do I have to first run some kind of rescore script? 

Jan Yenda Trmal

unread,
Jul 26, 2022, 3:03:34 PM7/26/22
to kaldi-help
not at al. the rescoring is done by the carpa model.
but in your case, all the scores will be the same. Thats why we think that in your case the G.carpa does not make sense. It will be a different representation of the G.fst
y.

Sage Khan

unread,
Jul 26, 2022, 3:39:33 PM7/26/22
to kaldi-help
Got it, Thanks :)
Reply all
Reply to author
Forward
0 new messages