Grammars with on-the-fly parts.

181 views
Skip to first unread message

Ďoďo Ivanecký

unread,
Apr 20, 2022, 5:42:10 PM4/20/22
to kaldi-help
Hi,
I am trying to build a dynamic grammar as described here:
and in simple_demo.sh

I created a small test.
The main grammar is is:
0       1       Hallo    Hallo    1.69314694
1       2       or    or               1.38629401
1       2       and     and        1.38629401
1
2       3       again again
3       4       dear dear
3       5       #nonterm:Dlist  <eps>
4       5       Mike Mike
5

The dynamic one is:
0       1       #nonterm_begin  <eps>
1       2       <eps>   <eps>   5
1       2       Carlos Carlos 1.60943794
1       2       Josef   Josef   1.60943794
1       2       Sam  Sam  1.60943794
1       2       John  John  1.60943794
1       2       Wei     Wei     1.60943794
2       3       #nonterm_end    <eps>
3

prepare_lang.pl was used to prepare the language data with nonterminal symbols:
tail phones.txt
}:_E 682
}:_I 683
}:_S 684
#0 685
#1 686
#nonterm_bos 687
#nonterm_begin 688
#nonterm_end 689
#nonterm_reenter 690
#nonterm:Dlist 691

 tail words.txt
or 10
and 11
again 12
NOISE 13
#0 14
<s> 15
</s> 16
#nonterm_begin 17
#nonterm_end 18
#nonterm:Dlist 19

In the final step - after compilation of the grammars - when I run make-grammar-fst, I am getting an error:
/opt/kaldi/bin/make-grammar-fst --write-as-grammar=true --nonterm-phones-offset=687 MainC.fst 691 DynC.fst aaa.fst

ERROR (make-grammar-fst[5.5]:InitEntryOrReentryArcs():decoder/grammar-fst.cc:143) There is something wrong with the graph; did you forget to add #nonterm_begin and #nonterm_end to the non-top-level FSTs before compiling?

But I did not forget. i did check also LG.fst of the non-top-level fst and I see the symbols there. 

Am I missing something? 

Thanks for any hint.

Josef







Ďoďo

unread,
Apr 25, 2022, 11:02:45 AM4/25/22
to kaldi...@googlegroups.com
OK, so I found out what the issue is. Actually 2. 
1. During CLG creation  "fstcomposecontext"  had no "--nonterm-phones-offset" parameter. That was critical. But it was still crashing because of #2.
2. I did use some demo script to build the HCLG.fst but to build a disambig list "grep '#' " is used. But such a grep also takes all #nonterms with and it results in a crash in grammar-context-fst. Fix is replacing "grep '#'" by "grep '[0-9]+'

Josef
 

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/0585a5c6-a373-4a0a-ac98-2ba9836a9ae1n%40googlegroups.com.

Anh Nguyễn Mạnh Tiến

unread,
Mar 23, 2023, 5:41:24 AM3/23/23
to kaldi-help
I've the same problem to, but I can't find a demo script that contains  "grep '#' ". Can you tell me where is it?

Vào lúc 22:02:45 UTC+7 ngày Thứ Hai, 25 tháng 4, 2022, Ďoďo đã viết:

Ďoďo

unread,
Mar 23, 2023, 10:03:47 AM3/23/23
to kaldi...@googlegroups.com
Uff, I am not able to find it after 1 year. But tell me what is in your phones/disambig.txt file. Just to see if it's really the same problem.

Jozef

Anh Nguyễn Mạnh Tiến

unread,
Mar 23, 2023, 10:57:26 PM3/23/23
to kaldi-help
Thank you very much for your response. I've found the reason actually. I  used Srilm to create an ARPA LM (for sub LM) and arpa2fst to create G.fst. Therefore in G.fst I don't have symbol #nonterm_begin and #nonterm_end. I think that is the reason. Do you have any solution to add #nonterm_begin and #nonterm_end to G.fst if we create from arpa2fst? Thanks in advance!

Vào lúc 21:03:47 UTC+7 ngày Thứ Năm, 23 tháng 3, 2023, Ďoďo đã viết:

Ďoďo

unread,
Mar 24, 2023, 3:15:27 AM3/24/23
to kaldi...@googlegroups.com
Please read https://kaldi-asr.org/doc/grammar.html

There is a section which says:
The user should never need to explicitly add these symbols to the words.txt and phones.txt files; they are automatically added by utils/prepare_lang.sh. All the user has to do is to create the file 'nonterminals.txt' in the 'dict dir' (the directory containing the dictionary, as validated by validate_dict_dir.pl).

I did not play with LM (arpa), but with grammars. So I just generated nonterminals.txt' from my simple grammar compiler. One nonterminal per line:
cat graph/nonterminals.txt                                                                                                                                                                                    
#nonterm:Dlist                                   

Jozef



Anh Nguyễn Mạnh Tiến

unread,
Mar 25, 2023, 3:07:31 AM3/25/23
to kaldi-help
But if my grammar is not unigram (e.g Dlist contain names such as Lionel Messi, Bruno Fernandes,...) How could I make a grammar from that n-gram without building LM (using Srilm or KenLM)?

Vào lúc 14:15:27 UTC+7 ngày Thứ Sáu, 24 tháng 3, 2023, Ďoďo đã viết:

Daniel Povey

unread,
Mar 25, 2023, 3:50:51 AM3/25/23
to kaldi...@googlegroups.com
you have to understand the basic concepts of FSTs.  It's a simple graph with a loop state (say state 0) and you'd
have "lionel" state 0->2, "messi" state 2->1
  "Bruno" state 0->3, "messi" state 3->1
and so on

Daniel Povey

unread,
Mar 25, 2023, 3:50:59 AM3/25/23
to kaldi...@googlegroups.com
.. and state 1 would be final
Reply all
Reply to author
Forward
0 new messages