Invalid n-gram data line

368 views
Skip to first unread message

ashwinraj...@gmail.com

unread,
Jan 26, 2017, 10:26:15 AM1/26/17
to kaldi-help

I was trying to combine two or more public datasets and was trying to run HMM-GMM model. While creating G.fst am getting this error.


ERROR (arpa2fst[5.0.20~1-1dabf]:Read():arpa-file-parser.cc:185) line 6 [-3.782042       ]: Invalid n-gram data line                                                                                         
                                                                                                                                                                                                            
[ Stack-Trace: ]                                                                                                                                                                                            
arpa2fst() [0x533672]                                                                                                                                                                                       
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)                                                                                                                          
kaldi::MessageLogger::~MessageLogger()                                                                                                                                                                      
kaldi::ArpaFileParser::Read(std::istream&, bool)                                                                                                                                                            
void kaldi::ReadKaldiObject<kaldi::ArpaLmCompiler>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::ArpaLmCompiler*)                                          
main                                                                                                                                                                                                        
__libc_start_main                                                                                                                                                                                           
_start                                                                                                                                                                                                      
                                                                                                                                                                                                            
fstisstochastic outputs_VoxForge/lang_prep/lang_build0/G.fst                                                                                                                     
ERROR (fstisstochastic[5.0.20~1-1dabf]:Input():kaldi-io.cc:742) Error opening input stream outputs_VoxForge/lang_prep/lang_build0/G.fst                                          
                                                                                                                                                                                                            
[ Stack-Trace: ]                                                                                                                                                                                            
fstisstochastic() [0x53eb0a]                                                                                                                                                                                
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)                                                                                                                          
kaldi::MessageLogger::~MessageLogger()                                                                                                                                                                       kaldi::Input::Input(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool*)                                                                                           fst::ReadFstKaldi(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)                                                                                                           main                                                                                                                                                                                                         __libc_start_main                                                                                                                                                                                            _start                                                                                                                                                                                                                                                                                                                                                                                                                    ERROR: ReadFst: Can't open file: outputs_VoxForge/lang_prep/lang_build0/G.fst                     

Daniel Povey

unread,
Jan 26, 2017, 3:18:16 PM1/26/17
to kaldi-help
It looks like something went wrong when creating the arpa file that
you are giving to arpa2fst. You didn't even say what tool you were
using for that.

Dan
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

ashwinraj...@gmail.com

unread,
Jan 26, 2017, 7:32:24 PM1/26/17
to kaldi-help, dpo...@gmail.com
I am using this command to create G.fst

arpa2fst --disambig-symbol=#0 --read-symbol-table=outputs_VoxForge/lang_prep/lang_build2/words.txt - outputs_VoxForge/lang_prep/lang_build2/G
.fst

Does having a blank space or empty new line in the trans text would have caused this error ? I am not sure.

Daniel Povey

unread,
Jan 26, 2017, 7:35:34 PM1/26/17
to ashwinraj...@gmail.com, kaldi-help
I was talking about the ARPA-format input. That comes from stdin and
you didn't say where it came from. There was an error in that stage.
Read online about how the ARPA format works.

xtluo

unread,
Dec 5, 2019, 10:09:15 PM12/5/19
to kaldi-help
Hi, everyone.

I encountered the same issue "Invalid n-gram data line" and it turns out the line reported by Kaldi is empty, something like:

"-8.702529 "

The language model is trained using the KenLM toolkit on a large corpus(about 155G) and pruned at 5e-9, any idea what may be the problem? 

Thanks.

On Friday, January 27, 2017 at 8:35:34 AM UTC+8, Dan Povey wrote:
I was talking about the ARPA-format input.  That comes from stdin and
you didn't say where it came from.  There was an error in that stage.
Read online about how the ARPA format works.


Daniel Povey

unread,
Dec 5, 2019, 10:16:35 PM12/5/19
to kaldi-help
Probably an encoding issue.  Grep for the line and examine in an editor, possibly there is some weird Unicode space in there that KenLM did not recognize as a space.  I don't know how KenLM deals with encodings. 

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

xtluo

unread,
Dec 5, 2019, 10:35:45 PM12/5/19
to kaldi-help
Thanks, Dan. 

Here is what I did next:

- zcat lm.gz | head -n 56 > temp (since "line 55 [-8.702529 ]: Invalid n-gram data line")
- opened file 'temp' in python3 in UTF-8 encoding and print the line, the output is  listed below:

'-8.702529\t\n'

And I didn't see the encoding issue. Since my text corpus is huge, it may take some time to debug, is there a quick fix method related to this kind of issue, such as deleting the single line?

On Friday, December 6, 2019 at 11:16:35 AM UTC+8, Dan Povey wrote:
Probably an encoding issue.  Grep for the line and examine in an editor, possibly there is some weird Unicode space in there that KenLM did not recognize as a space.  I don't know how KenLM deals with encodings. 

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,
Dec 5, 2019, 10:37:54 PM12/5/19
to kaldi-help
Yeah you could probably delete that line.
Can't help much further, it's probably a bug that should be reported to the KenLM people.


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/912cdfc5-b0ad-40b6-adc7-fe0a1368b8f8%40googlegroups.com.

xtluo

unread,
Dec 5, 2019, 10:47:22 PM12/5/19
to kaldi-help
Yeah you could probably delete that line.
I think it's worth a try.
 Can't help much further, it's probably a bug that should be reported to the KenLM people.
Appreciate your help, I'm reporting this to the KenLM repository.

Have a nice day, Dan.

On Friday, December 6, 2019 at 11:37:54 AM UTC+8, Dan Povey wrote:
Yeah you could probably delete that line.
Can't help much further, it's probably a bug that should be reported to the KenLM people.


Reply all
Reply to author
Forward
0 new messages