Problem with running latticewordsegmentation for multibyte chararacter

29 views
Skip to first unread message

Yuki Ikeshita

unread,
Jun 6, 2017, 2:25:57 AM6/6/17
to latticeword...@googlegroups.com
Dear Oliver!

Thanks a lot for updating the script, now I'm trying it in Japanese character, however I got the following error.

*************
./LatticeWordSegmentation -KnownN 3 -UnkN 1 -InputFilesList train.scp -NumIter 10 -OutputDirectoryBasename results/ -ReferenceTranscription train.ref -EvalInterval 1 -InitLM train.word.txt -CalculateWER -UseViterby 1 -BeamWidth 1000 -PdinConcent 0.1-NoThreads 4
-----------------------------------------------------------------------------------------------------------
LatticeWordSegmentation: build date: May 29 2017 - time: 18:54:19
-----------------------------------------------------------------------------------------------------------
WARNING: Reference transcription contains more characters than input transcription!
<eps> 0
<phi> 1
<sigma> 2



 Starting word segmentation!
 Initializing empty language model with KnownN=3, UnkN=1!

 Parsing initialization sentences and initializing language model!

  Sentence: 1 of 1

Word length: 0, word length probability: 0.0000
Word length: 1, word length probability: 0.0000
Word length: 2, word length probability: 0.3854
Word length: 3, word length probability: 0.4773
Word length: 4, word length probability: 0.0907
Word length: 5, word length probability: 0.0290
Word length: 6, word length probability: 0.0088
Word length: 7, word length probability: 0.0050
Word length: 8, word length probability: 0.0025
Word length: 9, word length probability: 0.0000
Word length: 10, word length probability: 0.0013
Mean word length: 2.8312, number of word: 794

 Perplexity: 4.58

 CHPYLM statistics:
  Contexts:             1
  Tables:             540
  Characters:        2248
  Concentration:     0.14
  Discount:          0.66
 WHPYLM statistics:
  Contexts:             1     490    1573
  Tables:             794    1797    2404
  Words:             1797    2404    2808
  Concentration:     0.08    0.20    0.14
  Discount:          0.66    0.77    0.74

 Training language model!

<eps> 0
<phi> 1
<sigma> 2


Params.PDicConcent = 0.10
Error : word あっ not found in language model.

*****************
I think this error occurs due to handling 3bytes characters (Japanese characters are 3bytes in utf-8).
How can I enable it to handle 3bytes characters ?

Thanks and kind regards,
Yuki.

---------------------------------------------------------

Tokyo Institute of technology
School of Engineering Department of information and Communications Engineering master 1st
Yuki Ikeshita

mail : ikeshi...@m.titech.ac.jp

---------------------------------------------------------
train.word.txt
prondic.txt
Reply all
Reply to author
Forward
0 new messages