Problem with running latticewordsegmentation for multibyte chararacter

29 views

Skip to first unread message

Yuki Ikeshita

unread,

Jun 6, 2017, 2:25:57 AM6/6/17

to latticeword...@googlegroups.com

Dear Oliver!

Thanks a lot for updating the script, now I'm trying it in Japanese character, however I got the following error.

*************
./LatticeWordSegmentation -KnownN 3 -UnkN 1 -InputFilesList train.scp -NumIter 10 -OutputDirectoryBasename results/ -ReferenceTranscription train.ref -EvalInterval 1 -InitLM train.word.txt -CalculateWER -UseViterby 1 -BeamWidth 1000 -PdinConcent 0.1-NoThreads 4
-----------------------------------------------------------------------------------------------------------
LatticeWordSegmentation: build date: May 29 2017 - time: 18:54:19
-----------------------------------------------------------------------------------------------------------
WARNING: Reference transcription contains more characters than input transcription!
<eps> 0
<phi> 1
<sigma> 2

・
・
Starting word segmentation!
Initializing empty language model with KnownN=3, UnkN=1!

Parsing initialization sentences and initializing language model!

Sentence: 1 of 1

Word length: 0, word length probability: 0.0000
Word length: 1, word length probability: 0.0000
Word length: 2, word length probability: 0.3854
Word length: 3, word length probability: 0.4773
Word length: 4, word length probability: 0.0907
Word length: 5, word length probability: 0.0290
Word length: 6, word length probability: 0.0088
Word length: 7, word length probability: 0.0050
Word length: 8, word length probability: 0.0025
Word length: 9, word length probability: 0.0000
Word length: 10, word length probability: 0.0013
Mean word length: 2.8312, number of word: 794

Perplexity: 4.58

CHPYLM statistics:
Contexts:             1
Tables:             540
Characters:        2248
Concentration:     0.14
Discount:          0.66
WHPYLM statistics:
Contexts:             1     490    1573
Tables:             794    1797    2404
Words:             1797    2404    2808
Concentration:     0.08    0.20    0.14
Discount:          0.66    0.77    0.74

Training language model!

<eps> 0
<phi> 1
<sigma> 2
・
・
Params.PDicConcent = 0.10
Error : word あっ not found in language model.

*****************
I think this error occurs due to handling 3bytes characters (Japanese characters are 3bytes in utf-8).
How can I enable it to handle 3bytes characters ?

Thanks and kind regards,
Yuki.

---------------------------------------------------------

Tokyo Institute of technology
School of Engineering Department of information and Communications Engineering master 1st
Yuki Ikeshita

mail : ikeshi...@m.titech.ac.jp

---------------------------------------------------------

train.word.txt

prondic.txt

Reply all

Reply to author

Forward

0 new messages