I was able to build LatticeWordSegmentation and run the test scripts fine. However, I don't seem to be able to run the program on the Sighan 2005 Bakeoff (Chinese segmentation) data, even though I formatted the data to have the same format as the sample data in the test/Text folder (i.e. with the spaces between the characters and the <\unk>, <\s> tokens added. The terminal output is as follows:
./StartSim_text.bash Text/cityu_test.txt 2 6 2
./LatticeWordSegmentation -KnownN 2 -UnkN 6 -InputFilesList Text/cityu_test.txt -NumIter 2 -OutputDirectoryBasename Results/Text/cityu_test/ -ReferenceTranscription Text/cityu_test.txt.ref -CalculateWER -EvalInterval 1 -WordLengthModulation 0 -UseViterby 151 -DeactivateCharacterModel 175
-----------------------------------------------------------------------------------------------------------
LatticeWordSegmentation: build date: Apr 2 2018 - time: 17:21:26
-----------------------------------------------------------------------------------------------------------
Starting word segmentation!
Initializing empty language model with KnownN=2, UnkN=6!
Iteration: 1 of 2
Sentence: 1492 of 1492
Mean length of observed words at base of WHPYLM: 43.069, number of word: 1609
Mean length of generated words by CHPYLM: 45.7743, number of words: 100000
Perplexity: 397.15
terminate called after throwing an instance of 'std::runtime_error'
what(): End of word symbol with empty buffer
./StartSim_text.bash: line 79: 28883 Aborted (core dumped) ./LatticeWordSegmentation ${KnownN} ${UnkN} ${NoThreads} ${PruneFactor} ${InputFilesList} ${InputType} ${SymbolFile} ${Debug} ${LatticeFileType} ${ExportLattices} ${NumIter} ${OutputDirectoryBasename} ${OutputFilesBasename} ${ReferenceTranscription} ${CalculateLPER} ${CalculatePER} ${CalculateWER} ${SwitchIter} ${AmScale} ${InitLM} ${InitLmNumIterations} ${PruningStep} ${BeamWidth} ${OutputEditOperations} ${EvalInterval} ${WordLengthModulation} ${UseViterby} ${DeactivateCharacterModel} ${HTKLMScale}
The second time, it gave a different error:
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
./StartSim_text.bash: line 79: 15299 Aborted (core dumped)
There seems to be a previous post
https://groups.google.com/forum/#!topic/latticewordsegmentation/4ZLHrvCic3I in this group that talks about being unable to run the model on Japanese characters. However, I'm not sure if that poster encountered the same issue as mine, as his error was:
Error : word あっ not found in language model.
Interestingly, iteration 1 did finish, but the results produced were whole sentences with absolutely no segmentation/spaces within a sentence.
I'm just curious, whether the program is supposed to run on UTF-8 input data (as far as I know C++ doesn't natively support UTF-8 very well), or is the problem with my input data. If it wasn't supposed to support UTF-8 text, I might look at how to modify the code so that it runs on Chinese and Japanese text, just as the original Mochihashi paper described.
Thank you.