Support for UTF-8 characters?

24 views
Skip to first unread message

Xiang Ji

unread,
Apr 2, 2018, 1:56:42 PM4/2/18
to LatticeWordSegmentation
I was able to build LatticeWordSegmentation and run the test scripts fine. However, I don't seem to be able to run the program on the Sighan 2005 Bakeoff (Chinese segmentation) data, even though I formatted the data to have the same format as the sample data in the test/Text folder (i.e. with the spaces between the characters and the <\unk>, <\s> tokens added. The terminal output is as follows:

./StartSim_text.bash Text/cityu_test.txt 2 6 2
./LatticeWordSegmentation -KnownN 2 -UnkN 6 -InputFilesList Text/cityu_test.txt -NumIter 2 -OutputDirectoryBasename Results/Text/cityu_test/ -ReferenceTranscription Text/cityu_test.txt.ref -CalculateWER -EvalInterval 1 -WordLengthModulation 0 -UseViterby 151 -DeactivateCharacterModel 175
-----------------------------------------------------------------------------------------------------------
LatticeWordSegmentation: build date: Apr  2 2018 - time: 17:21:26
-----------------------------------------------------------------------------------------------------------
 
Starting word segmentation!
 
Initializing empty language model with KnownN=2, UnkN=6!

 
Iteration: 1 of 2
   
Sentence: 1492 of 1492

 
Mean length of observed words at base of WHPYLM: 43.069, number of word: 1609

 
Mean length of generated words by CHPYLM: 45.7743, number of words: 100000

 
Perplexity: 397.15

terminate called after throwing an instance of
'std::runtime_error'
  what
():  End of word symbol with empty buffer
./StartSim_text.bash: line 79: 28883 Aborted                 (core dumped) ./LatticeWordSegmentation ${KnownN} ${UnkN} ${NoThreads} ${PruneFactor} ${InputFilesList} ${InputType} ${SymbolFile} ${Debug} ${LatticeFileType} ${ExportLattices} ${NumIter} ${OutputDirectoryBasename} ${OutputFilesBasename} ${ReferenceTranscription} ${CalculateLPER} ${CalculatePER} ${CalculateWER} ${SwitchIter} ${AmScale} ${InitLM} ${InitLmNumIterations} ${PruningStep} ${BeamWidth} ${OutputEditOperations} ${EvalInterval} ${WordLengthModulation} ${UseViterby} ${DeactivateCharacterModel} ${HTKLMScale}


The second time, it gave a different error:

terminate called after throwing an instance of 'std::out_of_range'
  what
():  vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
./StartSim_text.bash: line 79: 15299 Aborted                 (core dumped)

There seems to be a previous post https://groups.google.com/forum/#!topic/latticewordsegmentation/4ZLHrvCic3I in this group that talks about being unable to run the model on Japanese characters. However, I'm not sure if that poster encountered the same issue as mine, as his error was:

 Error : word あっ not found in language model.

Interestingly, iteration 1 did finish, but the results produced were whole sentences with absolutely no segmentation/spaces within a sentence.

I'm just curious, whether the program is supposed to run on UTF-8 input data (as far as I know C++ doesn't natively support UTF-8 very well), or is the problem with my input data. If it wasn't supposed to support UTF-8 text, I might look at how to modify the code so that it runs on Chinese and Japanese text, just as the original Mochihashi paper described.

Thank you.

Xiang Ji

unread,
Apr 2, 2018, 2:01:00 PM4/2/18
to LatticeWordSegmentation
Correction: The raw input text only consists of single characters separated by spaces and newlines, while the "ref text" contains the </unk> </s> tokens.

Xiang Ji

unread,
Apr 3, 2018, 11:13:04 AM4/3/18
to LatticeWordSegmentation
Well of course it should be a trivial problem to solve though, in the sense that one should be able to transform the UTF-8 characters into numbers already in the preprocessing step, without involving C++, before feeding the transformed data into this program?

Oliver Walter

unread,
Apr 3, 2018, 11:42:05 AM4/3/18
to Xiang Ji, LatticeWordSegmentation
Hi Jo,

unfortunately I'm not sure if the code can handle UTF-8. I'm also not completely sure about the error you are getting and if it might be related to UTF-8.

The error in the first run seems to indicate that an empty word was sampled.

Please make sure that your input data is clean and all sentences contain characters.

The easiest would be to check the reading functions https://github.com/fgnt/LatticeWordSegmentation/blob/1221e7e1eafc0d4f534585bf210ed6cc702c0605/src/FileReader/FileReader.cpp#L690 and see if they can deal with UTF-8 chats. Since we use a see:string, this might not be the case.


   Oliver

--
You received this message because you are subscribed to the Google Groups "LatticeWordSegmentation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to latticewordsegmen...@googlegroups.com.
To post to this group, send email to latticeword...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/latticewordsegmentation/0da6e0b7-963d-4fff-97b4-e0e8641232e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages