Problem with running on our lattices

73 views
Skip to first unread message

Ekaterina Egorova

unread,
Apr 21, 2016, 10:31:09 AM4/21/16
to LatticeWordSegmentation
Dear Oliver!

First, thanks a lot for updating the script, now we're able to run the tool on your HTK lattice example, However, when I'm trying to run the tool on the lattices that I've generated, I get the following error:


./LatticeWordSegmentation -KnownN 1 -UnkN 2 -PruneFactor 16 -InputFilesList short_test_list.txt -InputType fst -LatticeFileType htk -NumIter 100 -OutputDirectoryBasename Results_SVite/ -ReferenceTranscription short_test_list.txt.ref -CalculateLPER -CalculatePER -CalculateWER -AmScale 1 -EvalInterval 1 -WordLengthModulation -1 -UseViterby 151 -DeactivateCharacterModel 175 -HTKLMScale 0
-----------------------------------------------------------------------------------------------------------
LatticeWordSegmentation: build date: Apr 19 2016 - time: 10:42:59
-----------------------------------------------------------------------------------------------------------
Reading nBest file [0/1] from HTK FST 011_011c0201.lat
142459 States | 7698208 Arcs (7442 States | 19851 Arcs after pruning)
WARNING: Reference transcription contains more characters than input transcription!
 Calculating LPER with pruning from inf with step size 0 to inf!

  Pruning factor: inf
 Lattice phoneme error rate:
  PER: -nan %, Precision: -nan %, Recall: -nan %, F-score: -nan %
  Ins: 0, Del: 0, Sub: 0, Corr: 0, NFound: 0, NRef: 0

 Starting word segmentation!
 Initializing empty language model with KnownN=1, UnkN=2!

  Iteration: 1 of 100
Segmentation fault


I run it on just one lattice, the lattice and reference transcription can be found here: https://drive.google.com/folderview?id=0B6Dx-6Uhwb_jVVB6amh6aWpHY1k&usp=sharing. To me they look exactly as the example lattices/transcriptions. Do you know what might be the problem?

Thanks and kind regards,
Kate

Oliver Walter

unread,
Apr 21, 2016, 4:03:31 PM4/21/16
to LatticeWordSegmentation
Hi Kate,

the problem were the sentence start <s> and sentence end </s> symbols in your lattice file. Unfortunately those symbols are considered as special symbols within the software and are usually not allowed as input. I've fixed this. During parsing of the SLF file they are now mapped to "<eps>". I've updated the version on github and your lattice should work with it now.

Further the symbols "!SENT_START", "!SENT_END", "!NULL", "sil", "!ENTER", "!EXIT" and "NSN" are mapped to "<eps>". If you have one of those symbols in your input, they won't appear in the output. Especially the symbol "sil" might be of interest for you. If you think those symbols should be configurable, let us know. You will find the corresponding routine (IsSilence) in "FileReader/FileReader.cpp" at the end.

A more detailed explanation: Internally the software considers several symbols as special symbols and usually does not accept them as input symbols. Amongst them are:
"<eps>", "<phi>", "<unk>", "</unk>", "<s>", "</s>" (see includes/definitions.hpp)

The symbols "<eps>" and "</unk>" (word end marker) are accepted as input. The symbol "<phi>" is not accepted as input and will only be used for the fallback arcs in the language model and to perform a phi composition with lexicon and input. </s> is internally added as a sequence end character. Together with </unk>, the sequence </s> </unk>, forms the sentence end word. This is done to be consistent with the definition of a word being a character sequence and always ending with an "</unk>" symbol. The symbols <s> and <unk> are unused at the moment.

An additional remark for your reference file: The reference file should begin with the first character (no <s>) and end with </s> </unk>. Further the silences should not appear in the reference file (This is the reason for the message: "WARNING: Reference transcription contains more characters than input transcription!". A word end has to be marked by </unk>.

I hope that helps. If you have any comments, questions, wishes or whatever, let us know.

Please apologize my late reply. I saw your message a little too late.


   Oliver

Ekaterina Egorova

unread,
Apr 23, 2016, 1:40:37 PM4/23/16
to Oliver Walter, LatticeWordSegmentation
Dear Oliver, 
Thanks a lot for your explanation! I'm traveling now, but I'll try it as soon as I have a stable internet / electric power. 
Have a nice weekend!
Kate

--
You received this message because you are subscribed to the Google Groups "LatticeWordSegmentation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to latticewordsegmen...@googlegroups.com.
To post to this group, send email to latticeword...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/latticewordsegmentation/315fd9b1-f665-4aff-9d34-1732dbcb75db%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages