WER for first 3 modes is 42.9 wheras the bigram decoding HCLG yielded 43.0, so it stays the same. mode 4 is still running, but I can see by producing 1-best of partial lattices, that results are effectively very similar to the other modes, I'd expect WER to be the same or maybe slightly worse for reasons I'll explain below.
On the other hand, like in 3-gram rescoring, processing time is slow and lattices much bigger than the input lattices.
I was looking at the alignment between hypotheses and references: it looks to me that the first 3 modes produce basically identical outputs, hence the same WER. There are, though, some (slight) differences with the mode 4 that probably explain what looked like a worse WER.
What I consistently observe is that mode 4 produces output like:
"J' ai" or "Je t' appelle" (which respectively means in French "I have" or "I call you")
while the first three modes generate usually something like:
"Je ai" and "Je tu appelle"
they are the same expression, no big deal, but it's definitely more correct in French to utter like in the first way (it's an obvious contraction of Je and tu in J' and t' because of the following vowel); the strange thing is that the manual transcriber has consistently generated the reference in the second way, so mode4 is looking worse in those case, that are relatively frequent, while it should look better IMO
also, I see some consistent difference in some "hesitation" processing; since it's spontaneous telephonic conversation that I am decoding, so there are frequent expression like "uhm...ben...." ("ben" could be used by French as English could use "uhm...well..")
now, in mode4 I see it more frequently than in all the other modes, and most of the time it is not correct, in the sense that "ben" has not really been uttered, but still there some kind of hesitation sound coming from the speaker, whihc is usually not transcribed at all by the other modes; so it's really the language model doing something there cause acoustically there cannot be a good score for "ben".
Beside that, I can see rare differences for other words (and I must say, my perception is that mode4 is doing a bit better) but consistent differences are only those mentioned above.
In general, it looks to me that mode3 is doing the same as mode 1 and 2; mode 4 does not look worse, in reality, but processing time is not reasonable.
Actually my bigger concern as of now, is to make lm_rescore_const_arpa work because it looks like mode 1 and should be faster and memory efficient. There's definitely no vocabulary mismatch, because I can see that many words are correctly recognized, though the word insertion is very high: I have a total of 145215 words in the hyp, and only 100191 in the ref, whereas for the other modes with lm_rescore I have a more reasonable 103K
Indeed, that's what NIST ctm scoring gives me for mode1 lm_rescore and lm_rescore_const_arpa respectively
Corr Sub Del Ins Err
const 63.3 32.8 3.9 33.7 70.4
mode1 68.8 20.7 10.5 7.4 38.6