If you are following
is fairly easy
let's say you have a language model LM that you have trained on punctuated text.
Build a words.txt with those additional punctuation symbols, say
cat words.txt <(echo ", $(cat words.txt | wc -l)")\
<(echo ": $(($(cat words.txt | wc -l)+1))")\
<(echo "; $(($(cat words.txt | wc -l)+2))")\
<(echo "? $(($(cat words.txt | wc -l)+3))")\
> words.punct.txt
if you have ',' ':' ';' '?' for punctuation.
then, let's say you have a 1-best for a utterance that your ASR system has already output
"I want to go to Paris Paris is a beautiful city" (I guess you would like a period after the first occurrence of Paris)
echo "I want to go to Paris Paris is a beautiful city" > utt.txt
Build the "hyper-string FSA" as it is called in that paper. I'll put a simple perl code at the end of this post, but it has the usage
hyperstringFSA.pl utt.txt > utt.fsa.txt
which produces an FST in text format that you compile into binary with
fstcompile --isymbols=words.punct.txt --osymbols=words.punct.txt --keep_isymbols=false --keep_osymbols=false utt.fsa.txt |\
fstarcsort --sort_type=olabel > utt.fst
then build your LM (with punctuation) fst like Kaldi does, using arpa2fst, just remember one important thing: the <eps> of the hyperstring and your LM must be on matching sides; if Kaldi replaces the <eps> on the input side of the LM with #0, just invert the fst
fstinvert KaldiLM.fst > KaldiLMforcompositionwithHyperstring.fst
then, you compose LM and hyperstring fst and extract the one best and hopefully you will have the period you were looking for after the first occurrence of Paris
fstcompose --compose_filter=auto --connect=false --v=5 utt.fst KaldiLMforcompositionwithHyperstring.fst |\
fstshortestpath | fstprint | int2symb -f 3 words.punct.txt | perl -lane 'print $F[2] if not /<eps>/'
I have not used this extensively, I don't believe much in restoring of punctuation and capitalization, but it's true that NPL people often ask for it.
But you might want to look at some method that exploits seepch features also, not just text based approaches.
look at this:
I'll put the perl code here
#!/usr/bin/perl
$utterance_fn=$ARGV[0];
$hstringfsa_fn=$ARGV[1];
open($utterance_fh, "<", $utterance_fn) || die "cannot open file $utterance_fn in reading mode\n";
open($hstringfsa_fh, ">", $hstringfsa_fn) || die "cannot open file $hstringfsa_fn in writing mode\n";
chomp(@utterance = <$utterance_fh>);
@punct = (',', ':', ';', '?', '<eps>');
$nlines = scalar @utterance;
for ($i = 0; $i < $nlines; $i++) {
@line = split(/\s+/, $utterance[$i]);
$src_state = 0;
$dest_state = 1;
$nwords = scalar @line;
for ($j = 0; $j < $nwords; $j++){
#print word in 1-best
print STDOUT "$j-th word is [$line[$j]] src=$src_state dest=$dest_state $nwords\n";
$word = $line[$j];
print $hstringfsa_fh "$src_state $dest_state $word $word\n";
$src_state = $dest_state;
$dest_state++;
#print all punctuation symbols + <eps>
print $hstringfsa_fh "$src_state $dest_state $punct[0] $punct[0]\n";
print $hstringfsa_fh "$src_state $dest_state $punct[1] $punct[1]\n";
print $hstringfsa_fh "$src_state $dest_state $punct[2] $punct[2]\n";
print $hstringfsa_fh "$src_state $dest_state $punct[3] $punct[3]\n";
print $hstringfsa_fh "$src_state $dest_state $punct[4] $punct[4]\n";
print STDOUT "$j-th word is [$line[$j]] src=$src_state dest=$dest_state $nwords\n";
#now the following word in 1-best
$src_state = $dest_state;
$dest_state++;
}
}
print $hstringfsa_fh "$src_state\n";
close($utterance_fh);
close($hstringfsa_fh);