Kaldi repunctuation/recapitalization?

Phil

unread,

Jan 29, 2019, 2:08:14 PM1/29/19

to kaldi-help

Hi, is there any Kaldi functionality for restoring capitalization and punctuation to ASR output? Looking at some FST-based papers that seem like they'd fit in well with current framework.

Daniel Povey

unread,

Jan 29, 2019, 2:10:58 PM1/29/19

to kaldi-help

No, we don't have any examples for that.

On Tue, Jan 29, 2019 at 2:08 PM Phil <philip...@gmail.com> wrote:

Hi, is there any Kaldi functionality for restoring capitalization and punctuation to ASR output? Looking at some FST-based papers that seem like they'd fit in well with current framework.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/806b58d0-922a-429b-9a26-7be966f3692a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Armando

unread,

Jan 30, 2019, 3:30:36 AM1/30/19

to kaldi-help

If you are following

https://www.researchgate.net/publication/220732929_Restoring_punctuation_and_capitalization_in_transcribed_speech/download

is fairly easy

let's say you have a language model LM that you have trained on punctuated text.

Build a words.txt with those additional punctuation symbols, say

cat words.txt <(echo ", $(cat words.txt | wc -l)")\
                       <(echo ": $(($(cat words.txt | wc -l)+1))")\
                       <(echo "; $(($(cat words.txt | wc -l)+2))")\
                       <(echo "? $(($(cat words.txt | wc -l)+3))")\
                       > words.punct.txt

if you have ',' ':' ';' '?' for punctuation.

then, let's say you have a 1-best for a utterance that your ASR system has already output

"I want to go to Paris Paris is a beautiful city" (I guess you would like a period after the first occurrence of Paris)

echo "I want to go to Paris Paris is a beautiful city" > utt.txt

Build the "hyper-string FSA" as it is called in that paper. I'll put a simple perl code at the end of this post, but it has the usage

hyperstringFSA.pl utt.txt > utt.fsa.txt

which produces an FST in text format that you compile into binary with

fstcompile --isymbols=words.punct.txt --osymbols=words.punct.txt --keep_isymbols=false --keep_osymbols=false utt.fsa.txt |\
fstarcsort --sort_type=olabel > utt.fst

then build your LM (with punctuation) fst like Kaldi does, using arpa2fst, just remember one important thing: the <eps> of the hyperstring and your LM must be on matching sides; if Kaldi replaces the <eps> on the input side of the LM with #0, just invert the fst

fstinvert KaldiLM.fst > KaldiLMforcompositionwithHyperstring.fst

then, you compose LM and hyperstring fst and extract the one best and hopefully you will have the period you were looking for after the first occurrence of Paris

fstcompose --compose_filter=auto --connect=false --v=5 utt.fst KaldiLMforcompositionwithHyperstring.fst |\
fstshortestpath | fstprint | int2symb -f 3 words.punct.txt | perl -lane 'print $F[2] if not /<eps>/'

I have not used this extensively, I don't believe much in restoring of punctuation and capitalization, but it's true that NPL people often ask for it.

But you might want to look at some method that exploits seepch features also, not just text based approaches.

look at this:

https://github.com/ottokart/punctuator

I'll put the perl code here

#!/usr/bin/perl

$utterance_fn=$ARGV[0];
$hstringfsa_fn=$ARGV[1];

open($utterance_fh, "<", $utterance_fn) || die "cannot open file $utterance_fn in reading mode\n";
open($hstringfsa_fh, ">", $hstringfsa_fn) || die "cannot open file $hstringfsa_fn in writing mode\n";

chomp(@utterance = <$utterance_fh>);

@punct = (',', ':', ';', '?', '<eps>');
$nlines = scalar @utterance;

for ($i = 0; $i < $nlines; $i++) {

@line = split(/\s+/, $utterance[$i]);
$src_state = 0;
$dest_state = 1;
$nwords = scalar @line;

for ($j = 0; $j < $nwords; $j++){

    #print word in 1-best
    print STDOUT "$j-th word is [$line[$j]] src=$src_state dest=$dest_state $nwords\n";
    $word = $line[$j];
    print $hstringfsa_fh "$src_state $dest_state $word $word\n";
    $src_state = $dest_state;
    $dest_state++;

    #print all punctuation symbols + <eps>
    print $hstringfsa_fh "$src_state $dest_state $punct[0] $punct[0]\n";
    print $hstringfsa_fh "$src_state $dest_state $punct[1] $punct[1]\n";
    print $hstringfsa_fh "$src_state $dest_state $punct[2] $punct[2]\n";
    print $hstringfsa_fh "$src_state $dest_state $punct[3] $punct[3]\n";
    print $hstringfsa_fh "$src_state $dest_state $punct[4] $punct[4]\n";
    print STDOUT "$j-th word is [$line[$j]] src=$src_state dest=$dest_state $nwords\n";
    #now the following word in 1-best
    $src_state = $dest_state;
    $dest_state++;
}
}

print $hstringfsa_fh "$src_state\n";

close($utterance_fh);
close($hstringfsa_fh);

Phil

unread,

Jan 30, 2019, 1:37:05 PM1/30/19

to kaldi-help

Wow, thank you very much. Was just starting to implement this same paper ;).

Phil

unread,

Jan 30, 2019, 2:10:21 PM1/30/19

to kaldi-help

I think to add back in capitalization you need to add in these modifications (assuming input is all lowercase):

print STDOUT "$j-th word is [$line[$j]] src=$src_state dest=$dest_state $nwords\n";
$word = $line[$j];

+ $word_upper = ucfirst $word;

+ print $hstringfsa_fh "$src_state $dest_state $word $word_upper\n";

+ print $hstringfsa_fh "$src_state $dest_state $word $word\n";

$src_state = $dest_state;

$dest_state++;

Saranya V

unread,

Nov 17, 2021, 1:38:10 AM11/17/21

to kaldi-help

Hi Phil,

I am trying to implement this paper.

https://www.researchgate.net/publication/220732929_Restoring_punctuation_and_capitalization_in_transcribed_speech/download

We need to create a hyper string FSA for the test data and compose with LM fst (G.fst or HCLG.fst) or the LM text and Combined with L.fst .