bigram vs trigram

Roy Jerden

unread,

Apr 15, 1997, 3:00:00 AM4/15/97

to

What is the advantage of the trigram model used by IBM over the bigram
model used by some other recognition engines. Is it merely speed,
accuracy, both, or what?

Tony Robinson

unread,

Apr 16, 1997, 3:00:00 AM4/16/97

to

rje...@mindspring.com (Roy Jerden) writes:

Trigrams predict the next word better and so the resulting speech
recognition system has potentially better accuracy.

However, in simple implementations the number of distinct word histories
that must be kept separate increases by a factor which is the vocabulary
size (e.g. it will run 60,000 times slower and use 60,000 times more
memory!). So clearly, simple implementations can't be used. However,
when smart search techniques are used then the situation can be
reversed. Consider the case where your trigram language model was so
much better you only had to perform the acoustic matching on a very
small number of words - there is a trade off here.

In summary, trigrams should be more accurate than bigrams, but the
speed/memory problems are an active area of research. ICASSP is a good
conference with reasonably accessible proceedings to find out more.

Tony Robinson

--
http://www.softsound.demon.co.uk/
email a...@softsound.com
Fax +44-1223-740026

Joseph S. Wisniewski

unread,

Apr 16, 1997, 3:00:00 AM4/16/97

to

Tony Robinson wrote:
>
> rje...@mindspring.com (Roy Jerden) writes:
>
> > What is the advantage of the trigram model used by IBM over the bigram
> > model used by some other recognition engines. Is it merely speed,
> > accuracy, both, or what?
>
> Trigrams predict the next word better and so the resulting speech
> recognition system has potentially better accuracy.
>
> However, in simple implementations the number of distinct
> word histories that must be kept separate increases by a factor
> which is the vocabulary size (e.g. it will run 60,000 times slower
> and use 60,000 times more memory!).

I'm confused. Shouldn't this only increase proportinatly to perplexity
for an exhaustive search, and even less for a constrained search?

> So clearly, simple implementations can't be used. However,
> when smart search techniques are used then the situation can be
> reversed. Consider the case where your trigram language model was so
> much better you only had to perform the acoustic matching on a very
> small number of words - there is a trade off here.

There's also a tradeoff in the amount of data it takes to generate a
trigram grammar vs. a bigram grammar for a dictation application,
especially if it's a statistical grammar. You may find yourself looking
at needing to analyze hundreds of times more text to derive a trigram
grammar than you would for a bigram.

> In summary, trigrams should be more accurate than bigrams, but the
> speed/memory problems are an active area of research. ICASSP is a
> good conference with reasonably accessible proceedings to find out
> more.
>
> Tony Robinson
>
> --
> http://www.softsound.demon.co.uk/
> email a...@softsound.com
> Fax +44-1223-740026

--
Joseph S. Wisniewski | Views expressed are my own, and don't reflect
Ford Motor Company | those of the Ford Motor Co. or affiliates.
Project Sapphire | LeMans, Daytona, Bonneville, and Sebring are
jwis...@ford.com | just races, won by people driving Ford cars!

Tony Robinson

unread,

Apr 16, 1997, 3:00:00 AM4/16/97

to

"Joseph S. Wisniewski" <jwis...@ford.com> writes:

> Tony Robinson wrote:
> >
> > However, in simple implementations the number of distinct
> > word histories that must be kept separate increases by a factor
> > which is the vocabulary size (e.g. it will run 60,000 times slower
> > and use 60,000 times more memory!).
>
> I'm confused. Shouldn't this only increase proportinatly to perplexity
> for an exhaustive search, and even less for a constrained search?

[ What I missed from my first post was the fact that I assumed the
vocabulary size is 60,000 ]

The search space for an exhaustive search increases by a factor of the
vocabulary size, V, every time N is increased in an N-gram (what I'm
saying is that going from a 2-gram/bigram to a 3-gram/trigram increases
the search space by V).

The whole interaction of phone models, word models and language models
is something I found took several years to grasp, and thus I'm on a
sticky spot when I have to teach this is eight one-hour lectures on
search and language modelling. My latest attempt-at-an-explanation can
be found at URL http://svr-www.eng.cam.ac.uk/~ajr/network.eps

This picture shows a set of major nodes in a finite state grammar each
of which is labelled by the word context and the links are labelled by
the words in the vocabulary. By taking a link transition you recognise
a word and thus update the word history (major node). The word
transitions are themselves composed of minor nodes which are a standard
HMM.

I hope you'll agree that the number of major nodes is V^(N-1) and thus
the search space without pruning increases proportional to V^(N-1).

Of course, as I said, the exhaustive search is unreasonable and can't
often be used. So most implementations used a constrained search.

> > So clearly, simple implementations can't be used. However,
> > when smart search techniques are used then the situation can be
> > reversed. Consider the case where your trigram language model was so
> > much better you only had to perform the acoustic matching on a very
> > small number of words - there is a trade off here.
>
> There's also a tradeoff in the amount of data it takes to generate a
> trigram grammar vs. a bigram grammar for a dictation application,
> especially if it's a statistical grammar. You may find yourself looking
> at needing to analyze hundreds of times more text to derive a trigram
> grammar than you would for a bigram.

Indeed, but only because we don't properly understand model
combination/mixture methods properly yet (such as backoff or deleted
interpolation). For the same amount of language model training data a
trigram should (theoretically) always outperform or equal a bigram.

The problem becomes even more complex as for any reasonable amount of
data, you can only estimate a small fraction of your N-grams and the
rest must backoff to a (N-1)-gram (and some of the (N-1)-grams back off
to a (N-2)-gram, etc). Thus the *exhaustive* search space for a
practical trigram is considerably reduced if you take this into account.

[ Not that you have these problems in cars yet! ]

Tony

--
Tony Robinson
http://www.SoftSound.demon.co.uk/
email a...@softsound.com
Fax +44-1223-740026