How does the reestimation of the emission probabilities really work in a HMM?

108 views
Skip to first unread message

Julio Aguilar

unread,
May 20, 2014, 2:35:14 PM5/20/14
to accor...@googlegroups.com
Hello Cesar and Accord users,

I've been trying to properly understand how to use and train HMMs. I've been reading, among other papers and books, the tutorial-paper of Rabiner. I'm stuck at the training phase.

I do understand the forward and backward variables (I used conditional probability theory to get the equations on the paper in order to understand them), which are use to get the \xi and the \gamma variables, which I also understand. These last variables are used for the reestimation of the parameters.

My problem lies with the reestimation of the emission probabilities. It seems to me that you don't need to choose a probability distribution function to model the observations, since the parameteres will be reestimated based on the counting approach!

From paper:

reestimated emission prob of symbol k at state i = (expected number of times in state i and observing symbol k) / (expected number of times in state i)

At the end my emission probabilities will be determined by this counting procedure, so why do I need to choose a distribution function?

Is it used for an initial estimation on the parameters to increase the probabilities of obtaining a global maxima with the baum-welch? That's the only thing I can come up with and I don't think it is true ... 

I would really appreciate any help :-)

Thanks and best regards,
Julio


César

unread,
May 20, 2014, 3:17:23 PM5/20/14
to accor...@googlegroups.com
Hi Julio!

In fact, if you are working with discrete symbols only, you don't need to specify a probability distribution. The hidden Markov models in Accord.NET can be created either with a generic argument specifying the observation distribution, but also without this argument, which means the distribution will be assumed a general discrete distribution based on symbol observation counts. 

A distribution only needs to be specified when you are dealing with non-discrete observations, or, in other words, observations that you can't simply count. Whenever this happens, instead of computing the emission probability of symbol k at state i, we will have to model the emission probability of a continuous sample x at state i. For this, we can assume each state has a probability density function that follows a given distribution.

Hope it helps!

Best regards,
Cesar

Julio Aguilar

unread,
May 21, 2014, 4:54:10 AM5/21/14
to accor...@googlegroups.com
Hi Cesar,

Thank you very much for your reply.

Ok, so this is the case for a single discrete distribution. But then, what about a vector of 3 elements, each element can have a discrete values. (which is actually my case)

double[] obsVector = [X1, X2, X3];
// where X1 can take the following values: 0, 1, 2, 3
// X2: 0, 1, 2
// X3: 0, 1, 2

In that case, I would need to use a generic HMM with independent (or  joint) emissions with three components, where each component is a categorical distribution (i.e. a discrete distribution). Something like this:

var emissions = new Independent(
   
new GeneralDiscrete(symbols: 4),
   
new GeneralDiscrete(symbols: 3),
   
new GeneralDiscrete(symbols: 3)
);


For this case, even though the observations are discrete and I am able to count them (to some extent), I won't be able to use the counting approach, would I? If I am right, what I still don't understand is that they also use some kind of counting for the new reestimation equations.

Ohhhh I think I got it! The joint probatility of N discrete (or also continous) distributions can be modeled by some kind of gaussian (or whatever) which has parameters: mean and variance. The counting in this case, will be to estimate these parameters?

In the case when continuos distributions comform the mixture, the counting comes to estimate the mean and variance of those distributions? and from those, compute the mean and variance of the joint distribution (but without counting, just computing the joint probabilities and estimating these two final values)?

In the case when discrete distributions conform the mixture, the counting comes to estimate the prob of getting one single element in each component? and from those compute the mean and variance of the joint distribution (again without counting, just computing the joint probabilities and estimating these two final values)?

I think I'm still missing something important.

Best Regards,
Julio

Message has been deleted

César

unread,
May 21, 2014, 6:21:55 AM5/21/14
to accor...@googlegroups.com
Hi Julio!

Your example is perfect: if you have those kind of data, you would generally want to specify the an independent of joint distribution of underlying discrete distributions. However, what the Baum-Welch will do internally will be exactly equivalent to the counting approach. I will explain more in an instant.

Just a little observation: If you have a multivariate discrete distribution, then you also have a univariate discrete distribution. You can assume every possible combination of symbols correspond to a single, univariate symbol. For example, if you have only (0, 0, 1), (0, 0, 2), (1, 2, 3) in your data, you could assume those are actually (0, 0, 1) = 0, (0, 0, 2) = 1, (1, 2, 3) = 2, forming a single discrete distribution. Thus, a discrete Markov model for your original multivariate discrete and for this new discrete distribution would be equivalent. But, as you might have noted, it would be very difficult to estimate this kind of distribution directly (which is just a joint distribution of every component, by the way) because the number of possible combinations for the symbols might grow very fast, and you might not have enough examples in your training data to cover all of those possibilities. 

In order to avoid this problem, what we can do is to assume another distribution for the data, which would be more constrained, but hopefully be able to capture enough of the data to be useful. As in your example, instead of assuming a joint distribution, we could use a Independent<GeneralDiscreteDistribution> as you said, which would be far easier to estimate. You don't need to use mixtures if you don't want; in case you really have multivariate discrete symbols, I would say the Independent route would be one of the best options.

Now, please note that if you create a HiddenMarkovModel<GeneralDiscreteDistribution>, it will work exactly in the same way as a HiddenMarkovModel (without generic parameters). The point is that the learning algorithm (such as Baum-Welch) will call the discrete distribution's "Fit" method, which in case of the GeneralDiscreteDistribution will just apply the counting approach in exact the same way as it would be done internally by the discrete hidden Markov model. The Baum-Welch is a kind of expectation-maximization algorithm. So what it does during the "counting" phase is just the maximization step: it makes the underlying distributions fit to the data, passing a set of sample weights for each distribution so they can specialize only in a given segment of the training samples. 

In the continuous models, instead of counting, the underlying distributions will have to do something else in order to fit themselves to the weighted data. If they are Gaussian distributions, they will compute the weighted mean and weighted variances; if they are general discrete, they will compute the weighted counting (as Baum-Welch does internally); if they are Bernoulli distributions, they will use a weighted parameter estimation to estimate the distribution parameter's. As you see, you can assume any distribution probability for your emission models. You can also do as you say and model the joint probability of your discrete distribution as Gaussians, however, if it would work or not would depend on the problem you are trying to solve.

Julio Aguilar

unread,
May 21, 2014, 8:07:47 AM5/21/14
to accor...@googlegroups.com
Hi Cesar,

again, thank you very much for helping me!

I just want to check if I correctly understood.

Let's say I only have one single discrete random variable that can take any value within the discrete range [0, 4]. => counting approach

b_i(k) = probability of observing symbol k at state i
       
= expected number of times in state i and observing symbol k /  expected number of times in state;
       
=> counting approach


Now, let's do the same with an observation vector, where each element can take a discrete value (exactly like the one used before)

double[] obsVector = [X1, X2, X3]

b_i
(k) = prob of observing symbol k at state i = b_i(k1) * b_i(k2) * b_i(k3);
b_i
(kj) = probability of observing symbol kj at state i = counting apporach

So, basicaly what I understood from your last reply is the following:

I will need to use the counting approach to estimate the probability of observing a given value of one element (kj) at state i. Which is what I did for a single random variable. And finaly to get the probability of observing (k1, k2, k3), which is the joint probability, I will just multiply them as in the equation above! At the end, I will have a single discrete distribution where k = number of possible combinations of symbols = k1 * k2 * k3 = 4 * 3 * 3 = 36 (in my case).

Question 1. Did I understand correctly!?
Question 2. The step of counting is the maximization and where is the expectation step?
Question 3. How does the this method know which state produced any given observation!? Is here where the expection step takes place? 

Oh I see,
counting/fitting data => maxStep
reestimation => expStep 

Thanks again for the help.

Best Regards,
Julio

Julio Aguilar

unread,
May 24, 2014, 2:16:53 PM5/24/14
to accor...@googlegroups.com
Hi Cesar,

I know you must be busy and I'm sorry to disturb you but I would really like to know your comments about my last questions. Or if you could point me in the right direction (maybe links) apart from the paper of Rabiner.

Best Regards,
Julio

César

unread,
May 25, 2014, 9:11:31 AM5/25/14
to accor...@googlegroups.com
Hi Julio!

From what you described, it seems your assertions will be correct: if you create a single distribution for all possible combination of symbols, it will be the same as creating a joint distribution for all components of your observation vector. However, by looking at your pseudo-code, it seems that your are not computing the joint distribution, but an independent distribution instead: = b_i(k1) * b_i(k2) * b_i(k3). 

In case this would be a true joint distribution, it would be something more like this. Let's assume you have created a distribution for all possible combination of symbols. This could be seen as a single vector of k positions, where k = k1*k2*k3. Each position in this vector contains the probability of each possible combination. To check/store the probability of a combination of symbols, you would have to first compute the combination index x, as in:

   x = k1*x1 + k2*x2 + k3*x3

where k1 is the number of different symbols in the first component of your feature vector, k2 is the number of different symbols in the second component of your feature vector, and k3 is the number of different symbols in the third part. The values x1, x2, x3, are just the components of your feature vector. The final value x will be the position in the joint distribution that will contain the probability you would be looking for.

Regarding your question about the Baum-Welch algorithm, the expectation step is when you are computing the forward and backward matrices of probabilities (which will then give the weights for maximizing the individual state distributions in the last step). The maximization is when you apply the weights that you got before, plus the symbol counts, to fit your state distributions. Well, at least that is how I understand it, but please note that I could also be a bit equivocated on that.

César

unread,
May 25, 2014, 9:16:13 AM5/25/14
to accor...@googlegroups.com
Also, I think this material from Carlos Guestrin tries to explain a little better the Baum-Welch under the light of the E-M algorithm:

 - http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Slides/baumwelch.pdf

I will try to answer your other question soon!

Best regards,
Cesar

César

unread,
May 26, 2014, 5:17:49 AM5/26/14
to accor...@googlegroups.com
Ops,

Sorry Julio; the index calculation I gave you was incorrect. Perhaps it should be more or less something like this (but please check):

  x = (k1*k2)*x1 + k2*x2 + x3

If you wish, there is an actual implementation for building those indices/distributions in the JointDistribution source code.

Best regards,
Cesar

Julio Aguilar

unread,
May 26, 2014, 5:48:41 AM5/26/14
to accor...@googlegroups.com
Hi Cesar, 

Thank you for your patience and help :-)

I think I missled you at some point of our conversation. I am not trying to implement an HMM with a single discrete distribution containing all possible observations. I already implemented two HMMs: one with an independent distribution and another one with a joint distribution (I did that to see differences between the two). Each of this have the same observation vector, which is the discrete one I wrote on previous posts.

I need to undestand what happens inside the Baum-Welch (or how does your code do it). I was confused with the Baum-Welch because it seemed to me that all that it did was to count and I asked myself, why do I need to tell it which distribution function to use when at the end it counts. Then, thanks to you, I understood that it was only the case for a single discrete distribution. But I still don't quite understand how it deals with multiple discrete (or continuos) functions.

Does your code or the Baum-Welch algorithm creates a single discrete distribution from the independent or joint distribution?
1. In relation to my case, where I have 3 discrete distribution as components of an independent distribution or components of a joint distribution. Does it uses the count approach to get the probabilities of each individual discrete distribution? And finally, it multiplies them independently or dependently according to which distribution is in used?
2. Or is it something else?

The reason I wan to know about this is actually for my thesis, but the reason I want to know about how to deal with discrete and continous distributions is for me to learn and to fulfill my curiosity :-)

I will give it a try to your link about the Baum-Welch algorithm :-)

Thanks again and best regards,
Julio
Reply all
Reply to author
Forward
0 new messages