Your example is perfect: if you have those kind of data, you would generally want to specify the an independent of joint distribution of underlying discrete distributions. However, what the Baum-Welch will do internally will be exactly equivalent to the counting approach. I will explain more in an instant.
Just a little observation: If you have a multivariate discrete distribution, then you also have a univariate discrete distribution. You can assume every possible combination of symbols correspond to a single, univariate symbol. For example, if you have only (0, 0, 1), (0, 0, 2), (1, 2, 3) in your data, you could assume those are actually (0, 0, 1) = 0, (0, 0, 2) = 1, (1, 2, 3) = 2, forming a single discrete distribution. Thus, a discrete Markov model for your original multivariate discrete and for this new discrete distribution would be equivalent. But, as you might have noted, it would be very difficult to estimate this kind of distribution directly (which is just a joint distribution of every component, by the way) because the number of possible combinations for the symbols might grow very fast, and you might not have enough examples in your training data to cover all of those possibilities.
In order to avoid this problem, what we can do is to assume another distribution for the data, which would be more constrained, but hopefully be able to capture enough of the data to be useful. As in your example, instead of assuming a joint distribution, we could use a Independent<GeneralDiscreteDistribution> as you said, which would be far easier to estimate. You don't need to use mixtures if you don't want; in case you really have multivariate discrete symbols, I would say the Independent route would be one of the best options.
Now, please note that if you create a HiddenMarkovModel<GeneralDiscreteDistribution>, it will work exactly in the same way as a HiddenMarkovModel (without generic parameters). The point is that the learning algorithm (such as Baum-Welch) will call the discrete distribution's "Fit" method, which in case of the GeneralDiscreteDistribution will just apply the counting approach in exact the same way as it would be done internally by the discrete hidden Markov model. The Baum-Welch is a kind of expectation-maximization algorithm. So what it does during the "counting" phase is just the maximization step: it makes the underlying distributions fit to the data, passing a set of sample weights for each distribution so they can specialize only in a given segment of the training samples.
In the continuous models, instead of counting, the underlying distributions will have to do something else in order to fit themselves to the weighted data. If they are Gaussian distributions, they will compute the weighted mean and weighted variances; if they are general discrete, they will compute the weighted counting (as Baum-Welch does internally); if they are Bernoulli distributions, they will use a weighted parameter estimation to estimate the distribution parameter's. As you see, you can assume any distribution probability for your emission models. You can also do as you say and model the joint probability of your discrete distribution as Gaussians, however, if it would work or not would depend on the problem you are trying to solve.