> Dear all,
> I have been facing the following learning/optimization problem:
> Consider a vector of observed data y = {y1,y2,...yn}, which is assumed to
> be generated by a model p(y|v), where v is a parameter vector.
> Now, assume that we do not have a deterministic expression for the above
> model, but we have a computational method that generates synthetic data x
> ={x1,x2,...xn} given a parameter vector v'.
> The aim is to find the parameter vector v' that is closest to the real one
> v so that both x and y have the same distribution, given our limitation
> that we can only work on comparisons between the real data and the
> synthetic data generated by our computational model.
> Could you please give me some hint on which kind of optimization/learning
> algorithms should I study to find out an update rule for the parameter
> vector v'?
> To make things clearer I have the Matlab script below considering observed
> data coming from a mixture of 3 equal-variance Gaussians, but each of them
> having a different contribution in the observed data.
> We know the model for the data and we know the centers of the Gaussians,
> but we would like to estimate their contributions. To this end, we first
> generate synthetic data coming from 3 Gaussians, and we compare the
> resulting distribution with the observed one.
> To update the estimated contributions, we add the error between the real
> distribution and the generated one at the Gaussians centers, and it seems
> to work!.
> The problem is that I do not have clear why this is working and which
> method I am (unconsciently) using to make this work.
> Thank you!
> Maco
> %-------- MATLAB SCRIPT -----
> %Total number of samples
> N = 1000000;
> %Model parameters
> center = [-2; -1; -0.0];
> contrib = [0.45; 0.35; 0.2];
> %Generate observed data as a mixture of three Gaussians
> d1 = sqrt(0.1)*randn(contrib(1)*N,1)+center(1);
> d2 = sqrt(0.1)*randn(contrib(2)*N,1)+center(2);
> d3 = sqrt(0.1)*randn(contrib(3)*N,1)+center(3);
> d_t = [d1;d2;d3];
> %Plot observed data distribution
> y = -4:0.1:4; %histogram bins
> p_y = hist(d_t,y);
> scale = sum(p_y*(y(2)-y(1)));
> p_y = p_y./scale;
> plot(y,p_y,'r');
> %Now, we want to estimate the contributions from each gaussian. We assume
> %now that we know in advance the centers of the gaussians, corresponding
> %to histogram bins c1, c2 and c3:
> c1 = find(abs(y-center(1))==min(abs(y-center(1))));
> c2 = find(abs(y-center(2))==min(abs(y-center(2))));
> c3 = find(abs(y-center(3))==min(abs(y-center(3))));
> %First, we initialize our estimated contributions
> cont_est = [1/3; 1/3; 1/3];
> n_iter = 10;
> for n=1:n_iter
> %Generate synthetic data from current parameters
> d1_s = sqrt(0.1)*randn(round(cont_est(1)*N),1)+center(1);
> d2_s = sqrt(0.1)*randn(round(cont_est(2)*N),1)+center(2);
> d3_s = sqrt(0.1)*randn(round(cont_est(3)*N),1)+center(3);
> d_s = [d1_s; d2_s; d3_s];
> %Plot and compare current the distributions of the current synthetic
> %data and the observed data
> p_x = hist(d_s,y);
> scale = sum(p_x*(y(2)-y(1)))
> p_x = p_x./scale;
> plot(y,p_x), hold on, plot(y,p_y,'r');
> %Get amplitudes of the real distribution and the synthetic
> distribution
> %at the centers of the 3 distributions
> amp_x = [p_x(c1); p_x(c2); p_x(c3)];
> amp_y = [p_y(c1); p_y(c2); p_y(c3)];
> %Update estimated contributions with the error between normalized
> %amplitudes
> cont_est = cont_est + (amp_y./sum(amp_y) - amp_x./sum(amp_x));
> pause
> close
> end