Biased parameters for truncated, noisy data

226 views
Skip to first unread message

Ben Maughan

unread,
Sep 6, 2016, 4:13:16 PM9/6/16
to Stan users mailing list
Dear Stan Users,

Thanks for the great software and documentation!

I was hoping someone could give me some advice with the following problem. First a caveat; I am an astrophysicist by training, not a statistician so I might use incorrect terminology and/or be making an obvious mistake!

I am trying to model a situation where I have noisy observed data points sampled from some underlying distribution, and my data are truncated such that points with low observed values are excluded. I would like to recover the parameters of the underlying distribution.

I have tried modelling the truncation as described in 9.3 of the reference manual. However, when I include the extra level of scatter in the data, the modelling returns biased parameters.
 
Here is a simplified model to generate some toy data in R

## start with 1000 data points
J
<- 1000

## the underlying distribution of values is normal
mu
<- 0
sigma
<- 1
## generate a rate for each data point
x
<- rnorm(J,mu,sigma)

## I observe a noisy version y of these data with known amount of
## noise, ysig
ysig
<- 1
y
<- rnorm(J,x,ysig)

## but I only include those above some threshold
ymin
<- 0
x
<- x[y>ymin]
y
<- y[y>ymin]
J
<- length(y)


Now I write my model in Stan:

data {
 
int<lower=0> J; // number of observed points
  real ymin
; // minimum allowed observed value
  real
<lower=0> ysig; // known amount of noise
  real
<lower=ymin> y[J]; // observed values
}

parameters
{
  real mu
; // mean of underlying normal
  real
<lower=0> sigma; // sd of underlying normal
  real x
[J]; // "true" values of each data point
}

model
{

  x
~ normal(mu, sigma); // points are normally distributed

 
for (i in 1:J){
    y
[i] ~ normal(x[i],ysig) T[ymin,]; // with truncated normal noise
 
}

}


and then run Stan

## make list for stan
sim_dat
<- list(J = J,
                y
= y,
                ymin
= ymin,
                ysig
= ysig
               
)

## initial values
inits
<- list(list(mu=mu,sigma=sigma,x=x))

## fit
niter
<- 5000
fit
<- stan(file = "my-model.stan", data = sim_dat, iter = niter, chains = 1, seed=666, init = inits)

## examine results
fit_summary
<- summary(fit, pars = c("mu","sigma"), probs = c(0.025,0.5,0.975))$summary
print(fit_summary)


When I run this, I recover a biased values of mu and sigma for e.g. for ymin=0:

       mean     se_mean         sd      2.5%       50%     97.5%    n_eff
mu    
0.5434274 0.003793941 0.08376895 0.3653829 0.5473229 0.7010205 487.5117
sigma
0.8635829 0.008732197 0.11146207 0.6399063 0.8624748 1.0797541 162.9323
     
Rhat
mu    
1.000211
sigma
1.001197


If I set ymin to be very low so there is effectively no truncation (e.g. ymin=-10), the parameters are not biased, as expected. The bias increases as I raise ymin.

Can anyone suggest what I might be doing wrong here?

Cheers,

 Ben



Daniel Lee

unread,
Sep 7, 2016, 9:48:43 AM9/7/16
to stan-users mailing list
Hi Ben,

The issue is that the Stan program you've written doesn't match the data generating process, so the mu that you're fitting with Stan won't match the true mu. If you add a generated quantities block, you'll see that the posterior predictive ys line up with the data. Try adding this to your Stan program and comparing y_rep to the data y.

generated quantities {
  real<lower = ymin> y_rep[J];
  y_rep = rep_array(ymin - 1, J);
  for (j in 1:J) {
    while (y_rep[j] < ymin) {
      y_rep[j] = normal_rng(x[j], ysig);
    }
  }
}

So first, the Stan program is working as you've written, but what you've written isn't what you really want. In your program, mu is being estimated as the marginal mean of the xs (with that much data).

Here's a Stan program that will work:
data {
  int<lower = 0> J;
  real ymin;
  real<lower = 0> ysig;
  real<lower = ymin> y[J];
}
parameters {
  real mu;
  real<lower = 0> sigma;
}
transformed parameters {
  real<lower = 0> sigma_combined;
  sigma_combined = sqrt(square(sigma) + square(ysig));
}
model {
  for (j in 1:J)
    y[j] ~ normal(mu, sigma_combined) T[ymin, ];
}

This will correctly estimate mu and sigma. I'll let you add in a generated quantities block.

I haven't broken it down into an x component and an error term, but I'm sure it could be parameterized that way. Here, I'm assuming your data generating process where you first generate xs then the error terms as normal(0, ysig). Independent variances add, so sigma_combined is just the square root of the combined variance.

Hopefully that makes sense. If you're able to explain it clearer than what I've written out, please send a better explanation out to the list. It's a bit early and I need more coffee before being much more lucid.



Daniel







--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Maughan

unread,
Sep 7, 2016, 5:49:30 PM9/7/16
to Stan users mailing list
Hi Daniel,

Thanks for the quick and detailed reply. I think I only partly understand your explanation. Firstly, the generated quantities example was useful and I see how that demonstrates that the program is working properly even if my modelling is not doing what I want.


So first, the Stan program is working as you've written, but what you've written isn't what you really want. In your program, mu is being estimated as the marginal mean of the xs (with that much data).


I'm afraid I don't understand this point. It seems to me that I want mu to be the mean of the xs, as that is what mu is when I generate my data. Similarly, I don't understand why my model does not match the data generation - to me it seems identical so I am clearly missing something important!
 
Here's a Stan program that will work:
data {
  int<lower = 0> J;
  real ymin;
  real<lower = 0> ysig;
  real<lower = ymin> y[J];
}
parameters {
  real mu;
  real<lower = 0> sigma;
}
transformed parameters {
  real<lower = 0> sigma_combined;
  sigma_combined = sqrt(square(sigma) + square(ysig));
}
model {
  for (j in 1:J)
    y[j] ~ normal(mu, sigma_combined) T[ymin, ];
}

This will correctly estimate mu and sigma. I'll let you add in a generated quantities block.

I haven't broken it down into an x component and an error term, but I'm sure it could be parameterized that way. Here, I'm assuming your data generating process where you first generate xs then the error terms as normal(0, ysig). Independent variances add, so sigma_combined is just the square root of the combined variance.


Thanks for the example code. I can see why this works, but not why my original model did not. In reality, my data are not so simple as the two independent normal distributions I used in my example, so I cannot combine them into a single normal distribution and must model the distribution of x and the errors separately. For my data, the "x" values are modelled with a power-law and the error term is Poissonian - I set up the toy model with two normal distributions as the simplest way I could think of to demonstrate my problem. I'd find it really helpful if anyone could show me how to model this with x component and error term separately, including the truncation.

Many thanks,
 Ben
 





To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,
Sep 7, 2016, 6:41:22 PM9/7/16
to stan-...@googlegroups.com
Daniel's right. Your model doesn't match your generating
process. In the generating process, you censor out
items where y <= ymin.

If you don't record how many get censored and account for
that in the model, you won't be able to accurately estimate
the parameters.

Here's a simple example:

> y <- rnorm(100, 0, 1)
> mean(y)
[1] 0.0642081

> y <- y[y > 0]
> mean(y)
[1] 0.8642496

Check out the section on Censored Data in the Truncated or
Censored Data chapter. You just have a lower truncation instead
of an upper truncation. So you'd use the normal_lcdf instead
of the normal_lccdf if you want to marginalize out the censored
values.

- Bob

Ben Maughan

unread,
Sep 8, 2016, 5:34:29 AM9/8/16
to Stan users mailing list
Dear Bob,

Thanks for your comments.


    If you don't record how many get censored and account for
    that in the model, you won't be able to accurately estimate
    the parameters.


Maybe I'm using the wrong terminology (and maybe it isn't important)
but although in my toy model I know the number of missing data points
that fell below ymin, in my real data I do not. For this reason I was
following the model for truncated data rather than censored data in
the "Truncated or censored data chapter". In my Stan model above I
used


    y[i] ~ normal(x[i],ysig) T[ymin,]; // with truncated normal noise

to model the effect of truncation on the observed y values, while my
"intrinsic/true" (not sure what the correct term is) x values are not
truncated. I'm still unclear why this does not match the data
generation, and this seems to be an important gap in my understanding.

Following from your comment I tried applying the approach for censored data:

  data {
   
int<lower=0> N_obs; // number of observed points
   
int<lower=0> N_cen; // known number of missing points

    real
<lower=0> ysig; // known amount of noise

    real y
[N_obs]; // observed values
    real
<upper=min(y)> ymin; // known minimum allowed observed value

 
}

  parameters
{
    real mu
; // mean of underlying normal
    real
<lower=0> sigma; // sd of underlying normal

    real x
[N_obs]; // "true" values of each data point
    real x_cen
[N_cen]; // "true" values of each missing data point
 
}

  model
{
   
// observed points

    x
~ normal(mu, sigma); // points are normally distributed

    y
~ normal(x, ysig); // errors are normally distributed
   
// missing points
    x_cen
~ normal(mu, sigma); // points are normally distributed
   
for (i in 1:N_cen){
      target
+= normal_lcdf(ymin | x_cen[i],ysig); // marginalise over missing points
   
}
 
}

This works correctly, which is great. However, I'm unclear how to
proceed in the case that I do not know the number of missing points.

Many thanks,

Ben

Bob Carpenter

unread,
Sep 8, 2016, 1:25:24 PM9/8/16
to stan-...@googlegroups.com

> On Sep 8, 2016, at 5:34 AM, Ben Maughan <benma...@gmail.com> wrote:
>
> Dear Bob,
>
> Thanks for your comments.
>
> If you don't record how many get censored and account for
> that in the model, you won't be able to accurately estimate
> the parameters.
>
>
> Maybe I'm using the wrong terminology (and maybe it isn't important)

It's important if you want to communicate!

> but although in my toy model I know the number of missing data points
> that fell below ymin, in my real data I do not.

Then you won't be able to recover the true values.

> For this reason I was
> following the model for truncated data rather than censored data in
> the "Truncated or censored data chapter".

Your data generating process is not truncated, though, it's
censored.

> In my Stan model above I
> used
>
> y[i] ~ normal(x[i],ysig) T[ymin,]; // with truncated normal noise
>
> to model the effect of truncation on the observed y values,

That's not what this does.

> while my
> "intrinsic/true" (not sure what the correct term is) x values are not
> truncated. I'm still unclear why this does not match the data
> generation, and this seems to be an important gap in my understanding.

You want to think in terms of the density function and the generative
model. With the truncation above, you get the following contribution:

PROD_i normal(y[i] | x[i], ysig) / (1 - normal_cdf(y_min | x[i], ysig))

But that's not how your data's generated in the R program. You're
trying to apply what looks like a measurement error model (there's
another chapter in the manual for those), whereas what you really have
is censoring (you remove data points below ymin).

> Following from your comment I tried applying the approach for censored data:
>
> data {
> int<lower=0> N_obs; // number of observed points
> int<lower=0> N_cen; // known number of missing points
> real<lower=0> ysig; // known amount of noise
> real y[N_obs]; // observed values
> real<upper=min(y)> ymin; // known minimum allowed observed value
> }
>
> parameters {
> real mu; // mean of underlying normal
> real<lower=0> sigma; // sd of underlying normal
> real x[N_obs]; // "true" values of each data point
> real x_cen[N_cen]; // "true" values of each missing data point
> }
>
> model {
> // observed points
> x ~ normal(mu, sigma); // points are normally distributed
> y ~ normal(x, ysig); // errors are normally distributed
> // missing points
> x_cen ~ normal(mu, sigma); // points are normally distributed
> for (i in 1:N_cen){
> target += normal_lcdf(ymin | x_cen[i],ysig); // marginalise over missing points
> }
> }
>
> This works correctly, which is great. However, I'm unclear how to
> proceed in the case that I do not know the number of missing points.

You're out of luck, I'm afraid. If you don't know the number
of censored data points you have no way of recovering the true
parameters, I'm afraid.

- Bob

Ben Maughan

unread,
Sep 8, 2016, 4:49:46 PM9/8/16
to Stan users mailing list
Dear Bob,

Thanks for the continuing help. I think I am being quite obtuse here, but your distinction between truncation and censoring is unclear to me. I should start by saying that I am sure you are right, since I can see for myself that my truncated model doesn't work!

My understanding from the manual and elsewhere was that the only difference between censorship and truncation is that in the case of censorship you know the number of missing points, while for truncation you do not. If that is correct, then can I just pretend that I did not know how many points I generated before rejecting those below ymin? I could also ask a colleague to generate the data in the same way and not tell me how many points they started with. In that case, have I not generated a truncated rather than censored dataset, and shouldn't the truncation model work?

In my actual data, I am looking at a class of astrophysical object (active galactic nuclei or AGN) that has a distribution of X-ray emission rates (let's say it is lognormal, and that I want to know the mean and variance). I observe the sky with a photon counting detector which measures the number of X-ray photons in some length of time (i.e. sampling a Poisson distribution), and I detect an AGN if I receive more than some minimum number of photons in a pixel of the detector (say, 4 photons). I am trying to use the observations of photon counts for a large number of such AGN to infer the parameters of the distribution of rates, but I don't know how many AGN are out there that did not meet the detection threshold. My toy model was an attempt to replicate this in the simplest useful way, with x representing the rates, y representing the detected counts and ymin the detection threshold. I believe this is a truncated dataset rather than censored - is that the case, and if so would you maintain that there is no way to recover the parameters I am interested in? 

The way I would try to replicate the data described above would be something like

x <- rlnorm(1000,log(mu),sigma)
y
<- rpois(1000,x)

y
<- y[y>ymin]

but does this somehow fall into the trap of generating censored rather than truncated data and therefore not reproducing my experiment (even if I pretend I don't know that I started with 1000 points)? If so, I'd be really grateful for any suggestion of how to generate data correctly.

(N.B. I started by generating and modelling data along the lines above - lognormal and poisson - to more closely match my observations, but after failing with that I boiled it down to the toy model with two normal distributions in my original question as I thought that would be the simplest case to understand!)

Thanks for your patience!

 Ben

Avraham Adler

unread,
Sep 8, 2016, 5:03:15 PM9/8/16
to Stan users mailing list
On Thursday, September 8, 2016 at 4:49:46 PM UTC-4, Ben Maughan wrote:
Dear Bob,

Thanks for the continuing help. I think I am being quite obtuse here, but your distinction between truncation and censoring is unclear to me. I should start by saying that I am sure you are right, since I can see for myself that my truncated model doesn't work!

My understanding from the manual and elsewhere was that the only difference between censorship and truncation is that in the case of censorship you know the number of missing points, while for truncation you do not. If that is correct, then can I just pretend that I did not know how many points I generated before rejecting those below ymin? I could also ask a colleague to generate the data in the same way and not tell me how many points they started with. In that case, have I not generated a truncated rather than censored dataset, and shouldn't the truncation model work?
 

Simplistically, perhaps, censoring means you know that there is an observation outside the zone but you don't know what it is. Truncation means you don't know anything outside the zone. For example, in actuarial work, if a loss hits a policy limit, it is censored from above. We know it exists, and it is at least X, but we don't know how much larger. Losses that fail to breach a deductible are truncated from below—no one knows that they exist, even though NEXT TIME, that event may cause a large-enough loss to register.

Hope that helps,

Avi

Michael Betancourt

unread,
Sep 8, 2016, 5:28:29 PM9/8/16
to stan-...@googlegroups.com
Firstly and somewhat tangentially, no you did not measure any measurement
noise ysig.  Physicists make this mistake all the time because they’re used
to Poisson asymptotic estimators of the form lambda = N +/- \sqrt{N}.  But
if your noise is Poisson then model it generatively as

y ~ poisson(lambda);

or if there are measurement affects that corrupt the simple Poissonian variation

y ~ neg_binomial2(lambda, phi);

etc, etc.  Or make a normal approximation as

y ~ normal(mu, sigma);

and _fit_ sigma as a parameter.

Anyways, unless you have some other measurements sensitive to how many
observations that you are missing then you cannot infer how many you missed
as there is literally no information about that (i.e. you can’t compute an efficiency
if you don’t have counts before and after some process!).  

BUT you can do something even better if you think generatively.  Really what
you have here is a generative model with two components: a latent population
of AGN and a measurement model for those AGN with sufficient magnitude.
If you fit both of those together then you can infer the shape of the AGN population
distribution even with the truncation, and then you can recover the overall rate
of the observations you missed by integrating the tail of that distribution from
the truncation.

Very, very prototypically, something like

data {
  int<lower=1> N;
  real y[N];
}

parameters {
  real agn_mu; // AGN population log mean
  real<lower=0> agn_sigma; // AGN population deviation

  real y_latent[N]; // Latent AGN magnitudes

  real thresh; // measurement threshold
  real<lower=0> sigma; // heteroskedastic measurement deviation
}

model {
  // priors on everything!  No uniform distributions!

  // AGN population model
  for (n in 1:N)
    y_latent[n] ~ normal(agn_mu, agh_sigma) T[thresh,];

  // Measurement model
  for (n in 1:N)
    y[n] ~ normal(y_latent[n], sigma);
}

generated quantities {
  real true_rate;
  real missing_rate;

  true_rate = N / normal_ccdf(thresh, agn_mu, agh_sigma);
  missing_rate = true_rate - N;
}

In practice you’ll want to vectorize everything and probably switch
to a non-centered parameterization for the y_latent.

Generative models — they are the best!

Andrew Gelman

unread,
Sep 8, 2016, 9:35:45 PM9/8/16
to stan-...@googlegroups.com
Hi, particularly relevant to this discussion is section 2 of this cult classic paper:
You _can_ model truncated data as censored data with an unknown number of censored data points.  I don't know how easy this is to do in Stan, though, as Stan does not allow discrete parameters.
A

Bob Carpenter

unread,
Sep 8, 2016, 11:48:25 PM9/8/16
to stan-...@googlegroups.com
What your R code is doing is censoring.

To generate truncated data with y > ymin, you could
build a simple rejection sampler

y <- normal_rng(mu, sigma);
while (y <= ymin)
y <- normal_rng(mu, sigma);

or you can invert a CDF:

z <- uniform_rng(normal_cdf(ymin, mu, sigma), 1)
y <- normal_inverse_cdf(z, mu, sigma);

(I don't know the R calls to do this.)

But this isn't your generative process, so just having
truncation won't get you the right answer.

- Bob

Ben Maughan

unread,
Sep 9, 2016, 4:11:58 AM9/9/16
to Stan users mailing list
Hi,

Thanks for this. If I understand your model correctly, it models the case where I might measure a y value to be below thresh even though its y_latent must be above thresh. However, I have the opposite case; my data may have any value of y_latent, but the measured y must be above thresh. In the above model, the probability that would be assigned to (y_latent<thresh & y>thresh) is zero, but this is perfectly possible in nature.

Specifically, an AGN might have a low magnitude that makes it unlikely to be detected, but in a particular observation the Poisson fluctuations in photon counts might just get over the detection threshold. In that case y_latent < thresh but y > thresh. This is what I tried to set up with my toy model by using the truncated normal for my measured y, but this did not work.

Am I misunderstanding your model?

Cheers,
 Ben

Michael Betancourt

unread,
Sep 9, 2016, 5:05:52 AM9/9/16
to stan-...@googlegroups.com
Then move the truncation to the measurement model for the y.
The point is that you can’t infer anything about the missing
measurements unless you model the latent population distribution
exactly.

Ben Maughan

unread,
Sep 9, 2016, 7:02:11 AM9/9/16
to Stan users mailing list
Hi Michael,

That is precisely what I did in my initial model (see first post, x is the latent population, ymin is thresh):

  x ~ normal(mu, sigma);

 
for (i in 1:J){

    y
[i] ~ normal(x[i],ysig) T[ymin,];
 
}

and this didn't work - or are you suggesting something different from my first model?

Cheers,
 Ben

Michael Betancourt

unread,
Sep 9, 2016, 8:27:57 AM9/9/16
to stan-...@googlegroups.com
As Bob and Dan very clearly explained, your original results
were invalid because you weren’t simulating data correctly.
You simulated data assuming that you knew the total number
of trials, but such truncation models assume that you only
know the number of measured trials.  Hence you have to
do rejection sampling or, in the Gaussian case, inverse
CDF transform sampling to generate appropriate simulation
data.

Even once you do that, however, you can only infer the
latent rate of missingness (or the latent true rate).  This is
actually more informative than the actual number of missing
events as it isn’t corrupted by measurement noise.

Bob Carpenter

unread,
Sep 9, 2016, 10:42:24 AM9/9/16
to stan-...@googlegroups.com
The paper Andrew linked has a clear explanation of
what's going on in section 2.1:

http://www.stat.columbia.edu/~gelman/research/published/parameterization.pdf

I was wrong in saying you couldn't infer the parameters,
though. And now that I think about it, you could probably infer
the amount of missingness from the CDFs given the parameters,
because you know the mass < ymin and the mass > ymin, which
gives you the expected ratio of missing to observed, and then
it's just binomial stuff after that to do inference on number
of missing items. At least I think that's right. I love
having my work checked on this list!

- Bob

Ben Maughan

unread,
Sep 15, 2016, 4:38:51 AM9/15/16
to Stan users mailing list
Hi,

Thanks to everyone for the comments. I have been pondering this for
the last few days and believe I understand the nature of my problem
now. As several of you said, the issue was that my data generation was
not consistent with my Stan model.

I'll summarise my understanding in the hope that (a) it is correct and
(b) it is of some use to others.

In my examples, I have some latent quantity x sampled from a
distribution of known form, with parameters θ_x, and then some
observed quantity sampled from a known distribution centred on x, with
parameters θ_y. The data are only collected if y is greater than
some threshold y_min.

I'll denote the pdf of x as P_x, and the pdf of y as P_y. In my first
toy model, P_x and P_y were both normal while in my semi-realistic
model, P_x was lognormal and P_y was Poissonian.

I generated my data using something like

x = rnorm(N_tot,mu,sigma_x)
y
= rnorm(N_tot,x,sigma_y)

and applied my detection threshold with

y = y[y>ymin]
N_obs
= length(y)

Now consider the likelihood for y. First, neglecting the truncation at
y_min, the likelihood of a particular pair of x and y is

P_1(y,x|θ_x,θ_y) = P_x(x|θ_x) * P_y(y|x,θ_y)

and we can marginalise out x to get

P_2(y|θ_x,θ_y) = \int dx P_x(x|θ_x) * P_y(y|x,θ_y)

Now, taking the truncation into account, we set P_2(y|θ_x,θ_y)=0 for y<=y_min and
renormalise the distribution to give

P_trunc(y|θ_x,θ_y,y_min) = P_2(y|θ_x,θ_y) / \int_y_min^inf dy P_2(y|θ_x,θ_y)

[equation 1]

The problem I had was that my Stan model for truncated data had the form

    x ~ normal(mu,sigma_x);
   
for (i in 1:N_obs)
      y
[i] ~ normal(x[i], sigma_y) T[ymin,];

but this does not correspond to the likelihood above. Instead, this
corresponds to

P_trunc(y|θ_x,θ_y,y_min) = \int dx P_x(x|θ_x) * P_y(y|x,θ_y) / \int_y_min^∞ dy P_y(y|x,θ_y)

i.e. in this case it is only the distribution of y that is truncated,
while for the first case [equation 1], the x-marginalised distribution
of y is truncated, as in my generated data.

In the simple case of P_x and P_y being normal, then P_trunc
simplifies to a single normal with the variances added to give
sigma_tot^2 (as pointed out earlier in the thread), and you can just
write

y[i] ~ normal(mu,sigma_tot) T[ymin,]

For the general case where P_x and P_y are not both normal, I don't
know a way of writing a model corresponding to [equation 1] concisely
in Stan, (i.e. keeping the x's as latent parameters rather than
integrating them out). An inelegant version in which I compute the
integrals numerically in user-defined functions, correctly recovers the
parameters in my lognormal-poisson generated data.

Any thoughts on whether it is possible to write the general case in
Stan would be gratefully received!

Thanks for all of your time and input!

Ben
Reply all
Reply to author
Forward
0 new messages