# weighted log-likelihood

1278 views

### Ben Goodrich

Jun 23, 2015, 7:00:54 PM6/23/15
Andrew (mostly): Do we want stan_glm() to throw an error if weights are passed? In general, we want stan_glm() to accept all the same arguments as glm(), but we could make an exception if something would be objectively wrong from a Bayesian perspective. The glm() function allows weights to be any non-negative vector and clarifies that

Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations. For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes: they would rarely be used for a Poisson GLM.

So, which if any, of those possibilities should stan_glm() allow? It seems that negative or non-integer weights should definitely not be allowed.

Similarly, for stan_lm vs. lm, the weights are described as
If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used. Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized).

Ben

### Bob Carpenter

Jun 23, 2015, 9:42:36 PM6/23/15
Are we just talking about something like a regression
with predictors x[n], outcomes y[n] and weights w[n] that mean
data point n is treated as w[n] observations? We can easily
code that in Stan as:

for (n in 1:N)
increment_log_prob(w[n] * normal_log(y[n], x[n] * beta, sigma));

What's non-Bayesian about it? That you don't model the weights?

Even negative weights make sense algorithmically in this context,
though I'm not sure what they'd mean conceptually.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

### Andrew Gelman

Jun 23, 2015, 9:44:02 PM6/23/15
No no no no no.

The easiest way to see why it’s not Bayesian is to realize there’s no generative model here. You can’t do fake-data simulation or ppc.

### Ben Goodrich

Jun 23, 2015, 9:47:55 PM6/23/15
I tend to think that non-negative integer weights should be allowed too if they actually correspond to people in your sample, but Andrew didn't mention that case this morning. But to say that some person in the sample represents 1000.23456789 identical people in the population is non-Bayesian if you condition on those 999.23456789 people that you didn't actually observe.

Ben
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

### Andrew Gelman

Jun 23, 2015, 11:07:01 PM6/23/15
Yah, good point.  Perhaps best to just disallow weights for now.  Zero weights are easy enough, of course, but it seems like more trouble than we need to allow them.

Alternatively we could allow weights, as Bob suggests, but then spit out lots of warnings.  The trouble with disallowing weights is that a lot of users will want them.  Also, if weights are fairly mild, it shouldn’t be so horrible to include them.

As I told Bob, suppose someone supplies a dataset with weights of 1000, 1000, 1000, 1000, etc.  Compared to weights of 1, 1, 1, 1, etc., the result won’t affect the MLE, but they will change any informative-prior Bayes estimate (by giving 1000x the weight to the likelihood) and also it will result in tiny standard errors (and, depending on the model, possible poor convergence of HMC).  Bob suggested rescaling the weights to sum to N, but that won’t in general work, as can be seen from the example of a simple logistic regression where three items each have weights of 1 and the other N-3 items have weight of 1e-6 each.  The result will be to give each of the three items a weight of N/3 each, implying much stronger data than actually exist.  (Things are a bit trickier with linear regression because of the free-floating variance parameter.)

The long story is that, from a Bayesian perspective, there’s really no such thing as weighted regression.  So we’re in a bit of a bind in how we compromise this.

We have another such compromise issue when we want to do point estimation.  For logistic regression, ordered logit, probit, Poisson, neg binomial, etc etc etc., standard practice is MLE and we can do that too, or even better we can do posterior mode.  But for linear regression, standard practice is to use the 1/(N-K) denominator on the variance, and I’ve struggled to see how to come up with this answer generically (without “cheating” and having the program know that it’s linear regression).  Not an issue with full Bayes but it comes up if we want to have point estimation for compatibility.

For now I guess we should follow Bob’s advice and use increment_log_prob() with weights, and when weights are specified in stan_regression, we spit out a warning and also print out the sum of the weights.

On Jun 23, 2015, at 7:00 PM, Ben Goodrich <goodri...@gmail.com> wrote:

### Ben Goodrich

Jun 24, 2015, 12:00:34 AM6/24/15
On Tuesday, June 23, 2015 at 11:07:01 PM UTC-4, Andrew Gelman wrote:
For now I guess we should follow Bob’s advice and use increment_log_prob() with weights, and when weights are specified in stan_regression, we spit out a warning and also print out the sum of the weights.

I guess there are several situations:
• For MLE that happens to use Stan's optimization algorithm, I think it is self-evident that stan_foo() should handle the weights in the same way as foo().
• For penalized MLE, MCMC, and things like that, we could follow Bob's suggestion (presumably  disallowing negative weights), but I have a hard time coming up with any rationale for non-integer weights in a Bayesian GLM.
• For the case where glm() does "For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes", that also makes complete sense from a Bayesian perspective (provided weight[i] * y[i] is an integer). Sometimes medical data comes this way because they don't want to individually identify the successes and failures.
• More generally, in any case where the weights are non-negative integers and represent the number of cases observed in the sample with that data, I wouldn't mind throwing an error that says, in effect, "duplicate each row i weight[i] times and try again without weights".
• Generalized Least Squares is Ordinary Least Squares on transformed data, so it doesn't seem particularly problematic from a Bayesian perspective to do a Gaussian model on the transformed data.

Other opinions?

Ben

### Bob Carpenter

Jun 24, 2015, 12:36:36 AM6/24/15
I still don't see the problem.

> On Jun 23, 2015, at 11:06 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
>
> Yah, good point. Perhaps best to just disallow weights for now. Zero weights are easy enough, of course, but it seems like more trouble than we need to allow them.
>
> Alternatively we could allow weights, as Bob suggests, but then spit out lots of warnings. The trouble with disallowing weights is that a lot of users will want them. Also, if weights are fairly mild, it shouldn’t be so horrible to include them.
>
> As I told Bob, suppose someone supplies a dataset with weights of 1000, 1000, 1000, 1000, etc. Compared to weights of 1, 1, 1, 1, etc., the result won’t affect the MLE, but they will change any informative-prior Bayes estimate (by giving 1000x the weight to the likelihood) and also it will result in tiny standard errors (and, depending on the model, possible poor convergence of HMC).

Isn't that the whole point of weighting --- if something counts
as 1000 observations, it counts as 1000 observations relative to
the prior, too.

> Bob suggested rescaling the weights to sum to N,

I think I was trying to ask is whether an ordinary linear
regression was Bayesian given that we don't model N. (And
we don't model x in most cases, either, which is also problematic
for fake data generation if we want to, say, generate a fake
data set to test whether our logistic regression estimator is
working.)

> but that won’t in general work, as can be seen from the example of a simple logistic regression where three items each have weights of 1 and the other N-3 items have weight of 1e-6 each. The result will be to give each of the three items a weight of N/3 each, implying much stronger data than actually exist.

Right --- that's why I think it makes sense to treat the weights as
counts.

> (Things are a bit trickier with linear regression because of the free-floating variance parameter.)
>
> The long story is that, from a Bayesian perspective, there’s really no such thing as weighted regression. So we’re in a bit of a bind in how we compromise this.

I still don't understand why, or more importantly, why
it's such a big deal. It's sort of like MI, right? It's
not a Bayesian procedure because there's no big joint model
governing everything (there could be, though --- I realize
it's a subtle point).

> We have another such compromise issue when we want to do point estimation. For logistic regression, ordered logit, probit, Poisson, neg binomial, etc etc etc., standard practice is MLE and we can do that too, or even better we can do posterior mode. But for linear regression, standard practice is to use the 1/(N-K) denominator on the variance, and I’ve struggled to see how to come up with this answer generically (without “cheating” and having the program know that it’s linear regression).

And I don't see why that's cheating or if we call it
cheating, why we wouldn't want to cheat!

> Not an issue with full Bayes but it comes up if we want to have point estimation for compatibility.
>
> For now I guess we should follow Bob’s advice and use increment_log_prob() with weights, and when weights are specified in stan_regression, we spit out a warning and also print out the sum of the weights.

I just figured that's what you were doing already! Is
there another way to do weighting?

It's come up on the list a few times.

- Bob

### Ben Goodrich

Jun 24, 2015, 2:10:44 AM6/24/15
On Wednesday, June 24, 2015 at 12:36:36 AM UTC-4, Bob Carpenter wrote:
I still don't see the problem.

For example, this survey

has a column called "weight" with real positive numbers that sum to N = 8198 (excluding missing values). These weights are a multiplicative function of the gender, age, and region categories so that when weighted, the sample has the same proportions on these three variables as the US adult population of like 200+ million. But utilizing them would put you in violation of the likelihood principle.

Ben

### Andrew Gelman

Jun 24, 2015, 4:07:27 PM6/24/15
Bob:
The trouble is that weights are used in many different ways for different purposes. When a survey respondent has a weight of 3.1, this does not mean that there were 3.1 respondents who gave these exact same respondents. What it means is that this person represents (based on some model) 3.1 people in the population. You can perhaps already see the problem here. The analogy with MI is not a good one. With MI we can see the imputations as an approximation to Bayesian inference given a poorly specified model. With weighting there’s no implicit model at all. Also, yes, linear regression is Bayesian even though N and x are not modeled. The implicit assumption is separability of (N,x) and y in the posterior. We discuss this near the beginning of the regression chapter of BDA.
A

> On Jun 24, 2015, at 12:35 AM, Bob Carpenter <ca...@alias-i.com> wrote:
>
> I still don't see the problem.
>
>
>> On Jun 23, 2015, at 11:06 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
>>
>> Yah, good point. Perhaps best to just disallow weights for now. Zero weights are easy enough, of course, but it seems like more trouble than we need to allow them.
>>
>> Alternatively we could allow weights, as Bob suggests, but then spit out lots of warnings. The trouble with disallowing weights is that a lot of users will want them. Also, if weights are fairly mild, it shouldn’t be so horrible to include them.
>>
>> As I told Bob, suppose someone supplies a dataset with weights of 1000, 1000, 1000, 1000, etc. Compared to weights of 1, 1, 1, 1, etc., the result won’t affect the MLE, but they will change any informative-prior Bayes estimate (by giving 1000x the weight to the likelihood) and also it will result in tiny standard errors (and, depending on the model, possible poor convergence of HMC).
>
> Isn't that the whole point of weighting --- if something counts
> as 1000 observations, it counts as 1000 observations relative to
> the prior, too.
>
>> Bob suggested rescaling the weights to sum to N,
>
> I think I was trying to ask is whether an ordinary linear
> regression was Bayesian given that we don’t model N.

No, we are just assuming separability of N, x, and y in the model. This is dis

> (And
> we don't model x in most cases, either, which is also problematic
> for fake data generation if we want to, say, generate a fake
> data set to test whether our logistic regression estimator is
> working.)
>
>> but that won’t in general work, as can be seen from the example of a simple logistic regression where three items each have weights of 1 and the other N-3 items have weight of 1e-6 each. The result will be to give each of the three items a weight of N/3 each, implying much stronger data than actually exist.
>
> Right --- that's why I think it makes sense to treat the weights as
> counts.
>
>> (Things are a bit trickier with linear regression because of the free-floating variance parameter.)
>>
>> The long story is that, from a Bayesian perspective, there’s really no such thing as weighted regression. So we’re in a bit of a bind in how we compromise this.
>
> I still don't understand why, or more importantly, why
> it's such a big deal. It's sort of like MI, right? It's
> not a Bayesian procedure because there's no big joint model
> governing everything (there could be, though --- I realize
> it's a subtle point).
>
>> We have another such compromise issue when we want to do point estimation. For logistic regression, ordered logit, probit, Poisson, neg binomial, etc etc etc., standard practice is MLE and we can do that too, or even better we can do posterior mode. But for linear regression, standard practice is to use the 1/(N-K) denominator on the variance, and I’ve struggled to see how to come up with this answer generically (without “cheating” and having the program know that it’s linear regression).
>
> And I don't see why that's cheating or if we call it
> cheating, why we wouldn't want to cheat!
>
>> Not an issue with full Bayes but it comes up if we want to have point estimation for compatibility.
>>
>> For now I guess we should follow Bob’s advice and use increment_log_prob() with weights, and when weights are specified in stan_regression, we spit out a warning and also print out the sum of the weights.
>
> I just figured that's what you were doing already! Is
> there another way to do weighting?
>
> It's come up on the list a few times.
>
> - Bob
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.

### Andrew Gelman

Jun 24, 2015, 8:53:35 PM6/24/15
I think we should allow non-integer weights.  R returns an error if you give it negative weights, so we can do that too.

The only trouble is, I don’t quite know what R is doing with the weights.

Here’s an example:

> y <- 1:5
> w <- 1:5
> display(lm(y~1))
lm(formula = y ~ 1)
coef.est coef.se
(Intercept) 3.00     0.71
---
n = 5, k = 1
residual sd = 1.58, R-Squared = 0.00
> display(lm(y~1, weight=w))
lm(formula = y ~ 1, weights = w)
coef.est coef.se
(Intercept) 3.67     0.62
---
n = 5, k = 1
residual sd = 2.42, R-Squared = 0.00

OK, so far so good.  We gave higher weights to the big values, so beta.hat goes up.

But then check this out:

> display(lm(y~1, weight=10*w))
lm(formula = y ~ 1, weights = 10 * w)
coef.est coef.se
(Intercept) 3.67     0.62
---
n = 5, k = 1
residual sd = 7.64, R-Squared = 0.00

I multiplied al the weights by 10, and what happened?  The coef estimate did not change (fine), but the se did not change either!  Which suggests it’s doing some sort of normalization with the weights.  Which I don’t actually want because now I don’t know what’s going on.  Also, the residual sd increased, which suggests that it’s a weight-scaled sd, not a residual sd at all.

OK, now let’s try logistic regression:

glm(formula = y ~ 1, family = binomial(link = "logit"))
coef.est coef.se
(Intercept) 0.41     0.91
---
n = 5, k = 1
residual deviance = 6.7, null deviance = 6.7 (difference = 0.0)
glm(formula = y ~ 1, family = binomial(link = "logit"), weights = w)
coef.est coef.se
(Intercept) -0.41     0.53
---
n = 5, k = 1
residual deviance = 20.2, null deviance = 20.2 (difference = 0.0)
glm(formula = y ~ 1, family = binomial(link = "logit"), weights = 10 *
w)
coef.est coef.se
(Intercept) -0.41     0.17
---
n = 5, k = 1
residual deviance = 201.9, null deviance = 201.9 (difference = 0.0)

_This_ time when we multiply all weights by 10, the se of the coefficient decreases by the expected factor of sqrt(10).

So . . . it seems to me that R is not coherent in what it does with lm and glm.  (By the way, if you run glm with family=“gaussian”, you reproduce the lm results.)  Which suggests to me that we do not need to preserve compatibility with R here!

My suggestion is that we disallow negative or missing weights, and that when weights are supplied we use increment_log_prob with the specified weights.  This is clean and clear.  The only concern is that the autogenerated Stan code will now be uglier.  I suppose we’ll have to have stan_regression generate different Stan programs, depending on whether it is called with or without weights.

### Ben Goodrich

Jun 24, 2015, 9:02:48 PM6/24/15
On Wednesday, June 24, 2015 at 8:53:35 PM UTC-4, Andrew Gelman wrote:
I think we should allow non-integer weights.  R returns an error if you give it negative weights, so we can do that too.

Even for MCMC with a non-Gaussian GLM? If there is any question for which that would yield a correct Bayesian answer, I think we should allow it, but I can't think of one.

Ben

### Andrew Gelman

Jun 24, 2015, 9:30:08 PM6/24/15
I agree that it’s not Bayesian; it’s what is sometimes called a quasi-likelihood in that it acts mathematically as a likelihood function but there is no generative model.  Still, it is a well-defined target distribution and I think we should allow it.  Cos people are gonna want to do it.

### Bob Carpenter

Jun 24, 2015, 11:21:37 PM6/24/15

> On Jun 24, 2015, at 4:07 PM, Andrew Gelman <Gel...@stat.columbia.edu> wrote:
>
> Bob:
> The trouble is that weights are used in many different ways for different purposes. When a survey respondent has a weight of 3.1, this does not mean that there were 3.1 respondents who gave these exact same respondents. What it means is that this person represents (based on some model) 3.1 people in the population.

I understand this.

> You can perhaps already see the problem here. The analogy with MI is not a good one. With MI we can see the imputations as an approximation to Bayesian inference given a poorly specified model. With weighting there’s no implicit model at all. Also, yes, linear regression is Bayesian even though N and x are not modeled. The implicit assumption is separability of (N,x) and y in the posterior. We discuss this near the beginning of the regression chapter of BDA.

Putting your two paragraphs together, I take it there's no way
to treat the weights as "data" (like N or x) in such a way that they're
separable in the same way as (N, x) in regression?

- Bob

### Ben Goodrich

Jun 24, 2015, 11:32:14 PM6/24/15
On Wednesday, June 24, 2015 at 9:30:08 PM UTC-4, Andrew Gelman wrote:
I agree that it’s not Bayesian; it’s what is sometimes called a quasi-likelihood in that it acts mathematically as a likelihood function but there is no generative model.  Still, it is a well-defined target distribution and I think we should allow it.  Cos people are gonna want to do it.

Perhaps another way to ask the question is: What do you want the print to look like in the case of a non-Gaussian GLM with non-integer weights under MCMC? For bona fide Bayesian analysis, it would be something like the mean, standard deviation, and credible interval of the margins of the posterior distribution. But if there is no posterior distribution, then what are the users supposed to use to make a statistical inference? Maybe the multivariate mode of this target distribution? But if there is no valid statistical inference, I don't see why we don't just return an error that says "if you want to use non-integer weights, you have to do MLE".

Ben

### Andrew Gelman

Jun 25, 2015, 12:08:28 AM6/25/15
I’m just thinking we take the quasi-likelihood and multiply it by the prior, treat the combined thing as a posterior and go from there.  That’s basically what R does with glm to get estimates and standard errors.

### Andrew Gelman

Jun 25, 2015, 12:09:50 AM6/25/15
No, weights are trickier. We have ideas of how to treat them fully Bayesianly but it’s tough. See the following paper:
http://www.stat.columbia.edu/~gelman/research/published/Si_et_al-BA14.pdf

I don’t actually think this is the right thing to do, but the complexity of that analysis indicates the difficulty that’s involved here.

### Jonah

Jul 4, 2015, 5:30:20 PM7/4/15
Thought I'd chime in here to give an update on where things stand right now regarding user-supplied weights to stan_glm:

We should be able handle the weights without needing separate .stan files using using if clauses. For example. here's how it currently looks for discrete outcomes (right now this means the likelihood is bernoulli or poisson, but this will eventually be extended to others too):

if (has_weights == 0) { // unweighted log-likelihoods
if (family == 1) { // family = binomial
if (link == 1) y ~ bernoulli_logit(eta);
else {
vector[N] pi;
y ~ bernoulli(pi);
}
}
else { // family = poisson
if (link == 1) y ~ poisson_log(eta);
else {
vector[N] phi;
y ~ poisson(phi);
}
}
}
else { // weighted log-likelihoods
vector[N] summands;
if (family == 1) // bernoulli
else // poisson
increment_log_prob(dot_product(weights, summands));
}

(The linkinv_binom and linkinv_pois functions apply the appropriate inverse link function to the linear predictor, and the pw_binom and pw_pois functions compute the pointwise log-likelihoods.)

### Andrew Gelman

Jul 4, 2015, 5:36:09 PM7/4/15
Just one man’s opinion, but . . . to me, it’s actually cleaner just to have 2 sets of stan models!

### Jonah Sol Gabry

Jul 4, 2015, 5:46:51 PM7/4/15
I know what you mean. I go back and forth on this.

On one hand, having fewer .stan files means having fewer files to edit every time something needs to be changed that's common to more than one of them. It also means less compilation. On the other hand, it means that each file is longer and more convoluted.

### Andrew Gelman

Jul 4, 2015, 5:48:40 PM7/4/15
Perhaps you could have a master stan file and then a script that takes it and writes the weighted version?

One reason I want separate files is that I like the idea of stan_regression being a gateway drug for Stan.  The cleaner the stan_regression files are, the easier it wil be for people to customize them.

### Jonah Sol Gabry

Jul 4, 2015, 6:26:38 PM7/4/15
If we really want users to be able to look at the code, make sense of it, and easily customize it, then I suppose we'd want to give them a version that doesn't have lots of conditional clauses for the family, link function, weights, offsets, the presence of an intercept term, all the different possibilities for the prior distributions, etc.

Maybe the best way to have pre-compiled models while keeping the code readable to new users is to really view these things as separate issues. That is, we can have the stan code that's used behind the scenes be different than the code the user sees. It might be best to have these long, convoluted .stan files from the perspective of  the functionality of the R package, but then we could have a function that writes a clean, readable .stan file for the user's specified model. This file wouldn't be used by the package (otherwise it couldn't be pre-compiled), but it would produce the same results if the user ran it with RStan.

### Andrew Gelman

Jul 4, 2015, 6:42:57 PM7/4/15
Yes, I was thinking there’d be a different Stan model for each link function etc.

### Ben Goodrich

Jul 4, 2015, 6:44:02 PM7/4/15
On Saturday, July 4, 2015 at 5:48:40 PM UTC-4, Andrew Gelman wrote:
The cleaner the stan_regression files are, the easier it wil be for people to customize them.

I don't think that goal is compatible with the more pressing goals for the package. The precompiled .stan files have lots of tricks, optimizations, and edge-case logic (like not having an intercept term,  having offset terms, possible stratification, etc.).

If you are trying to teach people to write .stan files, then they would be better served by looking at generated Stan code via the rethinking / glmer2stan stuff that Rob has been working on.

Ben

### Jonah Sol Gabry

Jul 4, 2015, 7:06:11 PM7/4/15
Yeah, what Ben said. It would be really hard (probably impossible) to have all the functionality and performance we want from the package using Stan programs intelligible to newcomers.

Maybe what we should do is really improve the pedagogical content in the example models library (repository) so that the R package can point users there. What would be cool is if stan_glm would print a link to a webpage with an example model using using the same outcome variable type, likelihood, and link as the user's model. It would be pretty straightforward to implement this. It would just take time, as we'd need to make more and better example models, but that's probably something we should do regardless.

### Ben Goodrich

Jul 4, 2015, 7:27:03 PM7/4/15
On Saturday, July 4, 2015 at 7:06:11 PM UTC-4, Jonah wrote:
Maybe what we should do is really improve the pedagogical content in the example models library (repository) so that the R package can point users there.

That is always a good idea, can be done by almost anyone, and enhances stan_demo (plus shinyStan) but ..

What would be cool is if stan_glm would print a link to a webpage with an example model using using the same outcome variable type, likelihood, and link as the user's model.

I think it would be better to just silently replace the Stan code in the stanfit object with autogenerated code for that model (assuming no weights and whatnot).

Ben