Log-log regression

300 views
Skip to first unread message

Mario Henrique

unread,
Feb 21, 2016, 6:54:47 PM2/21/16
to julia-stats
How to run a log-log linear regression  in Julia?
Like "lm(log (y) ~ log (x)" in R

Milan Bouchet-Valat

unread,
Feb 22, 2016, 4:45:25 AM2/22/16
to julia...@googlegroups.com
Le dimanche 21 février 2016 à 15:54 -0800, Mario Henrique a écrit :
> How to run a log-log linear regression  in Julia?
> Like "lm(log (y) ~ log (x)" in R
AFAIK this is perfectly equivalent to taking the log of both x and y,
and applying a linear regression on the resulting variables. So that
should be quite straightforward with GLM.jl (see its documentation).


Regards

Michael Borregaard

unread,
Feb 22, 2016, 6:30:19 AM2/22/16
to julia-stats
The documentation is not very explicit about the preferred way to do something like that. It looks to me as if you have to put the variables in a DataFrame, update the DataFrame with log versions, then do the lm.
using GLM, DataFrames
x
= collect(1:5) + rand(5)
y
= collect(1:5) + rand(5)
test
= DataFrame(x = x, y = y)
test
[:logx] = log(test[:x])
test
[:logy] = log(test[:y])
lm
(logy ~ logx, test)

Is that the preferred method?

Mario Silveira

unread,
Feb 22, 2016, 6:48:46 AM2/22/16
to julia...@googlegroups.com
Yes, i have to put log in data frame firt, take "lm(log(y) ~ log(x), data)" i get a error.
Thanks for help.

--
You received this message because you are subscribed to a topic in the Google Groups "julia-stats" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/julia-stats/wGH77VmDQDc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Mario Henrique

Milan Bouchet-Valat

unread,
Feb 22, 2016, 7:54:11 AM2/22/16
to julia...@googlegroups.com
Le lundi 22 février 2016 à 08:48 -0300, Mario Silveira a écrit :
> Yes, i have to put log in data frame firt, take "lm(log(y) ~ log(x),
> data)" i get a error.
Yes, sorry I wasn't clear (I thought your question was about
statistical theory). Adding transformed variables to the data frame is
the preferred method AFAIK. And it's not too annoying to do either.

That said, transformation inside formulas will likely be supported at
some point. See this old issue :
https://github.com/JuliaStats/DataFrames.jl/issues/19


Regards

> Thanks for help.
>
> 2016-02-22 8:30 GMT-03:00 Michael Borregaard <mkborr...@gmail.com>
> :
> > The documentation is not very explicit about the preferred way to
> > do something like that. It looks to me as if you have to put the
> > variables in a DataFrame, update the DataFrame with log versions,
> > then do the lm. 
> > using GLM, DataFrames
> > x = collect(1:5) + rand(5)
> > y = collect(1:5) + rand(5)
> > test = DataFrame(x = x, y = y)
> > test[:logx] = log(test[:x])
> > test[:logy] = log(test[:y])
> > lm(logy ~ logx, test)
> >
> > Is that the preferred method?
> >
> >

Michael Krabbe Borregaard

unread,
Feb 22, 2016, 8:07:31 AM2/22/16
to julia...@googlegroups.com
Thanks for the clarification! And great news on the potential for functions to be included in formulas. It is not too annoying to add the variables, but it will help to create a more fluid and natural analytical workflow I feel.

You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.

Mario Silveira

unread,
Feb 22, 2016, 8:59:29 AM2/22/16
to julia...@googlegroups.com
Yes Michael. I'm economist and most softwares I know for econometrics have to add log variables firt to do regression. There is no problem in do it for me.
Thanks for help, Julia is great!!

Mario Henrique

Cedric St-Jean

unread,
Mar 1, 2016, 8:39:52 PM3/1/16
to julia-stats
This page seems relevant:

  1. Abusing linear regression makes the baby Gauss cry. Fitting a line to your log-log plot by least squares is a bad idea. It generally doesn't even give you a probability distribution, and even if your data do follow a power-law distribution, it gives you a bad estimate of the parameters. You cannot use the error estimates your regression software gives you, because those formulas incorporate assumptions which directly contradict the idea that you are seeing samples from a power law. And no, you cannot claim that because the line "explains" (really, describes) a lot of the variance that you must have a power law, because you can get a very high R^2 from other distributions (that test has no "power"). And this is without getting into the additional errors caused by trying to fit a line to binned histograms.
    It's true that fitting lines on log-log graphs is what Pareto did back in the day when he started this whole power-law business, but "the day" was the 1890s. There's a time and a place for being old school; this isn't it.
  2. Use maximum likelihood to estimate the scaling exponent. It's fast! The formula is easy! Best of all, it works! The method of maximum likelihood was invented in 1922 [parts 1 and 2], by someone who studied statistical mechanics, no less. The maximum likelihood estimators for the discrete (Zipf/zeta) and continuous (Pareto) power laws were worked out in 1952 and 1957 (respectively). They converge on the correct value of the scaling exponent with probability 1, and they do so efficiently. You can even work out their sampling distribution (it's an inverse gamma) and so get exact confidence intervals. Use the MLEs!
I don't usually work with power laws, so I don't have an opinion on this. But I believe that the issue is that the log on the y-axis distorts the Gaussian error distribution on which the least-square fit is predicated.

HTH,

Cédric

Mario Silveira

unread,
Mar 1, 2016, 9:17:54 PM3/1/16
to julia...@googlegroups.com
Cédric,
in economics log-log models are widely used for various reasons, the main one is that it allows you to interpret the model in elasticity term.
Log-log with OLS appears in the majority of econometrics books, has been shown, empirical and mathematically be useful when used with the right premises.
The author must be right that log-log has its problems, but I think it is wrong to generalize it, at least he should present some quotes with proof and evidence, if unlike just merely an allegation without scientific value.

Thanks for atention Cédric.

Stefan Karpinski

unread,
Mar 1, 2016, 9:27:24 PM3/1/16
to julia-stats
Cedric's point isn't that you can't have models that appear linear on a log-log plot – power laws are precisely this kind of model. The point is that you should not use linear regression on the log-transformed data to estimate the model parameters. If that's what's done in econometrics books, they may need revision.

Mario Silveira

unread,
Mar 1, 2016, 9:49:29 PM3/1/16
to julia...@googlegroups.com
Econometrics eventually "developing independently" of the statistical form. The reason is the type of data we deal. In a basic course in statistics, linear regression takes only a few chapters of the book. In a first course of econometrics, the whole course is about regression models, when the estimators are biased, multicollinearity, heteroscedasticity, static or dynamic interpretation ...
I believe that no other area of study is so much concerned with the accuracy of regression as in econometrics.
Ps: in econometrics, statistical estimation happens in a second stage, first the model must be validated theoretically. Of course, there are several problems, but the log-log regression to elasticity is one of the most basic and reliable tools used every day worldwide.
If you interest in see a litle of how econometric works, an sugestion is Econometrics Analysis by Greene.

Mario Silveira

unread,
Mar 1, 2016, 9:58:11 PM3/1/16
to julia...@googlegroups.com
Maybe I did not express myself properly. Here an example of what I meant http://www.dummies.com/how-to/content/econometrics-and-the-loglog-model.html
--
Mario Henrique

Stefan Karpinski

unread,
Mar 1, 2016, 11:21:30 PM3/1/16
to julia-stats
I think the key claim here is:

You can estimate this model with OLS by simply using natural log values for the variables instead of their original scale.

There are people far more qualified than I am on this list confirm (or deny) this, but statistical best-practice seem to be that this is not a good way to estimate the model parameters. The attached paper gives a nice overview – I found it very informative in any case. They suggest using equation (5) to estimate the exponent of a power law. The preceding paragraph explains why using OLS on the log-transformed values is a bad idea, giving an example with synthetic data where that method gives a confidence interval that fails to contain the true parameter value. Using equation (5), on the other hand, gives a confidence interval perfectly centered on the true value.
0412004v3.pdf

Mario Silveira

unread,
Mar 1, 2016, 11:35:03 PM3/1/16
to julia...@googlegroups.com
Thanks Stefan for the article.
I'll
study it calmly and see how I can apply it. It is always good to have feedback from people from other fields.

Andreas Noack

unread,
Mar 1, 2016, 11:35:22 PM3/1/16
to julia...@googlegroups.com
There is some confusion here. The context for the blog post is not really explained. The kind of log-log plots that the blog is talking about are very different from the log-log regressions made in econometrics.

In the blog post, the author considers plots of P(X>x) against x with log scales on both axes, i.e. a single variable. Power laws would give a straight downward-sloping line. By construction, the dots will cross the y-axis at one regardless of the distribution that has created the data, but a regression line fitted to the data might not do that which is why the author of the blog post writes that "It [the least squares fit of the log transformed data] generally doesn't even give you a probability distribution".

The model often used in econometrics, and probably many other places, relates **two** or more log transformed variables. As Mario points out, the regression coefficients are then interpreted as elasticities. There is probably also bad things to say about this, but that would be something different than what the blog post is criticizing.

Economists are usually not that interested in power laws. The main exception is that the Pareto distribution is often used for modellng the upper tail in the income distribution.

By the way, the blog seems quite good if you are interested in statistics. I didn't know about it so thanks for the link.

Mario Silveira

unread,
Mar 1, 2016, 11:41:09 PM3/1/16
to julia...@googlegroups.com
You got the point Andreas. Thank you!
This group is wonderful.

Cedric St-Jean

unread,
Mar 2, 2016, 10:53:38 AM3/2/16
to julia-stats
Thanks Andreas, you're right, I confused "power law" and "power law distribution".

Stefan Karpinski

unread,
Mar 2, 2016, 11:04:20 AM3/2/16
to julia-stats
Glad we resolve that. Andreas' explanation makes sense.

Jason Merrill

unread,
Mar 4, 2016, 11:23:31 AM3/4/16
to julia-stats
On Tuesday, March 1, 2016 at 11:35:22 PM UTC-5, Andreas Noack wrote:
There is some confusion here. The context for the blog post is not really explained. The kind of log-log plots that the blog is talking about are very different from the log-log regressions made in econometrics.

...
 
The model often used in econometrics, and probably many other places, relates **two** or more log transformed variables. As Mario points out, the regression coefficients are then interpreted as elasticities. There is probably also bad things to say about this, but that would be something different than what the blog post is criticizing.

I would like to take a crack at this. I guess it's a bit of a hobby horse. I don't know enough economics to criticize the economics scenario, but I have seen a lot of questionable log-transformed least-squares in the hard sciences.

The typical scenario that justifies standard least squares is a model that looks like this:

(1) y_i = f(x_i; a) + σ ϵ_i

where f is some deterministic model function, x_i and y_i are the independent and dependent data variables, a is a free parameter or a whole collection of free parameters, σ is the standard deviation of the error (which is typically unknown or uncertain), and the ϵ_i are independent samples from a standard normal distribution.

The known (standard normal) joint distribution of the ϵ_i justifies taking the likelihood in terms of y_i - f(x_i; a) to be multivariate normal, which in turn justifies least squares as maximum likelihood. This story is told in various notations at the beginning of essentially every treatment of maximum likelihood.

If your error is multiplicative and log-normally distributed instead of additive and normally distributed, then the model instead looks like

(2) y_i = f(x_i; a)exp(σ ϵ_i)

where all the symbols have exactly the same meaning as in (1) (note that exponentiating a normally distributed variable gets you a log-normally distributed variable). Then, taking logs gets you back to something that looks like (1) in terms of transformed variables

log(y_i) = log(f(x_i; a)) + σ ϵ_i

which justifies log-transformed least squares as maximum likelihood in the same way as before.

This is fine in theory, but there are a couple problems in practice:

1. People very frequently decide to do log-transformed least-squares based on the algebraic form of f: if f is exponential or a power law, the log transformation turns a non-linear least-squares problem into linear least squares. Linear least squares is easier to execute, so that's what people frequently do. But the algebraic form of f is a totally independent issue from the question of whether the errors are additive or multiplicative (or enters in some even more complicated way). Therefore, the algebraic form of f is totally independent from the statistical justification for log-transformed least-squares, contrary to folk lore and popular practice.

2. Additive noise of some kind almost always exists in real measurements, even if there is *also* multiplicative noise. If you put something through an electronic circuit, you'll end up with at least some additive Johnson noise. Additionally, there is very commonly an uncertain additive background of some kind. So even if multiplicative noise is the dominant effect, more realistic models look like

(3) y_i = f(x_i; a)exp(σ ϵ_i) + b + ω δ_i

where b is an uncertain additive background,  ω is the standard deviation of the additive noise, and δ_i represents independent samples from a standard normal distribution, just like ϵ_i.

If you ignore the additive noise, and try to account for the additive background by subtracting off an uncertain estimate of it, and then perform log-transformed least-squares, you end up with big problems if you have any data where y_i is small compared to the uncertainty in b (the background), or compared to ω (the size of the additive noise). Poorly accounted-for additive effects might make some of your data negative (maybe only after background subtraction), which makes the log-transformed procedure totally blow up. You sometimes see people try to fix this up by clamping the data to be above some very small positive value. Even when nothing ends up negative, very small y_i often end up with very large relative error. Because log-transformed least-squares essentially assumes constant relative error, your whole fit may end up being dominated by very small data values and their anomalously large relative error.

Doing standard least-squares when the error is actually multiplicative is often less bad than doing log-transformed least squares when the error is actually additive, because it's often preferable to have your fit dominated by large values and their (possibly) anomalously large absolute error than it is to have your fit dominated by small values and their (possibly) anomalously large relative error. You usually want to err on the side of accurately modeling the part of your data that is not very close to zero.

But it's also possible to turn the maximum-likelihood crank on the full model (3), and with the help of software, I don't think this actually has to be so much more onerous than any other regression procedure.

I'm not sure this sketch is enough to convince anyone who doesn't already know about all of this, and I also think that various aspects of it don't apply directly to the economics scenario. But I wanted to mention it since log-transformed least-squares does run into big problems even in the variable-response scenario, albeit somewhat different problems from the ones covered by Shalizi and Newman for the distribution-fitting scenario.

Stefan Karpinski

unread,
Mar 4, 2016, 2:05:17 PM3/4/16
to julia-stats
Thanks for writing that up. I really enjoyed reading it. Would actually make a rather nice blog post.

--

Ken B

unread,
Apr 29, 2016, 9:58:35 AM4/29/16
to julia-stats
for fitting power law disitributions, I've just stumbled across PowerLaws.jl
I believe these are the algorithms mentioned by Cedric St-Jean earler (i.e. from Clauset et al. 2009)
Reply all
Reply to author
Forward
0 new messages