[R] Fitting a curve to weibull distribution in R using nls

289 views
Skip to first unread message

Aditya Bhatia

unread,
Oct 13, 2015, 11:02:40 PM10/13/15
to r-h...@r-project.org
I am trying to fit this data to a weibull distribution:

My y variable is:1 1 1 4 7 20 7 14 19 15 18 3 4 1 3 1 1 1
1 1 1 1 1 1

and x variable is:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24

The plot looks like this:http://i.stack.imgur.com/FrIKo.png and I want
to fit a weibull curve to it. I am using the nls function in R like
this: nls(y ~ ((a/b) * ((x/b)^(a-1)) * exp(- (x/b)^a)))

This function always throws up an error saying: Error in
numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: Warning message:
In nls(y ~ ((a/b) * ((x/b)^(a - 1)) * exp(-(x/b)^a))) :
No starting values specified for some parameters.
Initializing ‘a’, ‘b’ to '1.'.
Consider specifying 'start' or using a selfStart model

So first I tried different starting values without any success. I
cannot understand how to make a "good" guess at the starting values.
Then I went with the SSweibull(x, Asym, Drop, lrc, pwr) function which
is a selfStart function. Now the SSWeibull function expects values for
Asym,Drop,lrc and pwr and I don't have any clue as to what those
values might be.

I would appreciate if someone could help me figure out how to proceed.

Background of the data: I have taken some data from bugzilla and my
"y" variable is number of bugs reported in a particular month and "x"
variable is the month number after release.

______________________________________________
R-h...@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius

unread,
Oct 14, 2015, 12:15:12 AM10/14/15
to Aditya Bhatia, r-h...@r-project.org

On Oct 13, 2015, at 2:42 PM, Aditya Bhatia wrote:

> I am trying to fit this data to a weibull distribution:
>
> My y variable is:1 1 1 4 7 20 7 14 19 15 18 3 4 1 3 1 1 1
> 1 1 1 1 1 1
>
> and x variable is:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> 19 20 21 22 23 24
>
> The plot looks like this:http://i.stack.imgur.com/FrIKo.png and I want
> to fit a weibull curve to it. I am using the nls function in R like
> this: nls(y ~ ((a/b) * ((x/b)^(a-1)) * exp(- (x/b)^a)))

So despite the fact that the Weibull function has a continuous domain, you want to fit a set of integers to something like the Weibull with these values as case weights with the "x-values" being the position of these integers in a sequence?


--
David.
>
> This function always throws up an error saying: Error in
> numericDeriv(form[[3L]], names(ind), env) :
> Missing value or an infinity produced when evaluating the model
> In addition: Warning message:
> In nls(y ~ ((a/b) * ((x/b)^(a - 1)) * exp(-(x/b)^a))) :
> No starting values specified for some parameters.
> Initializing ‘a’, ‘b’ to '1.'.
> Consider specifying 'start' or using a selfStart model
>
> So first I tried different starting values without any success. I
> cannot understand how to make a "good" guess at the starting values.
> Then I went with the SSweibull(x, Asym, Drop, lrc, pwr) function which
> is a selfStart function. Now the SSWeibull function expects values for
> Asym,Drop,lrc and pwr and I don't have any clue as to what those
> values might be.
>
> I would appreciate if someone could help me figure out how to proceed.
>
> Background of the data: I have taken some data from bugzilla and my
> "y" variable is number of bugs reported in a particular month and "x"
> variable is the month number after release.
>
> ______________________________________________
> R-h...@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

Aditya Bhatia

unread,
Oct 14, 2015, 12:30:10 AM10/14/15
to David Winsemius, r-h...@r-project.org
Yes. I do.I'm trying to repeat the methodology of a paper. They have fitted
their data to a weibull curve and so I want to do the same too, but I'm
unable to figure out how..

On Wed, Oct 14, 2015, 9:44 AM David Winsemius <dwins...@comcast.net>
wrote:

[[alternative HTML version deleted]]

peter dalgaard

unread,
Oct 14, 2015, 7:39:25 AM10/14/15
to Aditya Bhatia, r-h...@r-project.org
There's a number of issues with this:

(a) your data appear to be binned counts, not measurements along a curve.
(b) The function you are trying to fit is the Weibull _density_ This has integral 1, by definition, whereas any curve anywhere near your y's would have integral near sum(y)=127
(c) SSweibull is for growth curves which are proportional to the cumulative Weibull distribution.
(d) SelfStart functions do *not* need starting values, that is the whole point

So you need to study things a bit more...

The expedient way would be this:

> MASS::fitdistr(rep(x,y), "Weibull")
shape scale
2.4207659 10.5078293
( 0.1530137) ( 0.4079979)
Warning message:
In densfun(x, parm[1], parm[2], ...) : NaNs produced

> plot(y~x, ylim=c(0,20), xlim=c(0,24))
> curve(127*dweibull(x,2.42,10.5), add=TRUE)

It doesn't actually fit very well, but there are quite a few observations out in what was supposed to be the tail of the distribution.


If you want to play with SSweibull, you might do something like

> yy <- cumsum(y)
> nls(yy~SSweibull(x, Asym, Drop, lrc, pwr))
Nonlinear regression model
model: yy ~ SSweibull(x, Asym, Drop, lrc, pwr)
data: parent.frame()
Asym Drop lrc pwr
122.417 122.471 -6.944 3.129
residual sum-of-squares: 187

This gives a nonlinear least squares fit to the cumulative distribution (I am _not_ advocating this, but you said that you were trying to figure out what others had been up to...). If you plot it, you get

> plot(yy~x)
> curve(SSweibull(x, 122.42, 122.47, -6.94, 3.13), add=TRUE)

which _looks_ nicer, but beware! Everything looks nicer when cumulated and the fit strongly underemphasizes that the data curve is clearly growing past x=15.

Notice also that there is a parametrization difference. SSweibull has Asym and Drop which are F(inf) and F(inf)-F(0) respectively; in this case one could fix both at 127. pwr is equal to a in the Weibull density, whereas lrc (log rate constant) comes from writing exp(-(x/b)^a) as exp(-exp(lrc)*x^a), so
b = exp(-lrc)^(1/a) -- i.e. exp(6.94)^(1/3.13) = 9.18 which is in the same range as the estimate from fitdistr().

You could also fit the weibull density directly using least squares

> nls(y~127*dweibull(x,shape,scale), start=c(shape=3,scale=10))
Nonlinear regression model
model: y ~ 127 * dweibull(x, shape, scale)
data: parent.frame()
shape scale
3.419 9.574
residual sum-of-squares: 230.6

Number of iterations to convergence: 6
Achieved convergence tolerance: 6.037e-06

> plot(y~x, ylim=c(0,20), xlim=c(0,24))
> curve(127*dweibull(x,2.42,10.5), add=TRUE)
> curve(127*dweibull(x,3.419,9.574), add=TRUE)

This fits the peak quite a bit better than the fitdistr() version, but notice again that there are also more observations in regions where there shouldn't really be any according to the fitted curve. This is a generic difference between maximum likelihood and the curve fitting approaches.

-pd
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd....@cbs.dk Priv: PDa...@gmail.com

Therneau, Terry M., Ph.D.

unread,
Oct 14, 2015, 8:52:23 AM10/14/15
to r-h...@r-project.org, Aditya Bhatia


On 10/14/2015 05:00 AM, r-help-...@r-project.org wrote:
> I am trying to fit this data to a weibull distribution:
>
> My y variable is:1 1 1 4 7 20 7 14 19 15 18 3 4 1 3 1 1 1
> 1 1 1 1 1 1
>
> and x variable is:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> 19 20 21 22 23 24
>

One could always use existing R functions that fit a Weibull regression, instead of
reinventing the wheel.

library(survival)
y <- scan()
1 1 1 4 7 20 7 14 19 15 18 3
4 1 3 1 1 1
1 1 1 1 1 1

wfit <- survreg(Surv(1:24) ~ 1, weights=y, dist="weibull")
wfit

Call:
survreg(formula = Surv(1:24) ~ 1, weights = y, dist = "weibull")

Coefficients:
(Intercept)
2.352121

Scale= 0.4130924

Loglik(model)= -351.4 Loglik(intercept only)= -351.4
n= 24


zz <- seq(0, 25, length=100)
plot(zz, dsurvreg(zz, 2.352121, 0.4130924), col=2, type='l', ylim=c(0, .15),
xlab="Value", ylab="Density")
points(1:24, y/sum(y))

-----
There are a half dozen ways to parameterize a Weibull distribution; the location-scale
form used by survreg is one of the less common. See help(survreg) for more information --
look at the example near the bottom of the page.

Terry Therneau

Aditya Bhatia

unread,
Oct 14, 2015, 8:26:44 PM10/14/15
to peter dalgaard, r-h...@r-project.org
Thank you for the amazing response. You are right;I definitely have to
study a bit more. I am just trying to copy the procedure in a paper so
I didn't give it much thought.

for point (a) : yes the data is binned counts; and my aim is to find
out which curve best approximates these counts.

I am going to try and see if I can fit something like :
nls(y~d*dweibull(x,shape,scale), start=c(shape=3,scale=10,d=127))
instead of just putting it to be the sum of the observations.
Hopefully I will get a better result. And the tail of distribution is
not very important to me so I don't mind that the result may not fit
the tail of the curve correctly.

Thanks again for your help. I can make some progress now.

-Aditya Bhatia

Reply all
Reply to author
Forward
0 new messages