Transform the data y=log(x) then estimate the mean and variance of y.
With the appropriate transformations (which you will have to look up
depending on the convention of the log-normal calculations that you
are using), these are reasonable estimates of the log-normal
distribution for your data.
Or you could just stay in the transformed space.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
Theory wise:
-Do a linear regression on your data.
-Apply a logrithmic transform to your data's dependent variable, and do another linear regression.
-Apply a logrithmic transform to your data's independent variable, and do another linear regression.
-Take the best regression (highest r^2 value) and execute a back transform.
Then, to get your desired extrapolation, simply substitute in the size for the independent variable to get the expected value.
If, however, you're looking for how to implement this in NumPy or SciPy, I can't really help :-P
Ian
_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
I don't think I understand the connection to the log-normal distribution.
You seem to have a non-linear relationship
conc = f(size) where you want to find a non-linear relationship f
If conc where just lognormal distributed, then you would not get any
relationship between conc and size.
If you have many observations with conc, size pairs then you could
estimate a noisy model
conc = f(size) + u where the noise u is for example log-normal
distributed but you would still need to get an expression for the
non-linear function f.
Extending a non-linear function outside of the observed range
is essentially always just a guess or by assumption.
If you want to fit a curve f that has the same shape as the pdf of
the log-normal, then you cannot do it with lognorm.fit, because that
just assumes you have a random sample independent of size.
So, it's not clear to me what you really want, or what your sample data
looks like (do you have only one 15 element sample or lots of them).
Josef
>
> Thanks
>
>
> --
> Gökhan
> So, it's not clear to me what you really want, or what your sample data
> looks like (do you have only one 15 element sample or lots of them).
I'm guessing that they aren't really samples of (conc, size) pairs so
much as binned data. Particles with sizes between 0.1 and 0.3 um (for
example; I don't know where the bin edges actually are in his data)
have a concentration of 119.7681 particles/<some unit of volume>. This
can be normalized to a more proper histogrammed distribution, except
that the lower end of the distribution below 0.1 um has been censored
by his measuring process. He then wants to infer the continuous
distribution that generated that censored histogram so he can predict
what the distribution is in the censored region.
So, I would say that it's a bit trickier than fitting the log-normal
PDF to the data for a couple of reasons.
1) Directly fitting PDFs to histogram values is usually not a great
idea to begin with.
2) We don't know how much probability mass is in the censored region.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
I don't think I understand the connection to the log-normal distribution.
You seem to have a non-linear relationship
conc = f(size) where you want to find a non-linear relationship f
If conc where just lognormal distributed, then you would not get any
relationship between conc and size.
If you have many observations with conc, size pairs then you could
estimate a noisy model
conc = f(size) + u where the noise u is for example log-normal
distributed but you would still need to get an expression for the
non-linear function f.
Extending a non-linear function outside of the observed range
is essentially always just a guess or by assumption.
If you want to fit a curve f that has the same shape as the pdf of
the log-normal, then you cannot do it with lognorm.fit, because that
just assumes you have a random sample independent of size.
So, it's not clear to me what you really want, or what your sample data
looks like (do you have only one 15 element sample or lots of them).
Josef
>
> Thanks
>
>
> --
> Gökhan
>
> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
On Tue, Nov 17, 2009 at 12:38, <josef...@gmail.com> wrote:
> So, it's not clear to me what you really want, or what your sample data
> looks like (do you have only one 15 element sample or lots of them).
I'm guessing that they aren't really samples of (conc, size) pairs so
much as binned data.
Particles with sizes between 0.1 and 0.3 um (for
example; I don't know where the bin edges actually are in his data)
have a concentration of 119.7681 particles/<some unit of volume>.
This can be normalized to a more proper histogrammed distribution, except
that the lower end of the distribution below 0.1 um has been censored
by his measuring process. He then wants to infer the continuous
distribution that generated that censored histogram so he can predict
what the distribution is in the censored region.
So, I would say that it's a bit trickier than fitting the log-normal
PDF to the data for a couple of reasons.
1) Directly fitting PDFs to histogram values is usually not a great
idea to begin with.
2) We don't know how much probability mass is in the censored region.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
>> If conc where just lognormal distributed, then you would not get any
>> relationship between conc and size.
>>
>> If you have many observations with conc, size pairs then you could
>> estimate a noisy model
>> conc = f(size) + u where the noise u is for example log-normal
>> distributed but you would still need to get an expression for the
>> non-linear function f.
>
> I don't understand why I can't get a relation between sizes and conc values
> if conc is log-normally distributed. Can I elaborate this a bit more? The
> non-linear relationship part is also confusing me. If say to test the linear
> relationship of x and y data pairs we just fit a line, in this case what I
> am looking is to fit a log-normal to get a relation between size and conc.
It's a language issue. Your concentration values are not log-normally
distributed. Your particle sizes are log-normally distributed (maybe).
The concentration of a range of particle sizes is a measurement that
is related to particle size the distribution, but you would not say
that the measurements themselves are log-normally distributed. Josef
was taking your language at face value.
>> If you want to fit a curve f that has the same shape as the pdf of
>> the log-normal, then you cannot do it with lognorm.fit, because that
>> just assumes you have a random sample independent of size.
>
> Could you give an example on this?
x = stats.norm.rvs()
stats.norm.fit(x)
>> So, it's not clear to me what you really want, or what your sample data
>> looks like (do you have only one 15 element sample or lots of them).
>
> I have many sample points (thousands) that are composed of this 15 elements.
> But the whole data don't look much different the sample I used. Most peaks
> are around 3rd - 4th channel and decaying as shown in the figure.
Do you need to fit a different distribution for each 15-vector? Or are
all of these measurements supposed to be merged somehow?
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
>> So, I would say that it's a bit trickier than fitting the log-normal
>> PDF to the data for a couple of reasons.
>>
>> 1) Directly fitting PDFs to histogram values is usually not a great
>> idea to begin with.
>> 2) We don't know how much probability mass is in the censored region.
>
> So we agree that it is easy to implement a log-normal fit than a discrete
> one?
No, none of the things we have suggested will work well for you. You
have a more complicated task ahead of you. I have ideas that might
work, but explaining them will take more time than I have.
The way I see it, you have to variables, size and counts (or concentration).
My initial interpretation was you want to model the relationship between
these two variables.
When the total number of particles is fixed, then the conditional size
distribution is univariate, and could be modeled by a log-normal
distribution. (This still leaves the total count unmodelled.)
If you have the total particle count per bin, then it
should be possible to write down the likelihood function that is
discretized to the bins from the continuous distribution.
Given a random particle, what's the probability of being in bin 1,
bin 2 and so on. Then add the log-likelihood over all particles
and maximize as a function of the log-normal parameters.
(There might be a numerical trick using fraction instead of
conditional count, but I'm not sure what the analogous discrete
distribution would be. )
Once the parameters of the log-normal distribution are
estimated, the distribution would be defined over all of
the real line (where the out of sample pdf is determined
by assumption not data).
Josef
> The way I see it, you have to variables, size and counts (or concentration).
> My initial interpretation was you want to model the relationship between
> these two variables.
> When the total number of particles is fixed, then the conditional size
> distribution is univariate, and could be modeled by a log-normal
> distribution. (This still leaves the total count unmodelled.)
>
> If you have the total particle count per bin, then it
> should be possible to write down the likelihood function that is
> discretized to the bins from the continuous distribution.
> Given a random particle, what's the probability of being in bin 1,
> bin 2 and so on. Then add the log-likelihood over all particles
> and maximize as a function of the log-normal parameters.
> (There might be a numerical trick using fraction instead of
> conditional count, but I'm not sure what the analogous discrete
> distribution would be. )
I usually use the multinomial as the likelihood for such
"histogram-fitting" exercises. The two problem points here are that we
have real-valued concentrations, not integer-valued counts, and that
we don't have a measurement for the censored region. For the former, I
would suggest simply multiplying by the concentrations by a factor of
10 (equivalently, changing the units to particles/<10^n larger
volume>) such that the resolution of the measurements is 1
particle/<volume>. Then just apply the multinomial. It should be a
close enough approximation.
I'm not entirely sure what to do about the censored probability mass.
I think there might be a simple correction factor that you can apply
to the multinomial likelihood, but I haven't worked it out.
> Once the parameters of the log-normal distribution are
> estimated, the distribution would be defined over all of
> the real line (where the out of sample pdf is determined
> by assumption not data).
Since we are extrapolating to the censored region, it would probably
be a good idea to estimate the uncertainty of the estimate. I would
probably suggest using PyMC to do a Bayesian model. A parametric
bootstrap might also serve.
I think, for the continuous distribution it would be just dividing by
the probability of the not-censored region (which is also a function of
the distribution parameters). This would then just be a truncated
log-normal. multinomial might work the same, as long as the
probabilities are defined by the discretization.
Would you apply the multinomial directly? I don't see in that case
how you would recover the parameters of the continuous distribution.
Josef
>
>> Once the parameters of the log-normal distribution are
>> estimated, the distribution would be defined over all of
>> the real line (where the out of sample pdf is determined
>> by assumption not data).
>
> Since we are extrapolating to the censored region, it would probably
> be a good idea to estimate the uncertainty of the estimate. I would
> probably suggest using PyMC to do a Bayesian model. A parametric
> bootstrap might also serve.
I would use bootstrap, since I still haven't figured out how to use MCMC.
Josef
If this is the case, I suspect you might be able to get away with just doing a least-squares fit of a log-normal model function to your measured values, potentially with weights which reflect the estimated error in each bin (obtained either by taking the std. deviation of repeated measurements, or by analysing the noise characteristics of the measurement instrument). Although it's not strictly optimal, and you ought to be aware of the potential hiccups, it's often good enough for the task at hand (I use it routinely to fit 2D Gaussians to point like objects in image data).
cheers,
David
You would just be using the multinomial to build the likelihood. For
each iteration in the likelihood maximization, you are given the
parameters of the continuous distribution. Given the bin edges and
those parameters, you compute the probability mass within each bin for
that specific distribution (the difference of the CDF between bin
edges). That is the p-vector for the multinomial. The probability of
getting the observed counts is the likelihood for the given parameters
of the continuous distribution.
And now that I think about it, you don't need to apply any correction
to the multinomial in the likelihood. The number of counts in the
censored region is just another unknown parameter to optimize along
with the continuous distribution's parameters.
Lorenzo
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
On Tue, Nov 17, 2009 at 13:36, Gökhan Sever <gokha...@gmail.com> wrote:
> On Tue, Nov 17, 2009 at 12:57 PM, Robert Kern <rober...@gmail.com> wrote:
>> So, I would say that it's a bit trickier than fitting the log-normalNo, none of the things we have suggested will work well for you. You
>> PDF to the data for a couple of reasons.
>>
>> 1) Directly fitting PDFs to histogram values is usually not a great
>> idea to begin with.
>> 2) We don't know how much probability mass is in the censored region.
>
> So we agree that it is easy to implement a log-normal fit than a discrete
> one?
have a more complicated task ahead of you. I have ideas that might
work, but explaining them will take more time than I have.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
> Besides, what is wrong with using the spline interpolation technique? It
> fits nicely on my sample data. See the resulting image here:
> http://img197.imageshack.us/img197/9638/sizeconcsplinefit.png (Green line
> represents the fit spline)
What spline interpolation technique? That certainly doesn't look like
a good spline fit. In any case, splines may be fine for
*interpolation*, but you need *extrapolation*, and splines are useless
for that. You need a physically-motivated model like the distributions
recommended by your textbook.
Lorenzo
On Tue, Nov 17, 2009 at 16:21, Gökhan Sever <gokha...@gmail.com> wrote:What spline interpolation technique?
> Besides, what is wrong with using the spline interpolation technique? It
> fits nicely on my sample data. See the resulting image here:
> http://img197.imageshack.us/img197/9638/sizeconcsplinefit.png (Green line
> represents the fit spline)
That certainly doesn't look like
a good spline fit.
In any case, splines may be fine for
*interpolation*, but you need *extrapolation*, and splines are useless
for that.
You need a physically-motivated model like the distributions
recommended by your textbook.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
Don't judge it based on its smoothness at many points. The smooth
appearance is simply a function of the number of points you choose to
sample it with, not how well it fits the data.
Even if you weren't dealing with an extrapolation problem, you
shouldn't use spline interpolation* on noisy data. You would do
something like least-squares fitting to a low-order spline. The spline
should not go through the observed data points exactly.
* And this brings up another terminological issue. I may have used the
term "interpolation" in a couple of different ways. There is a general
sense in which "interpolate" means "to make predictions about certain
inputs (e.g. the concentration [prediction] for the given particle
size [input]) within the range of observed inputs". Whereas,
"interpolate" can also mean something much more specific: finding a
curve that exactly goes through the given observations. "Spline
interpolation" would be a form of the latter, and is not related to
what you need.
>> In any case, splines may be fine for
>> *interpolation*, but you need *extrapolation*, and splines are useless
>> for that.
>>
>> You need a physically-motivated model like the distributions
>> recommended by your textbook.
>
> Using spline-interp is a test case to see how good it will do on my data.
Good. I just wanted to make sure that you knew what was wrong with
using splines in this case. :-)
> I
> will use log-normal way as was in the original intention. Let me check with
> someone else in the department to get some feedback on this before I
> completely get lost in the matter.
Always wise. :-)
> One quick question: "extrapolation" means to estimate a data both "beyond"
> and "below" the given limits, right? (For my example to guess less than
> 0.1um should I say downward-extrapolation and above 3.0 um
> upward-extrapolation or just extrapolation is enough?)
Just "extrapolation" can describe either case, yes.
Just "extrapolation" can describe either case, yes.
> One quick question: "extrapolation" means to estimate a data both "beyond"
> and "below" the given limits, right? (For my example to guess less than
> 0.1um should I say downward-extrapolation and above 3.0 um
> upward-extrapolation or just extrapolation is enough?)
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
> I asked this in one of my early replies just repeating what is the way to
> get log-normal sample using scipy.stats? I will use it for a demonstrative
> case.
> For some reason, this never looks an expected log-normal sample to me:
>
> stats.lognorm.rvs(1,size=15)
>
> What am I missing here?
Are you expecting that to look like your 15-vector concentration data?
That's not what you should expect. Instead,
x = stats.lognorm.rvs(1, size=10000)
h = np.histogram(x, bins=15)
Now, the *histogram* of the samples should look like roughly like the
shapes that you are expecting. .rvs() produces the samples themselves.
I.e. pretend like each element is the size of an individual particle,
not the concentration of a size class of particles. Taking the
histogram "simulates" what your instrument does: it finds the amont of
particles in each size class.
> Thanks for your time and explanations Robert. I really appreciate your help.
> Probably I will include you in the acknowledgements part of my presentation.
Entirely unnecessary, of course.
On Tue, Nov 17, 2009 at 17:52, Gökhan Sever <gokha...@gmail.com> wrote:Are you expecting that to look like your 15-vector concentration data?
> I asked this in one of my early replies just repeating what is the way to
> get log-normal sample using scipy.stats? I will use it for a demonstrative
> case.
> For some reason, this never looks an expected log-normal sample to me:
>
> stats.lognorm.rvs(1,size=15)
>
> What am I missing here?
That's not what you should expect. Instead,
x = stats.lognorm.rvs(1, size=10000)
h = np.histogram(x, bins=15)
Now, the *histogram* of the samples should look like roughly like the
shapes that you are expecting. .rvs() produces the samples themselves.
I.e. pretend like each element is the size of an individual particle,
not the concentration of a size class of particles. Taking the
histogram "simulates" what your instrument does: it finds the amont of
particles in each size class.
Entirely unnecessary, of course.
> Thanks for your time and explanations Robert. I really appreciate your help.
> Probably I will include you in the acknowledgements part of my presentation.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user