[SciPy-User] Fitting a curve on a log-normal distributed data

31 views
Skip to first unread message

Gökhan Sever

unread,
Nov 17, 2009, 12:44:17 AM11/17/09
to Discussion of Numerical Python, SciPy Users List
Hello,

I have a data which represents aerosol size distribution in between 0.1 to 3.0 micrometer ranges. I would like extrapolate the lower size down to 10 nm. The data in this context is log-normally distributed. Therefore I am looking a way to fit a log-normal curve onto my data. Could you please give me some pointers to solve this problem?

Thank you.

--
Gökhan

Robert Kern

unread,
Nov 17, 2009, 12:51:19 AM11/17/09
to SciPy Users List

Transform the data y=log(x) then estimate the mean and variance of y.
With the appropriate transformations (which you will have to look up
depending on the convention of the log-normal calculations that you
are using), these are reasonable estimates of the log-normal
distribution for your data.

Or you could just stay in the transformed space.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Gökhan Sever

unread,
Nov 17, 2009, 12:29:10 PM11/17/09
to Discussion of Numerical Python, SciPy Users List


On Tue, Nov 17, 2009 at 12:13 AM, Ian Mallett <geome...@gmail.com> wrote:
Theory wise:
-Do a linear regression on your data.
-Apply a logrithmic transform to your data's dependent variable, and do another linear regression. 
-Apply a logrithmic transform to your data's independent variable, and do another linear regression.
-Take the best regression (highest r^2 value) and execute a back transform.

Then, to get your desired extrapolation, simply substitute in the size for the independent variable to get the expected value.

If, however, you're looking for how to implement this in NumPy or SciPy, I can't really help :-P
Ian

_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


OK, before applying your suggestions. I have a few more questions. Here is 1 real-sample data that I will use as a part of the log-normal fitting. There is 15 elements in this arrays each being a concentration for corresponding 0.1 - 3.0 um size ranges.

I[74]: conc
O[74]:
array([ 119.7681,  118.546 ,  146.6548,   96.5478,  109.9911,   32.9974,
         20.7762,    6.1107,   12.2212,    3.6664,    3.6664,    1.2221,
          2.4443,    2.4443,    3.6664])

For now not calibrated size range I just assume a linear array:

I[78]: sizes = linspace(0.1, 3.0, 15)

I[79]: sizes
O[79]:
array([ 0.1       ,  0.30714286,  0.51428571,  0.72142857,  0.92857143,
        1.13571429,  1.34285714,  1.55      ,  1.75714286,  1.96428571,
        2.17142857,  2.37857143,  2.58571429,  2.79285714,  3.        ])


Not a very ideal looking log-normal, but so far I don't know what else besides a log-normal fit would give me a better estimate:
I[80]: figure(); plot(sizes, conc)
http://img406.imageshack.us/img406/156/sizeconc.png

scipy.stats has the lognorm.fit

    lognorm.fit(data,s,loc=0,scale=1)
        - Parameter estimates for lognorm data

and applying this to my data. However not sure the right way of calling it, and not sure if this could be applied to my case?

I[81]: stats.lognorm.fit(conc)
O[81]: array([ 2.31386066,  1.19126064,  9.5748391 ])

Lastly, what is the way to create a ideal log-normal sample using the stats.lognorm.rvs?

Thanks


--
Gökhan

josef...@gmail.com

unread,
Nov 17, 2009, 1:38:01 PM11/17/09
to SciPy Users List

I don't think I understand the connection to the log-normal distribution.
You seem to have a non-linear relationship
conc = f(size) where you want to find a non-linear relationship f

If conc where just lognormal distributed, then you would not get any
relationship between conc and size.

If you have many observations with conc, size pairs then you could
estimate a noisy model
conc = f(size) + u where the noise u is for example log-normal
distributed but you would still need to get an expression for the
non-linear function f.
Extending a non-linear function outside of the observed range
is essentially always just a guess or by assumption.

If you want to fit a curve f that has the same shape as the pdf of
the log-normal, then you cannot do it with lognorm.fit, because that
just assumes you have a random sample independent of size.

So, it's not clear to me what you really want, or what your sample data
looks like (do you have only one 15 element sample or lots of them).

Josef


>
> Thanks
>
>
> --
> Gökhan

Robert Kern

unread,
Nov 17, 2009, 1:57:46 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 12:38, <josef...@gmail.com> wrote:

> So, it's not clear to me what you really want, or what your sample data
> looks like (do you have only one 15 element sample or lots of them).

I'm guessing that they aren't really samples of (conc, size) pairs so
much as binned data. Particles with sizes between 0.1 and 0.3 um (for
example; I don't know where the bin edges actually are in his data)
have a concentration of 119.7681 particles/<some unit of volume>. This
can be normalized to a more proper histogrammed distribution, except
that the lower end of the distribution below 0.1 um has been censored
by his measuring process. He then wants to infer the continuous
distribution that generated that censored histogram so he can predict
what the distribution is in the censored region.

So, I would say that it's a bit trickier than fitting the log-normal
PDF to the data for a couple of reasons.

1) Directly fitting PDFs to histogram values is usually not a great
idea to begin with.
2) We don't know how much probability mass is in the censored region.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco

Gökhan Sever

unread,
Nov 17, 2009, 2:28:45 PM11/17/09
to SciPy Users List

R. Kern has nicely summarized my intention. Let me add some more onto his description.
 
I don't think I understand the connection to the log-normal distribution.
You seem to have a non-linear relationship
conc = f(size)  where you want to find a non-linear relationship f

Here I am directly quoting from on of my cloud physics books:

"Once a discrete model size distribution has been laid out, the initial particle number,
volume, and mass concentrations must be distributed among model size bins. This
can be accomplished by fitting measurements to a continuous size distribution,
then discretizing the continuous distribution over the model bins. Three continuous
distributions available for this procedure are the lognormal, Marshall–Palmer, and
modified gamma distributions."

My data are discrete in its nature, since have only 15 channels in between (0.1 to 3.0 um ranges).
Say that (from the sample data that I used in my previous e-mail) the first channel is in between
0.10 to 0.31 um and I read the number concentration for this size-range as 119.77 #/cm^3 so on so forth.

Since I am interested to estimate the number concentrations below the 0.1 um (preferably down to 0.01 um or 10 nm)
I would like to fit a continuous distribution onto my dataset. Among the all three continuous distributions lognormal seems
to be the easiest to implement, and log-normal distribution is commonly used to represent aerosol size distribution in the
atmosphere. If there is a way to do this discretely I would like to know very much.
 

If conc where just lognormal distributed, then you would not get any
relationship between conc and size.

If you have many observations with conc, size pairs then you could
estimate a noisy model
conc = f(size) + u  where the noise u is for example log-normal
distributed but you would still need to get an expression for the
non-linear function f.

I don't understand why I can't get a relation between sizes and conc values if conc is log-normally distributed. Can I elaborate this a bit more? The non-linear relationship part is also confusing me. If say to test the linear relationship of x and y data pairs we just fit a line, in this case what I am looking is to fit a log-normal to get a relation between size and conc.

 
Extending a non-linear function outside of the observed range
is essentially always just a guess or by assumption.

Yes, I am aware of this. Just trying to put my guesses into a well-defined form. So when I am describing the analysis process I will be able tell to others that this extrapolation is a result of log-normal fitting.
 

If you want to fit a curve f that has the same shape as the pdf of
the log-normal, then you cannot do it with lognorm.fit, because that
just assumes you have a random sample independent of size.

Could you give an example on this?
 

So, it's not clear to me what you really want, or what your sample data
looks like (do you have only one 15 element sample or lots of them).

I have many sample points (thousands) that are composed of this 15 elements. But the whole data don't look much different the sample I used. Most peaks are around 3rd - 4th channel and decaying as shown in the figure.
 

Josef



 


>
> Thanks
>
>
> --
> Gökhan
>
> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user



--
Gökhan

Gökhan Sever

unread,
Nov 17, 2009, 2:36:36 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 12:57 PM, Robert Kern <rober...@gmail.com> wrote:
On Tue, Nov 17, 2009 at 12:38,  <josef...@gmail.com> wrote:

> So, it's not clear to me what you really want, or what your sample data
> looks like (do you have only one 15 element sample or lots of them).

I'm guessing that they aren't really samples of (conc, size) pairs so
much as binned data.

Correct. These are discrete sample points.
 
Particles with sizes between 0.1 and 0.3 um (for
example; I don't know where the bin edges actually are in his data)
have a concentration of 119.7681 particles/<some unit of volume>.

True, in particles/cm^3 units
 
This can be normalized to a more proper histogrammed distribution, except
that the lower end of the distribution below 0.1 um has been censored
by his measuring process. He then wants to infer the continuous
distribution that generated that censored histogram so he can predict
what the distribution is in the censored region.

Exactly. Where later I am hoping to find a critical size point using another equation, and
integrating upwards to obtain total concentration from that point on and do a comparison
with another instrument.

The 0.1 um threshold comes from the instrument limit. It can't measure below this level
due to the constraint of the Mie scattering theory.

 

So, I would say that it's a bit trickier than fitting the log-normal
PDF to the data for a couple of reasons.

1) Directly fitting PDFs to histogram values is usually not a great
idea to begin with.
2) We don't know how much probability mass is in the censored region.


So we agree that it is easy to implement a log-normal fit than a discrete one?


 
--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________



--
Gökhan

Robert Kern

unread,
Nov 17, 2009, 2:37:33 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 13:28, Gökhan Sever <gokha...@gmail.com> wrote:

>
>
> On Tue, Nov 17, 2009 at 12:38 PM, <josef...@gmail.com> wrote:

>> If conc where just lognormal distributed, then you would not get any
>> relationship between conc and size.
>>
>> If you have many observations with conc, size pairs then you could
>> estimate a noisy model
>> conc = f(size) + u  where the noise u is for example log-normal
>> distributed but you would still need to get an expression for the
>> non-linear function f.
>
> I don't understand why I can't get a relation between sizes and conc values
> if conc is log-normally distributed. Can I elaborate this a bit more? The
> non-linear relationship part is also confusing me. If say to test the linear
> relationship of x and y data pairs we just fit a line, in this case what I
> am looking is to fit a log-normal to get a relation between size and conc.

It's a language issue. Your concentration values are not log-normally
distributed. Your particle sizes are log-normally distributed (maybe).
The concentration of a range of particle sizes is a measurement that
is related to particle size the distribution, but you would not say
that the measurements themselves are log-normally distributed. Josef
was taking your language at face value.

>> If you want to fit a curve f that has the same shape as the pdf of
>> the log-normal, then you cannot do it with lognorm.fit, because that
>> just assumes you have a random sample independent of size.
>
> Could you give an example on this?

x = stats.norm.rvs()
stats.norm.fit(x)

>> So, it's not clear to me what you really want, or what your sample data
>> looks like (do you have only one 15 element sample or lots of them).
>
> I have many sample points (thousands) that are composed of this 15 elements.
> But the whole data don't look much different the sample I used. Most peaks
> are around 3rd - 4th channel and decaying as shown in the figure.

Do you need to fit a different distribution for each 15-vector? Or are
all of these measurements supposed to be merged somehow?

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco

Robert Kern

unread,
Nov 17, 2009, 2:40:33 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 13:36, Gökhan Sever <gokha...@gmail.com> wrote:
> On Tue, Nov 17, 2009 at 12:57 PM, Robert Kern <rober...@gmail.com> wrote:

>> So, I would say that it's a bit trickier than fitting the log-normal
>> PDF to the data for a couple of reasons.
>>
>> 1) Directly fitting PDFs to histogram values is usually not a great
>> idea to begin with.
>> 2) We don't know how much probability mass is in the censored region.
>
> So we agree that it is easy to implement a log-normal fit than a discrete
> one?

No, none of the things we have suggested will work well for you. You
have a more complicated task ahead of you. I have ideas that might
work, but explaining them will take more time than I have.

josef...@gmail.com

unread,
Nov 17, 2009, 3:04:20 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 2:37 PM, Robert Kern <rober...@gmail.com> wrote:
> On Tue, Nov 17, 2009 at 13:28, Gökhan Sever <gokha...@gmail.com> wrote:
>>
>>
>> On Tue, Nov 17, 2009 at 12:38 PM, <josef...@gmail.com> wrote:
>
>>> If conc where just lognormal distributed, then you would not get any
>>> relationship between conc and size.
>>>
>>> If you have many observations with conc, size pairs then you could
>>> estimate a noisy model
>>> conc = f(size) + u  where the noise u is for example log-normal
>>> distributed but you would still need to get an expression for the
>>> non-linear function f.
>>
>> I don't understand why I can't get a relation between sizes and conc values
>> if conc is log-normally distributed. Can I elaborate this a bit more? The
>> non-linear relationship part is also confusing me. If say to test the linear
>> relationship of x and y data pairs we just fit a line, in this case what I
>> am looking is to fit a log-normal to get a relation between size and conc.
>
> It's a language issue. Your concentration values are not log-normally
> distributed. Your particle sizes are log-normally distributed (maybe).
> The concentration of a range of particle sizes is a measurement that
> is related to particle size the distribution, but you would not say
> that the measurements themselves are log-normally distributed. Josef
> was taking your language at face value.

The way I see it, you have to variables, size and counts (or concentration).
My initial interpretation was you want to model the relationship between
these two variables.
When the total number of particles is fixed, then the conditional size
distribution is univariate, and could be modeled by a log-normal
distribution. (This still leaves the total count unmodelled.)

If you have the total particle count per bin, then it
should be possible to write down the likelihood function that is
discretized to the bins from the continuous distribution.
Given a random particle, what's the probability of being in bin 1,
bin 2 and so on. Then add the log-likelihood over all particles
and maximize as a function of the log-normal parameters.
(There might be a numerical trick using fraction instead of
conditional count, but I'm not sure what the analogous discrete
distribution would be. )
Once the parameters of the log-normal distribution are
estimated, the distribution would be defined over all of
the real line (where the out of sample pdf is determined
by assumption not data).

Josef

Robert Kern

unread,
Nov 17, 2009, 3:41:47 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 14:04, <josef...@gmail.com> wrote:

> The way I see it, you have to variables, size and counts (or concentration).
> My initial interpretation was you want to model the relationship between
> these two variables.
> When the total number of particles is fixed, then the conditional size
> distribution is univariate, and could be modeled by a log-normal
> distribution. (This still leaves the total count unmodelled.)
>
> If you have the total particle count per bin, then it
> should be possible to write down the likelihood function that is
> discretized to the bins from the continuous distribution.
> Given a random particle, what's the probability of being in bin 1,
> bin 2 and so on. Then add the log-likelihood over all particles
> and maximize as a function of the log-normal parameters.
> (There might be a numerical trick using fraction instead of
> conditional count, but I'm not sure what the analogous discrete
> distribution would be. )

I usually use the multinomial as the likelihood for such
"histogram-fitting" exercises. The two problem points here are that we
have real-valued concentrations, not integer-valued counts, and that
we don't have a measurement for the censored region. For the former, I
would suggest simply multiplying by the concentrations by a factor of
10 (equivalently, changing the units to particles/<10^n larger
volume>) such that the resolution of the measurements is 1
particle/<volume>. Then just apply the multinomial. It should be a
close enough approximation.

I'm not entirely sure what to do about the censored probability mass.
I think there might be a simple correction factor that you can apply
to the multinomial likelihood, but I haven't worked it out.

> Once the parameters of the log-normal distribution are
> estimated, the distribution would be defined over all of
> the real line (where the out of sample pdf is determined
> by assumption not data).

Since we are extrapolating to the censored region, it would probably
be a good idea to estimate the uncertainty of the estimate. I would
probably suggest using PyMC to do a Bayesian model. A parametric
bootstrap might also serve.

josef...@gmail.com

unread,
Nov 17, 2009, 4:01:56 PM11/17/09
to SciPy Users List

I think, for the continuous distribution it would be just dividing by
the probability of the not-censored region (which is also a function of
the distribution parameters). This would then just be a truncated
log-normal. multinomial might work the same, as long as the
probabilities are defined by the discretization.

Would you apply the multinomial directly? I don't see in that case
how you would recover the parameters of the continuous distribution.

Josef

>
>> Once the parameters of the log-normal distribution are
>> estimated, the distribution would be defined over all of
>> the real line (where the out of sample pdf is determined
>> by assumption not data).
>
> Since we are extrapolating to the censored region, it would probably
> be a good idea to estimate the uncertainty of the estimate. I would
> probably suggest using PyMC to do a Bayesian model. A parametric
> bootstrap might also serve.

I would use bootstrap, since I still haven't figured out how to use MCMC.

Josef

David Baddeley

unread,
Nov 17, 2009, 4:11:59 PM11/17/09
to SciPy Users List
I guess it depends on how accurately you want to estimate the missing bin, and whether you can get any information about the amount of error in the individual measurements. Just looking at the curve you posted it looks like the variability at low particle sizes is a lot higher than at larger particle sizes. Although you would expect a similar effect due to the Poisson nature of counting, I'd expect it to be smaller. This might suggest that there is additional structure in your size distribution at these sizes, and that the best you can hope for with a log-normal model is a fairly rough approximation.

If this is the case, I suspect you might be able to get away with just doing a least-squares fit of a log-normal model function to your measured values, potentially with weights which reflect the estimated error in each bin (obtained either by taking the std. deviation of repeated measurements, or by analysing the noise characteristics of the measurement instrument). Although it's not strictly optimal, and you ought to be aware of the potential hiccups, it's often good enough for the task at hand (I use it routinely to fit 2D Gaussians to point like objects in image data).

cheers,
David

Robert Kern

unread,
Nov 17, 2009, 4:12:17 PM11/17/09
to SciPy Users List

You would just be using the multinomial to build the likelihood. For
each iteration in the likelihood maximization, you are given the
parameters of the continuous distribution. Given the bin edges and
those parameters, you compute the probability mass within each bin for
that specific distribution (the difference of the CDF between bin
edges). That is the p-vector for the multinomial. The probability of
getting the observed counts is the likelihood for the given parameters
of the continuous distribution.

And now that I think about it, you don't need to apply any correction
to the multinomial in the likelihood. The number of counts in the
censored region is just another unknown parameter to optimize along
with the continuous distribution's parameters.

Lorenzo Isella

unread,
Nov 17, 2009, 5:03:14 PM11/17/09
to gokha...@gmail.com, scipy...@scipy.org

> Date: Mon, 16 Nov 2009 23:44:17 -0600
> From: G?khan Sever <gokha...@gmail.com>
> Subject: [SciPy-User] Fitting a curve on a log-normal distributed data
> To: Discussion of Numerical Python <numpy-di...@scipy.org>, SciPy
> Users List <scipy...@scipy.org>
> Message-ID:
> <49d6b3500911162144x119...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"

>
> Hello,
>
> I have a data which represents aerosol size distribution in between 0.1 to
> 3.0 micrometer ranges. I would like extrapolate the lower size down to 10
> nm. The data in this context is log-normally distributed. Therefore I am
> looking a way to fit a log-normal curve onto my data. Could you please give
> me some pointers to solve this problem?
>
> Thank you.
>
>
Hello,
I have not followed the many replies to this long post in detail, but by
chance I happen to know quite in detail what you are talking about
(probably SMPS data or similar).
I normally resort to R for this kind of tasks
(http://www.r-project.org/), but nothing prevents you from using Python
instead. You just want to compare your empirical data binning with what
would be expected from a lognormal distribution. Please have a look at
http://tinyurl.com/ygmw4lc
and at the functions defined there (A1, mu1 and myvar1 are the overall
concentration, the geometric mean and the std of the number-size
distribution, respectively).
Cheers

Lorenzo

Gökhan Sever

unread,
Nov 17, 2009, 5:07:17 PM11/17/09
to SciPy Users List

For my comparison case I will use an hour length of data, which are composed of 3600 sample points. At each minute I will average these points. This is because I am comparing data from two different instruments and by averaging I am trying to eliminate intrinsic measurement error. It is really not an easy task to make point by point comparison in my case. So in the end I will have 60 averaged data-points where each point composed of 15-elements in them. Later use the same fitting technique to guess the out-ouf-the-measurement-limits parts.

 

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________



--
Gökhan

Gökhan Sever

unread,
Nov 17, 2009, 5:21:27 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 1:40 PM, Robert Kern <rober...@gmail.com> wrote:
On Tue, Nov 17, 2009 at 13:36, Gökhan Sever <gokha...@gmail.com> wrote:
> On Tue, Nov 17, 2009 at 12:57 PM, Robert Kern <rober...@gmail.com> wrote:

>> So, I would say that it's a bit trickier than fitting the log-normal
>> PDF to the data for a couple of reasons.
>>
>> 1) Directly fitting PDFs to histogram values is usually not a great
>> idea to begin with.
>> 2) We don't know how much probability mass is in the censored region.
>
> So we agree that it is easy to implement a log-normal fit than a discrete
> one?

No, none of the things we have suggested will work well for you. You
have a more complicated task ahead of you. I have ideas that might
work, but explaining them will take more time than I have.

Looking at some recent replies and re-reading them a couple times, I should say the techniques mentioned in them are beyond my technical skills or at least I need a professor to help me or a good statistics book to study further. I should also note that this is just a feasibility study comparing actual observed cloud condensation nuclei concentration measurements to the modelled concentrations using another instrument's size distribution data with the help of a thermodynamic particle activation equation which I will be able to infer an activation size limit. The results that are found in this study will not be placed on a journal, they will just be presented in my cloud physics class presentation. I am trying to assess the sources of errors and testing the usability of the size distributions from that particular instrument in this comparison study. Extending the size distribution beyond and below the instruments measurement limit is one of the biggest source of errors to represent the reality, but of course there other simplifications and assumptions that add uncertainties.

Besides, what is wrong with using the spline interpolation technique? It fits nicely on my sample data. See the resulting image here: http://img197.imageshack.us/img197/9638/sizeconcsplinefit.png    (Green line represents the fit spline)



 

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user



--
Gökhan

Robert Kern

unread,
Nov 17, 2009, 5:27:06 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 16:21, Gökhan Sever <gokha...@gmail.com> wrote:

> Besides, what is wrong with using the spline interpolation technique? It
> fits nicely on my sample data. See the resulting image here:
> http://img197.imageshack.us/img197/9638/sizeconcsplinefit.png    (Green line
> represents the fit spline)

What spline interpolation technique? That certainly doesn't look like
a good spline fit. In any case, splines may be fine for
*interpolation*, but you need *extrapolation*, and splines are useless
for that. You need a physically-motivated model like the distributions
recommended by your textbook.

Gökhan Sever

unread,
Nov 17, 2009, 5:30:56 PM11/17/09
to Lorenzo Isella, scipy...@scipy.org

Hey Lorenzo,

Finally someone who knows the heart of the subject :) Thanks for stopping by.

The data that I am using is Passive Cavity Aerosol Spectrometer (PCASP) measured size-distributions. Unfortunately even if we had the mains part of the SMPS instrument we couldn't fly it since the radioactive element was not reached during campaign. It is always an issue to deliver the radioactive parts out of the US :)

Anyways assuming that the relative humidity was quite low in the measurement region I am not expecting a huge deviation from the dry-particle size definition. But as I said above this is just a feasibility study. I will test and see how much an error I will get with this method. Besides there is no information regarding to the chemical composition of the aerosols, therefore I am basing on kappa-kohler theory and making another simplification at that point.

Could you please send this script off-list and the data file associated with it?

Would be greatly appreciated.

 

Lorenzo



--
Gökhan

Gökhan Sever

unread,
Nov 17, 2009, 5:42:44 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 4:27 PM, Robert Kern <rober...@gmail.com> wrote:
On Tue, Nov 17, 2009 at 16:21, Gökhan Sever <gokha...@gmail.com> wrote:

> Besides, what is wrong with using the spline interpolation technique? It
> fits nicely on my sample data. See the resulting image here:
> http://img197.imageshack.us/img197/9638/sizeconcsplinefit.png    (Green line
> represents the fit spline)

What spline interpolation technique?

From here http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html

Spline interpolation in 1-d (interpolate.splXXX)
 

That certainly doesn't look like
a good spline fit.

True, because I used only 30 points. It looks much smoother with alot more point as you might expected.
 
In any case, splines may be fine for
*interpolation*, but you need *extrapolation*, and splines are useless
for that.
You need a physically-motivated model like the distributions
recommended by your textbook.


Using spline-interp is a test case to see how good it will do on my data. I will use log-normal way as was in the original intention. Let me check with someone else in the department to get some feedback on this before I completely get lost in the matter.

One quick question: "extrapolation" means to estimate a data both "beyond" and "below" the given limits, right? (For my example to guess less than 0.1um should I say downward-extrapolation and above 3.0 um upward-extrapolation or just extrapolation is enough?)

 
--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user



--
Gökhan

Robert Kern

unread,
Nov 17, 2009, 5:58:43 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 16:42, Gökhan Sever <gokha...@gmail.com> wrote:
>
> On Tue, Nov 17, 2009 at 4:27 PM, Robert Kern <rober...@gmail.com> wrote:
>>
>> On Tue, Nov 17, 2009 at 16:21, Gökhan Sever <gokha...@gmail.com> wrote:
>>
>> > Besides, what is wrong with using the spline interpolation technique? It
>> > fits nicely on my sample data. See the resulting image here:
>> > http://img197.imageshack.us/img197/9638/sizeconcsplinefit.png    (Green
>> > line
>> > represents the fit spline)
>>
>> What spline interpolation technique?
>
> From here
> http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
>
> Spline interpolation in 1-d (interpolate.splXXX)
>
>> That certainly doesn't look like
>> a good spline fit.
>
> True, because I used only 30 points. It looks much smoother with alot more
> point as you might expected.

Don't judge it based on its smoothness at many points. The smooth
appearance is simply a function of the number of points you choose to
sample it with, not how well it fits the data.

Even if you weren't dealing with an extrapolation problem, you
shouldn't use spline interpolation* on noisy data. You would do
something like least-squares fitting to a low-order spline. The spline
should not go through the observed data points exactly.

* And this brings up another terminological issue. I may have used the
term "interpolation" in a couple of different ways. There is a general
sense in which "interpolate" means "to make predictions about certain
inputs (e.g. the concentration [prediction] for the given particle
size [input]) within the range of observed inputs". Whereas,
"interpolate" can also mean something much more specific: finding a
curve that exactly goes through the given observations. "Spline
interpolation" would be a form of the latter, and is not related to
what you need.

>> In any case, splines may be fine for
>> *interpolation*, but you need *extrapolation*, and splines are useless
>> for that.
>>
>> You need a physically-motivated model like the distributions
>> recommended by your textbook.
>
> Using spline-interp is a test case to see how good it will do on my data.

Good. I just wanted to make sure that you knew what was wrong with
using splines in this case. :-)

> I
> will use log-normal way as was in the original intention. Let me check with
> someone else in the department to get some feedback on this before I
> completely get lost in the matter.

Always wise. :-)

> One quick question: "extrapolation" means to estimate a data both "beyond"
> and "below" the given limits, right? (For my example to guess less than
> 0.1um should I say downward-extrapolation and above 3.0 um
> upward-extrapolation or just extrapolation is enough?)

Just "extrapolation" can describe either case, yes.

Gökhan Sever

unread,
Nov 17, 2009, 6:52:34 PM11/17/09
to SciPy Users List

Talking to another guy creating second modal (probably a normal distributed way) might be the other approach to take in addition to log-normally extrapolating the data. In any case, I should be able to parametrize the fits since I will do integration once I am done with the extrapolation part.

I asked this in one of my early replies just repeating what is the way to get log-normal sample using scipy.stats? I will use it for a demonstrative case.
For some reason, this never looks an expected log-normal sample to me:

stats.lognorm.rvs(1,size=15)

What am I missing here?
 

> One quick question: "extrapolation" means to estimate a data both "beyond"
> and "below" the given limits, right? (For my example to guess less than
> 0.1um should I say downward-extrapolation and above 3.0 um
> upward-extrapolation or just extrapolation is enough?)

Just "extrapolation" can describe either case, yes.


Thanks for your time and explanations Robert. I really appreciate your help. Probably I will include you in the acknowledgements part of my presentation.
 
--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user



--
Gökhan

Robert Kern

unread,
Nov 17, 2009, 7:00:40 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 17:52, Gökhan Sever <gokha...@gmail.com> wrote:

> I asked this in one of my early replies just repeating what is the way to
> get log-normal sample using scipy.stats? I will use it for a demonstrative
> case.
> For some reason, this never looks an expected log-normal sample to me:
>
> stats.lognorm.rvs(1,size=15)
>
> What am I missing here?

Are you expecting that to look like your 15-vector concentration data?
That's not what you should expect. Instead,

x = stats.lognorm.rvs(1, size=10000)
h = np.histogram(x, bins=15)

Now, the *histogram* of the samples should look like roughly like the
shapes that you are expecting. .rvs() produces the samples themselves.
I.e. pretend like each element is the size of an individual particle,
not the concentration of a size class of particles. Taking the
histogram "simulates" what your instrument does: it finds the amont of
particles in each size class.

> Thanks for your time and explanations Robert. I really appreciate your help.
> Probably I will include you in the acknowledgements part of my presentation.

Entirely unnecessary, of course.

Gökhan Sever

unread,
Nov 17, 2009, 7:36:42 PM11/17/09
to SciPy Users List
On Tue, Nov 17, 2009 at 6:00 PM, Robert Kern <rober...@gmail.com> wrote:
On Tue, Nov 17, 2009 at 17:52, Gökhan Sever <gokha...@gmail.com> wrote:

> I asked this in one of my early replies just repeating what is the way to
> get log-normal sample using scipy.stats? I will use it for a demonstrative
> case.
> For some reason, this never looks an expected log-normal sample to me:
>
> stats.lognorm.rvs(1,size=15)
>
> What am I missing here?

Are you expecting that to look like your 15-vector concentration data?
That's not what you should expect. Instead,

 x = stats.lognorm.rvs(1, size=10000)
 h = np.histogram(x, bins=15)

Now, the *histogram* of the samples should look like roughly like the
shapes that you are expecting. .rvs() produces the samples themselves.
I.e. pretend like each element is the size of an individual particle,
not the concentration of a size class of particles. Taking the
histogram "simulates" what your instrument does: it finds the amont of
particles in each size class.

Now, I see it better. Makes much more sense now.
 

> Thanks for your time and explanations Robert. I really appreciate your help.
> Probably I will include you in the acknowledgements part of my presentation.

Entirely unnecessary, of course.

Not at all ;)
 

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user



--
Gökhan

Gökhan Sever

unread,
Nov 21, 2009, 4:57:10 PM11/21/09
to Discussion of Numerical Python, SciPy Users List
One more update on this subject. I have been looking through some of the papers on this topic, and I have finally found exactly what I need in this paper:

Hussein, T., Dal Maso, M., Petaja, T., Koponen, I. K., Paatero, P., Aalto, P. P., Hameri, K., and Kulmala, M.: Evaluation of an automatic algorithm for fitting the particle number size distributions, Boreal Environ. Res., 10, 337–355, 2005.

Here is the abstract:

"The multi log-normal distribution function is widely in use to parameterize the aerosol particle size distributions. The main purpose of such a parameterization is to quantitatively describe size distributions and to allow straightforward comparisons between different aerosol particle data sets. In this study, we developed and evaluated an algorithm to parameterize aerosol particle number size distributions with the multi log-normal distribution function. The current algorithm is automatic and does not need a user decision for the initial input parameters; it requires only the maximum number of possible modes and then it reduces this number, if possible, without affecting the fitting quality. The reduction of the number of modes is based on an overlapping test between adjacent modes. The algorithm was evaluated against a previous algorithm that can be considered as a standard procedure. It was also evaluated against a long-term data set and different types of measured aerosol particle size distributions in the ambient atmosphere. The evaluation of the current algorithm showed the following advantages: (I) it is suitable for different types of aerosol particles observed in different environments and conditions, (2) it showed agreement with the previous standard algorithm in about 90% of long-term data set, (3) it is not time-consuming, particularly when long-term data sets are analyzed, and (4) it is a useful tool in the studies of atmospheric aerosol particle formation and transformation."

The full-text is freely available at: http://www.borenv.net/BER/pdfs/ber10/ber10-337.pdf
--
Gökhan
Reply all
Reply to author
Forward
0 new messages