Add class PMStandardizationScaler

Serge Stinckwich

unread,

Jul 31, 2018, 4:20:40 AM7/31/18

to polymath...@googlegroups.com

Dear all,

in order to clean data before using Machine Learning algorithms (like PCA), I implement a class to do data standardization: 

PMStandardizationScaler

https://github.com/PolyMathOrg/PolyMath/blob/development/src/Math-PrincipalComponentAnalysis/PMStandardizationScaler.class.st

​Data can be centered and scaled. The class is similar to the one defined in scikit-learn : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html​

​

More complex data transformation can be implemented as subclass of the abstract class PMDataTransformer. You have to implement fit: and transform: method.​

​

A+​

--

Serge Stinckwich
UMI UMMISCO 209 (SU/IRD/UY1)

"Programs must be written for people to read, and only incidentally for machines to execute."

http://www.doesnotunderstand.org/

werner kassens

unread,

Jul 31, 2018, 9:43:07 AM7/31/18

to PolyMath

Hi Serge,
i find the PMStandardizationScaler is a extremely useful thing, hence please excuse me making a few comments:
#transform: uses self scale & self mean repeatedly. self scale calcs the covarianceMatrix repeatedly which can be a lot of work, perhaps a local var would speed that up.
calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code?
the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention?
i like the architecture, it is simple and obvious to use.
i can use it, thanks!
werner

Serge Stinckwich

unread,

Jul 31, 2018, 10:04:35 AM7/31/18

to polymath...@googlegroups.com

On Tue, Jul 31, 2018 at 2:43 PM werner kassens <werne...@gmail.com> wrote:

Hi Serge,
i find the PMStandardizationScaler is a extremely useful thing, hence please excuse me making a few comments:
#transform: uses self scale & self mean repeatedly. self scale calcs the covarianceMatrix repeatedly which can be a lot of work, perhaps a local var would speed that up.

​ok done in last version.

​

calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code?
the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention?

​I don't understand what you want to say here.

​

i like the architecture, it is simple and obvious to use.
i can use it, thanks!

​This is the same pattern used in scikit learn in StandardScaler: 

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html​

​Would be nice to add more transformers like http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing​

werner kassens

unread,

Jul 31, 2018, 11:56:54 AM7/31/18

to polymath...@googlegroups.com

On Tue, Jul 31, 2018 at 4:03 PM, Serge Stinckwich <serge.st...@gmail.com> wrote:

calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code?
the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention?

I don't understand what you want to say here.

collection>>stdev
    | avg sample sum |
    avg := self average.
    "see comment in self sum"
    sample := self anyOne.
    sum := self inject: sample into: [:accum :each | accum + (each - avg) squared].
    sum := sum - sample.
    ^ (sum / (self size - 1)) sqrt

the method you use does probably:
^ (sum / self size) sqrt

werner

Serge Stinckwich

unread,

Jul 31, 2018, 12:24:31 PM7/31/18

to polymath...@googlegroups.com

​Sorry I still don't understand. Where do you see a problem ?

PMStandardizationScaler use an iterative way to compute mean and variance with PMCovarianceAccumulator.

So we need to compute covariance matrix in order to compute variance.

A+

werner kassens

unread,

Jul 31, 2018, 12:33:26 PM7/31/18

to polymath...@googlegroups.com

i guess its a misunderstanding, i dont see no problem, sorry for the irritation, Serge.

werner

--
You received this message because you are subscribed to the Google Groups "PolyMath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to polymath-project+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nicolas Cellier

unread,

Jul 31, 2018, 5:28:34 PM7/31/18

to polymath...@googlegroups.com

Hi Serge,

sum( (x_i - average)^2 for i=1:n) / n is a biased estimator of variance.

One must divide by n-1 to obtain an unbiased estimator.

That's probably what Werner means.

See https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

Apart the bias, I have verified the formula, and it sounds correct.

But it would deserve an explanation (or a reference to an explanation, in Didier Besset Book?), because it is non trivial.

delta_{n+1} is the difference of (average_estimate_{n} - vector_{n+1}).

We want to use (average_estimate_{n+1} - vector_{n+1}) in the covariance estimator.

But then we should also compensate the evolution of average estimate in previous accumulation...

Let's do it in scalar first:

sum( (x_i - average_estimate_{n+1})^2/n ) = ( sum( x_i^2)/n - 2*sum(x_i)/n*average_estimate_{n+1}+sum(average_estimate_{n+1}^2)/n )

... = sum( x_i^2)/n + average_estimate_{n+1}^2 - 2*average_estimate_{n+1}*average_estimate_n

Since we have computed:

variance_estimate_n = sum( (x_i - average_estimate_n)^2/n ) = sum( x_i^2)/n - average_estimate_n^2

Then compensating the error requires taking:

variance_estimate_n_corrected = variance_estimate_n + (average_estimate_{n+1}-average_estimate_n)^2

... = variance_estimate_n + delta_{n+1}^2

Then, updating the variance with new accumulated value, with biased estimator:

variance_estimate_{n+1} = (variance_estimate_n_corrected * n + (average_estimate_{n+1} - x_{n+1})^2) / (n+1)

average_estimate_{n+1} = average_estimate_n - delta_{n+1}

average_estimate_{n+1} - x_{n+1} = average_estimate_n - x_{n+1} - delta_{n+1}

... = {n+1)*delta_{n+1} - delta_{n+1}

... = n * delta_{n+1}

So, if I did not messed up so far:

variance_estimate_{n+1} = (n * variance_estimate_n + n * delta_{n+1}^2 + n^2*delta_{n+1}) / (n+1)

IOW:

variance_estimate_{n+1} = variance_estimate_n * n/(n+1) + n*delta_{n+1}^2

This can be extended to covariance, and we indeed find the iterative formula which is programmed.

Serge Stinckwich

unread,

Aug 1, 2018, 5:13:41 AM8/1/18

to polymath...@googlegroups.com

Thank you for the math, Nicolas !

Actually, Collection>>stdev is not part of PolyMath but Pharo.

Collection>>stdev
    | avg sample sum |
    "In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values.
    For details about implementation see comment in self sum."
    avg := self average.

    sample := self anyOne.
    sum := self inject: sample into: [ :accum :each | accum + (each - avg) squared ].
    sum := sum - sample.
    ^ (sum / (self size - 1)) sqrt

Reply all

Reply to author

Forward