Can straight line est fit with error measure be made in single pass?

Alf P. Steinbach

unread,

Oct 18, 2013, 9:57:36 PM10/18/13

to

It would be nice if one could make do with a non-copyable forward
iterator for the (x,y) data points, without storing the points.

However, my current code requires the ability to iterate a second time,
from a copy of the start iterator.

Any suggestions for how to avoid the second iteration, or should I
perhaps just forget it (I only need this to check algorithm complexity
in unit test, so it's not a current problem, just "not ideal" code...)?

[code]
#pragma once
// Copyright (c) 2013 Alf P. Steinbach

// <url:
http://terpconnect.umd.edu/~toh/spectrum/CurveFitting.html#MathDetails>
// n = number of x,y data points
// sumx = Σx
// sumy = Σy
// sumxy = Σx*y
// sumx2 = Σx*x
// meanx = sumx / n
// meany = sumy / n
// slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
// intercept = meany-(slope*meanx)
// ssy = Σ(y-meany)^2
// ssr = Σ(y-intercept-slope*x)^2
// R2 = 1-(ssr/ssy)
// Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
sumx*sumx))
// Standard deviation of the intercept =
SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
//
// Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>

#include <rfc/cppx/calc/Statistics.h> // cppx::calc::Statistics

namespace cppx{ namespace calc{

class Best_line_fit
: private Statistics
{
private:
double intercept_;
double slope_;
double ssy_; // Sum of squares of y-expressions
double ssr_; // Sum of squares of residue expressions

public:
auto stats() const -> Statistics const& { return *this; }

auto slope() const -> double { return slope_; }
auto intercept() const -> double { return intercept_; }

auto ssy() const -> double { return ssy_; }
auto ssr() const -> double { return ssr_; }
auto r_squared() const -> double { return (ssr_ == 0 || ssy_
== 0? 1.0 : 1.0 - ssr_/ssy_); }
auto r() const -> double { return sqrt( r_squared() ); }

template< class It >
Best_line_fit( It const first, It const end )
: Statistics( first, end )
, ssy_( 0.0 )
, ssr_( 0.0 )
{
slope_ = (n()*sum_xy() - sum_x()*sum_y()) / (n()*sum_xx() -
square( sum_x() ));
intercept_ = mean_y() - slope_*mean_x();

for( It it = first; it != end; ++it )
{
double const x = x_of( *it );
double const y = y_of( *it );

//ssy_ += square( y - mean_y() );
ssr_ += square( y - (slope_*x + intercept_) );
}
ssy_ = sum_yy() - 2*sum_y()*mean_y() + n()*square( mean_y() );
}
};

} } // namespace cppx::calc
[/code]

Disclaimer 1: I do not KNOW that the code is correct, in particular the
r_squared() implementation.

Disclaimer 2: In the mathworld reference I do not understand eq. 17.

Cheers,

- Alf

Victor Bazarov

unread,

Oct 19, 2013, 9:54:38 AM10/19/13

to

On 10/18/2013 9:57 PM, Alf P. Steinbach wrote:
> It would be nice if one could make do with a non-copyable forward
> iterator for the (x,y) data points, without storing the points.

Looking at the code below, I cannot really answer this question without
knowing what 'Statistics' is. Presuming that it does iterate over the
given range to collect the "sum of" and "mean" and the other values,
your other choice is to fix the 'Statistics' class to either calculate
your 'ssr' to boot, or to make it store the x and y values so you can
extract them later at your leisure.

It's the simplification of eq 16. X' (x with line over it) is the mean
x value, isn't it?

SUMi(Xi - X')^2 = SUMi(XiXi - 2XiX' + X'X') =

= SUMi(Xi^2) - SUMi(2XiX') + SUMi(X'X') =

= SUMi(Xi^2) - 2X'*SUMi(Xi) + N*X'X' = ( assuming X'=SUM(Xi)/N )

= SUMi(Xi^2) - 2X'*N*X' + N*X'X' = SUMi(Xi^2) - N*X'X'

Same with eq's 18 and 20.

V
--
I do not respond to top-posted replies, please don't ask

Paavo Helde

unread,

Oct 19, 2013, 11:29:27 AM10/19/13

to

"Alf P. Steinbach" <alf.p.stein...@gmail.com> wrote in
news:l3souu$4j1$1...@dont-email.me:

> It would be nice if one could make do with a non-copyable forward
> iterator for the (x,y) data points, without storing the points.
>
> However, my current code requires the ability to iterate a second
> time, from a copy of the start iterator.
>
> Any suggestions for how to avoid the second iteration, or should I
> perhaps just forget it (I only need this to check algorithm complexity
> in unit test, so it's not a current problem, just "not ideal"
> code...)?

Depends on what you mean by ideal? Code simplicity, performance, accuracy,
reducing the number of passes? These goals might be in conflict with each
other. For example std. deviation and variance can be easily calculated by
one-pass algorithm. However, a naive and simple implementation produces big
round-off errors; there is a numerically stable one-pass algorithm from
Knuth, but it involves a division operation inside loop, so might be slow.
There is a stable two-pass algorithm that can actually be faster, but well,
it has two passes.

This was about variance; for linear regression it seems to be yet more
complicated. First google search turns up

http://stats.stackexchange.com/questions/23481/are-there-algorithms-for-
computing-running-linear-or-logistic-regression-param

Alf P. Steinbach

unread,

Oct 19, 2013, 11:50:52 AM10/19/13

to

On 19.10.2013 15:54, Victor Bazarov wrote:
> On 10/18/2013 9:57 PM, Alf P. Steinbach wrote:
>> It would be nice if one could make do with a non-copyable forward
>> iterator for the (x,y) data points, without storing the points.
>
> Looking at the code below, I cannot really answer this question without
> knowing what 'Statistics' is. Presuming that it does iterate over the
> given range to collect the "sum of" and "mean" and the other values,

Yes.

> your other choice is to fix the 'Statistics' class to either calculate
> your 'ssr' to boot,

This would be preferable.

However the 'ssr' value depends on values that apparently depend on the
whole collection of numbers, + those original numbers themselves again,

> // slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
> // intercept = meany-(slope*meanx)

> // ssr = Σ(y-intercept-slope*x)^2

I know there is a neat trick for updating a mean value incrementally, so
I thought maybe someone knew about something similar here?

It seems like a tradeoff math/C++: more of one to get less of the other.

> or to make it store the x and y values so you can
> extract them later at your leisure.

Yes.

[snip]

>> Disclaimer 1: I do not KNOW that the code is correct, in particular the
>> r_squared() implementation.
>>
>> Disclaimer 2: In the mathworld reference I do not understand eq. 17.
>
> It's the simplification of eq 16. X' (x with line over it) is the mean
> x value, isn't it?
>
> SUMi(Xi - X')^2 = SUMi(XiXi - 2XiX' + X'X') =
>
> = SUMi(Xi^2) - SUMi(2XiX') + SUMi(X'X') =
>
> = SUMi(Xi^2) - 2X'*SUMi(Xi) + N*X'X' = ( assuming X'=SUM(Xi)/N )
>
> = SUMi(Xi^2) - 2X'*N*X' + N*X'X' = SUMi(Xi^2) - N*X'X'
>
> Same with eq's 18 and 20.

Thanks! :-)

I got some help with that already this morning, on Facebook, and yes I
felt pretty stupid... :-( Not to mention whether that feeling now
reflects reality. Which it may.

Anyway, for my unit testing this now works, using 40 000 time
measurement points I get an R goodness between 0.99 and 1 (perfect) for
linear complexity, and forcing O(n^2) behavior[1] I get an R around
0.98. Apparently values tightly up against 0.99 or thereabouts are
normal, i.e. apparently this is a very finely tuned goodness-of-fit
criterion. But again I do not know, and the texts I've found neglect to
say anything about which numerical values decide one way or other.

Cheers,

- Alf (still in uncharted territory)

Notes:
[1] By simple expedient of allocating just required size of internal
buffer in string, instead of using the doubling buffer size idea.

Richard Damon

unread,

Oct 19, 2013, 2:57:19 PM10/19/13

to

On 10/18/13 9:57 PM, Alf P. Steinbach wrote:
> It would be nice if one could make do with a non-copyable forward
> iterator for the (x,y) data points, without storing the points.
>
> However, my current code requires the ability to iterate a second time,
> from a copy of the start iterator.
>
> Any suggestions for how to avoid the second iteration, or should I
> perhaps just forget it (I only need this to check algorithm complexity
> in unit test, so it's not a current problem, just "not ideal" code...)?
>
>

...

> // ssy = Σ(y-meany)^2
> // ssr = Σ(y-intercept-slope*x)^2
> // R2 = 1-(ssr/ssy)
> // Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
> sumx*sumx))
> // Standard deviation of the intercept =
> SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
> //
> // Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>

...
> - Alf

Sounds like you just need to precompute any needed terms to compute ssy
and ssr.

For example, ssy = sum( (y-my)^2 ) =
sum( y^2 - 2*my*y + my^2)

since my is a constant over the sum =

sum(y^2) - 2*my*sum(y) + my^2*sum(1)

so you need to have computed sum(y^2) and sum(y)

Do the same math for ssr, and Bob's your uncle, you have your solution.

First glance, srr will need sums of y^2, y, xy, x, x^2, 1

Alf P. Steinbach

unread,

Oct 20, 2013, 1:16:47 AM10/20/13

to

Thanks, but the current (presented) code already does this for ssy, and
ssr is the problem, not easily updated.

Cheers,

- Alf

Alf P. Steinbach

unread,

Oct 20, 2013, 1:25:06 AM10/20/13

to

On 19.10.2013 17:29, Paavo Helde wrote:
> "Alf P. Steinbach" <alf.p.stein...@gmail.com> wrote in
> news:l3souu$4j1$1...@dont-email.me:
>
>> It would be nice if one could make do with a non-copyable forward
>> iterator for the (x,y) data points, without storing the points.
>>
>> However, my current code requires the ability to iterate a second
>> time, from a copy of the start iterator.
>>
>> Any suggestions for how to avoid the second iteration, or should I
>> perhaps just forget it (I only need this to check algorithm complexity
>> in unit test, so it's not a current problem, just "not ideal"
>> code...)?
>
> Depends on what you mean by ideal? Code simplicity, performance, accuracy,
> reducing the number of passes? These goals might be in conflict with each
> other. For example std. deviation and variance can be easily calculated by
> one-pass algorithm. However, a naive and simple implementation produces big
> round-off errors; there is a numerically stable one-pass algorithm from
> Knuth, but it involves a division operation inside loop, so might be slow.
> There is a stable two-pass algorithm that can actually be faster, but well,
> it has two passes.

Well, unfortunately my Mr. Knuth (the three first volumes, I never got
around to obtaining the TeX works or the current fourth volume stuff)
resides at the bottom of some unidentified, anonymous-looking box, among
numerous other unidentified anonymous-looking boxes.

But thanks anyway, it really helps to have some background.

> This was about variance; for linear regression it seems to be yet more
> complicated. First google search turns up
>
> http://stats.stackexchange.com/questions/23481/are-there-algorithms-for-computing-running-linear-or-logistic-regression-param

Hm, among the answers there the last one, with 0 votes, seemed most
promising.

Still, the upshot seems to be "can't be done".

Cheers,

- Alf

PS: Donald Knuth did most of the first volume work while he was (doing
postdoc work?) at Norsk Regnesentral in Oslo. I can't help but connect
that somehow to the Simula OO work being done at that time at the
University of Oslo. Even though I never read about any connection. I
once interviewed there, and mentioned this, just to show I'd done my bit
about background research. But they had not even heard of him. :(

Paavo Helde

unread,

Oct 20, 2013, 3:10:56 AM10/20/13

to

"Alf P. Steinbach" <alf.p.stein...@gmail.com> wrote in

news:l3vpfj$3tc$1...@dont-email.me:

> On 19.10.2013 17:29, Paavo Helde wrote:
>> For example std. deviation and variance can
>> be easily calculated by one-pass algorithm. However, a naive and
>> simple implementation produces big round-off errors; there is a
>> numerically stable one-pass algorithm from Knuth, but it involves a
>> division operation inside loop, so might be slow.
>

> Well, unfortunately my Mr. Knuth (the three first volumes, I never got
> around to obtaining the TeX works or the current fourth volume stuff)
> resides at the bottom of some unidentified, anonymous-looking box,
> among numerous other unidentified anonymous-looking boxes.

No need to dig into boxes, this is all present in wikipedia:

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Cheers
Paavo

Richard Damon

unread,

Oct 20, 2013, 8:48:55 PM10/20/13

to

But if you do the math, you can derive similar equations for ssr.

ssr = sum((y-b-m*x)^2)
= sum( y^2 + b^2 + m^2x^2 - 2*y*b - 2*m*x*y - 2*b*m*x )

= sum(y^2) + b^2*sum(1) + m^2*sum(x^2) - 2*b*sum(y) - 2*m*sum(x*y) -
2*b*m*sum(x)

Remember, for THIS summation, intercept (call b above) and slope (called
m above) are constants and can be factored out of the loop.

No promise I haven't made a dumb math error above, but the principle is
solid.

Alf P. Steinbach

unread,

Oct 20, 2013, 9:30:20 PM10/20/13

to

Thanks a bunch! :-)

Cheers,

- Alf

Oh, why did that seem impossible to me, why didn't I think of it...

SG

unread,

Oct 23, 2013, 8:59:36 AM10/23/13

to

On Saturday, October 19, 2013 3:57:36 AM UTC+2, Alf P. Steinbach wrote:
> It would be nice if one could make do with a non-copyable forward
> iterator for the (x,y) data points, without storing the points.

Let me just mention that this extends to any linear least squares
problem. The amount of memory you really need is just determined by
the number of unknowns, not the number of "data points". Suppose

A x = b

is an overdetermined linear equation system with very few unknowns
(and hence very hew columns in A) but many equations (rows in A).
Then, you can, for example, compute A^T A and A^T b of

A^T A x = A^T b // Gauss' normal equation system

using a single loop over all rows of A and b. A^T A will be a small
square matrix with n^2 entries where n is the number of unknowns.
This can then be solved with a Cholesky factorization, for example.

Other approaches are possible, too. If you care about numerical
accuracy you could "accumulate" a couple of rows and periorically
reduce the whole system to O(n^2) memory via Householder
transformations.

Cheers!
SG

Richard Damon

unread,

Oct 26, 2013, 1:55:12 PM10/26/13

to

Note that Least Square fits are NOT overdetermined linear equations. You
do not get a "equation" for each data point when processing, but a
multidimensional optimization problem, with one dimension for each
unknown to find. This tends to be converted to a set of equations by
taking the partial derivative of the error over each coefficient to
solve for and finding the location where all the derivatives are zero.
What is helpful is that for a large class of problems, you don't need to
keep all the data points to solve the system, but can reduce the data
points to a limited number of moments, that the solution is based on.