On 6/22/12 8:00 AM, in article
de637d5a-efbc-45e4...@googlegroups.com, "cc"
We have already discussed how you missed the change in the trend... you call
them "outliers" (though you have also used the term "erroneous" and
others... all rather silly of you). Let us be more specific on why your
claim that the data from the latter half of 2011 should be seen as
"outliers"
<
http://en.wikipedia.org/wiki/Outlier>
-----
An outlying observation, or outlier, is one that appears to
deviate markedly from other members of the sample in which it
occurs.
-----
But if you look at the data from the latter half of 2011:
<
http://tmp.gallopinginsanity.com/LinuxTrend2011-2ndhalf.png>
Those data points show a clear and very strong trend (even if nobody
predicted that trend would continue unchanged for any great length of time).
Those data points do *not* deviate "markedly from other members of the
sample". This can be seen with the high R^2 value. Even looking at the
greater set of data:
<
http://tmp.gallopinginsanity.com/LinuxTrendMar2012Snit-vs-cc.png>
It is *very* clear that there is an upward trend at the latter half of
2011... those data points are forming a pattern. The same Wikipedia page
speaks of using caution that you did not:
-----
Caution: Unless it can be ascertained that the deviation is
not significant, it is ill-advised to ignore the presence of
outliers.
-----
From your description I have not understood what you did to "ascertained
that the deviation is not significant". Maybe you can explain that.
From my view, the fact that it was not a single data point that seemed "off"
but a set of at least six concurrent ones in a very clear trend discount
them as being ignored as meaningless "outliers". But I am open to your
explanation... what makes you think those six data point with such a strong
and clear trend (an R^2 value of over 0.98, and this is *without* weighing
or assuming any outliers, etc.) is "*ONE* that appears to deviate markedly
from the other members of the sample" (emphasis mine... but the importance
of it being *ONE* data point is an important thing to keep in mind). This
does not mean there cannot be more than one outlier in a sample - but that
points that make a trend of their own are not occurring as a single
"outlier".
In case you do not want to accept the single definition from Wikipedia, I
found those for you so you can better understand what an outlier is:
<
http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm>
-----
An outlier is an observation that lies an abnormal distance
from other values in a random sample from a population. In a
sense, this definition leaves it up to the analyst (or a
consensus process) to decide what will be considered
abnormal. Before abnormal observations can be singled out, it
is necessary to characterize normal observations.
-----
Again, "an observation"... and again it makes it clear that the
determination of if something is "an outlier" (one) is subjective.
<
http://www.statsoft.com/textbook/basic-statistics/#Correlationse>
-----
Outliers. Outliers are atypical (by definition), infrequent
observations.
-----
But you called these 6 of 24 data points "outliers"... and to keep your
claim of 1% at all times, you might also include Feb 2012. Even if not, you
are deeming 25% of the data as being "outliers". This is not consistent
with the idea that they would be "atypical".
-----
Needless to say, one should never base important conclusions
on the value of the correlation coefficient alone (i.e.,
examining the respective scatterplot is always recommended).
-----
This is what I have been telling you. Looking just at the linear trend line
is *not* sufficient, esp. when you are assuming that 25% of your data points
are "outliers" The same link gets even more clear:
-----
Nonlinear Relations between Variables. Another potential
source of problems with the linear (Pearson r) correlation is
the shape of the relation. As mentioned before, Pearson r
measures a relation between two variables only to the extent
to which it is linear; deviations from linearity will
increase the total sum of squared distances from the
regression line even if they represent a "true" and very
close relationship between two variables. The possibility of
such non-linear relationships is another reason why examining
scatterplots is a necessary step in evaluating every
correlation. For example, the following graph demonstrates an
extremely strong correlation between the two variables which
is not well described by the linear function.
-----
As I have been telling you: when the data is non-linear, as the data in this
case is not, then one *must* look at the data itself. You did not - hence
the reason why you missed the upward trend of the latter half of 2011.
But there are more resources to help you understand this:
<
http://mathworld.wolfram.com/Outlier.html>
-----
An outlier is an observation that lies outside the overall
pattern of a distribution (Moore and McCabe 1999). Usually,
the presence of an outlier indicates some sort of problem.
This can be a case which does not fit the model under study,
or an error in measurement.
Outliers are often easy to spot in histograms. For example,
the point on the far left in the above figure is an outlier.
-----
If you look at the graph, you can see it shows what appears to be a *true*
outlier... a single point that is significantly different from the rest of
the data. Your "outliers" are 25% of the data and form a clear trend. This
means they are not "outliers" at all, but a trend that is seen in the
overall data. A trend that fit my vague prediction (which, to remind you,
does not prove causation).
But there is more:
<
http://www.experiment-resources.com/statistical-outliers.html>
-----
Statistical outliers are data points that are far removed and
numerically distant from the rest of the points.
-----
Calling 25% of the data points "outliers" is a bit silly.... esp. when they
show such a strong trend. Points that form such a strong tend *cannot* be
"far removed and numerically distant from the rest of the points".
And I found many more examples... pretty much any reasonable page that talks
about outliers will make it clear why
1) Such determinations are largely subjective - contrary to your claim that
they are "fact"
2) Cannot include 25% of the data - esp. when that 25% of the data are
points in a direct series which show a *very* clear trend (even if a
non-lasting trend).
Your "outlier" claim is a bit absurd - and, again, shows how you do not
really get the concept of what you are talking about. This is the same as
when you insisted sigma lines could not be based on the distance from the
mean (they can - they are based on the distance from the mean to the
inflection points) and your claim that the depictions I showed you were fine
when it was *very* clear they were not. And you *know* this... hence the
reason you repeatedly snip your own comments and refuse to answer questions
on these topics.
You were shown to be wrong about sigma lines. Now you have shown yourself
to be wrong about outliers in data.
Really, is there anything you can point to where you can claim to be right?
> The latter half of 2011, which you love to point to, is mostly made up of
> outliers. This is a fact.
A "fact"? Based on what. Also from the same Wikipedia page:
-----
There is no rigid mathematical definition of what constitutes
an outlier; determining whether or not an observation is an
outlier is ultimately a subjective exercise.
-----
In other words, it is not a "fact" but a subjective *opinion*. And I find
that opinion to be rather absurd given that we are *not* talking about *ONE*
point but a set - and that set has a very clear trend. Amazingly clear,
really.
> You have changed datasets to just using the later half of 2011 to try and
> prove your point, since my trendline refuted your original point using your
> original data set. This is a fact.
Incorrect.
> You refuse to acknowledge that the 2011 data you now insist on using is made
> up of almost entirely outliers, even though it's not a matter of opinion. This
> is a fact.
Incorrect.
> You confuse R^2 with outliers and try to point to the high R^2 value for your
> 2011 data as some sort of proof of non-existence of outliers. This is a fact.
Incorrect.
> There have been no lies from me at all this entire time. I cannot say the same
> for you though. You consistently try to pretend I've said things I have not,
> and continue to repeat those things even though you've been correct numerous
> times. I'm sorry you got your ass handed to you, but unfortunately for you,
> it's math and cannot really be refuted.
Your claims are incorrect. I am not suggesting the math is wrong.
> Desktop Linux has been flatlined for quite some time now. Most people with
> common sense realize this to be true, and now I've proven it to you.
Let's try different a tact here. I have been very open with the places
where I see where I have been wrong or did not handle things as well as you
should have. I am an honest and open person. For example: I was wrong in
my predictions for the trend in 2012 and I did not handle things as well as
I should have when I did not note the non-linear nature of the trend before
I did. In one I was wrong - in the other I did not handle things as well as
I did.
Let us test your honesty and openness: where do you think *you* have been
wrong... and no back-handed insults with that... just a sincere statement of
where you admit you were wrong. Can you think of *any* place in this whole
debate? Any at all?
My guess: you will not be willing to admit to any. I sincerely hope you
prove me wrong... (that would give me something else to add to my list). My
guess though is you are so tied to having to "prove" you are right - no
matter how wrong you have been - that you will simply avoid this question.