Bug in scales::zero_range?

310 views
Skip to first unread message

joran

unread,
Jun 5, 2012, 1:44:14 PM6/5/12
to ggplo...@googlegroups.com
We've had two questions recently on StackOverflow about behavior where when y values are all extremely small ggplot interprets them as being identical and plots them in a single horizontal line:


and


Both SO posts contain reproducible examples. Some of the comments there reference some older discussions on the ggplot mailing list that reference this problem and seem to suggest that it was fixed. Because of that, I wanted to raise the issue here first before filing a bug report on github.

Thoughts?

Brandon Hurr

unread,
Jun 5, 2012, 1:59:02 PM6/5/12
to joran, ggplo...@googlegroups.com
I remember a recent example of this and I don't think there was a resolution. The values were tiny and it put them in the same line almost like a factor. I think It had to do with the way the values were interpreted as double. 

joran

unread,
Jun 5, 2012, 2:11:59 PM6/5/12
to ggplo...@googlegroups.com
Just to put a finer point on it, this is scales::zero_range

function (x) 
{
    length(x) == 1 || isTRUE(all.equal(x[1] - x[2], 0))
}

and what's happening is that the default tolerance in all.equal is interpreting the difference as zero. I'm not confident enough in my knowledge of floating point edge cases to be certain of what the best alternative is. Just check x[1] == x[2]?

Winston Chang

unread,
Jun 5, 2012, 2:22:09 PM6/5/12
to joran, ggplo...@googlegroups.com
This is probably related to https://github.com/hadley/ggplot2/issues/550.

I'm not sure of a good general solution, though. The question is, how do distinguish between cases where the data really does have a small range, and cases where the data all (ideally) has the same value, but with some rounding error?

joran

unread,
Jun 5, 2012, 2:23:33 PM6/5/12
to ggplo...@googlegroups.com
I should have checked on github first, sorry for the duplication:

Brandon Hurr

unread,
Jun 5, 2012, 2:25:14 PM6/5/12
to joran, ggplo...@googlegroups.com
Crap, I forgot I filed the bug... It was mostly David Kahle for providing a reproducible example.

joran

unread,
Jun 5, 2012, 2:33:21 PM6/5/12
to ggplo...@googlegroups.com, joran
On Tuesday, June 5, 2012 11:22:09 AM UTC-7, Winston Chang wrote:
This is probably related to https://github.com/hadley/ggplot2/issues/550.

I'm not sure of a good general solution, though. The question is, how do distinguish between cases where the data really does have a small range, and cases where the data all (ideally) has the same value, but with some rounding error?

As a way out of that dilemma, would it make sense to simply switch when ggplot will behave "unexpectedly"? i.e. have ggplot assume by default that the data really do have a very small range, so that the bug is shifted to the case when the values really are all the same, but appear different due to rounding error? Isn't that what R's base graphics does?

That seems like a more sensible default behavior to me, I suppose. 

Brian Diggs

unread,
Jun 5, 2012, 4:03:17 PM6/5/12
to joran, public-ggplot2-dev-/...@plane.gmane.org
A previous fix (issue #6 in the scales package, commit e130daa), changed
the algorithm from:

if (length(x) == 1) return(TRUE)
x <- x / mean(x)
isTRUE(all.equal(x[1], x[2]))

to:

length(x) == 1 || isTRUE(all.equal(x[1] - x[2], 0))

Presumably to deal with "large numbers with small differences" (based on
the added tests).

The use of all.equal is to say "equal to within what is probably just
differences due to floating point representation" (my paraphrase). From
the actual documentation of all.equal:

Numerical comparisons for scale = NULL (the default) are done by first
computing the mean absolute difference of the two numerical vectors. If
this is smaller than tolerance or not finite, absolute differences are
used, otherwise relative differences scaled by the mean absolute
difference.

By default, tolerance is .Machine$double.eps ^ 0.5 (which is
1.490116e-08 on my machine).

The old way only looked a relative difference, the new way looks at
absolute or relative difference (though reading the documentation of
all.equal, I'm still not sure I understand when it does which).

Ultimately, we need more tests cases. Pulling from the existing tests,
the two StackOverflow questions, and some others, I think these are right:

expect_false(zero_range(c(5.63e-147, 5.93e-123)))
expect_false(zero_range(c(-7.254574e-11, 6.035387e-11)))
expect_false(zero_range(c(1330020857.8787, 1330020866.8787)))
expect_true(zero_range(c(1330020857.8787, 1330020857.8787)))
expect_true(zero_range(c(1330020857.8787, 1330020857.8787*(1+1e-20))))
expect_false(zero_range(c(0,10)))
expect_true(zero_range(c(0,0)))
expect_false(zero_range(c(-1,1)))
expect_false(zero_range(c(-1,1*(1+1e-20))))
expect_false(zero_range(c(-1,1)*1e-100))
expect_false(zero_range(c(0,1)*1e-100))
expect_true(zero_range(c(1)))
expect_true(zero_range(c(0)))
expect_true(zero_range(c(1e100)))
expect_true(zero_range(c(1e-100)))


If so, here is a function which meets that criteria:

zero_range <- function(x) {
m <- mean(x)
if(m == 0L) {
if (x[[1]] == 0L) {
m <- 1
} else {
m <- x[[1]]
}
}
length(x) == 1 || isTRUE(all.equal(diff(x/m), 0,
tolerance=.Machine$double.eps))
}

First, I change the tolerance to .Machine$double.eps, not the square
root of it. Second, I rescale the vector based on the mean value, after
checking that it is not exactly 0L to take care of absolute size
differences. If the mean is exactly 0L, then rescale by the first value
(if it is not 0L); if both the mean and first value are exactly 0L,
don't rescale (since all are zero).

Any other tests to throw at this? Can anyone break it? (Other than with
length(x) > 2, but that is part of the documentation).

--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University


Hadley Wickham

unread,
Jun 5, 2012, 5:39:34 PM6/5/12
to Brian Diggs, joran, public-ggplot2-dev-/...@plane.gmane.org


> If so, here is a function which meets that criteria:
>
> zero_range <- function(x) {

if (length(x) == 1) return(TRUE)

>    m <- mean(x)
>    if(m == 0L) {
>        if (x[[1]] == 0L) {
>            m <- 1

I don't follow this bit - could you add a comment explaining what's going on?


>        } else {
>            m <- x[[1]]
>        }
>    }
>    length(x) == 1 || isTRUE(all.equal(diff(x/m), 0,
> tolerance=.Machine$double.eps))

With early termination above, I think this could be simplified to

diff(x/m) < .Machine$double.eps

Hadley


--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Winston Chang

unread,
Jun 5, 2012, 6:06:51 PM6/5/12
to Brian Diggs, ggplo...@googlegroups.com
As I understand it, you're making the zero_range calculation relative the magnitude of the x values. It might be better, but I think there are still some tough cases.

Here's an example with your version of zero_range() :
zero_range(c(0, 1e-20))
# FALSE

How can you know whether this tiny range is due to FP error or if it's actually part of the data?


It's not entirely clear to me why you want the mean. Maybe it would be better to use the max value, like max(abs(x[1]), abs(x[2]))?


Some other test cases you may want to consider:
zero_range(c(1,NA))
zero_range(c(0,NaN))
zero_range(c(1,Inf))
zero_range(c(-Inf,Inf))
zero_range(c(Inf,Inf))


-Winston

Hadley Wickham

unread,
Jun 5, 2012, 6:10:53 PM6/5/12
to Winston Chang, Brian Diggs, ggplo...@googlegroups.com
I wonder if the problem is that zero_range simply shouldn't be used
when calculating the scale limits - it's more obviously useful in the
other places where it's used (detecting overlaps in position
adjustments and when computing the resolution)

Hadley

Brian Diggs

unread,
Jun 5, 2012, 6:30:04 PM6/5/12
to Hadley Wickham, ggplo...@googlegroups.com
On 6/5/2012 2:39 PM, Hadley Wickham wrote:
>
>
>> If so, here is a function which meets that criteria:
>>
>> zero_range<- function(x) {
>
> if (length(x) == 1) return(TRUE)
>
>> m<- mean(x)
>> if(m == 0L) {
>> if (x[[1]] == 0L) {
>> m<- 1
>
> I don't follow this bit - could you add a comment explaining what's going on?

# determine the relative scale for the difference of the endpoints
# Either endpoint or the mean of the endpoints are good candidates
# for the scale, but any of those may be 0 which would cause problems
# when divided by. Therefore, check if the mean is 0, and if so use
# the lower point of the range unless it is zero. If both the mean and
# the lower bound are zero, then do not rescale (set the scale to 1).

>
>
>> } else {
>> m<- x[[1]]
>> }
>> }
>> length(x) == 1 || isTRUE(all.equal(diff(x/m), 0,
>> tolerance=.Machine$double.eps))
>
> With early termination above, I think this could be simplified to
>
> diff(x/m)< .Machine$double.eps

actually, absolute value of that.

> Hadley

OK, time to roll this into a pull request for further comments.

https://github.com/hadley/scales/pull/21

Brian Diggs

unread,
Jun 5, 2012, 6:42:18 PM6/5/12
to Winston Chang, ggplo...@googlegroups.com
On 6/5/2012 3:06 PM, Winston Chang wrote:
> On Tue, Jun 5, 2012 at 3:03 PM, Brian Diggs<diggsb-k1...@public.gmane.org> wrote:
>
>>
>> On 6/5/2012 11:33 AM, joran wrote:
>>
>>> On Tuesday, June 5, 2012 11:22:09 AM UTC-7, Winston Chang wrote:
>>>
>>>>
>>>> This is probably related to https://github.com/hadley/**
>>>> ggplot2/issues/550<https://github.com/hadley/ggplot2/issues/550>.
>> expect_false(zero_range(c(5.**63e-147, 5.93e-123)))
>> expect_false(zero_range(c(-7.**254574e-11, 6.035387e-11)))
>> expect_false(zero_range(c(**1330020857.8787, 1330020866.8787)))
>> expect_true(zero_range(c(**1330020857.8787, 1330020857.8787)))
>> expect_true(zero_range(c(**1330020857.8787, 1330020857.8787*(1+1e-20))))
>> expect_false(zero_range(c(0,**10)))
>> expect_true(zero_range(c(0,0))**)
>> expect_false(zero_range(c(-1,**1)))
>> expect_false(zero_range(c(-1,**1*(1+1e-20))))
>> expect_false(zero_range(c(-1,**1)*1e-100))
>> expect_false(zero_range(c(0,1)***1e-100))
>> expect_true(zero_range(c(1)))
>> expect_true(zero_range(c(0)))
>> expect_true(zero_range(c(**1e100)))
>> expect_true(zero_range(c(1e-**100)))
>>
>>
>> If so, here is a function which meets that criteria:
>>
>> zero_range<- function(x) {
>> m<- mean(x)
>> if(m == 0L) {
>> if (x[[1]] == 0L) {
>> m<- 1
>> } else {
>> m<- x[[1]]
>> }
>> }
>> length(x) == 1 || isTRUE(all.equal(diff(x/m), 0,
>> tolerance=.Machine$double.eps)**)
>> }
>>
>> First, I change the tolerance to .Machine$double.eps, not the square root
>> of it. Second, I rescale the vector based on the mean value, after
>> checking that it is not exactly 0L to take care of absolute size
>> differences. If the mean is exactly 0L, then rescale by the first value (if
>> it is not 0L); if both the mean and first value are exactly 0L, don't
>> rescale (since all are zero).
>>
>> Any other tests to throw at this? Can anyone break it? (Other than with
>> length(x)> 2, but that is part of the documentation).
>
>
>
> As I understand it, you're making the zero_range calculation relative the
> magnitude of the x values. It might be better, but I think there are still
> some tough cases.
>
> Here's an example with your version of zero_range() :
> zero_range(c(0, 1e-20))
> # FALSE
>
> How can you know whether this tiny range is due to FP error or if it's
> actually part of the data?

Ultimately, I suppose you can't know. But at least this version errs on
the conservative side and assumes that the difference is real.

> It's not entirely clear to me why you want the mean. Maybe it would be
> better to use the max value, like max(abs(x[1]), abs(x[2]))?

Maybe. Any of those gives the appropriate magnitude of the data. And
that would also have to be checked against 0 as well.

>
> Some other test cases you may want to consider:
> zero_range(c(1,NA))
> zero_range(c(0,NaN))
> zero_range(c(1,Inf))
> zero_range(c(-Inf,Inf))
> zero_range(c(Inf,Inf))

Those are good cases (and my version fails most of them). Are we
guaranteed that only finite, non-missing values will be passed to the
function? If not, the code will have to be modified to take these
possibilities into account.

> -Winston

Brian Diggs

unread,
Jun 5, 2012, 6:48:01 PM6/5/12
to Hadley Wickham, Winston Chang, ggplo...@googlegroups.com
On 6/5/2012 3:10 PM, Hadley Wickham wrote:
> I wonder if the problem is that zero_range simply shouldn't be used
> when calculating the scale limits - it's more obviously useful in the
> other places where it's used (detecting overlaps in position
> adjustments and when computing the resolution)

Maybe that is the best answer.

This also made me realize that I did not test my pull request against
ggplot. It passed all the tests in scales, but I didn't check that it
didn't break anything in ggplot. In a quick count, zero_range is called
directly in 5 places in ggplot2.

> Hadley
>
> On Tue, Jun 5, 2012 at 5:06 PM, Winston Chang<winstonchang1-Re5J...@public.gmane.org> wrote:
Reply all
Reply to author
Forward
0 new messages