comparison

441 views
Skip to first unread message

Stefan Karpinski

unread,
May 18, 2012, 3:22:57 PM5/18/12
to Julia Dev
There are a lot of different notions of equality and ordering floating around. I just wanted to start a thread for clarifying, discussing, and making some decisions about this subject. I've made a spreadsheet (!!) of the different notions that I can think of:


We've recently been encountering a lot of confusion and trouble over the following pairs of equality comparisons:
  • intuitive & numeric equality (row 1) vs. element-wise array equality (row 2)
  • universal array equality (row 3) vs. strict equality (row 4)
The first pair causes trouble because we're often tempted to write `a == b` in generic boolean contexts to check for the equality of two objects, but when one of those objects is an array (or other container), the result is a similar container and our generic code blows up. Usually what we want is universal array equality (row 4) — `all(a == b)` — but sometimes we want strict equality (row 5) — `isequal(a,b)`.

Which leads to our second confusing pair — precisely those two equality operators. They're very, very similar, which is what causes the confusion. It's tempting to use isequal even when you mean `all(a == b)` because isequal has a short-circuiting implementation, potentially saving time, whereas `all(a == b)` computes a temporary array of booleans and then reduces over it, which is lousy.

There are similar issues with inequality/ordering operators. Note that I express the inequalities in terms of <= and its analogues because that's how math typically talks about orderings, although our core operations are defined in terms of the strict version: < and isless.

So the open questions are if we should introduce/change operators and/or syntax for various comparisons. Major considerations are:
  1. Make it easier to do generic programming.
  2. Provide convenient fast alternatives to constructs like `all(a == b)` and `all(a <= b)`.
  3. Potentially alleviate confusion by collapsing multiple rows into one?
The last point is particularly tempting in the case of universal array equality and strict equality, but the corresponding universal array inequality and strict equality ordering are very, very different — the former is a very partial order on arrays, whereas the latter uses lexicographic ordering and is total. This relates to generic programming as well, because when you write `a <= b` where the operands are arrays, do you generically want to ensure that all the elements in a are ≤ the corresponding elements in b, or do you want lexicographical ordering of arrays? For sorting, you want lexicographical ordering. For other situations, you probably want the all pairwise comparison behavior.

Feedback and thoughts, please. We really need to straighten this out.

Stefan Karpinski

unread,
May 18, 2012, 3:33:15 PM5/18/12
to Julia Dev
As if that wasn't enough text, here's some clarification about isequal and isless that I recently wrote in a conversation with Carlo:

The isequal function is an equivalence relation on all objects. The == function is not an equivalence relation: it doesn't always return a Bool and it's not reflexive since NaN != NaN. People could introduce other non-equivalence behaviors, although obviously we want it to remain at least symmetric and mostly reflexive and transitive. The NA object is a good candidate for breaking the equivalenceness of ==: we probably want NA == NA to be NA.

The isless function is a strict partial order, whereas < isn't, for the same reasons that == isn't an equivalence relation. Moreover, the relation (a,b)->!isless(a,b)&!isless(b,a) is a partial equivalence relation that is compatible with isequal(a,b) in the sense that whenever the former is defined, it agrees with the latter, and whenever it isn't, the latter is false.

To make matters even more complicated, isequal is intimately related to hashing: one of its core purposes is to determine which element are hashed the same way. Similarly, isless is the defining operator for sorting things. So isequal(NaN,NaN), for example, and isequal(1,1.0), but interestingly, !isequal(-0.0,0.0) — because -0.0 sorts before 0.0, implying that isless(-0.0,0.0) must be true, and therefore isequal(-0.0,0.0) cannot be true. Therefore neither the == relation nor the isequal relation is strictly finer than the other:
  • NaN != NaN but isequal(NaN,NaN)
  • -0.0 == 0.0 but !isequal(-0.0,0.0).
Sigh.

Stefan Karpinski

unread,
May 18, 2012, 3:46:08 PM5/18/12
to Julia Dev
I've filled in some possible syntaxes, but they're in no way written in stone.

Jeff Bezanson

unread,
May 18, 2012, 4:02:31 PM5/18/12
to juli...@googlegroups.com
== is only finer in the sense that it means "equal and ordered" where
isequal only deals with "equal". So isequal's notion of equality is
strictly finer, it just doesn't consider orderedness. It seems odd
when you put it that way, but it's the most useful behavior.

I would be fine with !isequal(1,1.0). We need that if 1 and 1.0 behave
differently to a meaningful extent. !isequal(-0.0,0.0) because they
give different answers for 1/x. Also !isequal(-0.0, 0) because there
is no integer negative zero. Is it worth shortening "x==1 &&
isinteger(x)" to "isequal(x,1)"? If so, it is consistent with
disallowing floats as indexes.

John Cowan

unread,
May 18, 2012, 4:41:33 PM5/18/12
to juli...@googlegroups.com
On Fri, May 18, 2012 at 3:22 PM, Stefan Karpinski <ste...@karpinski.org> wrote:

> Feedback and thoughts, please. We really need to straighten this out.

Here's what I think (and this is radical, so it may need a couple of
rereads -- ask me questions if it seems too funky):

0) Julia needs to introduce a notion of *mutable* vs. *immutable*
composite types. Alternatively, specific fields of a composite type
can be marked mutable or immutable, in which case a composite is
immutable only if all its fields are. All bits types are immutable.
This distinction will be essential for distributed processing anyway
-- you can copy immutable types but not mutable ones, and it also
motivates the rest of this account.

1) The distinction between object equality and identity is a Bad
Thing. There should only be the relationship that Henry Baker calls
EGAL <http://www.pipeline.com/~hbaker1/ObjectIdentity.html> and I'll
call identity. This means that bits objects are compared by bits,
immutable composite objects recursively by their fields, and mutable
objects are compared by address. This means that two objects are
never just contingently identical, because they are either the same
value or the same memory location.

2) Numerical equality is different from identity. Numerical equality
should use the familiar mathematical symbols = < > etc. Identity
should use something else, preferably avoiding the word "equal"; the
word "is" reads well.

3) Identity is an equivalence relation (symmetric, reflexive,
transitive) on all objects with the possible exception of NaN (just
because there are so many possible internal representations of NaN).
Numerical equality applies to numbers only and is symmetric,
reflexive, and transitive with the exceptions of NaN and NA.

4) Numerical equality should not distinguish between any of 0, 0.0,
and -0.0. Identity should distinguish all three. From this and the
trichotomy law (which doesn't quite apply because of NaN, but never
mind) it follows that -0.0 < 0.0 is false. This makes sense, since
even though most of the exact range that -0.0 represents is less than
most of the exact range that 0.0 represents, they do overlap at
exactly 0.

5) Relationships can be lifted over arrays in three different ways
(where R is a relationship like = or < or is):

a) Some element in one array is R at least one element in the other =>
the arrays are R.

b) Each element in one array is R to the corresponding element in the
other => the arrays are R.

c) Each element in one array is R to some (not necessarily
corresponding) element in the other => the arrays are R.

There should be "adverbs" to modify each kind of for at least the
first two, preferably the third as well. Explicit is better than
implicit: "some x = some y, each x = some y/some x = each y, each x =
each y" would be impossible to confuse. This also works well in the
degenerate case of a scalar.

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

david tweed

unread,
May 18, 2012, 5:16:23 PM5/18/12
to julia-dev
On May 18, 8:22 pm, Stefan Karpinski <ste...@karpinski.org> wrote:

> The first pair causes trouble because we're often tempted to write `a == b`
> in generic boolean contexts to check for the equality of two objects, but
> when one of those objects is an array (or other container), the result is a
> similar container and our generic code blows up. Usually what we want is
> universal array equality (row 4) — `all(a == b)` — but sometimes we want
> strict equality (row 5) — `isequal(a,b)`.

On the related note, I wonder if it's possible to somehow make . a
lifting operator that turns an

oplus:: a x a -> b

to an

.oplus:: Containter1[a] x Container2[a] -> max(Container1,Container2)
[b]

(so that if it had a standard type it'd be

. :: (a x a -> b)->(Containter1[a] x Container2[a]-
>max(Container1,Container2)[b])

) Then it might be easier for an IR optimizer to recognise and
optimise the patterns.

I imagine that actually implementing that literally might be a parsing
nightmare, but maybe it could define .+ = lift (+).

> Which leads to our second confusing pair — precisely those two equality
> operators. They're very, very similar, which is what causes the confusion.
> It's tempting to use isequal even when you mean `all(a == b)` because
> isequal has a short-circuiting implementation, potentially saving time,
> whereas `all(a == b)` computes a temporary array of booleans and then
> reduces over it, which is lousy.

This ties in with the question of the continuum between debugging and
optimization: it would be nice to be able to write

smaller = a .<= b
if all(smaller) ....

(with smaller not referred to anywhere else) and have something that
you could set a breakpoint on to view smaller in an unoptimized build,
while not generating the array in an optimized function.

With regard to short-circuiting you presumably mean "the
implementation is entitled to pick any arbitrary commutation/
reassociation and stop that evaluation once the result is
determined" (since that allows implicit parallelisation if the runtime
can do it).

Jeff Bezanson

unread,
May 18, 2012, 5:18:52 PM5/18/12
to juli...@googlegroups.com
Yes, I like EGAL and I think we want to move to this model of mutable
vs. immutable composites.

The biggest source of confusion is probably notation, in the presence
of "operator overloading". == is numerical equality, but is
traditionally lifted to arrays in some arbitrarily-chosen way when
given arrays as arguments. I've never been a huge fan of this but I
hoped we could support it in addition to a saner model of equality.

The only genuinely "built in" notions of equality in current julia are
is() and bit-level comparison. is() is broken for bits types; it's
more or less like eq in common lisp. From that perspective, does this
argument only really apply to is()? Should is() implement EGAL, the
standard library provide numerically-appropriate == and <, then
declare victory and write off the rest as "details"?

One cannot totally avoid contingent equality. You might want to
independently build two arrays or tables, then ask if they ended up
with the same contents. The programmer may also choose to hallucinate
things such as "unenforced immutable arrays". My current understanding
is that people can write/use such functions if they want, as long as
they are backed up by solid is() and ==, so one can write "map(is, x,
y)", "map(==, x, y)", etc. The question is what the convention should
be around such functions. For example, tuples are immutable, but how
should they be compared? Writing e.g. "x[1]==y[1] && x[2]==y[2]" is
the only safe thing, but people want a less-verbose default to turn
to.

Jeffrey Sarnoff

unread,
May 19, 2012, 1:19:03 PM5/19/12
to juli...@googlegroups.com
EGAL is good .. seems better with recognition of 3 sorts {immutable, stable (persistent), mutable} (presupposes stableness is deducible or inheritable & enforced).
 syntax order relation The more immediate gestalt should get the

Generally, my syntactic preference would give the simpler (less adorned) relational syntax to the more immediate relational gestalt. 
Then choose to use embellished or augmented  relational syntax to support other common intents,
    `thisThing                > thatThing`  ascertains something about the considered-together nature of thisThing (itself) and of thatThing(itself)
    `thisThing_itsParts .> thatThing_itsParts`

Jeffrey Sarnoff

unread,
May 22, 2012, 3:05:41 PM5/22/12
to juli...@googlegroups.com
@StefanKarpinski explains ..tempted to write `a==b` in generic boolean contexts to check for the equality of two objects, but when one of those objects is an array (or other container), the result is a similar container and our generic code blows up...

Knowing nothing, writing `a==b` where one or both is a container, I would expect the boolean test to examine type consistency, then shape(a)==shape(b), then either equality of nested type&shape or element matching.  Also, sometimes, it is useful to ask e.g. shape(a)>=shape(b).  Why should it blow up generic code?

R. Michael Weylandt

unread,
May 22, 2012, 3:14:46 PM5/22/12
to juli...@googlegroups.com
On Tue, May 22, 2012 at 3:05 PM, Jeffrey Sarnoff
<jeffrey...@gmail.com> wrote:
> @StefanKarpinski explains ..

Stefan is apparently now a macro -- that's how you know when you've
made it big ;-)

Michael

Stefan Karpinski

unread,
May 22, 2012, 4:24:05 PM5/22/12
to juli...@googlegroups.com
I do often generate code before run-time.

Jeff Bezanson

unread,
May 22, 2012, 6:42:24 PM5/22/12
to juli...@googlegroups.com
It blows up if the result is sometimes a boolean and sometimes an
array. In the latter case, you need to do extra work like any() or
all() to reduce the answer to a single decision. Bottom line,
mathematically you can ask if two matrices are equal, and the result
is a boolean, not another matrix.

On Tue, May 22, 2012 at 3:05 PM, Jeffrey Sarnoff
<jeffrey...@gmail.com> wrote:

Toivo Henningsson

unread,
May 24, 2012, 8:41:51 AM5/24/12
to juli...@googlegroups.com

I also like EGAL, and I agree that you can't avoid contingent comparison completely. I also think this is a really hard question.
I will try to share some of my thoughts, from the starting point that
* EGAL is made the finest equality predicate in the language (named `is`)
* .== .<= etc are used for elementwise comparison of arrays

I think there are some important distinctions to make between different
concepts of equality/ordering: (I know some have been up already)
* Invariant / Contingent:
    Will eq(x,y) always be constant over the lifetime of the values x, y?
* Coarseness: To what extent does eq(x,y) group together values that
    are similar but not the same?
* Whole-value / Multi-value comparison:
    In eq(x,y), are x and y considered to be single values,
    or can they represent some kind of sets of values?

    It seems to me that it is the multivalued case that causes eq(x,y)
    not to be an equivalence relation, or not to be boolean. Examples:
    - A .== B is non-boolean since it compares multiple values
    - NA can be thought of as the set of all possible values, so the
      answer to NA==1 is both yes and no, i e NA.
    - NaNs usually arise when the numeric result would have many (or no?)
      possible values:
      * Inf - Inf has the entire real line as possible value, if you take Inf
        to mean any variable that goes to infinity. (so Inf is also multivalued)
      * Likewise with 0/0, taken as the solution to 0*x = 0
      * I think 1/0 should have been NaN, since 1x = 0 has no solutions.
        (But it gives Inf)
      From this point of view, it had been preferable if (NaN==x) == NA,
      but there's no changing IEEE 754...


Some thoughts:
* In a given situation, it's preferable to use an equality concept that is as close to EGAL as possible.

* == <= etc (and .== .<= etc) should be considered multivalued. Still,it's nice if they can produce a boolean result whenever that's meaningful. I can't seem to find a case for invariant ==, when you write A == B you probably expect to compare the current values.

* Keys for sorting/hashing etc should be compared using invariant, boolean predicates; e g mutable keys compare by identity.
The cleanest way would be to use EGAL, but then 1 and int8(1) would hash differently. Is this desirable?

* With an invariant comparison for sorting/hasing, is there a motivation for keeping the current (contingent) isequal?

* In my research, I use some symbolic expressions. A simple example as it would look in julia:

    @var x, y
    feasible = (x >= y)

feasible will be a Bool value parametrized by the symbolic variables x, y.
Is it ok to use >= for this kind of thing (we already know >= is not strictly boolean), or should I use something like .>= instead?

 

Kevin Squire

unread,
May 24, 2012, 1:37:00 PM5/24/12
to juli...@googlegroups.com
 
    It seems to me that it is the multivalued case that causes eq(x,y)
    not to be an equivalence relation, or not to be boolean. Examples:
    - A .== B is non-boolean since it compares multiple values
    - NA can be thought of as the set of all possible values, so the
      answer to NA==1 is both yes and no, i e NA.
    - NaNs usually arise when the numeric result would have many (or no?)
      possible values:
      * Inf - Inf has the entire real line as possible value, if you take Inf
        to mean any variable that goes to infinity. (so Inf is also multivalued)
      * Likewise with 0/0, taken as the solution to 0*x = 0
      * I think 1/0 should have been NaN, since 1x = 0 has no solutions.
        (But it gives Inf)

You probably meant 0x = 1 here...

Toivo Henningsson

unread,
May 24, 2012, 1:50:19 PM5/24/12
to juli...@googlegroups.com


On Thursday, May 24, 2012 7:37:00 PM UTC+2, Kevin Squire wrote:

      * I think 1/0 should have been NaN, since 1x = 0 has no solutions.
        (But it gives Inf)

You probably meant 0x = 1 here...

oops :) Yes, of course.

Stefan Karpinski

unread,
May 24, 2012, 1:52:02 PM5/24/12
to juli...@googlegroups.com
Also, changing IEEE 754 semantics is simply not up for debate, regardless of how much or little sense alternate proposals make. Just saying.

Toivo Henningsson

unread,
May 24, 2012, 2:00:40 PM5/24/12
to juli...@googlegroups.com


On Thursday, May 24, 2012 7:52:02 PM UTC+2, Stefan Karpinski wrote:
Also, changing IEEE 754 semantics is simply not up for debate, regardless of how much or little sense alternate proposals make. Just saying.

Yes, that was what I meant.
 

John Cowan

unread,
May 24, 2012, 2:11:23 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 8:41 AM, Toivo Henningsson <toiv...@gmail.com> wrote:

>     It seems to me that it is the multivalued case that causes eq(x,y)
>     not to be an equivalence relation, or not to be boolean. Examples:
>     - A .== B is non-boolean since it compares multiple values
>     - NA can be thought of as the set of all possible values, so the
>       answer to NA==1 is both yes and no, i e NA.

Three-valued logic (as used in SQL) is even less intuitive than NaN.
I would stay as far away from it as possible.

>     - NaNs usually arise when the numeric result would have many (or no?)
>       possible values:
>       * Inf - Inf has the entire real line as possible value, if you take
> Inf
>         to mean any variable that goes to infinity. (so Inf is also
> multivalued)
>       * Likewise with 0/0, taken as the solution to 0*x = 0
>       * I think 1/0 should have been NaN, since 1x = 0 has no solutions.
>         (But it gives Inf)

That's because lim 1/x as goes to 0 approaches infinity. In
floating-point terms, 0.0 means the range from exact 0 (inclusive) to
the smallest representable positive number (exclusive), and Inf means
the range from the largest representable positive number (exclusive)
on up. So when you divide 1 (or any "normal" value) by a very small
quantity, you get a very large quantity. Note that the reciprocal of
denormal numbers is Inf, because there is no overflow analogue of
denormal numbers.

However, sometimes NaN actually does mean "no such real number". For
example, there is no real number whose cosine is 2, and consequently
acos(2) => NaN. In the complex field, acos(2+0im) => 0.0 +
1.31695789692482im.

> * Keys for sorting/hashing etc should be compared using invariant, boolean
> predicates; e g mutable keys compare by identity.
> The cleanest way would be to use EGAL, but then 1 and int8(1) would hash
> differently. Is this desirable?

Yes, it is. Consider the case of memoizing functions, ones which
store already-computed results in a hash table so that when you call
the same function again, the result is just looked up. A
general-purpose memoization facility wants to use EGAL as the identity
function for keys in the table. A function that is sensitive to the
type (not just the value) of its argument should still be memoizable.

> * With an invariant comparison for sorting/hasing, is there a motivation for
> keeping the current (contingent) isequal?

The trouble is that what counts as contingent equality is different
for each different data structure. So if we keep it, we should be
clear that people are expected to add new methods for
equality-testing.

Francois Pepin

unread,
May 24, 2012, 2:34:10 PM5/24/12
to juli...@googlegroups.com
>>     It seems to me that it is the multivalued case that causes eq(x,y)
>>     not to be an equivalence relation, or not to be boolean. Examples:
>>     - A .== B is non-boolean since it compares multiple values
>>     - NA can be thought of as the set of all possible values, so the
>>       answer to NA==1 is both yes and no, i e NA.
>
> Three-valued logic (as used in SQL) is even less intuitive than NaN.
> I would stay as far away from it as possible.

Yes, but this is something that we're going to deal with anyway. A lot
of data analysis needs to be able to deal with missing values.

The current idea (as per Harlan's DataFrame) is to have standalone NA
objects that overwrite the various comparison operators. I'd
personally be happier if NAs were more tightly integrated, because it
means that their behaviour would have been properly vetted over a
wider range of situations. I understand that it comes with complexity
and performance costs even in cases where no NAs are present.

Keeping NAs as a add-on bolted with the DataFrame objects should work
for most cases and it could propagate as necessary, but not having any
type of NAs would make life a lot harder for a number of people.

All of that to say that this is a problem that this is an issue that
will need to be tackled, even if it's not a core element of the
language.

Francois

Toivo Henningsson

unread,
May 24, 2012, 2:50:36 PM5/24/12
to juli...@googlegroups.com


On Thursday, May 24, 2012 8:11:23 PM UTC+2, John Cowan wrote:

>     It seems to me that it is the multivalued case that causes eq(x,y)
>     not to be an equivalence relation, or not to be boolean. Examples:
>     - A .== B is non-boolean since it compares multiple values
>     - NA can be thought of as the set of all possible values, so the
>       answer to NA==1 is both yes and no, i e NA.

Three-valued logic (as used in SQL) is even less intuitive than NaN.
I would stay as far away from it as possible.

I'm not arguing for or against it, just trying to point out what seems like the common denominator for the cases when equality-like operators are sometimes non-boolean or not equivalence relations (reflexive, symmetric and transitive).
 
>     - NaNs usually arise when the numeric result would have many (or no?)
>       possible values:
>       * Inf - Inf has the entire real line as possible value, if you take
> Inf
>         to mean any variable that goes to infinity. (so Inf is also
> multivalued)
>       * Likewise with 0/0, taken as the solution to 0*x = 0
>       * I think 1/0 should have been NaN, since 1x = 0 has no solutions.
>         (But it gives Inf)

That's because lim 1/x as goes to 0 approaches infinity.  In
floating-point terms, 0.0 means the range from exact 0 (inclusive) to
the smallest representable positive number (exclusive), and Inf means
the range from the largest representable positive number (exclusive)
on up.  So when you divide 1 (or any "normal" value) by a very small
quantity, you get a very large quantity.  Note that the reciprocal of
denormal numbers is Inf, because there is no overflow analogue of
denormal numbers.

That makes some sense, but usually when I write 0.0, I mean +/- 0, without a preference for either sign. Same for 1.0-1.0. But I guess that's really beside the point here, as IEEE 754 is apparently not up for discussion :)
 
However, sometimes NaN actually does mean "no such real number".  For
example, there is no real number whose cosine is 2, and consequently
acos(2) => NaN.  In the complex field, acos(2+0im) => 0.0 +
1.31695789692482im.

> * Keys for sorting/hashing etc should be compared using invariant, boolean
> predicates; e g mutable keys compare by identity.
> The cleanest way would be to use EGAL, but then 1 and int8(1) would hash
> differently. Is this desirable?

Yes, it is.  Consider the case of memoizing functions, ones which
store already-computed results in a hash table so that when you call
the same function again, the result is just looked up.  A
general-purpose memoization facility wants to use EGAL as the identity
function for keys in the table.  A function that is sensitive to the
type (not just the value) of its argument should still be memoizable.

I agree.
In more loosely typed languages, I guess 1 and 1.0 are usually hashed together since you never really know what types of numbers will be used as keys. But in Julia, you can easily use a specific numeric type for the keys of your Dict if want to avoid type to influence hashing. I think it's a pretty small gotcha, and the ability to be able to do things like general-purpose memoization is quite compelling.

> * With an invariant comparison for sorting/hasing, is there a motivation for
> keeping the current (contingent) isequal?

The trouble is that what counts as contingent equality is different
for each different data structure.  So if we keep it, we should be
clear that people are expected to add new methods for
equality-testing.

Maybe I'm not quite clear on the terminology.
My understanding is that EGAL does recursive comparison of values as long as they are immutable; when it encounters a mutable cell it is compared based on identity, and the recusion stops.
I would then define invariant comparison to be anything that is no finer than EGAL, and stops in the same way at mutable cells.
Anything else, i e that does dereference mutable cells, is contingent.
Or is this an oversimplification?

John Cowan

unread,
May 24, 2012, 3:13:08 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 2:50 PM, Toivo Henningsson <toiv...@gmail.com> wrote:

> That makes some sense, but usually when I write 0.0, I mean +/- 0, without a
> preference for either sign. Same for 1.0-1.0. But I guess that's really
> beside the point here, as IEEE 754 is apparently not up for discussion :)

Indeed, since the sole point of supporting it in a programming
language is speed. Exact rational arithmetic (that is, arithmetic
closed over rational operations and limited only by memory size) would
be nice-to-have, but not essential in a scientific programming
language, since it can be extraordinarily slow. (Notoriously, some
numerical Lisp programs can be speeded up by two orders of magnitude
by inserting a single judiciously placed decimal point in order to
force floating-point arithmetic throughout.) Julia's rationals, which
allow only limited precision, are a reasonable compromise.

> Maybe I'm not quite clear on the terminology.
> My understanding is that EGAL does recursive comparison of values as long as
> they are immutable; when it encounters a mutable cell it is compared based
> on identity, and the recusion stops.
> I would then define invariant comparison to be anything that is no finer
> than EGAL, and stops in the same way at mutable cells.
> Anything else, i e that does dereference mutable cells, is contingent.
> Or is this an oversimplification?

No, that's exactly right. See my other post for what I think
contingent array equality (which is an important notion once you get
it disentangled from EGAL and from arithmetic equality) should look
like.

Jeff Bezanson

unread,
May 24, 2012, 3:25:31 PM5/24/12
to juli...@googlegroups.com

For purposes of forward error analysis at least, floating point numbers are not ranges. If f(0.0) can give the right answer for zero, it must not give the answer for the smallest denormal, or eps/2, or anything else.

John Cowan

unread,
May 24, 2012, 3:42:59 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 3:25 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:

> For purposes of forward error analysis at least, floating point numbers are
> not ranges. If f(0.0) can give the right answer for zero, it must not give
> the answer for the smallest denormal, or eps/2, or anything else.

Well, every float represents a range of exact numbers: there is no
getting away from that. float64(3.141592653589793) represents the
exact rational number 3141592653589793/1000000000000000, but also
31415926535897936/10000000000000000 and
31415926535897942/10000000000000000 and infinitely more exact rational
numbers; what is more, it also represents pi and uncountably many
irrationals. The only anomalies are that the ranges of 0.0 and -0.0
overlap at exact 0, that Inf and -Inf have only one bound, and that
NaN is the union of two degenerate ranges, the "every real number"
range and the "no real number" range, depending on how it was
produced.

So f(0.0) will necessarily give the same value as f(1e-324), because
both have the same representation in float64, namely 0.0. So since
the reciprocal of exact 1e-324 is exact 1e324 (which is represented as
Inf), the reciprocal of 0.0 must be Inf also.

Jeff Bezanson

unread,
May 24, 2012, 3:56:20 PM5/24/12
to juli...@googlegroups.com

Sure infinitely many numbers round to a given float, but once the rounding is done the float represents only that number. You are not free to make up extra digits unless you only care about reverse error.

Stefan Karpinski

unread,
May 24, 2012, 4:05:15 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 2:34 PM, Francois Pepin <fpe...@gmail.com> wrote:

Yes, but this is something that we're going to deal with anyway. A lot
of data analysis needs to be able to deal with missing values.

The current idea (as per Harlan's DataFrame) is to have standalone NA
objects that overwrite the various comparison operators. I'd
personally be happier if NAs were more tightly integrated, because it
means that their behaviour would have been properly vetted over a
wider range of situations. I understand that it comes with complexity
and performance costs even in cases where no NAs are present.

Keeping NAs as a add-on bolted with the DataFrame objects should work
for most cases and it could propagate as necessary, but not having any
type of NAs would make life a lot harder for a number of people.

All of that to say that this is a problem that this is an issue that
will need to be tackled, even if it's not a core element of the
language.

I'm not quite sure I get what you're saying here.

The current proposal for NA and DataFrames is that NA is an actual singleton value, much like Nothing (used to indicate that an expression has nothing interesting to return). It can be used like any other value. When building DataFrames, one uses NA to indicate places where no data is available. Also, when pulling single values out of DataFrames, NA is what you get when accessing a value that isn't available. However, NA is never stored in the DataFrame object. The fact that a value is not available is stored internally with a mask, allowing efficient computations on both data and NA masks. To the casual user, however, it *looks* like DataFrames actually contain NA values.

Is there some part of that approach that you object to? If so, what, specifically?

However NA interacts with values — in terms of addition, equality, etc. — as a value, simply needs to be reflected in how bitmask operations are done. It's not very hard to do; we do however, need to think through what that behavior should be. Following R's lead there seems sensible to me, and R treats NA as poisonous: NA == NA => NA, NA == 1 => NA, etc. That can be implemented with a simple ORing of NA masks.

Stefan Karpinski

unread,
May 24, 2012, 4:08:33 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 3:13 PM, John Cowan <johnw...@gmail.com> wrote:
 
Julia's rationals, which allow only limited precision, are a reasonable compromise.

Rational{BigInt} works too:

julia> BigInt("1287318263871263871236817236")//BigInt("8768762387462873468273648723468273") 
1287318263871263871236817236//8768762387462873468273648723468273

Speaking of which, should we maybe move bigint.jl into base?

Jeffrey Sarnoff

unread,
May 24, 2012, 4:18:15 PM5/24/12
to juli...@googlegroups.com

from R help doc:
<key point is that R has 5 different (and differently represented) NA values, one for each core type>
<they look alike, which has caused some confusion, but it allows NA everywhere typeright>

NA is a logical constant of length 1 which contains a missing value indicator.
NA can be coerced to any other vector type except raw.
There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ 
of the other atomic vector types which support missing values: all .. are reserved words in the 
R language.
The generic function is.na indicates which elements are missing.
The generic function is.na<- sets elements to NA.

Stefan Karpinski

unread,
May 24, 2012, 4:23:52 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 4:18 PM, Jeffrey Sarnoff <jeffrey...@gmail.com> wrote:

from R help doc:
<key point is that R has 5 different (and differently represented) NA values, one for each core type>
<they look alike, which has caused some confusion, but it allows NA everywhere typeright>

I think we're going to have to depart from that because NA is never going to be a valid Int value, for example. I'm not sure what cramming NA into each core type gives you other than slowing everything down. I don't really see any down side to just having a single NA value.

Jeffrey Sarnoff

unread,
May 24, 2012, 5:05:09 PM5/24/12
to juli...@googlegroups.com
oh, I do not like what R does with NA -- they multitype it for internal convenience --
afaic 'not available' is a repeatedly recognizable, but otherwise undifferentiated, state-of-affairs
there should be only one -- or, better still none [as you suggest, via masking] as absence best reflects this entity.

Francois Pepin

unread,
May 24, 2012, 5:06:22 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 1:05 PM, Stefan Karpinski <ste...@karpinski.org> wrote:
>
> I'm not quite sure I get what you're saying here.
>
> The current proposal for NA and DataFrames is that NA is an actual singleton
> value, much like Nothing (used to indicate that an expression has nothing
> interesting to return). It can be used like any other value. When building
> DataFrames, one uses NA to indicate places where no data is available. Also,
> when pulling single values out of DataFrames, NA is what you get when
> accessing a value that isn't available. However, NA is never stored in the
> DataFrame object. The fact that a value is not available is stored
> internally with a mask, allowing efficient computations on both data and NA
> masks. To the casual user, however, it *looks* like DataFrames actually
> contain NA values.
>
> Is there some part of that approach that you object to? If so, what,
> specifically?

I don't specifically object to this. I'm mostly saying that this would
complicate things when stepping out of the DataFrame. Who has the
responsibility of dealing with an NA value if one wants to, for
example, fit a distribution to a set of values that includes an NA, or
a new machine learning algorithm? R puts NAs front and center, making
sure that all the basic code is aware and can deal with NAs
appropriately. In this case, either we'd need for the function to be
dispatched with the NA type or have the programmer check for NAs and
feed in the appropriate non-NA'd data.

I think the current approach is interesting and it makes a lot of
sense for Julia. I just wanted to point out that not all kinks have
been resolved yet.

For example, what is the length of NA? In a string context, having 0
make sense, but as an integer, we should count NA as 1. One way is to
have different types of NAs. (I haven't checked Harlan's branch
recently, so maybe this has changed since).

> However NA interacts with values — in terms of addition, equality, etc. — as
> a value, simply needs to be reflected in how bitmask operations are done.
> It's not very hard to do; we do however, need to think through what that
> behavior should be. Following R's lead there seems sensible to me, and R
> treats NA as poisonous: NA == NA => NA, NA == 1 => NA, etc. That can be
> implemented with a simple ORing of NA masks.

That was the point of my post: this is something that we can't ignore.
I would tend to agree with R in this case. It's annoying to deal with,
but such is the problem with missing data.

Francois

Stefan Karpinski

unread,
May 24, 2012, 5:06:30 PM5/24/12
to juli...@googlegroups.com
Right, the NA object is really just a user-facing convenience. Internally, no NA objects are used. Seems like we're on the same page here.

John Cowan

unread,
May 24, 2012, 6:04:15 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 4:08 PM, Stefan Karpinski <ste...@karpinski.org> wrote:

> Rational{BigInt} works too:

Ah. Well, bigints aren't in the manual, so I didn't know they existed.

Stefan Karpinski

unread,
May 24, 2012, 6:11:18 PM5/24/12
to juli...@googlegroups.com
Right, they're not in the base distribution yet, but still in extras. Hence my followup question :-)

John Cowan

unread,
May 24, 2012, 10:36:42 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 3:56 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:

> Sure infinitely many numbers round to a given float, but once the rounding
> is done the float represents only that number. You are not free to make up
> extra digits unless you only care about reverse error.

Granted, but when you are talking about 0.0 and -0.0, you have to take
their signs into account in two places:

The sign rule for division requires that 1.0/0.0 => +Inf, but 1.0/-0.0 => -Inf.

Log and other transcendental functions based on it have a branch cut
at zero, so log(0.0+0.0im) => +Inf+0.0im, but log(-0.0-0.0im) =>
Inf+3.141592653589793im. (Julia currently reports the real part as
NaN, which is a bug.)

Harlan Harris

unread,
May 28, 2012, 9:16:48 PM5/28/12
to juli...@googlegroups.com
Just to follow up on this thread, after returning from a vacation... Yes, in my stab at DataVec/DataFrames, NAs are only real in user-facing code. It's worth noting that I've now defined an AbstractDataVec that is currently implemented by two types, DataVec, which uses a Boolean mask to deal with missing data, and PooledDataVec. PooledDataVec is a generalization of the way that R implements vectors of strings/factors, with a single global string pool and indexes into that pool. What I did is to have a PooledDataVec be indexes into a vector of the parameterized type, and missing data represented as 0s in the index. I haven't done any benchmarking, but it should take much less memory in some very common cases, plus if ref counts are used, you get unique() for free. The code that loads CSVs from disk uses PooledDataVec for columns of non-numeric types, if there aren't too many distinct values. Very incomplete code, though...

(Another DataVec type I'd like to play with at some point would be BlockedDataVec, which would map onto disk blocks for eventual memory-map support, and would also allow operations such as deletion of rows with O(k) instead of O(n) cost, where k is the number of elements in a block.)

Philosophically, I think that Julia should not take R's approach, in which statistical-type data is the fundamental data structure, and all of the heavy number-crunching works on those data structures. Instead, I think it would be better if data is loaded into DataFrame-like structures, manipulated/merged/subset/reshaped as necessary, then translated into a raw Julia numeric matrix for the actual computation. That translation step would involve the appropriate action for NA-handling, whether that's filtering out those rows, replacing NAs with something else, or even something complex like multiple imputation.

 -Harlan
Reply all
Reply to author
Forward
0 new messages