# Distinguishing between Null and NA

689 views

### John Myles White

Jan 8, 2015, 9:05:36 PM1/8/15
After considering Milan's arguments in https://github.com/JuliaLang/julia/pull/9446, I've come to feel that we should consider having both a concept like R's NULL and a concept like R's NA in Julia. The distinction would allow us to express what I've often called the ontological and epistemological interpretations of missingness, which are already started to be conflated by proposals to use Nullables in Base for things like the return value of parseint(). In particular, I think that the use cases for NULL and NA lead to opposite preferences for the automatic propagation of nullability. I've found my own views about proper behavior to be less firm than I'd like, so I'd like to split the concept of Nullable into two distinct concepts that properly implement the contradictory interpretations of missing values.

The ontological interpretation says that an object x = Nullable{T}() is equivalent to an assertion that x, which might have been of type T, in fact has no value at all and cannot be safely used in downstream computations. One way to think of a Nullable object is that it is a pair of both a value and an error code, where the error code indcates whether the value is nonsense or not. Since the Nullable object represents an error, we should not propagate nullability by default since programmers should be encouraged to immediately handle errors rather than defer their resolution.

The epistemological interpretation says that an object x = Nullable{T}() is equivalent to an assertion that x, which is in fact of type T, has a value that is unknown to the person who might have observed it. One way to think of a Nullable object is that it is a pair of a value and a statement about the knowledge of the observer who recorded the value. Since a Nullable object represents uncertainty rather than an error, one can propagate uncertainty forward automatically -- although the uncertainty that is propagated is always a strict overestimate of the amount of uncertainty present in the system, because computations that might be invariant across all values of type T will propagate a null foward, despite being known with complete certainty. (The current shaky implementation of three-valued logic for NAtype is one example of a failed attempt to exploit these kind of invariants.)

Obviously this proposal adds additional complexity to Julia. But I think it simplifies the interpretation of missingness by enforcing two distinct interpretation of missing values and relieving the tension between them. And it makes it much easier to resolve the debates from #9446 about the automatic propagation of miissingness, since it provides us with both a functional programming approac

-- John

### ele...@gmail.com

Jan 8, 2015, 10:29:07 PM1/8/15

Agree with John, two semantics needs two constructs.

Cheers
Lex

### Steven G. Johnson

Jan 9, 2015, 5:20:43 PM1/9/15

On Thursday, January 8, 2015 at 9:05:36 PM UTC-5, John Myles White wrote:
The ontological interpretation says that an object x = Nullable{T}() is equivalent to an assertion that x, which might have been of type T, in fact has no value at all and cannot be safely used in downstream computations. One way to think of a Nullable object is that it is a pair of both a value and an error code, where the error code indcates whether the value is nonsense or not. Since the Nullable object represents an error, we should not propagate nullability by default since programmers should be encouraged to immediately handle errors rather than defer their resolution.

My concern is that this interpretation is basically a hack to work around exception-handling performance in inner-loop functions.  I'm not sure if it is worth a whole separate type.

Instead, we could just say that Nullable is generally for the epistemological case, with a couple of exceptions like parseint where you are responsible for checking immediately for errors.

### John Myles White

Jan 9, 2015, 5:25:02 PM1/9/15
Do you think the analogs of this interpretation are just hacks when they occur in Haskell or other FP languages? I've obviously drunken the Maybe kool-aid deeply, but I really like allowing methods to indicate that they might fail using a mechanism that involves types rather than mechanisms that involve control flow. Given how problematic try/catch is in Julia, this is particularly pressing right now, but I'd like to see the use of types rather than control flow be continued even if try/catch were made more performant.

-- John

### ele...@gmail.com

Jan 9, 2015, 7:43:15 PM1/9/15
Agree, the "maybe" paradigm is a useful way for code that can detect a failure to signal code that has an idea what to do about the failure.  Whilst exceptions can only signal upwards, maybes can also signal horizontally within the same piece of code.

Cheers
Lex

### Milan Bouchet-Valat

Jan 21, 2015, 12:23:04 PM1/21/15
I feel like this thread has died down too fast. Any other opinions on
the subject, especially from the core devs?

I'm personally undecided on this issue, as I noted on the GitHub issue.
One practical argument that occurred to me is that if we choose to
create two distinct types, one with propagation semantics, and another
with a stricter behavior, it would be quite easy later to reunite them
if it turns out it was a bad idea. All it would take would be to make
the strict version a type alias for the permissive one, and nothing
would break.

Regards

### John Myles White

Jan 21, 2015, 4:26:01 PM1/21/15
Agree that it's a shame more wasn't said.

My plan is the following:

(1) Change Nullable to have propagation semantics with the following exact semantics:

Given a definition of f(x...) for non-Nullable arguments, we lift this definition such that f(xs...) returns a null-valued Nullable if all of the arguments are Nullable and any of the arguments is null-valued. So Nullable(1) + Nullable(2) produces Nullable(3), Nullable(1) + Nullable{Int}() produces Nullable{Int}(), but Nullable(1) + 2 raises an error.

I still to confirm this doesn't have a performance cost because of indirection through call overloading, but think it should be fine.

(2) Create a PR for a new type Maybe that shares the current Nullable interface, but does not implement propagation. This would be useful for the NULL pointer case where you want to prevent propagation through the type system, rather than through run-time checks.

I'd then use Maybe for things like regex matches, but Nullable for statistical applications.

-- John

### Stefan Karpinski

Jan 21, 2015, 4:38:31 PM1/21/15
to Julia Dev
I'm still mulling this over. I feel like something just hasn't clicked for me on this whole subject.

### John Myles White

Jan 21, 2015, 4:45:03 PM1/21/15
I really like this piece for the Maybe side of things: http://nickknowlson.com/blog/2013/04/16/why-maybe-is-better-than-null/

I also find it very helpful to make sure you talk separately about whether you care about propagating nullability or propagating null values.

-- John

### Erik Schnetter

Jan 21, 2015, 4:46:51 PM1/21/15
On Jan 21, 2015, at 16:24 , John Myles White <johnmyl...@gmail.com> wrote:
>
> Agree that it's a shame more wasn't said.
>
> My plan is the following:
>
> (1) Change Nullable to have propagation semantics with the following exact semantics:
>
> Given a definition of f(x...) for non-Nullable arguments, we lift this definition such that f(xs...) returns a null-valued Nullable if all of the arguments are Nullable and any of the arguments is null-valued. So Nullable(1) + Nullable(2) produces Nullable(3), Nullable(1) + Nullable{Int}() produces Nullable{Int}(), but Nullable(1) + 2 raises an error.
>
> I still to confirm this doesn't have a performance cost because of indirection through call overloading, but think it should be fine.
>
> (2) Create a PR for a new type Maybe that shares the current Nullable interface, but does not implement propagation. This would be useful for the NULL pointer case where you want to prevent propagation through the type system, rather than through run-time checks.
>
> I'd then use Maybe for things like regex matches, but Nullable for statistical applications.

+1

There's an argument to be made that Nullable(1) + 2 should return Nullable(3). If this isn't done by the regular call operator, then a new call operator "nullable_call" (with a better name, obviously) should be helpful.

-erik

--
Erik Schnetter <schn...@gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/

My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from https://sks-keyservers.net.

signature.asc

### John Myles White

Jan 21, 2015, 4:58:43 PM1/21/15
There certainly is an argument for Nullable(1) + 2. I'm going to ignore it for the first round since I think you might write that kind of code by accident and it's nice to hit an immediate type error.

-- John

### ele...@gmail.com

Jan 21, 2015, 6:11:26 PM1/21/15

On Thursday, January 22, 2015 at 7:26:01 AM UTC+10, John Myles White wrote:
Agree that it's a shame more wasn't said.

My plan is the following:

(1) Change Nullable to have propagation semantics with the following exact semantics:

Given a definition of f(x...) for non-Nullable arguments, we lift this definition such that f(xs...) returns a null-valued Nullable if all of the arguments are Nullable and any of the arguments is null-valued.

I agree that having this baked into the language semantics is a convenient way of propagating Nullable, but I am not sure that there are not f()s where there are sensible values that could be returned despite some null-valued parameters, a user function ignoring null Nullables would not be possible with those semantics IIUC, eg min or max of the non-null Nullables.

OTOH if propagation isn't automatic, often it won't be propagated properly due to forgetfulness/laziness/inexperience etc.

So Nullable(1) + Nullable(2) produces Nullable(3), Nullable(1) + Nullable{Int}() produces Nullable{Int}(), but Nullable(1) + 2 raises an error.

I still to confirm this doesn't have a performance cost because of indirection through call overloading, but think it should be fine.

(2) Create a PR for a new type Maybe that shares the current Nullable interface, but does not implement propagation. This would be useful for the NULL pointer case where you want to prevent propagation through the type system, rather than through run-time checks.

I'd then use Maybe for things like regex matches, but Nullable for statistical applications.

Agree that two separate things are the right way to do it.

Cheers
Lex

### Milan Bouchet-Valat

Jan 23, 2015, 4:25:58 PM1/23/15
Le mercredi 21 janvier 2015 à 15:11 -0800, ele...@gmail.com a écrit :
>
>
> On Thursday, January 22, 2015 at 7:26:01 AM UTC+10, John Myles White
> wrote:
> Agree that it's a shame more wasn't said.
>
> My plan is the following:
>
> (1) Change Nullable to have propagation semantics with the
> following exact semantics:
>
> Given a definition of f(x...) for non-Nullable arguments, we
> lift this definition such that f(xs...) returns a null-valued
> Nullable if all of the arguments are Nullable and any of the
> arguments is null-valued.
>
>
> I agree that having this baked into the language semantics is a
> convenient way of propagating Nullable, but I am not sure that there
> are not f()s where there are sensible values that could be returned
> despite some null-valued parameters, a user function ignoring null
> Nullables would not be possible with those semantics IIUC, eg min or
> max of the non-null Nullables.
If I understand correctly John's proposal, this would only be a
fallback: "for non-Nullable arguments". A function accepting a Nullable
argument would be free to handle it as it likes.

Regards

### John Myles White

Jan 23, 2015, 4:28:54 PM1/23/15
That's certainly my goal. Right now it's not possible to do the relevant call overloading at all, let alone make the fallback below other higher precendence specifications.

-- John

### ele...@gmail.com

Jan 23, 2015, 6:37:09 PM1/23/15

So with your proposal if I passed a Nullable to a function taking (all?) non-nullable types I would silently get a Null Nullable at runtime if the passed Nullable is Null instead of a method not found at compile time?  What would be the generic parameter for this Nullable?

Not sure thats a good idea at all and as a side effect it will silently cost performance to test and get() the Nullable.

An explicit instruction would be better.  Maybe an @propogate_nulls (ok pick a shorter name for sure) macro that accepted your f(x...) and conveniently produced an automatic propogating version of f() (I guess it actually would be a staged function) which checks its calls if any Nullables are passed and generates the test and get code.

Cheers
Lex

### ele...@gmail.com

Jan 23, 2015, 7:13:08 PM1/23/15

On Saturday, January 24, 2015 at 9:37:09 AM UTC+10, ele...@gmail.com wrote:

So with your proposal if I passed a Nullable to a function taking (all?) non-nullable types I would silently get a Null Nullable at runtime if the passed Nullable is Null instead of a method not found at compile time?  What would be the generic parameter for this Nullable?

Not sure thats a good idea at all and as a side effect it will silently cost performance to test and get() the Nullable.

A coffee later and what was nagging me became obvious :)

Effectively the proposal is *sometimes* not calling a function and continuing to run when something that the function doesn't understand is passed to it.  Silently and automagically, for all functions, the proposal is changing the semantics of the program, especially if the function that was sometimes not called has side effects.  The magic really only should be applied to functions that the programmer explicitly says it is safe to treat this way.

If explicit marking is required then over time sensible Base and package functions will get marked as such.

Cheers
Lex

### John Myles White

Jan 23, 2015, 7:18:12 PM1/23/15
That's not a bad proposal, but I'm pretty happy with having exactly two types: one that never propagates automatically and one that always propagates automatically. It's certianly annoying if you call A_mul_B!(NULL) and don't get a method error, but it's also annoying if you can't do things like Nullable(1) + Nullable(2). Having functions opt in for that behavior is too hard since you don't want to have to augment every numeric function handle Nullable{Float64}(). We've tried that approach already and it doesn't work well.

-- John

### ele...@gmail.com

Jan 23, 2015, 8:23:13 PM1/23/15

On Saturday, January 24, 2015 at 10:18:12 AM UTC+10, John Myles White wrote:
That's not a bad proposal, but I'm pretty happy with having exactly two types: one that never propagates automatically and one that always propagates automatically. It's certianly annoying if you call A_mul_B!(NULL) and don't get a method error, but it's also annoying if you can't do things like Nullable(1) + Nullable(2). Having functions opt in for that behavior is too hard since you don't want to have to augment every numeric function handle Nullable{Float64}(). We've tried that approach already and it doesn't work well.

The idea of having @propagate_nulls is that the programmer that wants the behaviour can apply it to the function, it doesn't need to be included in the original definition.  The benefit is that there is no need to immediately go and opt-in half of base and heaps of packages before you can sensibly use propagating nulls, as you say, that won't happen.  But for commonly used functions like + which is pure, and so safe, it will probably get an @propagate_nulls in Base at some point, it just doesn't need to be immediate.

Note that as @propagate_nulls applies to the function, not individual methods, only one is needed no matter how many methods exist.

Cheers
Lex

PS not sure which of my posts you replied to, but as I said in the second, universally applying propagation to functions with side effects is a very bad idea.  Accidently passing a propagating nullable as a parameter to a function with side effects will change the behaviour of the program in undefined ways.

### Milan Bouchet-Valat

Jan 24, 2015, 9:44:47 AM1/24/15
"Undefined ways" is too strong: nothing would happen, even if you
expected a side-effect. For example, if `x` happens to be `null`, then
the following example would have absolutely no effect:
f = open(x, "w")
write(f, "some string")
close(f)

I agree this is not ideal. Maybe for non-pure functions raising a no
method error would make more sense. This could be an argument in favor
of adding a `pure` declaration to Julia, which is an old debate:
https://github.com/JuliaLang/julia/issues/414

OTOH the behavior above is not a big deal. `Nullable` is intended to be
used with data, not for general programming: if you start taking
`Nullable` values beyond data handling and analysis, you should call
`get` on it to avoid the uncertainty about missingness.

Regards

### Erik Schnetter

Jan 24, 2015, 10:56:13 AM1/24/15
The current proposal is to have two versions of Nullable: One that automatically "does nothing", and another that requires explicitly handling missing values. `Nullable` would be used with data, whereas the other type (e.g. called `Maybe`) could be used for control flow. There are good uses cases for both.

-erik
signature.asc

### ggggg

Jan 24, 2015, 6:25:07 PM1/24/15
I recently read http://www.juliabloggers.com/whats-wrong-with-statistics-in-julia-2/ which says there is going to be a 60% performance penalty for using Nullable{Float64} rather than just Float64 and the NaN value. And also became aware of Traits.jl.

I wonder if this system could be designed such that for Float64 and possibly other types that have a NaN-like value, they could be be used directly rather than wrapping them in Nullable.  I know there are now two concepts, Nullable and Maybe, so this would probably only work for one of those two.

So very roughly rather than defining something like
iteratecolumn(x::Nullable) = for ... end

you might define something like
iteratecolumn(x) = iteratecolumn_(x, nullability(typeof(x))
nullability(::Type{Float64}) = NativleyNullable
nullability{T<:Any}(::Type{T}) = WrapNullable

I'm not sure if this would work, but it seems like it may be able to recover the 60% speed penalty for Floats and still play nice with the proposed system.  A further extension may be that you could specify a value as being nullable for a specific dataset.

immutable FastNullable{T,N}
v::T
end
isnull{T<:FastNullable}(x::T) = x.v==nullvalue(T)
nullvalue(x::Type{FastNullable{Int64,N}}) = N
nullvalue(x::Type{FastNullable{MyType, 0}}) = MyType(otherwise, useless, values, here)

Or maybe the speed hit is too small for the extra complication?

### John Myles White

Jan 24, 2015, 6:39:11 PM1/24/15
This is something we may do. For now, I think we have more pressing performance issues.

— John

### ele...@gmail.com

Jan 24, 2015, 7:36:35 PM1/24/15
In this case yes its definable, but I was thinking of the side effect being modification of a global, its undefined because you don't know what uses that global or what the effect of not updating it will be.

I agree this is not ideal. Maybe for non-pure functions raising a no
method error would make more sense. This could be an argument in favor
of adding a `pure` declaration to Julia, which is an old debate:
https://github.com/JuliaLang/julia/issues/414

OTOH the behavior above is not a big deal. `Nullable` is intended to be
used with data, not for general programming: if you start taking
`Nullable` values beyond data handling and analysis, you should call
`get` on it to avoid the uncertainty about missingness.

Yes, agree, but I said "accidently" pass the nullable and get undefined behaviour.  If the programmer does the right thing it works just fine, but the chances of an accident and the result (continuing to run an undefined program) seems quite high.  The programmer who is using Nullable data should check every function they pass that data to, and well, will that happen?  Especially if the only way of telling is to read and understand the source, since functions won't be annotated "nullable safe".

I should emphasise my problem is the program changing its behaviour and continuing to run, meaning the results are silently undefined.  If it threw an exception, failed to compile etc would be fine.  But other than having the default be to not allow automatic propagation and programmers manually noting the functions where it is ok, I can't see how to make it safe, for all the reasons noted in the "pure" thread above.

Cheers
Lex

### John Myles White

Feb 2, 2015, 11:22:51 AM2/2/15
Coming back to this thread, I think we've ended up stuck with the same incompatible design goals again.

As I see it, progagation can work in a couple of ways:

(1) Always off: you can't propagate nullability without explicitly constructing a Nullable object.
(2) Opt-in at the call-site: you can propagate nullability by using something like map(f, Nullable).
(3) Opt-in near the call-site: you can propagate nullability within a block of code using something like @propagate begin x = Nullable(1) + Nullable(2); sin(x) end
(4) Opt-in at the definition site: you can propagate nullability by annotating a function as propagating in the same way that we annotate functions as vectorizable.
(5) Always on: you can't prevent propagation without explicitly checking for nullability before calling any function.

There are strengths and weaknesses to all of these approaches. We currently have (1), which is the most verbose, but also the safest.

There was a proposal to move to (2), which would make things less verbose and comparably safe, but would still be more verbose than R. The verbose is very severe in complex expressions like sin(cos(x)) + cos(sin(x - pi)).

There was a proposal to move to (3), which requires specifying the semantics of @propagate in greater detail, but makes things much less verbose. It does so at the cost of some safety since you might wrap too large a block with @propagate. But it's always explicit, which is a virtue.

There was a proposal to move to (4), which requires a lot of on-going labor to ensure that all relevant functions are annotated properly. There's also an issue in which the annotations that one person believes are reasonable are different rom the annotations preferred by others. Should, for example, length(Nullable{Array}) work by default?

Finally, there was a proposal to move to (5), which requires no effort, but allows strange things like push!(Nullable{Int}()) to execute without errors.

There are also technical impediments to be dealt with:

(1) No problems here. We have this working already.
(2) Higher-order functions are slow. Until functions have a richer place in the type system, map will likely continue to be a performance trap.
(3) The semantics of propagate are pretty vague. If the code expands to uses of map, we have the performance problems noted in (2).
(4) We've tried doing annotations before in DataArrays. It requires a huge amount of work -- more than anyone can reasonably expect to do. Given my very negative experiences doing this work for DataArrays, I'm pretty strongly opposed to this proposal.
(5) Call overloading currently doesn't work this way. It could be extended, but that changes Base in a deep way just to provide this feature.

Given that we need a clear plan, I'm going to make a judgment call here and rescind my previous plan for trying to implement (5) one day. Instead, let's push for a version of (3) that does not involve calling map. We can also implement map, with the understanding that map cannot be used in high-performance code until the type system gets another iteration.

-- John

### ele...@gmail.com

Feb 2, 2015, 6:10:07 PM2/2/15

John,

Very nice summary.  Agree with most of what you said.

Option (1) also has the downside that it actually doesn't improve the situation any.

Option (2) is way too verbose to be practical.

Option (3) still puts the onus on the programmer to determine safety of functions they use, but making it local makes that more tractable.  It probably would be good to request programmers who determine functions to be safe to submit PRs to document that, so it becomes less onerous to check, and the developers of the called function are warned that they need to tell everyone if they change null safety of an implementation.  With no indication that there is an expectation of propagation safety, the implementation may change, and there is no warning to users.

Option (4) definitely requires a lot of work, though it is one-time work rather than requiring every user of nullability to do the checking.  Which is why I suggested that allowing the user to declare propagation safety as well, but only for the functions they use.  But that clearly has the same problem of the implementation changing as in (3).

Option (5) agree and it silently changes the behaviour of existing programs if they are accidentally passed nullable data.

Also agree that your proposal for (3) is the most practical.  It provides you a not "too" verbose means of explicitly making use of propogation of nullable data without changing the semantics of the rest of the language.  And having to write @propagate hopefully reminds you to check that the functions you use are safe.

Cheers
Lex