the real reason Julia doesn't have a type-proper null reference

745 views
Skip to first unread message

Deacon Blues

unread,
Oct 21, 2016, 10:53:22 AM10/21/16
to julia-dev

I'm sure there are lots of reasons why Julia doesn't have a type-proper null reference (i.e. why `!(Nullable{T}() <: T)`), but I'm trying to understand what they are.  When I start to think about what most code would have to do to handle this, I certainly see that there'd be a fair amount of work.  However, I'm haunted by two things: the existence of IEEE floating point NaN and of C++ `nullptr`.  These are both examples of types of null values which are type stable and do not cause any kind of performance penalty (though, to be clear, I understand that at least NaNs are COMPLETELY different than null references in Julia, though in many cases they'd be handled the same way algorithmically).  The existence of these things makes me seriously question why Julia doesn't have null references.  It seems clear to me that if there is a fundamental limitation it is certainly performance.  For instance, when working with standard numeric types, Julia would need some way of dealing with the possibility of the null reference, creating an extra layer of abstraction away from the actual numeric operations.

Could somebody explain to me what is the most important reason why we can't have this?  I still feel pretty confident that we can't, I'd just like to make sure I understand why.

Stefan Karpinski

unread,
Oct 21, 2016, 10:56:08 AM10/21/16
to juli...@googlegroups.com

Scott Jones

unread,
Oct 28, 2016, 4:13:32 AM10/28/16
to julia-dev
That was a very good decision for Julia.

Unfortunately, Julia repeated C's own "billion dollar" mistake, which is mentioned in the first article: using \0 nul byte terminated strings.

Jameson Nash

unread,
Oct 28, 2016, 10:52:24 AM10/28/16
to juli...@googlegroups.com
Julia's strings aren't nul-terminated, except when they have to be passed to a C function that only accepts c-style nul-terminated strings.

Scott Jones

unread,
Oct 28, 2016, 6:01:14 PM10/28/16
to julia-dev
Since when?  I still see code that allocates an extra byte in the v0.6 source code.
It was one of the major problems I had with the string support in Julia, the inconsistency of an implicit \0 byte for ASCIIString and UTF8String, but an explicit 16-bit or 32-bit word of 0 to terminate UTF16String and UTF32String.
(It made a lot of the code that I wrote last year for strings much less generic)
I'd be very happy if the places that I saw in the code to reserve space for the implicit \0 byte were simply no longer necessary.

Simon Byrne

unread,
Oct 29, 2016, 10:35:46 AM10/29/16
to julia-dev
Well, none of those types exist anymore, so it's hard to say what would need to be changed.

Internally, when Julia arrays are allocated they may stick an extra null byte on the end so that the Cstring conversion doesn't require allocating a new array (since the cost of this is pretty much negligible), but this is not exposed from within Julia.

-simon

Scott Jones

unread,
Oct 29, 2016, 5:37:48 PM10/29/16
to julia-dev
It does mean that there is a hidden cost, because looking at the code, it sticks on the extra byte for any allocation of an array whose elementsize is 1.
I'm concerned now, that if I allocate a buffer of 8192, it actually will allocate 8193 (and need to round that up to a much larger size).
It seems like a bit of a mess, just trying to cater to an out-of-date C convention.

Stefan Karpinski

unread,
Oct 31, 2016, 11:21:21 AM10/31/16
to juli...@googlegroups.com
The problem with null-terminated strings has never been that the extra byte is wasteful. Quite the opposite: the primary reasons C chose null-termination over length prefixing was that it uses less storage and allows arbitrary length. The actual problems with null-termination are:
  1. Finding the length in bytes of a null-terminated string is an O(n) operation;
  2. Strings cannot contain arbitrary binary data since NUL bytes are not allowed.
Neither of these problems apply to Julia, so it's hard to credit any claim that Julia is making C's mistake again as anything but ill-informed FUD.

Scott Jones

unread,
Nov 1, 2016, 12:46:31 PM11/1/16
to julia-dev
You seem to have missed my point entirely.  It is not one extra byte that concerns me, it is that *any* allocation with elementsize == 1 will ask for one extra byte, which, depending on how things are allocated (which I had been informed when I first started was by doubling the size), may end up allocating twice as much space.

The other major problems in Julia with catering to nul termination were that:
      1. You could not take a substring of a string, without performing a copy (so as to be able to add that hidden nul termination)
      2. The constructors for ASCIIString and UTF8String did not require a terminating nul byte, but the constructors for UTF16String and UTF32String required a terminating 0 16-bit or 32-bit word.
          That inconsistency made it much harder to write generic code dealing with all 4 string types.

The situation may have been improved now in v0.5 and later, however it would be ill-informed to say that Julia did not have a serious problem in v0.4 and previously (and still does with the LegacyStrings package),
because of unnecessarily catering to that C mistake (and it was already considered to be a mistake back when I first learned C in 1980).

Jeff Bezanson

unread,
Nov 1, 2016, 9:07:26 PM11/1/16
to juli...@googlegroups.com
The extra 0 byte is only added as a safety net in case somebody passes
a pointer to a byte array to a C string function without being
careful. It's just defensive programming. Nothing like "the
AbstractString interface requires nul termination" is true. The
SubString constructor has never copied data or required nul
termination as far as I can remember.

We could take out all hacks related to nul-termination, and it might
end up being the best way to go, but that would mean copying any time
you pass a string to C, and you do seem concerned about unnecessary
copying.


On Mon, Oct 31, 2016 at 10:02 PM, Scott Jones

Scott Jones

unread,
Nov 3, 2016, 8:28:10 PM11/3/16
to julia-dev
It just seems like optimizing a case that isn't really very common, and causing a penalty for all allocations (not just Vectors, but any dimension array) with elementsize == 1, is not the best way to go.
The copying with substrings happened whenever a substring was passed by ccall, to add the nul terminating byte or word, however, I *think* that is now only if you use Cstring (but I haven't verified that yet).

Most decent C/C++ APIs avoid using nul terminated strings any more anyway (too many cases of security holes with buffer overruns, etc.), and what remains is typically for fairly short things, like names, paths, etc.

I am concerned about both problems with poor memory utilization and *unnecessary* copying, however there are ways of avoiding most all of the copying, without causing a tax on all allocations with elementsize == 1.

Taking out the penalty for all elementsize == 1 allocations would not mean that you would always have to do a copy when you pass a string to C.  Here is a fairly simple way of doing so:

1) Have the String type add a \0, *if* there is room in the memory allocated to the underlying Vector (because of allocating an aligned size)
2) Have the ccall Cstring conversion from String only make a copy if there is no room in the passed vector for a \0 to have been added (which would likely be rather infrequent)
3) Possibly make Cstring into a concrete type, so that you could completely avoid the checking for embedded nuls / copying if you were passing Cstrings frequently to ccall.

I think currently there are likely some holes due to expecting that nul byte to be there in a Vector{UInt8}, because (as Fengyang Wang [@TotalVerb on GitHub] pointed out yesterday on Gitter), you can create a Vector{UInt16}, and then reinterpret it.
I've verified that in that case, you have a Vector{UInt8} that does not have a nul terminating byte.
There is also a minor bug in the code in _new_array_ (in array.c), where the extra byte is added *after* the length has been checked for the maximum (MAXINTVAL).
(I think that is not a problem in practice though, since MAXINTVAL is pretty large)

Bumping up to the next allocation size is small potatoes though, compared to the huge amount of memory used to store immutable strings with a structure optimized for growable, mutable, multi-dimensional arrays.
I hope that will be addressed in v0.6.

Thanks for the response!
Reply all
Reply to author
Forward
0 new messages