There's a current open issue for NullableArrays here:
https://github.com/JuliaStats/NullableArrays.jl/issues/112#issuecomment-225366836
The gist of the issue is that there are cases like CSV.jl, where it can be incredibly efficient to first mmap a delimited file like:
# returns a Vector{UInt8}
m = Mmap.mmap(file)
and then parse out individual column values from the byte array. For bitstypes (float, integers), there's no problem because they end up being stack allocated anyway while parsing, but for Strings, for example, it can be much more efficient to just "point" to the original mmapped memory instead of copying the string bytes out. This is possible through the `unsafe_wrap(String, ptr, len)` method, but poses a concern because now the strings created this way are, unsurprisingly, "unsafe" in the sense that they don't own their underlying memory. If the original mmapped memory gets "unmapped", through its finalizer, all these strings are now invalid by having their underlying memory taken out from under them.
Another newer and more pressing example is that of
feather files. We now have a
native implementation working, but this same issue is exacerbated even more. Feather defines a common binary format for dataframes across languages. The general idea is that you can mmap a feather file then "reinterpret" the mmapped bytes into full-blown dataframe columns, no parsing necessary. While this works wonderfully in Julia (and probably even more so than other languages), we still run into the same problem of needing to keep a reference to the underlying memory around somewhere. For example, we might mmap a feather file and through the serialized metadata know that bytes 5 through 1204 represent 150 Float64 values. We can build a Julia array by saying `unsafe_wrap(Array, pointer(sub(bytes, 5:1204)), 1200)` which does no copying. We now just need to make sure we keep a reference to the mmapped Vector{UInt8} to ensure our Vector{Float64} doesn't become invalid.
My
current proposal is to add a single `ref` field to the NullableArray type to allow a NullableArray itself to hold a reference to a shared parent memory region. Current discussion has revolved around if there are other ways of dealing with this kind of problem. My proposal is based off a
similar pattern I've seen in the Arrow/Feather projects.
Another idea that has been suggested would involve some kind of runtime interaction with GC, somehow letting the GC know that, for example, the Vector{Float64} type is based off the mmapped Vector{UInt8} so that the GC doesn't run the "unmap" finalizer on the parent mmap memory until references to any arrays reinterpreted from it are all gone.
Definitely like to hear any ideas out there of handling situations like this and what's feasible or not.
-Jacob