Ideas for safely re-using memory in types?

261 views
Skip to first unread message

Jacob Quinn

unread,
Jun 11, 2016, 4:19:57 PM6/11/16
to julia-dev
There's a current open issue for NullableArrays here: https://github.com/JuliaStats/NullableArrays.jl/issues/112#issuecomment-225366836

The gist of the issue is that there are cases like CSV.jl, where it can be incredibly efficient to first mmap a delimited file like:

# returns a Vector{UInt8}
m = Mmap.mmap(file) 

and then parse out individual column values from the byte array. For bitstypes (float, integers), there's no problem because they end up being stack allocated anyway while parsing, but for Strings, for example, it can be much more efficient to just "point" to the original mmapped memory instead of copying the string bytes out. This is possible through the `unsafe_wrap(String, ptr, len)` method, but poses a concern because now the strings created this way are, unsurprisingly, "unsafe" in the sense that they don't own their underlying memory. If the original mmapped memory gets "unmapped", through its finalizer, all these strings are now invalid by having their underlying memory taken out from under them.

Another newer and more pressing example is that of feather files. We now have a native implementation working, but this same issue is exacerbated even more. Feather defines a common binary format for dataframes across languages. The general idea is that you can mmap a feather file then "reinterpret" the mmapped bytes into full-blown dataframe columns, no parsing necessary. While this works wonderfully in Julia (and probably even more so than other languages), we still run into the same problem of needing to keep a reference to the underlying memory around somewhere. For example, we might mmap a feather file and through the serialized metadata know that bytes 5 through 1204 represent 150 Float64 values. We can build a Julia array by saying `unsafe_wrap(Array, pointer(sub(bytes, 5:1204)), 1200)` which does no copying. We now just need to make sure we keep a reference to the mmapped Vector{UInt8} to ensure our Vector{Float64} doesn't become invalid.

My current proposal is to add a single `ref` field to the NullableArray type to allow a NullableArray itself to hold a reference to a shared parent memory region. Current discussion has revolved around if there are other ways of dealing with this kind of problem. My proposal is based off a similar pattern I've seen in the Arrow/Feather projects.

Another idea that has been suggested would involve some kind of runtime interaction with GC, somehow letting the GC know that, for example, the Vector{Float64} type is based off the mmapped Vector{UInt8} so that the GC doesn't run the "unmap" finalizer on the parent mmap memory until references to any arrays reinterpreted from it are all gone.

Definitely like to hear any ideas out there of handling situations like this and what's feasible or not.

-Jacob

Erik Schnetter

unread,
Jun 11, 2016, 4:28:22 PM6/11/16
to juli...@googlegroups.com
By way of suggesting an alternative approach that may or may not work for you:

You could define a new array type, similar to `SubArray`, that knows that its data are owned by another entity, and that holds a reference to that other entity. In fact, since a `SubArray` can hold a reference to any `AbstractArray`, you would only need to define a pseudo `AbstractArray` that corresponds to the mmaped file. You can then use a regular `SubArray`, which is (so one hopes) supported everywhere in Julia where a regular `Array` is supported as well.

Thus the price you'd pay is that you get a type that is different from `Array`, but one that is (or really should) be equally well supported. The advantage you'd gain is that you are re-use an existing mechanism that has been created for a very similar scenario.

-erik

Jacob Quinn

unread,
Jun 11, 2016, 4:42:32 PM6/11/16
to juli...@googlegroups.com
The main issue I see in taking the SubArray approach is how to deal with the "cross types" problem; i.e. when I mmap a file, I get a Vector{UInt8}, and SubArray then has to either be SubArray{UInt8} or I reinterpret the original mmap array to the desired eltype and then do a SubArray. The problem there is I might mmap 9 bytes, yet need to reinterpret a Vector{Float64}. This somewhat goes back to an issue I recently opened here: https://github.com/JuliaLang/julia/issues/16652.

I think the pseudo-SubArray type in this case would need to be able to not only hold a reference to its parent array, but also be able to define a separate eltype than its parent. Maybe that's additional functionality we can extend SubArray with, but I'm not aware of how easy that would be to do. Definitely something worth looking into.

-Jacob

Yichao Yu

unread,
Jun 11, 2016, 4:46:00 PM6/11/16
to Julia Dev
On Sat, Jun 11, 2016 at 4:42 PM, Jacob Quinn <quinn....@gmail.com> wrote:
> The main issue I see in taking the SubArray approach is how to deal with the
> "cross types" problem; i.e. when I mmap a file, I get a Vector{UInt8}, and
> SubArray then has to either be SubArray{UInt8} or I reinterpret the original
> mmap array to the desired eltype and then do a SubArray. The problem there
> is I might mmap 9 bytes, yet need to reinterpret a Vector{Float64}. This
> somewhat goes back to an issue I recently opened here:
> https://github.com/JuliaLang/julia/issues/16652.
>
> I think the pseudo-SubArray type in this case would need to be able to not
> only hold a reference to its parent array, but also be able to define a
> separate eltype than its parent. Maybe that's additional functionality we
> can extend SubArray with, but I'm not aware of how easy that would be to do.
> Definitely something worth looking into.

julia> sub(reinterpret(Float64, zeros(UInt8, 100)), 2:10)
9-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

Jacob Quinn

unread,
Jun 11, 2016, 4:51:02 PM6/11/16
to juli...@googlegroups.com
Ok, looks like that would probably work. Out of curiosity, what happens to the "extra bytes" that don't fit in type size of what you're re-interpreting? I.e. If I have a 9 bytes and reinterpret a Float64 (8 bytes), my reinterpreted array just doesn't include those extra bytes?

Yichao Yu

unread,
Jun 11, 2016, 5:00:54 PM6/11/16
to Julia Dev
On Sat, Jun 11, 2016 at 4:50 PM, Jacob Quinn <quinn....@gmail.com> wrote:
> Ok, looks like that would probably work. Out of curiosity, what happens to
> the "extra bytes" that don't fit in type size of what you're
> re-interpreting? I.e. If I have a 9 bytes and reinterpret a Float64 (8
> bytes), my reinterpreted array just doesn't include those extra bytes?

It is ignored.

Stefan Karpinski

unread,
Jun 11, 2016, 8:05:49 PM6/11/16
to juli...@googlegroups.com
We really should introduce a MemoryBlock type (name to be bikeshedded), which represents a contiguous region of bytes and define Arrays and Strings in terms of those, allowing them to reference arbitrary portions of a MemoryBlock object.

Scott Jones

unread,
Jun 13, 2016, 9:32:35 AM6/13/16
to julia-dev
Is this the same as the array buffer ideas that have been talked about previously?
This would help a lot with things that I'm working on - for large memory mapped arrays and strings.
+100 to getting this done for v0.6 (the earlier the better, IMO)

Jeffrey Sarnoff

unread,
Jun 16, 2016, 9:07:01 PM6/16/16
to julia-dev
In case of shed overflow, shed bikes.  (as a type, a contiguous region of bytes may be named Bytes)


On Saturday, June 11, 2016 at 8:05:49 PM UTC-4, Stefan Karpinski wrote:
Reply all
Reply to author
Forward
0 new messages