Reference versus value semantics

814 views
Skip to first unread message

haymo kutschbach

unread,
May 1, 2012, 7:40:24 AM5/1/12
to juli...@googlegroups.com
This is a follow up to this post https://groups.google.com/forum/?fromgroups#!topic/julia-dev/thF0GbNMz-0 and several others, but it is more general, so I decided to open a new thread. 

I am still concerned about reference semantics in Julia. It appears (obviously ;), there is a strong bias towards references on assignments and function calls among Julias developers. Besides the respect this deserves, I want to count some potential problems this may introduces. 

For me, reference semantics is an artefact from the movement of numerical applications towards general purpose languages. At first, nobody really wanted reference semantics, but languages like C and Java simply did not offer the option. Mathematical DSLs (e.g. FORTAN, Matlab) usually provide value semantics (if the underlying GPL allows it, see numpy, where it does not). I think, it is even fair to state, that a reasonable number of potential users of a numerical DSL doesn't even know about the concept of pointers in C. If one goal of a good language design is to give convenience _and_ performance, what's left is: value semantics with lazy copies on write.  

IMO, the advantages of a copy on write system outweight its disadvantages (more complexity, but it can be hidden from the user). I understand that every developer/designer likes to keep things simple. But for me, forcing reference semantics for numerical arrays is keeping things simpler only for the language designer. Reference semantics violate an important rule in SW development: decoupling. The user is faced with much higher complexity. If algorithms grow, the bug potential does as well. I also like simple things though, but this value/reference semantics thing is just not simple.

One statement came up in this group several times: "copy on write introduces hidden performance costs and kills performance on surprise" 

Could someone please clarify why this ? It is important, because the performance is the only argument left against value semantics AFAICS. Copy on write - if done right - does only copy in all cases where a copy is really needed. This corresponds to all situations where the user should have used copy() anyway. So why would copy on write use more memory / be slower than ? The "surprise" - if I get its meaning correct - arises, if mutating an array leads to a hidden copy of its storage. But, again, this happens only, when otherwise an explicit copy() was necessary. Forgetting explicit things (like "copy()") seems to introduce more potential for surprises. 

There are other arguments against reference semantics:
* Functions can alter their arguments on surprise. This has been discussed here already but IMO not to an end. Marking mutating functions with ! is not feasable: i) it is hard to force this convention and in reality (just like in Julias current code base) will not be consistent. ii) new problems arise for functions with multiple arguments. Which is the one which is potentially mutated? Which one is used for inplace operations?  iii) Nested functions. When writing a function the user is faced with a new complexity: she must take care which functions to use: ! or without !. Now from a users point of view, it is not only important, _what_ a function does, but also _how_. (The last point is a minor problem, I guess.)

* The cell issue: Having cells which are capable of storing arrays of arrays, it gets hard to think of any case where storing references inside cells would be needed. If one stores an array in a cell she wants to have it safe and be sure external modifications to the source array will not alter the copy inside the cell. So, storing into cells should always make a copy. The same is true for fetching an array from a cell. Problems arise, if cells are stored into cells. Without copy on write, one will end up getting cascading copies - which _really_ kills performance. 

Regarding slices: I personally dont care, if a slice produces a copy or not. Cases, where creating _and working with_ a "reference slice" efficiently are limited. (see: strided storage in numpy). So I rather have a copy coming out of subarrays than having to keep in mind potential side effects on write. Having built a system with such reference slices ourself, we turned away from it and since than concentrate on improving the memory management. The higher cost of most copies can be kept very small, if the target memory comes from a pool (and hence still stays in the cache). This holds for a general copy on write mechanism as well. 

I apologize if this topic is pushed too much here. But I still think, it deserves a reasonable, conscious decision. At the end, both schemes are usable of course (not sure about cells though). And I promise, once the Julia staff declares a design lock on the issue - I will keep silence once for all .. :) 



John Myles White

unread,
May 1, 2012, 8:04:12 AM5/1/12
to juli...@googlegroups.com
For me, the usefulness of value semantics for a general purpose scientific language can be dispelled with a single example: I have a matrix that occupies N% of RAM (N > 50) and I want to perform some trivial operation on it that could succeed if no copy is ever made. But if even one copy is made, we must use disk instead of RAM and the computation becomes unusably slow.

-- John

haymo kutschbach

unread,
May 1, 2012, 8:20:22 AM5/1/12
to juli...@googlegroups.com
John, value semantics can handle this situation just the same as reference semantics. This is, what 'copy on write' is for. I admit, it introduces a larger overall memory footprint even if the storage gets pooled (which is advisable). In practice, however, this turns out to be of much less importance. Lets take your example: if the problem size is larger than a certain %P of the RAM part of virtual memory, one cannot reliably prevent from swapping anyway. This is also true, because you never reliably know, how large 100% is with virtual memory systems. In practice, a feasible P appears to lay rather somewhere around 10%. A matrix with the size of 50% of the process' virtual memory which currently exists in RAM is also hard to create, transfer, work with and store without swapping. Lastly, one would have to make sure, all necessary functions will work inplace properly... I'd suggest to use either a considerably larger RAM or a distributed memory system in such cases. 

Krzysztof Kamieniecki

unread,
May 1, 2012, 8:30:10 AM5/1/12
to juli...@googlegroups.com
Would/could Julia check methods and declare an error if the methods modifies a parameter and does not have a ! in it's name?

david tweed

unread,
May 1, 2012, 9:26:31 AM5/1/12
to julia-dev
On May 1, 1:30 pm, Krzysztof Kamieniecki <k...@kamieniecki.com> wrote:
> Would/could Julia check methods and declare an error if the methods
> modifies a parameter and does not have a ! in it's name?

From experience in C++, the problems tend to occur the other way: with
lots of generic programming (sometimes metaprogramming) it can be
difficult to prove if some combination of functions and higher-order
functions actually don't mutate it's argument, and whether it does may
conceivably vary with the actual types, (ie, it's easy to write a
conservative analysis that returns "does modify" if anything looks
vaguely modificatory), so almost no non-trivial funcions can be
declared non-modifying.

Andrei Jirnyi

unread,
May 1, 2012, 3:20:54 PM5/1/12
to julia-dev
On May 1, 6:40 am, haymo kutschbach <h.kutschb...@ilnumerics.net>
wrote:
> not). I think, it is even fair to state, that a reasonable number of
> potential users of a numerical DSL doesn't even know about the concept of
> pointers in C. If one goal of a good language design is to give convenience
> _and_ performance, what's left is: value semantics with lazy copies on
> write.

Thanks for support! :)
Realistically, I totally understand how switching to copy-on-write and
especially getting rid of pass by reference could be very problematic
in terms of modifying and maintaining the code base; even though I do
like this part of Matlab a lot, Julia developers are probably not
quite on par with Mathworks in terms of resources yet :)

Fortran has pass by reference by default everywhere, and it works
fine, especially with intent() qualifiers. Having some explicit
qualifiers could be nice btw! Even if entirely optional, and not very
strictly enforced -- e.g. say if "intent(in)::X" produced an error
simply if X is modified within a function itself -- it might still be
helpful reading and parallelizing code. I can also see how smth like
that could be helpful for the optimizer as well.

It seems like a lot of potential problems with value semantics are
related to complicated data structures -- I think people who design
such structures are likely to be more familiar with concept of
pointers and references, and thus more comfortable using special
syntax for them, than the average Matlab script writer who wants to do
some basic array manupulations.

--aj

Toivo Henningsson

unread,
May 1, 2012, 3:40:00 PM5/1/12
to julia-dev
> For me, reference semantics is an artefact from the movement of numerical
> applications towards general purpose languages.

I think having a pass by reference array type in Julia is an
inescapable consequence of being able to write things which would
otherwise have been done in e g C. I think various kinds of low level
operations (including linear algebra) need mutable arrays of this
kind.

Now, Julia is meant to span many levels, so I guess there's nothing
stopping someone from implementing an array type with copy on write
semantics on top of Array (might need a little support from the
developers to keep track of which values are live). Then those who
like it better could use it. Another possibility could be a
microlanguage with copy on write semantics.
It would of course be unfortunate if this created a divide in the
community. But Julia is so powerful I think we have to constantly face
these diversification pressures.

Tim Holy

unread,
May 1, 2012, 4:41:56 PM5/1/12
to juli...@googlegroups.com
On Tuesday, May 01, 2012 12:20:54 PM Andrei Jirnyi wrote:
> Realistically, I totally understand how switching to copy-on-write and
> especially getting rid of pass by reference could be very problematic
> in terms of modifying and maintaining the code base; even though I do
> like this part of Matlab a lot, Julia developers are probably not
> quite on par with Mathworks in terms of resources yet

Resources matter, but many of us have written versions of specific algorithms
that are more performant and/or capable than anything Mathworks provides. It's
more a question of whether it's a good idea. If we literally "got rid of" pass
by reference, I would complain loudly. One cannot write efficient algorithms
without it.

--Tim

dslate

unread,
Jun 26, 2012, 2:25:56 AM6/26/12
to juli...@googlegroups.com
This is a bit of an aside from the argument over pass-by-reference vs
pass-by-value, but I find the manual a bit unclear on the subject.
The section on Arrays says:

  In Julia, all arguments to functions are passed by reference.  Some
  technical computing languages pass arrays by value, and this is
  convenient in many cases. In Julia, modifications made to input
  arrays within a function will be visible in the parent function.

The fact that scalar variables (as opposed to arrays and other
collections) are passed by value and not by reference isn't clear from
this passage, and may not be obvious to a julia newbie.  Of course
this behavior is easy enough to verify (see log below), but perhaps
the manual should mention this explicitly.

---------------------------------------------
Script started on Tue Jun 26 01:05:01 2012
david@LC2430HD:~$ cat var-ref-tst.jl
function modx( x::Int64)
x = 3
println( x)
end
y = 7
println( y)
modx( y)
println( y)

function moda( a::Array{Int64,1})
a[ 1] = 5
println( a[ 1])
end
z = Array( Int64, 10)
z[ 1] = 7
println( z[ 1])
moda( z)
println( z[ 1])
david@LC2430HD:~$ julia -v
julia version 0.0.0+89509897.r69b4
david@LC2430HD:~$ julia var-ref-tst.jl
7
3
7
7
5
5
david@LC2430HD:~$ exit
exit

Script done on Tue Jun 26 01:06:35 2012
---------------------------------------------

Stefan Karpinski

unread,
Jun 26, 2012, 2:39:36 AM6/26/12
to juli...@googlegroups.com
The passage is accurate: everything is semantically passed by reference. However, bits types (such as the various integer and float types), are immutable — there's no way to modify an integer value, so there's no difference in behavior between passing by reference and passing by value. When you write a variable assignment like `x = 3` you aren't modifying the integer value that was passed in, you are rebinding x to a different integer value. When you write `a[1] = 5` you are actually modifying the contents of a, while leaving it bound to the same array value as was passed in. If you wrote `a = [5]` then you would be rebinding a to a new array value and that change would have no effect outside of the function body; `x = 3` is exactly the same.

--

Jeff Bezanson

unread,
Jun 27, 2012, 3:32:09 AM6/27/12
to juli...@googlegroups.com
I agree with Tim. I find pass by reference more expressive, since
updating in-place becomes a tool available when you want it. Much code
naturally follows a functional style anyway, where arguments are not
modified, and can continue to do so. I believe the performance of
reference vs. copy-on-write ends up being comparable. Except that with
copy-on-write you can't implement data structures that need O(1)
in-place updates.

The problem with "If one stores an array in a cell she wants to have
it safe and be sure external modifications to the source array will
not alter the copy inside the cell" is quite simply: how do you know?
What if you want to keep track of a set of objects that are changing?
Immutable data and purely functional data structures have their place,
and it would be great to support them well eventually, but
copy-on-write is not necessarily the best way to achieve that. In fact
many functional data structures do updates by allocating small
incremental objects rather than by copying.

I don't find that we constantly worry about whether things will be
mysteriously mutated. This requires only the convention that you don't
mutate objects you don't own unless the owner wants you to. I believe
most programmers follow this without really thinking about it.

But, I don't mean to debate, because we are not considering changing
this. You can rest assured that very little in julia was not a
conscious decision.

haymo kutschbach

unread,
Jun 29, 2012, 4:25:30 AM6/29/12
to juli...@googlegroups.com
Tor, thanks for your support. Your ideas are indeed exciting :). I also consider the optimal solution to be one with the best out of both worlds. Most "everyday use" of arrays would profit from value semantics. For performance critical inner loops, however, the option of inplace operations must exist. I am not familiar enough with the code base of Julia to judge the feasibility of your solution for it. 

The solutions we found for ILNumerics pretty much archive the best of both worlds. However, they are based on a static typed language (C#) and utilize strongly typed arrays in function declarations. Every function returns a special "return array" type - which will either be disposed after the first use or implicitly assigned (converted) to a local array. We implemented "input array" types which realize value semantics and "output array" types - which are similar to references in Julia. For the function declaration the user simply chooses the array type matching the intended semantic for the array. All this plays nicely together with copy on write, external (C/Fortran) modules and even enables automatic inplace operations on temporary storages - without compiler support. However, obviously something like that cannot work for Julia. Julia would require much different solutions. 

But Julia has so many other advantages - the absence of those comfort features must/will not stop its success. Especially, because the better compiler support can serve as a partial substitute. And so, cells are the only place where I miss the conscious decision Jeff was talking about. The only cell use case mentioned reduces cells to features already provided by simple lists. Why not get rid of cells than? Every user coming from Matlab would certainly get confused by the reference behavior of Julias current cells?

Anyway - congratulations again for your great efforts! Criticizing the one part does in no way means I don't appreciate having Julia around...


On Friday, June 29, 2012 6:20:54 AM UTC+2, tor wrote:
The OP warmed my heart. I completely agree with his general sentiment. Taking references of mutable objects and passing them around in (large) programs, by default and 'for efficiency' seems to be premature optimization and to introduce unnecessary coupling.

However, I think Julia would be a completely different language with only copy-on-write. What, for instance, should we pass to ccall?

I think there somehow must exist a good compromise, though.

> Toivo Henningson
> ... I guess there's nothing
> stopping someone from implementing an array type with copy on write
> semantics on top of Array ...

I think language design should be concerned with communication between people (and machines), not just with technical possibilities. It's about convenience and conventions, and what the language encourages. For example, C programmers can also subscript-check their own arrays, but this is rarely done. If it had been supported by the language and could be turned off for final releases, it might have been widely adopted. In C, subscript checking is simply not encouraged.

Here's my first thought: One could categorize references into four kinds, according to whether they 1) address immutable data or not, and 2) are uniquely bound or not.

The first two, namely bindings to immutable data are unproblematic. The number of bindings doesn't matter. The two kinds that address mutable data are more interesting.

Unique bindings to mutable data does not cause coupling. So these are also 'good guys', but require more thought if incorporated into the language. Non-unique bindings to mutable data are... well, let's just call them exciting! They are necessary for external calls and for implementing certain data structures, such as doubly linked lists.

What I (tentatively) propose is that non-unique bindings to mutable data are not allowed to cross module boundaries (in the hopefully upcoming module system for Julia). I believe this would reduce coupling and make Julia more scalable.

With unique bindings, you could for example make an array and then mutate it:

a = zeros(Float, 10, 10)
a[5,5] += 1

So far, this is conceptually the same as handling a single float:

x = 1.0
x += 1

As Stefan says, x is just rebound to a fresh immutable float. So we can THINK of the array in the same way.

I think Julia should exploit this. You could do a LOT of useful programming without ever using anything but immutable data and unique bindings to mutable data.

One could for example write quick-sort in one module and call it from another. This would all be in-place and efficient. But you could not mess with the array while it was being sorted (why would you want to?). You could import a doubly-linked list from a module and access all of it's functionality via a suitable interface. However, you would not get access to the non-unique bindings that glue the list nodes together (why would you want to?).  

For the real men out there, I suggest using '&' for making simple bindings to mutable objects. So, y = &x would make both y and x non-unique bindings, even if x was originally mutable. More or less what y = x does in Julia at the moment.

Simply y = x should work as if a copy is made, but should normally entail copy-on-write, or similarly efficient strategies.
For a lot of everyday, outside-of-inner-loops programming, you wouldn't need anything else.

To pass a unique reference, I suggest something like y = give x. This would bind y to x's object and then unbind x, which may seem meaningless (well, you could use it to rename and object). But in a function call, it may make more sense: a = qsort(give a). Or, to reduce an array and have it reclaimed by the gc: s = sum(give a).

Note. If the compiler sees something like x = f(x), it might sometimes be able to change it to x = f(give x) when the effect is provably the same.  

The thing is, to prevent the 'exciting' references from crossing module boundaries, the compiler or runtime must keep track of uniqueness. This might get tricky if those references are stored in arrays. Still, I do think it should be possible somehow, without any increase in asymptotic performance. E.g. if such a reference is made a member of an array, mark the entire array as 'exciting'.

Also note that 'give' is not guaranteed to be super-efficient. For example:

Julia> a = rand(1000)
Julia> b = a # setting up for copy-on-write
Julia> a = qsort(give a) # a's object is written to, so b must first get a copy.

But this is natural. Do you want a copy or not? If not, don't assign a to b!

I've also played with the thought of 'giving' slices of arrays. Example:

Julia> a = [1,2,3,4,5,6]

Julia> b = give a[1:3]

Julia> a
[4,5,6]

Julia> b
[1,2,3]

Since Julia supports multiple return values, one might perhaps give away the middle of an array:

Julia> a = [1:10]
Julia> (a, b, c) = give a[3:7]
Julia> a
[1, 2]

Julia>b
[3,4,5,6,7]

Julia>c
[8,9,10]

None of this should lead to linear time copying of course. Giving slices might be nice when implementing qsort:

function qsort (a)
    if length(a) < 10
        return insertSort (give a)
    end
    pivot = getAGoodPivot(a)
    left, right = partition (give a, pivot)
    return [qsort(give left), pivot, qsort(give right)]
end    

One issue here is that the concatenation at the end must just coalesce the array memory back together, rather than start reallocating. I'm not sure how realistic this is.

In summary. I think the cases where one really needs to hook up code with the internals of data structures written in other files are either rare or they should be rare. And in those cases one can use good old load() in order to put everything in one module (as is probably logical).   

PLEASE NOTE: I don't think for a second that anything of what I've suggested would simply work the way I describe it. I have no experience implementing any of it. I merely wanted to hear if anyone thought that some kind of compromise might work. Also forgive me for writing so much in my first post. I guess I feel strongly about this.

david tweed

unread,
Jun 29, 2012, 10:59:42 AM6/29/12
to juli...@googlegroups.com
It looks like having COW on some kinds of arguments still remains something a lot of people want. Others, such as Stefan, don't like the non-visible copying making it difficult to figure out performance. So, to ask a question:

does anyone know of a language/runtime that has done a good job of providing "debugging support" for figuring out when and why COW is triggering? If there was an existing success case we could look at it and see how well it succeeds in "making visible the invisible things affecting performance".

(Personally I'm open to COW IF it's easy to figure out when it's kicking in for really big objects, so that the code can be rewritten with either modification-in-place or explicit copying.)
Reply all
Reply to author
Forward
0 new messages