copy-on-write

Paul

unread,

Feb 24, 2017, 2:52:16 PM2/24/17

to

Please could someone link to (or provide) an example of copy-on-write in C++.
I know what it means, but I'd like to see a simple example. I didn't
really find one that looked convincing on a Google search.

I was asked about it in a job interview and got stumped.
Because I did well enough on the other questions, I have now been
invited back for a second interview, and I think they may test
how much I've learned in the meantime.

Many thanks,

Paul

Öö Tiib

unread,

Feb 24, 2017, 5:59:40 PM2/24/17

to

If you just need an implementation then I trust the CString of microsoft
ATL / MFC is copy-on-write.

If you need idea about its usefulness then it seems that CoW is not too
good in multi-threaded program and can cause major hidden performance
hit there. Most programs written nowadays are multi-threaded and so CoW
is not too good idea most of the time.

Alf P. Steinbach

unread,

Feb 24, 2017, 6:30:23 PM2/24/17

to

On 24.02.2017 23:59, Öö Tiib wrote:
> Most programs written nowadays are multi-threaded and so CoW
> is not too good idea most of the time.

Uhm, thread safety is an orthogonal issue. Adding illusory thread safety
to a string class is going to cost, no matter the internal implementation.

I write “illusory” because if one shares a mutable string object between
threads, it would be impractical to lock down the object any time its
`operator[]` returns a reference that possibly could be stored by the
thread and used to modify the string at arbitrary times.

The standard library's approach of most classes being thread-agnostic is
fine, I think. It's the responsibility of the code that uses those
classes, to ensure thread safe operation. E.g. by simply not sharing
mutable objects between threads.

Cheers!,

- Alf (opinionated, for the occasion)

Öö Tiib

unread,

Feb 24, 2017, 7:01:35 PM2/24/17

to

On Saturday, 25 February 2017 01:30:23 UTC+2, Alf P. Steinbach wrote:
> On 24.02.2017 23:59, Öö Tiib wrote:
> > Most programs written nowadays are multi-threaded and so CoW
> > is not too good idea most of the time.
>
> Uhm, thread safety is an orthogonal issue. Adding illusory thread safety
> to a string class is going to cost, no matter the internal implementation.
>
> I write “illusory” because if one shares a mutable string object between
> threads, it would be impractical to lock down the object any time its
> `operator[]` returns a reference that possibly could be stored by the
> thread and used to modify the string at arbitrary times.

Thread safety is not orthogonal when we assume that particular data can
be accessed only by one thread. That is "embarrassingly parallel".
For example if a task needs string then it is copied into input data of
task and then the task can be given to other thread to solve.
However if copy was lazy (with CoW) then we now either have race conditions
of naive CoW or inefficiency of non-naive CoW.

> The standard library's approach of most classes being thread-agnostic is
> fine, I think. It's the responsibility of the code that uses those
> classes, to ensure thread safe operation. E.g. by simply not sharing
> mutable objects between threads.

We can't if the CoW is done behind scenes. Then that means locks at
wrong level of granularity inside of string AND locks outside as well
if we really plan to use same string from multiple threads concurrently.

Alf P. Steinbach

unread,

Feb 24, 2017, 7:40:16 PM2/24/17

to

On 25.02.2017 01:01, Öö Tiib wrote:
> On Saturday, 25 February 2017 01:30:23 UTC+2, Alf P. Steinbach wrote:
>> On 24.02.2017 23:59, Öö Tiib wrote:
>>> Most programs written nowadays are multi-threaded and so CoW
>>> is not too good idea most of the time.
>>
>> Uhm, thread safety is an orthogonal issue. Adding illusory thread safety
>> to a string class is going to cost, no matter the internal implementation.
>>
>> I write “illusory” because if one shares a mutable string object between
>> threads, it would be impractical to lock down the object any time its
>> `operator[]` returns a reference that possibly could be stored by the
>> thread and used to modify the string at arbitrary times.
>
> Thread safety is not orthogonal when we assume that particular data can
> be accessed only by one thread. That is "embarrassingly parallel".
> For example if a task needs string then it is copied into input data of
> task and then the task can be given to other thread to solve.
> However if copy was lazy (with CoW) then we now either have race conditions
> of naive CoW or inefficiency of non-naive CoW.

To be clear, the situation you envision is that in thread B there is

const Cow_string sb = ...

initialized as a copy of tread A's

Cow_string sa = ...

Then when thread A modifies its `sa` it will have to stop the shared
buffer ownership, which modifies some reference counter or whatever
that's shared with `sb`, and might be concurrently accessed by B.

But this is only problem with an incorrect COW implementation, because
the cost of ensuring atomicity of the reference counting is miniscule,
completely insignificant, compared to the cost of copying the data,
which is the other part of that un-sharing-of-buffer operation.

>> The standard library's approach of most classes being thread-agnostic is
>> fine, I think. It's the responsibility of the code that uses those
>> classes, to ensure thread safe operation. E.g. by simply not sharing
>> mutable objects between threads.
>
> We can't if the CoW is done behind scenes. Then that means locks at
> wrong level of granularity inside of string AND locks outside as well
> if we really plan to use same string from multiple threads concurrently.

I think we can.

Yes it involves atomic operations or locking for the reference counting.
That's IMHO at the right level. It doesn't make the class generally
thread safe, which I see as an impractical goal (and on the way to
discovering how impractical, introducing awesome inefficiency), but it
makes it just thread safe enough to allow simple logical copying over to
another thread, without having to explicitly clone the object or check
its reference count – which is of course another option, to just be
completely explicit instead of relying on implicit, but then in a very
non-ideal way “leaking implementation details”.

Cheers!,

- Alf (expressive, so far without gibberish diagnostics, yay!)

Öö Tiib

unread,

Feb 25, 2017, 9:38:41 AM2/25/17

to

On Saturday, 25 February 2017 02:40:16 UTC+2, Alf P. Steinbach wrote:
> On 25.02.2017 01:01, Öö Tiib wrote:
> > On Saturday, 25 February 2017 01:30:23 UTC+2, Alf P. Steinbach wrote:
> >> On 24.02.2017 23:59, Öö Tiib wrote:
> >>> Most programs written nowadays are multi-threaded and so CoW
> >>> is not too good idea most of the time.
> >>
> >> Uhm, thread safety is an orthogonal issue. Adding illusory thread safety
> >> to a string class is going to cost, no matter the internal implementation.
> >>
> >> I write “illusory” because if one shares a mutable string object between
> >> threads, it would be impractical to lock down the object any time its
> >> `operator[]` returns a reference that possibly could be stored by the
> >> thread and used to modify the string at arbitrary times.
> >
> > Thread safety is not orthogonal when we assume that particular data can
> > be accessed only by one thread. That is "embarrassingly parallel".
> > For example if a task needs string then it is copied into input data of
> > task and then the task can be given to other thread to solve.
> > However if copy was lazy (with CoW) then we now either have race conditions
> > of naive CoW or inefficiency of non-naive CoW.
>
> To be clear, the situation you envision is that in thread B there is
>
> const Cow_string sb = ...
>
> initialized as a copy of tread A's
>
> Cow_string sa = ...
>

About like that. That sb is most likely non-const member of some
object (task input data) and may be later copied by B into some other
non-const string but in essence that was what I meant. Both sa and sb
are first owned by A then ownership of sb is transferred to b. CoW
however keeps that shared ownership alive behind scenes.

> Then when thread A modifies its `sa` it will have to stop the shared
> buffer ownership, which modifies some reference counter or whatever
> that's shared with `sb`, and might be concurrently accessed by B.

Yes and all what I'm saying that there is additional internal synchronized
check of reference counts in every write access just for case because
figuring out what two strings may have shared content is non-trivial
from outside. That makes a string slow compared to string that does not
do that but instead has small string optimization (up to 22 byte
strings have one level of indirection less).

>
> But this is only problem with an incorrect COW implementation, because
> the cost of ensuring atomicity of the reference counting is miniscule,
> completely insignificant, compared to the cost of copying the data,
> which is the other part of that un-sharing-of-buffer operation.

How recently you have profiled that? The lock-free atomic thingies seem
more expensive than good old mutexes in multiprocessor systems in my tests.

There is a surface somewhere in space of high frequency of copies and
rare writes into large sized objects from where CoW starts to win.
However making lot of copies of large objects can be likely reduced by
other means instead of making the copy lazy.

>
> >> The standard library's approach of most classes being thread-agnostic is
> >> fine, I think. It's the responsibility of the code that uses those
> >> classes, to ensure thread safe operation. E.g. by simply not sharing
> >> mutable objects between threads.
> >
> > We can't if the CoW is done behind scenes. Then that means locks at
> > wrong level of granularity inside of string AND locks outside as well
> > if we really plan to use same string from multiple threads concurrently.
>
> I think we can.
>
> Yes it involves atomic operations or locking for the reference counting.
> That's IMHO at the right level. It doesn't make the class generally
> thread safe, which I see as an impractical goal (and on the way to
> discovering how impractical, introducing awesome inefficiency), but it
> makes it just thread safe enough to allow simple logical copying over to
> another thread, without having to explicitly clone the object or check
> its reference count – which is of course another option, to just be
> completely explicit instead of relying on implicit, but then in a very
> non-ideal way “leaking implementation details”.

You are correct that the synchronized access of reference counts at low
level does not make the class thread safe. It still is mandatory for
CoW even if we have external locks outside.
I do not understand what scenario you meant with leaking details.
If something is thread-safe then that is not implementation detail but
important feature.

Chris M. Thomasson

unread,

Feb 25, 2017, 10:46:12 PM2/25/17

to

A lot of lock-free techniques are way to expensive, and require too many
damn atomics and/or membars to get the job done.

A typical generic mutex usually needs two atomic ops and an
acquire/release membar paired relationship to keep the protected data
visible. If a lock free technique needs more membars atomics than its
mutex counterpart, then you are not going to really see a performance
enhancement wrt mutex never hitting its slowpath, blocking. However,
mutexs can block, and that can ruin their performance. Use locking
patterns wisely; Deadlock no more...

There are asymmetric locks, and non-blocking algorithms, but they are
more exotic.

;^)

Now, try to beat this sucker with a mutex based queue in a high
pressure, loaded environment. Stress the queue out to see what happens.

https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion

You can use the original CAS based version by Dmitry Vyukov.

Or, my fast-pathed wait-free alteration:

<pseudo code, membars aside for a moment>
______________________________________________
struct cell { uint32_t ver; double state; };

uint32_t head = 0;
uint32_t tail = 0;
cell cells[N]; // N must be a power of 2

void init() {
for (uint32_t i = 0; i < N; ++i) cells[i].ver = i;
}

void producer(double state) {
uint32_t ver = XADD(&head, 1);
cell& c = cells[ver & (N - 1)];
while (LOAD(&c.ver) != ver) backoff();
c.state = state;
STORE(&c.ver, ver + 1);
}

double consumer() {
uint32_t ver = XADD(&tail, 1);
cell& c = cells[ver & (N - 1)];
while (LOAD(&c.ver) != ver + 1) backoff();
double state = c.state;
STORE(&c.ver, ver + N);
return state;
}
______________________________________________

Keep in mind, the tweak is wait-free on the atomics when fetch-and-add
is hardware based, and has no loops, like most CAS and/or all LL/SC.
Keep in mind that LOCK XADD is an example of a wait-free op, vs using
CMPXCHG in a software based loop on Intel...

Now, I also have a version of this with conditional waits here:

https://groups.google.com/forum/#!original/lock-free/acjQ3-89abE/idSNj77HsIIJ

JiiPee

unread,

Feb 26, 2017, 5:16:00 AM2/26/17

to

On 24/02/2017 23:28, Alf P. Steinbach wrote:
> E.g. by simply not sharing mutable objects between threads.

or doing mutex....

Marcel Mueller

unread,

Feb 26, 2017, 5:58:05 AM2/26/17

to

On 25.02.17 00.28, Alf P. Steinbach wrote:
> On 24.02.2017 23:59, Öö Tiib wrote:
>> Most programs written nowadays are multi-threaded and so CoW
>> is not too good idea most of the time.
>
> Uhm, thread safety is an orthogonal issue. Adding illusory thread safety
> to a string class is going to cost, no matter the internal implementation.

True.

A work around is to make only volatile instances thread safe. And of
course, this instances do not provide all operations.
(Of course, this is some kind of abuse of the volatile keyword.)

> I write “illusory” because if one shares a mutable string object between
> threads, it would be impractical to lock down the object any time its
> `operator[]` returns a reference that possibly could be stored by the
> thread and used to modify the string at arbitrary times.

There are some operations where thread safety makes no sense at all.
Probably no one expects a single mutable string instance to be thread
safe at any modifications.

But when using CoW or deduplication another thread issues arises since
now different string instances may share the same data. And of course,
the programmer expects the instance methods of his own instance to
require no global lock.

In fact it is a good advice to separate mutable strings from immutable
strings like Java or .NET do. (StringBuffer/Builder vs. String)
The C++ string class unfortunately represents the mutable version. So
introducing a immutable string class is a breaking change for many APIs.
And no performance benefit arises when the efficient immutable class is
converted to std::string all the time.
However, in any larger program where I could control all the APIs I have
done this job - with reasonable success. This concept is superior to CoW
since the copy operations are now visible part of the type system. In
practice this rarely happens at all since the use cases for mutable and
immutable strings are quite different.

> The standard library's approach of most classes being thread-agnostic is
> fine, I think. It's the responsibility of the code that uses those
> classes, to ensure thread safe operation. E.g. by simply not sharing
> mutable objects between threads.

Ack. But this does not apply to objects that are shared internally like
with CoW, since the programmer typically can't control this with
reasonable effort.

Marcel

Vir Campestris

unread,

Feb 26, 2017, 4:04:36 PM2/26/17

to

On 24/02/2017 22:59, Öö Tiib wrote:
> If you need idea about its usefulness then it seems that CoW is not too
> good in multi-threaded program and can cause major hidden performance
> hit there. Most programs written nowadays are multi-threaded and so CoW
> is not too good idea most of the time.

Isn't CoW the standard method Linux uses when implementing fork()? Give
the forked process a page table pointing at all the same pages as the
parent, but read only? and only copy them when they are written?

That seems reasonably efficient. Although the idea that any page fault
could result in an out of memory error does bother me!

Andy

Öö Tiib

unread,

Feb 26, 2017, 6:32:03 PM2/26/17

to

That is all true but non-constructive. The CoW is there non-comparable.
How you use that fact in implementation of your C++ CoW class?

The CoW of fork() is part of how kernel handles virtual memory on Intel
architecture. The pages are large, fixed size, (typically 4KiB) and
the per-process write protection is supported by MMU of Intel.

There are no such run-time write-protection features in C++ and it
would be quite pricey to implement that on lot of embedded systems.

Paavo Helde

unread,

Feb 26, 2017, 6:32:14 PM2/26/17

to

Yes, this is efficient. And the reason why it is efficient is that there
are few false positives when detecting if the page needs to be copied.
There are separate machine instructions for modifying and reading the
memory, and modifying instructions typically do not appear unless
needed. Also, the amount of data to copy is relatively large, like 2MB,
not some 10 bytes as typical for std::string.

The other problem with std::string (apart from prevailing small strings
which do not deserve COW) is that because of its interface one often
needs to perform COW also in cases where there is no modification
planned or happening. This makes COW bad with std::string in
multi-threaded programs. Other classes with better interfaces and larger
data can potentially benefit from COW also in multi-threaded programs,
similar to fork().

BTW, what makes the COW implementation of fork() very suitable to C++ is
that in a multi-threaded C++ programs one basically cannot do *anything*
between fork() and subsequent exec(). One cannot for example construct a
std::string() or anything else which might require dynamic memory
allocation because the memory allocator might contain locks held by
other threads which are absent in the new process and cannot release the
locks. So about the only thing which you can do after fork() in a
multithreaded C++ program is exec(), which means that all the old pages
get discarded anyway and there would have been no point in copying them.

Chris M. Thomasson

unread,

Feb 26, 2017, 11:34:48 PM2/26/17

to

On 2/25/2017 7:46 PM, Chris M. Thomasson wrote:
> On 2/25/2017 6:38 AM, Öö Tiib wrote:

[...]

> Now, try to beat this sucker with a mutex based queue in a high
> pressure, loaded environment. Stress the queue out to see what happens.
>
> https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion
>
> You can use the original CAS based version by Dmitry Vyukov.
>
> Or, my fast-pathed wait-free alteration:

Here is a direct link to the code:

http://pastebin.com/raw/mtCh5Zxu

Ian Collins

unread,

Feb 27, 2017, 2:15:54 AM2/27/17

to

On 02/27/17 10:04 AM, Vir Campestris wrote:
> On 24/02/2017 22:59, Öö Tiib wrote:
>> If you need idea about its usefulness then it seems that CoW is not too
>> good in multi-threaded program and can cause major hidden performance
>> hit there. Most programs written nowadays are multi-threaded and so CoW
>> is not too good idea most of the time.
>
> Isn't CoW the standard method Linux uses when implementing fork()?

It's the way Unix implements fork()!

--
Ian

Bo Persson

unread,

Feb 27, 2017, 7:55:19 AM2/27/17

to

When you create a new process you get a new virtual address space, not
shared by the original process. So you now have MORE (virtual) memory
than before.

What is shared is the physical memory, but that can remapped or swapped
out as needed.

When you create a thread it is part of the same process, and competes
for the same memory resources. And C++ requires that when copying a
std::string you should get a potential std::bad_alloc exception
immediately, not when using non-const operator[] on one of the strings,
sometimes later.

Bo Persson

SG

unread,

Feb 28, 2017, 7:17:52 AM2/28/17

to

On Friday, February 24, 2017 at 8:52:16 PM UTC+1, Paul wrote:
> Please could someone link to (or provide) an example of copy-on-write in C++.
> I know what it means, but I'd like to see a simple example. I didn't
> really find one that looked convincing on a Google search.

A copy-on-write implementation of a type might be useful if otherwise
copying is too expensive. Internally, you could use std::shared_ptr<T>
for this. It offers a uniqueness check that allows you to make sure
that you won't modify something that is shared.

As a generic wrapper it might look like this:

template<class T>
class cow {
std::shared_ptr<T> ptr;
void make_unique();
public:
cow() : ptr(std::make_shared<T>()) {}
cow(T const& x) : ptr(std::make_shared<T>(x)) {}
cow(T && x) : ptr(std::make_shared<T>(std::forward<T>(x))) {}
T const& read() const { return *ptr; }
T& write() { if (!ptr.unique()) make_unique(); return *ptr; }
};

template<class T>
void cow<T>::make_unique() {
auto old = std::move(this->ptr);
this->ptr = std::make_shared<T>(*old);
}

If necessary, make_unique is called to create a copy. A copy is needed
if you want to modify the pointee but it is shared among multiple cow
instances. Note: I didn't test this. But I hope this helps anyways.

The use of std::shared_ptr allows you to follow "the rule of zero"
(you don't have to define any special member functions) and also
avoids some data race issues. Internally, the reference counter is
"atomic" which makes it safe to copy and destroy such shared_ptr
instances even if multiple threads share the same pointee. On that
note: const-access of such a T should also be thread-safe. If your T
has a mutable data member then chances are small that you can safely
use it inside cow<> in a multi-threaded program.

For special types (such as strings) you could probably do better. For
example, I liked the old libstdc++ (GCC) std::string implementation in
terms of memory layout. The size, capacity and reference counter
preceeded the string content in heap memory and a private data member
pointed directly to the first character. Only one allocation was
necessary. Their current implementation is different and does not use
copy-on-write anymore as far as I know.

Cheers!
sg

Juha Nieminen

unread,

Feb 28, 2017, 8:56:58 AM2/28/17

to

SG <s.ges...@gmail.com> wrote:
> cow(T const& x) : ptr(std::make_shared<T>(x)) {}
> cow(T && x) : ptr(std::make_shared<T>(std::forward<T>(x))) {}

I thought those were generated automatically by the compiler, unless
you disable them either explicitly (with =delete) or implicitly
(by implementing one but not the other, or implementing a copy or
assignment operator, or a destructor.)

Chris M. Thomasson

unread,

Mar 1, 2017, 5:04:56 PM3/1/17

to

On 2/25/2017 7:46 PM, Chris M. Thomasson wrote:

Here are the membars required wrt the pseudo code of the
multi-producer/consumer atomic bounded queue:

> ______________________________________________
> struct cell { uint32_t ver; double state; };
>
> uint32_t head = 0;
> uint32_t tail = 0;
> cell cells[N]; // N must be a power of 2
>
> void init() {
> for (uint32_t i = 0; i < N; ++i) cells[i].ver = i;

// RELEASE (depending on how this was created! dynamic, and/or before
threads?)

> }
>
> void producer(double state) {
> uint32_t ver = XADD(&head, 1);
> cell& c = cells[ver & (N - 1)];
> while (LOAD(&c.ver) != ver) backoff();

// ACQUIRE

> c.state = state;
// RELEASE

> STORE(&c.ver, ver + 1);
> }
>
> double consumer() {
> uint32_t ver = XADD(&tail, 1);
> cell& c = cells[ver & (N - 1)];
> while (LOAD(&c.ver) != ver + 1) backoff();

// ACQUIRE

> double state = c.state;
// RELEASE

> STORE(&c.ver, ver + N);
> return state;
> }
> ______________________________________________

This is the basic acquire/release model.

Chris M. Thomasson

unread,

Mar 1, 2017, 5:07:27 PM3/1/17

to

On 3/1/2017 2:04 PM, Chris M. Thomasson wrote:
>>
>> void init() {
>> for (uint32_t i = 0; i < N; ++i) cells[i].ver = i;
>
> // RELEASE (depending on how this was created! dynamic, and/or before
> threads?)
>
>> }

Ummm, that before should read as:

before/after

SG

unread,

Mar 3, 2017, 2:46:36 AM3/3/17

to

Juha Nieminen wrote:

This *is* a "rule-of-zero" implementation of a copy-on-write wrapper.
What you are looking at are conversion constructors, not copy/move
constructors. Otherwise they would take a cow<T> instead of a T as
parameter. :-)

Alf P. Steinbach

unread,

Mar 3, 2017, 2:55:26 AM3/3/17

to

On 3/3/2017 8:46 AM, SG wrote:
> Juha Nieminen wrote:
>> SG wrote:
>>> cow(T const& x) : ptr(std::make_shared<T>(x)) {}
>>> cow(T && x) : ptr(std::make_shared<T>(std::forward<T>(x))) {}

Just a note: `T` is not a deduced type here, it's specified in the class
template instantiation. So `T&&` is not a forwarding reference (a.k.a.
universal reference), it's just a good old rvalue reference. Hence the
`std::forward` reduces to a `std::move`, I believe, and anyway I'd write
`std::move` there in order to not mislead myself and others – I find
the magic of `std::forward` a bit impenetrable.

Cheers!,

- Alf

SG

unread,

Mar 4, 2017, 11:37:35 AM3/4/17

to

If you write std::move in such a context, you have to be sure that T
is never going to be a reference type. Template argument deduction is
not necessary to turn T into a lvalue reference type. Of course, in
this case, it makes little sense to try to instantiate something like
a cow<int&>. But I can point to you an example, where the use of
std::forward over std::move in such contexts actually prevents
accidental moves: tuple<string&>. Consider this situation:

string a = "a";
string b;
tie(b) = tie(a); // does this invalidate a ?

The relevant assignment operator is this one:

tuple<string&>& tuple<string&>::operator=(tuple<string&>&& rhs)

and it uses std::forward<string&> instead of std::move so that an
lvalue string expression is assigned to b and a is not moved from.

My suggestion for what to use in such general template contexts is
one of the following two options:

(1) std::forward (also works if T can be an lvalue reference)

(2) std::move with an in-class static assert that ensures T is
never going to be an lvalue reference.

I wouldn't use std::move without ensuring that T is an object type.

> Cheers!,
>
> - Alf

Cheers!
sg