Possibility of a CAS API

Showing 1-13 of 13 messages
Possibility of a CAS API Armon Dadgar 2/24/12 4:41 PM
As part of a new feature we are working on, we've run into
a situation where it would be incredibly convenient to have a 
check-and-set (CAS) API for Riak KV. In short, we are trying to build
a unique index of a bucket, using a second bucket which acts as a 
reverse index.

The CAS API would operate in the same manner as a PUT, except it
should take a "last vclock". The new value + last vclock are submitted
to the responsible vnodes. The vnodes respond if the last vclock
for the key matches the specified last value. If we get "r" nodes responding
that the last value matches, then we should commit the write. This method
is basically a two-phase commit.

It would also be great if no-value sentinel could be specified to indicate
the CAS should only succeed if there is not already a key. We need this
to make sure uniqueness constraints are not violated.

I wanted to gauge the interest from the community in something like this,
and see if I could get thoughts from the Basho team on if this could be
implemented.

Best Regards,

Armon Dadgar

Re: Possibility of a CAS API Ian Plosker 2/24/12 5:37 PM
Armon,

The feature you are describing is very similar to the conditional header, `If-Unmodified-Since`, provided by the client APIs (http://wiki.basho.com/HTTP-Store-Object.html & http://wiki.basho.com/PBC-Store-Object.html).

Is this similar to what you're describing or am I missing something?

-- 
Ian Plosker <i...@basho.com>
Developer Advocate
Basho Technologies

_______________________________________________
riak-users mailing list

Re: Possibility of a CAS API Dietrich Featherston 2/24/12 6:09 PM
If you need CAS semantics, then coordinate that outside of riak. Any check-then-act type of operation where atomicity is important is going to leave some room for a data race in a system with the distribution semantics of riak. Would suggest thinking about the problem in such a way that handling of siblings is tolerant of duplicate writes and eventually the correct value bubbles up to the readers. That or do the coordination of unique indexes in something not dynamo shaped.

I can't say I'm intimately familiar with the work yet, but others have prototyped/postulated consistency layers on top of riak (a la zab) that might more closely match what you're trying to do. None of this is in a released / supported version of riak to my knowledge though.

Thanks,
D


_______________________________________________
riak-users mailing list
riak-...@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Possibility of a CAS API Armon Dadgar 2/24/12 7:21 PM
It sounds like the "If-Unmodified-Since" and "If-None-Match" flags could do what I
need, but the docs specify "it is possible for the condition to evaluate to true for
multiple requests if the requests occur at the same time."

From my understanding, the KV vnode's process their requests in a serial fashion.
I'm not sure I fully understand how It could be that the request evaluates to true
for multiple requests, if the PUTs are handled serially.

If it is a matter of the vnodes being interleaved, would it be solvable by
setting w = r = n?

I'm not convinced that a CAS operation is inevitably subject to data races.
There are proven techniques for avoiding races at the cost of latency,
which is acceptable in certain situations.

I will take a look at Zab, thanks for the reference!

Best Regards,

Armon Dadgar
Re: Possibility of a CAS API Reid Draper 2/24/12 7:46 PM

On Feb 24, 2012, at 10:21 PM, Armon Dadgar wrote:

It sounds like the "If-Unmodified-Since" and "If-None-Match" flags could do what I
need, but the docs specify "it is possible for the condition to evaluate to true for
multiple requests if the requests occur at the same time."

From my understanding, the KV vnode's process their requests in a serial fashion.
I'm not sure I fully understand how It could be that the request evaluates to true
for multiple requests, if the PUTs are handled serially.

If it is a matter of the vnodes being interleaved, would it be solvable by
setting w = r = n?
The problem is that amongst all replicas for a particular key,
operations are not serializable. Put another way, if there
are concurrent writes to three replicas,
there is no way to figure out a total ordering for the actions.

It's also important to note that even when you set W=N
for a write, it's possible that 1 write could succeed
and 2 could fail. The succeeding write is _not_ "rolled back"
when this happens. The user will see an error message
that the write didn't succeed on all replicas.

I'm not convinced that a CAS operation is inevitably subject to data races.
There are proven techniques for avoiding races at the cost of latency,
which is acceptable in certain situations.
Correct, but as far as I know, there is no way to build a CAS system
on top of the primitives provided by the Riak public API. You need
a point of serialization amongst all of the replicas (for a particular key),
which Riak does not provide, for availability reasons.


I will take a look at Zab, thanks for the reference!
Zab and Paxos are going to be your best references.
It's also worth noting that if you don't need high availability,
there are other ways of gaining durability that will give
you strong consistency and the ability to do CAS operations.

Re: Possibility of a CAS API Martin Bruse 2/24/12 11:43 PM

Take a look at https://github.com/ProjectDaisy/blodsband#L480 where I have implemented a CAS-like method on top of Riak. It works by (slightly simplified):

0) Reading current value
  1) Checking current value
  2) Writing with a "request id", W = all and readbody = true if the current value is what we expected (vclock is right, or value didn't exist)
  3) OR overwriting siblings with the one with highest "request id" or "winner" if multiple siblings
  4) OR overwriting value with "winner" if it was the one we wrote and has no siblings

and looping around that until we have an entry marked "winner".

I have tested it extensively on single node systems, and it seems to work. What I would like is someone reviewing the algorithm and possibly trying it on their multi node system :)

On Feb 25, 2012 4:21 AM, "Armon Dadgar" <armon....@gmail.com> wrote:
>
> It sounds like the "If-Unmodified-Since" and "If-None-Match" flags could do what I
> need, but the docs specify "it is possible for the condition to evaluate to true for
> multiple requests if the requests occur at the same time."
>
> From my understanding, the KV vnode's process their requests in a serial fashion.
> I'm not sure I fully understand how It could be that the request evaluates to true
> for multiple requests, if the PUTs are handled serially.
>
> If it is a matter of the vnodes being interleaved, would it be solvable by
> setting w = r = n?
>
> I'm not convinced that a CAS operation is inevitably subject to data races.
> There are proven techniques for avoiding races at the cost of latency,
> which is acceptable in certain situations.
>
> I will take a look at Zab, thanks for the reference!
>

Re: Possibility of a CAS API Martin Bruse 2/24/12 11:46 PM
Here is the working link: https://github.com/ProjectDaisy/blodsband/blob/master/lib/blodsband/riak/bucket.rb#L484

The universe will always punish you if you send long emails via smartphone :/
Re: Possibility of a CAS API Armon Dadgar 2/25/12 5:18 PM
On Feb 24, 2012, at 7:46 PM, Reid Draper wrote:
> The problem is that amongst all replicas for a particular key,
> operations are not serializable. Put another way, if there
> are concurrent writes to three replicas,
> there is no way to figure out a total ordering for the actions.
>
> It's also important to note that even when you set W=N
> for a write, it's possible that 1 write could succeed
> and 2 could fail. The succeeding write is _not_ "rolled back"
> when this happens. The user will see an error message
> that the write didn't succeed on all replicas.

Got it. I was unsure how "If-Unmodified-Since" and "If-None-Match"
were implemented, but that makes sense given Riak's architecture.

>>
>> I'm not convinced that a CAS operation is inevitably subject to data races.
>> There are proven techniques for avoiding races at the cost of latency,
>> which is acceptable in certain situations.
> Correct, but as far as I know, there is no way to build a CAS system
> on top of the primitives provided by the Riak public API. You need
> a point of serialization amongst all of the replicas (for a particular key),
> which Riak does not provide, for availability reasons.

I think a point of serialization is not needed here. It should be sufficient
to add a new internal API for nodes to handle a two-phase commit, and
then the transaction coordinator (running on whatever node you make
the request to) can contain the logic to carry out the transaction.

I totally understand that under the current architecture and exposed
API's of Riak KV it is not possible.


_______________________________________________
riak-users mailing list
riak-...@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Possibility of a CAS API Reid Draper 2/25/12 5:24 PM

On Feb 25, 2012, at 8:18 PM, Armon Dadgar wrote:

> On Feb 24, 2012, at 7:46 PM, Reid Draper wrote:
>> The problem is that amongst all replicas for a particular key,
>> operations are not serializable. Put another way, if there
>> are concurrent writes to three replicas,
>> there is no way to figure out a total ordering for the actions.
>>
>> It's also important to note that even when you set W=N
>> for a write, it's possible that 1 write could succeed
>> and 2 could fail. The succeeding write is _not_ "rolled back"
>> when this happens. The user will see an error message
>> that the write didn't succeed on all replicas.
>
> Got it. I was unsure how "If-Unmodified-Since" and "If-None-Match"
> were implemented, but that makes sense given Riak's architecture.
>
>>>
>>> I'm not convinced that a CAS operation is inevitably subject to data races.
>>> There are proven techniques for avoiding races at the cost of latency,
>>> which is acceptable in certain situations.
>> Correct, but as far as I know, there is no way to build a CAS system
>> on top of the primitives provided by the Riak public API. You need
>> a point of serialization amongst all of the replicas (for a particular key),
>> which Riak does not provide, for availability reasons.
>
> I think a point of serialization is not needed here. It should be sufficient
> to add a new internal API for nodes to handle a two-phase commit, and
> then the transaction coordinator (running on whatever node you make
> the request to) can contain the logic to carry out the transaction.
If I understand you correctly, the "transaction coordinator" would
be a point of serialization.

>
> I totally understand that under the current architecture and exposed
> API's of Riak KV it is not possible.


_______________________________________________
riak-users mailing list
riak-...@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Possibility of a CAS API Armon Dadgar 2/25/12 5:29 PM
On Feb 25, 2012, at 5:24 PM, Reid Draper wrote:
>>>>
>>>> I'm not convinced that a CAS operation is inevitably subject to data races.
>>>> There are proven techniques for avoiding races at the cost of latency,
>>>> which is acceptable in certain situations.
>>> Correct, but as far as I know, there is no way to build a CAS system
>>> on top of the primitives provided by the Riak public API. You need
>>> a point of serialization amongst all of the replicas (for a particular key),
>>> which Riak does not provide, for availability reasons.
>>
>> I think a point of serialization is not needed here. It should be sufficient
>> to add a new internal API for nodes to handle a two-phase commit, and
>> then the transaction coordinator (running on whatever node you make
>> the request to) can contain the logic to carry out the transaction.
> If I understand you correctly, the "transaction coordinator" would
> be a point of serialization.

Not a global point of serialization, other transactions would be unaffected.
It is my understanding that there is a new transaction coordinator for
each client request. So it is already always "serial" in a sense.


_______________________________________________
riak-users mailing list
riak-...@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Possibility of a CAS API Reid Draper 2/25/12 5:38 PM

On Feb 25, 2012, at 8:29 PM, Armon Dadgar wrote:

> On Feb 25, 2012, at 5:24 PM, Reid Draper wrote:
>>>>>
>>>>> I'm not convinced that a CAS operation is inevitably subject to data races.
>>>>> There are proven techniques for avoiding races at the cost of latency,
>>>>> which is acceptable in certain situations.
>>>> Correct, but as far as I know, there is no way to build a CAS system
>>>> on top of the primitives provided by the Riak public API. You need
>>>> a point of serialization amongst all of the replicas (for a particular key),
>>>> which Riak does not provide, for availability reasons.
>>>
>>> I think a point of serialization is not needed here. It should be sufficient
>>> to add a new internal API for nodes to handle a two-phase commit, and
>>> then the transaction coordinator (running on whatever node you make
>>> the request to) can contain the logic to carry out the transaction.
>> If I understand you correctly, the "transaction coordinator" would
>> be a point of serialization.
>
> Not a global point of serialization, other transactions would be unaffected.
> It is my understanding that there is a new transaction coordinator for
> each client request. So it is already always "serial" in a sense.

What I'm suggesting you need is a point of serialization for
at least a quorum of the replicas for a particular key. It doesn't
have to be global across the key-space.

It might be possible to do CAS with replicas without a synchronization
point, but I imagine such a discovery would be publication-worthy.


_______________________________________________
riak-users mailing list
riak-...@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Possibility of a CAS API Armon Dadgar 2/25/12 5:46 PM
On Feb 25, 2012, at 5:38 PM, Reid Draper wrote:

> On Feb 25, 2012, at 8:29 PM, Armon Dadgar wrote:
>
>> On Feb 25, 2012, at 5:24 PM, Reid Draper wrote:
>>>>>>
>>>>>> I'm not convinced that a CAS operation is inevitably subject to data races.
>>>>>> There are proven techniques for avoiding races at the cost of latency,
>>>>>> which is acceptable in certain situations.
>>>>> Correct, but as far as I know, there is no way to build a CAS system
>>>>> on top of the primitives provided by the Riak public API. You need
>>>>> a point of serialization amongst all of the replicas (for a particular key),
>>>>> which Riak does not provide, for availability reasons.
>>>>
>>>> I think a point of serialization is not needed here. It should be sufficient
>>>> to add a new internal API for nodes to handle a two-phase commit, and
>>>> then the transaction coordinator (running on whatever node you make
>>>> the request to) can contain the logic to carry out the transaction.
>>> If I understand you correctly, the "transaction coordinator" would
>>> be a point of serialization.
>>
>> Not a global point of serialization, other transactions would be unaffected.
>> It is my understanding that there is a new transaction coordinator for
>> each client request. So it is already always "serial" in a sense.
>
> What I'm suggesting you need is a point of serialization for
> at least a quorum of the replicas for a particular key. It doesn't
> have to be global across the key-space.

Yes, that sounds correct. I think we are saying the same thing.
It certainly wouldn't be as performant as doing a non-CAS operation,
but for situations where you are willing to make that trade off, it would be nice to have support.

> It might be possible to do CAS with replicas without a synchronization
> point, but I imagine such a discovery would be publication-worthy.

Haha, indeed that would be.


_______________________________________________
riak-users mailing list
riak-...@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Possibility of a CAS API Reid Draper 2/25/12 5:53 PM

Agreed. We've talked about support for strong consistency, and I personally
think CAS would be a natural thing to do afterward.


>
>> It might be possible to do CAS with replicas without a synchronization
>> point, but I imagine such a discovery would be publication-worthy.
>
> Haha, indeed that would be.
>
>
>


_______________________________________________
riak-users mailing list
riak-...@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com