how to delete key/values?

Roman.Shestakov

unread,

Feb 16, 2009, 4:41:30 PM2/16/09

to scalaris

hello,

I know it was briefly discussed before but I don't think that the
definite answer was given on the question how to delete values by
key.

Is there any way of doing it in scalaris now and if not, are there any
plans to implement this feature?

To use Scalaris for any practical purposes, key deletion is required
( as well as ability to create multiple indexes)

Regards, Roman

Thorsten Schuett

unread,

Feb 17, 2009, 10:07:56 AM2/17/09

to scal...@googlegroups.com

On Monday 16 February 2009, Roman.Shestakov wrote:
> hello,
>
> I know it was briefly discussed before but I don't think that the
> definite answer was given on the question how to delete values by
> key.
>
> Is there any way of doing it in scalaris now and if not, are there any
> plans to implement this feature?

- Right now, there is no function for deleting a key.
- I remember that we had internal discussions on this topic. IIRC, one problem
with deleting an item is, that under rare circumstances this item could
reappear. Another issue is, when the delete misses one replica.

Assume, that you have 4 replicas, all at version number 99.

Replica: r1 r2 r3 r4
Version: 99 99 99 99
Value: foo foo foo foo

Now, you try to delete the item, but miss one replica.

Replica: r1 r2 r3 r4
Version: 99 - - -
Value: foo - - -

Now, you try to create the item again with version number 1 and a new value,
but again you miss the first replica.

Replica: r1 r2 r3 r4
Version: 99 1 1 1
Value: foo bar bar bar

The next read can again return the item with version number 99.

We could internally keep the item with version number 100 and mark it as
deleted, but it will still consume storage. Which defeats the purpose of
deleting the item.

> To use Scalaris for any practical purposes, key deletion is required
> ( as well as ability to create multiple indexes)

I agree with you on key deletion. But I rather leave this feature out than
provide an implementation which behaves randomly, also known as buggy.

Regarding the multiple indexes, it is not possible to build multiple indexes
in Scalaris which allow join-like operations. However, you can denormalize
your database-scheme and store several copies of the same item. Then you can
simulate multiple indexes.

Thorsten

Roman.Shestakov

unread,

Feb 18, 2009, 3:37:34 PM2/18/09

to scalaris

Thanks for answering Thorsten,

I can say from first hand experience that ability to delete values is
indispensable , without it I wouldn't even start considering of using
scalaris in practice.
( I have an experience working with a certain propitiatory key/value
distributed system which implements "optimistic locking" of replicas
and handles millions of critical financial transactions per day). I
know how incredibly powerful distributed key/value database is (subj.
to the proper implementation :-) . Key deletion is critical.

for your example:
1. I can only imagine that this scenario would be possible in some
setup when you have geographically distributed nodes with some
replication delays because of WAN. Why else the deletion of values in
close connected environment wouldn't work?

2. In case when you have diff. versions of the same key in diff.
replicas - you have a conflict which has to be resolves manually,
there is no other way around it, and also there is nothing wrong with
it. I would choose manual conflict resolution against not having the
ability to delete keys, 11 times out of 10, for me it is not a
brainer...

3. so, you have a version conflict - just create an object with ref.
to conflicted objects and send a notification to the db administrator.
He will have to choose which object to keep.

4. as I said - this is a common problem / practical solution with key/
value distributed replicas and there is nothing wrong with manual
conflict resolution.

Scalaris is a beautiful idea - me want it bad :-), pls, pls, pls
consider implementing key deletion

I could probably live without multiple indexes but I'd cry inside...
again, from a practical experience - you can't do SQL type selects
against the db, so you have to have an ability to quickly select a
group of objects by some custom index and then do some processing on
the client side. Object duplication doesn't seem like elegant solution
for this problem.

Regards, Roman

Thorsten Schuett

unread,

Feb 25, 2009, 9:29:33 AM2/25/09

to scal...@googlegroups.com

Hi all,

I added a delete function on the Erlang side and Nico added it to the Java-
API. You can now delete items, but on rare occasions you will see the side
effects described in this thread (the javadoc contains a big warning).

On the API:
the java function expects as the first parameter a key and returns an integer.
The integer indicates how many replicas were successfully deleted. In the best
case, this value is 4. If you see any other value, you can call
getLastDeleteResult() to get more information.

The DeleteResult object will tell you how many replicas were deleted (ok), how
many replicas weren't deleted because locks were set on the key (locks_set),
and how many replicas weren't found (undef).

Thorsten

Roman.Shestakov

unread,

Feb 26, 2009, 3:59:58 PM2/26/09

to scalaris

wow, thanks guys for super fast response.

I'll definitely test the new feature ASAP

Regards, Roman

pietro

unread,

Mar 19, 2009, 5:04:20 PM3/19/09

to scalaris

I'm not sure I'm understanding the delete operation correctly. Maybe
I'm just applying the quorum replication algorithm where I shouldn't.

If I have it right, in the case where one replica maintains the value
for the key after the delete, there is no way to then read that value,
as the quorum algorithm will not receive the old value from n/2+1
nodes. So in the case where delete() is called *without being followed
by a set() for the same key* it is guaranteed not to return a value.

If you do set() the same key after a delete(), then you can read() the
old value from before the delete() due to the versioning as explained.
The piece that has me a bit confused is how you can read the old value
from a single replica when there is no quorum agreeing to the new
value. Will the value from that single node be copied to the other
nodes in the replica, thus forming a quorum? And if that is true, then
wouldn't it be possible to read() the key from the DHT after a delete
() without having re-set() the key?

Again, thanks for your time and your responses!

--Peter

Florian Schintke

unread,

Mar 20, 2009, 4:25:16 AM3/20/09

to scal...@googlegroups.com

[pietro]

>
>
> I'm not sure I'm understanding the delete operation correctly. Maybe
> I'm just applying the quorum replication algorithm where I shouldn't.
>
> If I have it right, in the case where one replica maintains the value
> for the key after the delete, there is no way to then read that value,
> as the quorum algorithm will not receive the old value from n/2+1
> nodes. So in the case where delete() is called *without being followed
> by a set() for the same key* it is guaranteed not to return a value.

Yes.

> If you do set() the same key after a delete(), then you can read() the
> old value from before the delete() due to the versioning as explained.

say better '... then you could read() the old value...' The outcome of the
read depends on whether the undeleted replica is in the quorum set of the
fastest responding (for this read()) replicas or not. You may read the newly
set value, or the old one, with its high version number. Nobody knows.

> The piece that has me a bit confused is how you can read the old value
> from a single replica when there is no quorum agreeing to the new
> value. Will the value from that single node be copied to the other
> nodes in the replica, thus forming a quorum? And if that is true, then
> wouldn't it be possible to read() the key from the DHT after a delete
> () without having re-set() the key?

For a quorum read, valid answers are requested from a majority of the
replicas. The value with the highest version number is taken. It may only
occur once. The reason is the following worst case scenario:

(1) A minority of the replicas is temporarily not available.
(2) A write() is done with a minimal quorum.
(3) The minority of (1) becomes available again (and has an older version number)
(4) All but one node from (2) becomes temporarily not available
(5) We can read() the latest value from a majority, because one node of (2) is
still alive and provides the highest version number and thereby the latest
value written.

At all times, we allow an arbitrary minority of replicas to be not available.

Florian

Peter Molettiere

unread,

Mar 24, 2009, 5:26:02 PM3/24/09

to scal...@googlegroups.com

On Fri, Mar 20, 2009 at 1:25 AM, Florian Schintke <schi...@gmail.com> wrote:

For a quorum read, valid answers are requested from a majority of the
replicas. The value with the highest version number is taken.

Thanks very much for the response. This quote explains the issue. I had misunderstood the quorum algorithm. I thought it required agreement from a quorum, not simply a response from a quorum, from which the highest version would be chosen. If all nodes agree on the version, is the selection of the value randomly chosen from the quorum?

This makes me wonder, though, how consistency is achieved. Consider this scenario:

Three nodes, x, y, and z.

We write A to nodes x and y, creating version 1 on each node, with value A: x.1.A, y.1.A.

We write B to nodes y and z, creating versions y.2.B and z.1.B.

We write C to nodes x and z, creating versions x.2.C and z.2.C.

At this point, the nodes have the following versions and values: x.2.C, y.2.B, z.2.C.

If we receive a quorum which includes node y, then it seems possible to read value B instead of C.

I've been reading transstore/transaction.erl to figure out how the quorum reads are happening, but haven't seen any code which compares values read from different nodes yet. I haven't seen where the node from the quorum is selected, either, but I'm new to erlang, so I'm sure I'm just missing it.

Again, thanks very much for your time and effort.

--Peter

Florian Schintke

unread,

Mar 25, 2009, 4:10:40 AM3/25/09

to scal...@googlegroups.com

> Thanks very much for the response. This quote explains the issue. I had
> misunderstood the quorum algorithm. I thought it required agreement from a
> quorum, not simply a response from a quorum, from which the highest version
> would be chosen. If all nodes agree on the version, is the selection of the
> value randomly chosen from the quorum?

Values with the same version number are the same. As can be seen at line 143
in transaction.erl (SVN revision 229), if a value, which is not yet in the
transaction log, is written, a quorum read is done first to determine the
highest version number in the system. Then in line 155 this version number is
incremented. Thereby the consistency is achieved. So, your given scenario will
not happen.

Cheers,

Florian

>
> This makes me wonder, though, how consistency is achieved. Consider this
> scenario:
>
> Three nodes, x, y, and z.
>
> We write A to nodes x and y, creating version 1 on each node, with value A:
> x.1.A, y.1.A.
>
> We write B to nodes y and z, creating versions y.2.B and z.1.B.
>
> We write C to nodes x and z, creating versions x.2.C and z.2.C.
>
> At this point, the nodes have the following versions and values: x.2.C,
> y.2.B, z.2.C.
>
> If we receive a quorum which includes node y, then it seems possible to read
> value B instead of C.
>
> I've been reading transstore/transaction.erl to figure out how the quorum
> reads are happening, but haven't seen any code which compares values read
> from different nodes yet. I haven't seen where the node from the quorum is
> selected, either, but I'm new to erlang, so I'm sure I'm just missing it.
>
> Again, thanks very much for your time and effort.
>
> --Peter
>
> >

Florian

Reply all

Reply to author

Forward