Assume, that you have 4 replicas, all at version number 99.
Replica: r1 r2 r3 r4
Version: 99 99 99 99
Value: foo foo foo foo
Now, you try to delete the item, but miss one replica.
Replica: r1 r2 r3 r4
Version: 99 - - -
Value: foo - - -
Now, you try to create the item again with version number 1 and a new value,
but again you miss the first replica.
Replica: r1 r2 r3 r4
Version: 99 1 1 1
Value: foo bar bar bar
The next read can again return the item with version number 99.
We could internally keep the item with version number 100 and mark it as
deleted, but it will still consume storage. Which defeats the purpose of
deleting the item.
> To use Scalaris for any practical purposes, key deletion is required
> ( as well as ability to create multiple indexes)
I agree with you on key deletion. But I rather leave this feature out than
provide an implementation which behaves randomly, also known as buggy.
Regarding the multiple indexes, it is not possible to build multiple indexes
in Scalaris which allow join-like operations. However, you can denormalize
your database-scheme and store several copies of the same item. Then you can
simulate multiple indexes.
Thorsten
I added a delete function on the Erlang side and Nico added it to the Java-
API. You can now delete items, but on rare occasions you will see the side
effects described in this thread (the javadoc contains a big warning).
On the API:
the java function expects as the first parameter a key and returns an integer.
The integer indicates how many replicas were successfully deleted. In the best
case, this value is 4. If you see any other value, you can call
getLastDeleteResult() to get more information.
The DeleteResult object will tell you how many replicas were deleted (ok), how
many replicas weren't deleted because locks were set on the key (locks_set),
and how many replicas weren't found (undef).
Thorsten
Yes.
> If you do set() the same key after a delete(), then you can read() the
> old value from before the delete() due to the versioning as explained.
say better '... then you could read() the old value...' The outcome of the
read depends on whether the undeleted replica is in the quorum set of the
fastest responding (for this read()) replicas or not. You may read the newly
set value, or the old one, with its high version number. Nobody knows.
> The piece that has me a bit confused is how you can read the old value
> from a single replica when there is no quorum agreeing to the new
> value. Will the value from that single node be copied to the other
> nodes in the replica, thus forming a quorum? And if that is true, then
> wouldn't it be possible to read() the key from the DHT after a delete
> () without having re-set() the key?
For a quorum read, valid answers are requested from a majority of the
replicas. The value with the highest version number is taken. It may only
occur once. The reason is the following worst case scenario:
(1) A minority of the replicas is temporarily not available.
(2) A write() is done with a minimal quorum.
(3) The minority of (1) becomes available again (and has an older version number)
(4) All but one node from (2) becomes temporarily not available
(5) We can read() the latest value from a majority, because one node of (2) is
still alive and provides the highest version number and thereby the latest
value written.
At all times, we allow an arbitrary minority of replicas to be not available.
Florian
For a quorum read, valid answers are requested from a majority of the
replicas. The value with the highest version number is taken.