Clock Vector

0 views

Skip to first unread message

Yvone Rollman

unread,

Jan 24, 2024, 7:46:49 PM1/24/24

to mebithahe

A vector clock is a data structure used for determining the partial ordering of events in a distributed system and detecting causality violations. Just as in Lamport timestamps, inter-process messages contain the state of the sending process's logical clock. A vector clock of a system of N processes is an array/vector of N logical clocks, one clock per process; a local "largest possible values" copy of the global clock-array is kept in each process.

Vector Clock is an algorithm that generates partial ordering of events and detects causality violations in a distributed system. These clocks expand on Scalar time to facilitate a causally consistent view of the distributed system, they detect whether a contributed event has caused another event in the distributed system. It essentially captures all the causal relationships. This algorithm helps us label every process with a vector(a list of integers) with an integer for each local clock of every process within the system. So for N given processes, there will be vector/ array of size N.

clock vector

Download File ✯ https://t.co/p4YoxpWxnM

The above example depicts the vector clocks mechanism in which the vector clocks are updated after execution of internal events, the arrows indicate how the values of vectors are sent in between the processes (P1, P2, P3).

To sum up, Vector clocks algorithms are used in distributed systems to provide a causally consistent ordering of events but the entire Vector is sent to each process for every message sent, in order to keep the vector clocks in sync.

You might want to use a version vector for a leaderless distributed storage. You might use vector clocks for the same (although it's a worse fit; the article also suggests you use it for consistent snapshots, for implementing causal ordering in general distributed systems etc).

Same question here, and it's still not absolutely clear to me, but what I've found is that version vectors are more suitable to determine the causality of events in a specific network of replicated nodes in a distributed system, where the only thing you are interested in is what happened first and what happened after.

In that sense, using integers for version vectors is overly complicated, because if we just want to determine which node, A or B, is more updated, given a situation where initially A[2,2] and B[2,2] (therefore in sync).

From the version vector perspective, A[3,2] > B[2,2] means the same as A[10,2] > B[2,2]. That would explain why we can use a fixed set of values for version vectors and the only important operation is just sync versions.

From the vector clock perspective, there is a difference between A[10,2] and A[3,2]. It means that +7 events happened in the meantime. That would explain why we need to keep track of all the events and there are send and receive operations to sync all the vector clocks in the network.

In this case, there is only one DB instance to make it easy to illustrate, but you can think as there DB instances with replication delay between them. When you read the data, you'll get the version vector and resolve conflicts from writes happened to different replicas during a replication lag using the version vectors on each.

To make it clearer, imagine a case where for the same object (or key if it is a key-value store), two replicas now have so-called version vectors [1,0] and [0,1] respectively. Now how should they synchronize?

Version vectors order VERSIONS of replicas in a distributed data store, which is a specific kind of distributed system. Note that not all events change the version of a replica - only write requests/events do. So instead of concerning ourselves with events, it is better to conceptually focus only on versions of the replicas, which are the result of the events.

Suppose the writes to a single key can come to any node. Your client writes "1" as the value for some key, then it decides to write "2". The first write goes to node#1. It issues a replication request to node#2. However, your request to store "2" comes to node#2 (we can store on any node, remember) earlier than the replication request. It stores "2", issues a replication request with "2" to node#1, receives a replication request with "1" from it, changes its "2" to "1", while node#1 changes its "1" to "2". Now you have inconsistency in your data between the storage nodes. Also, if node#1 dies, all you have is node#2 that has value "1", while you remember it very well that you sent "2" after "1", and the storage system has confirmed that it saved it. Actually, many things might go "wrong", depending on what you expect from your storage system (read your writes? monotonic reads? etc), so you need a way to actually find out what the true, good, actual value for the key is, or even to prevent the system from "corrupting" data in this way. For that, the storage system needs to know what happened before what, either between its nodes, or it might even include your clients vision of the order of events into consideration. Vector clocks and version vectors are some of the techniques used in practice to achieve that or claim that 2 events have happened concurrently and you need some other way to decide between the results of them.

Why not always choose single leader (and hence a consensus protocol) over leader-less scheme (and hence an ordering mechanism like version vectors)? Negotiating leadership takes time (think up to seconds or tens of seconds) during which your system is unavailable or partially available in some special mode. Leaderless can perform better under some other conditions as well (e.g. the leader becomes slow due to software problems or network problems: with leaderless approach other nodes might take over its duties). Consensus becomes harder as the number of participants increases, so leaderless can potentially scale better.

No. Both can be used for solving the same tasks (e.g. distributed storage). They can be combined (paxos for cluster participation and then use that knowledge to determine which nodes form a quorum in an eventually consistent (through version vectors) system).

I note that there has been discussions on this topic earlier, but those are now closed. I recently bought some plans for a clock (DXF), and found multiple open vectors when I brought it into CC Pro. Many of them are not easily correctable and the join vector option simply disappears when I try to rebuild. I imported the file into Carveco Maker, and it repaired all of the open vectors at opening. After exporting the plans to an SVG file, they imported into CC Pro with no issues. Is this something that might change in future releases of CC?

Both of these mechanisms are practical ways of capturing causality in distributed systems. Causality (in distributed systems) is an abstract concept, can was formalized in 1978 by Leslie Lamport in one of the most cited articles in computer science. In 1983 Version Vectors are developed to track the causal relations among data items that can be concurrently updated. Some years later, around 1988, Vector Clocks are developed to track the causality between events in a distributed computation. In both cases a vector of integers, one per source of concurrency, is used. There is, however, a fundamental difference.

Lets consider a simple example, thats shows that vectors of integers are over-expressive . One has two recently synchronized replicas A and B with identical state and vectors A[2,3] and B[2,3]. Now, replica A suffers an update and its new vector is A[3,3]. We now see that A is more updated than B, since [3,3] > [2,3] (here each integer in the left is greater or equal than its counterpart in the same position). Now replica A suffers 7 more updates, A[10,3]. Still we have [10,3] > [2,3], and this increase in the integer does not convey more information to the task of tracking the causal order among the two replicas. The number of individual updates is typically not important, specially in systems where you can change either a lot or a little in a single update. What is important is how they change the causal order relation among the replicas.

I have always thought that vector clock and version clock were the same thing before reading this article. It is the theorem of Charron-Bost that reminds me that they are actually different: otherwise your dotted version clock cannot be right. I am wondering whether there is some theorem on version clock, analogous to the Charron-Bost theorem on vector clock.

Is my understanding right: For a distributed storage system, what the dotted version clock is tracking is the causality among *replica states* instead of among the *update events* modifying the states. In vector clock with one entry per client, the causal history is a set of *events*. However, in dotted version clock, the causal history is now a set of *replica states*, from which we cannot restore the origin *update events* which bring the initial state to the current one.

Both logical clocks allow one to totally order events in a way that is consistent with causality; this is true because every causal dependency results in an increased timestamp. For both clocks, you can assert that if A "happens before" B, then Clock(A) < Clock(B).

Vector clocks take this a step further by allowing one to compare any two events and check to see if they are causally dependent or concurrent. Put another way, not only can you show that if A "happens before" B, then VectorClock(A) < VectorClock(B), you can also assert the converse: if VectorClock(A) < VectorClock(B) then A "happened before" B.

Lamport timestamps employ a per-node counter to provide a causal ordering of events and an unique node ID to break ties and provide a total ordering, so the overhead of a Lamport timestamp does not vary with the number of nodes. Vector clocks employ per-node counters of every node, so the size of a vector clock is proportional to the number of nodes.