I'm starting a new thread for this Version syntax discussion.
Seph:
You've raised some interesting problems here which are worth addressing (And I've changed the title to hopefully bump this into a separate thread).
- Does the version field in a patch specify the "operation version" or "resultant version" after the client has applied a change? These can be different - eg, with vector clocks the document version might be {a:1, b:2, c:3}. My operation version is {c:4}. Should the patch say {a:1, b:2, c:4} or just {c:4} ?
The Braid specification is fine either way. It only requires that each version uniquely references a particular state of the resource, at a unique point in time, and that when you follow versions back through their parents, you get a DAG that says which versions were known to occur before other versions:
Each version string must be unique, to differentiate any two points in time. ...<snip>... The full graph of parents forms a Directed Acyclic Graph (DAG), representing the known partial order of all versions. A version A is known to have occurred before a version B if and only if A is an ancestor of B in the partial order.
So that's all that's required for the Braid spec, and all that is needed to synchronize perfectly. But you might be asking more about the tradeoffs in a particular implementation, so let's look at those.
First, rather than calling these "Operation Version" and
"Resultant Version", I'd propose we use the conventional academic
terms for these things:
- {c:4} is a "Lamport Timestamp"
- {a:1, b:2, c:4} is a "Vector Clock"Reference: https://www.cs.rutgers.edu/~pxk/417/notes/logical-clocks.html
The typical advantage of a Vector Clock is that you can compare
any two versions' partial order just by checking each entry in the
clock. You don't need to look at the DAG! The clocks themselves
carry enough info to imply a time DAG.
This means that the information in a Vector Clock is redundant with the "Parents:" field. You could recompute the parents for every version in time if you know that every version is a vector clock.
However, the Vector Clock itself takes up space proportional to
O(num_peers). This kinda sucks, because a Lamport Timestamp only
takes up O(1) space. And you can encode the same information using
Lamport Timestamps by just including the preceding versions in the
Parents: field. Then you get the whole DAG.
So if you know for a fact that you have a small number of peers, then you might find it handy to use Vector Clocks, and have a unified representation to compare any two versions. However, if you are interested in scaling your network to a large number of editing peers, then it behooves you to use just Lamport Timestamps for each version ID, and encode the DAG directly using the Parents: field.
The Academic work discussing this tradeoff is called Hash Histories:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.9074
ByungHoon Kang, Robert Wilensky, and John Kubiatowicz. The Hash History Approach for Reconciling Mutual Inconsistency. ICDCS, pp. 670-677, IEEE Computer Society, 2003.
...because it applies not just to Lamport Timestamps, but also Hashes, and in fact any unique identifier.
This availability of improved performance (in situations of many
peers) is a big reason that the Braid spec allows changes to be
expressed as a Time DAG instead of just Vector Clocks.
If we're sending operation versions (which was my understanding from our conversation earlier on discord):
- How does a client compare {c:4} with the "up to date version" field named at the start of the subscription stream, in the spec? Surely that up to date version must say {a:1, b:2, c:4} - but in that case, how will a client know that they're up to date? (They can't directly compare {c:4} with {a:1,b:2,c:4} if those strings are supposed to be opaque)
Can you clarify what you mean by "up to date version"? Are you referring to this? https://github.com/braid-work/braid-spec/pull/79/files
Or are you referring to the information provided in the
"Parents:" field for a GET?
Are you using Lamport Timestamps in this example, or Vector Clocks?- Actually how does the client even recieve a snapshot at {a:1,b:2,c:3} ? What does the version header in the first message in the subscription stream say? I'd love to see a worked example of this!
If Lamport Timestamps, you'd do:
GET /foo
Version: "{a: 1}"
...or whatever the timestamp is for the editor who created that version.
If Vector Clock, you'd do:
GET /foo
Version: "{a:1,b:2,c:3}"
What content-ranges draft are you referring to? The range-patch draft?
- This might be resolved in the content-ranges draft, but lets say I have versions like this:
{a:0,b:0}
/ \
{a:1} {b:1}
The former. Is the braid-http spec not clear about this?I subscribe. I get a copy of the document at version {a:1,b:0}. Then the server sends me the {b:1} operation with an insert. Is the insert location in the {b:1} patch specified with respect to {a:0,b:0} (as it was when the operation was created) or with respect to {a:1,b:0} (ie, transformed to meet the client's expected world)?
They interpret them with respect to {a:0,b:0}. Its up to the client to merge that change with the {b:1} patch. It can use a CRDT or OT to do that.If its with respect to {a:0,b:0}, how does the client interpret the contents and content-range locations, since presumably they're missing an explicit copy of the {b:1} patch?
I'm not sure what you are trying to do, there. I see three ways of representing versions here:Maybe, can you fill out this example so I can understand?
HTTP/1.1 209 Subscription
cache-control: no-cache
connection: keep-alive
content-type: application/automerge
Transfer-Encoding: chunked
version: "a:0,b:0"
content-length: xxx
version: "???" <-- What value does this contain?
<<document at {a:1, b:0}>>
----(later)----
content-length: xxx
version: "b:0"
parents: "a:0", "b:0" <-- How should the client interpret this? Its never seen "a:0" or "b:0"
content-range: json=.messages[1:1] <-- What are these indexes relative to?
<<patch content>> <-- Is this content not transformed by the server? Even when doing OT?
A resource needs to be consistent in how it represents versions.
Each version is a unique string, that uniquely identifies a point
in time in the Time DAG. When you reference a version in the
Parents: field, that version needs to have been defined somewhere
in a Version: field, so you need to use the same representation of
versions throughout the resource.
--You received this message because you are subscribed to the Google Groups "Braid" group.To unsubscribe from this group and stop receiving emails from it, send an email to braid-http+...@googlegroups.com.To view this discussion on the web visit https://groups.google.com/d/msgid/braid-http/e6ba81c5-0e28-ec5a-9b3e-3b41b78868bd%40gmail.com.
I don't see merges as special like that (just follow the semantics of the underlying system to update the document and update the version). Mike doesn't see document versions as distinct from operation versions - but specialcases merges, because you have to carve that out if you do that.
- A "document version" always uniquely identifies a state that a document is in. The same document version implies identical document contents.
- There's some freedom in how an operation is applied. Eg the {z:1} operation could move document {a:0,z:0} to {a:0,z:1} or it could move {a:0,b:0,c:1000,z:0} to {a:0,b:0,c:1000,z:1}

- {c:4} is a "Lamport Timestamp"- {a:1, b:2, c:4} is a "Vector Clock"
Reference: https://www.cs.rutgers.edu/~pxk/417/notes/logical-clocks.html
The typical advantage of a Vector Clock is that you can compare any two versions' partial order just by checking each entry in the clock. You don't need to look at the DAG! The clocks themselves carry enough info to imply a time DAG.
This means that the information in a Vector Clock is redundant with the "Parents:" field. You could recompute the parents for every version in time if you know that every version is a vector clock.
However, the Vector Clock itself takes up space proportional to O(num_peers). This kinda sucks, because a Lamport Timestamp only takes up O(1) space. And you can encode the same information using Lamport Timestamps by just including the preceding versions in the Parents: field. Then you get the whole DAG.
So if you know for a fact that you have a small number of peers, then you might find it handy to use Vector Clocks, and have a unified representation to compare any two versions. However, if you are interested in scaling your network to a large number of editing peers, then it behooves you to use just Lamport Timestamps for each version ID, and encode the DAG directly using the Parents: field.
Maybe, can you fill out this example so I can understand?HTTP/1.1 209 Subscriptioncache-control: no-cacheconnection: keep-alivecontent-type: application/automergeTransfer-Encoding: chunkedversion: "a:0,b:0"content-length: xxxversion: "???" <-- What value does this contain?<<document at {a:1, b:0}>>----(later)----content-length: xxxversion: "b:0"parents: "a:0", "b:0" <-- How should the client interpret this? Its never seen "a:0" or "b:0"content-range: json=.messages[1:1] <-- What are these indexes relative to?<<patch content>> <-- Is this content not transformed by the server? Even when doing OT?I'm not sure what you are trying to do, there. I see three ways of representing versions here:
- "a:0,b:0"
- "{a:0, b:0}"
- "a:0"
A resource needs to be consistent in how it represents versions. Each version is a unique string, that uniquely identifies a point in time in the Time DAG. When you reference a version in the Parents: field, that version needs to have been defined somewhere in a Version: field, so you need to use the same representation of versions throughout the resource.
- "a:0,b:0"
- "{a:0, b:0}"
--
You received this message because you are subscribed to the Google Groups "Braid" group.
To unsubscribe from this group and stop receiving emails from it, send an email to braid-http+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/braid-http/41509506-c332-492a-ba46-6d43ff3e5b47%40www.fastmail.com.
But maybe better terms would be something like:"Operation identifier" - unique identifier for a single operation"Version" / "document version" - Identifier that uniquely names the state a document is in. (Which is a set of operation identifiers in your synchronizer)
...
Example usage:"The up to date header names a document version""Every braid patch has a unique operation identifier" (Or maybe it should be patch identifier?)
2.3. PUT a new version as a patch Not only are patches smaller, and thus more efficient; they also provide useful information for merging two simultaneous edits, for instance in collaborative editing. One can send an update in a patch by setting the "Patches" header to a number, and then set the Message body to a sequence of that many patches, separated by blank lines: Request: PUT /chat Version: "g09ur8z74r" | Version Parents: "ej4lhb9z78" | Content-Type: application/json | Merge-Type: sync9 | Patches: 2 | | Content-Length: 62 | | Patch Content-Range: json .messages[1:1] | | | | [{text: "Yo!", | | author: {type: "link", value: "/user/yobot"}] | | | Content-Length: 40 | | Patch Content-Range: json .latest_change | | | | {"type": "date", "value": 1573952202370} | | Response: HTTP/1.1 200 OK Patches: OK
This patch ID isn't in the spec yet, but we could add it if it helps. However, we'd need to define if it has any particular properties. Is it unique? Or can you use a single patch in multiple different versions? If you do, does that mean that the contents of the patch have to be the same in those versions, or does the patch transform?PUT /chat Version: "g09ur8z74r" | Version Patches: 1 ||Patch: "eh872" | Content-Length: 62 | | Patch Content-Range: json .messages[1:1] | | | | [{text: "Yo!", | | author: {type: "link", value: "/user/yobot"}] | |
The background is great but {c:4} in this example isn't a lamport timestamp. Its an operation identifier.
This distinction shows up in many other pieces of work:- Yjs makes a distinction between a delta (with its (agent,seq) pair) and a "full version" (used for snapshoting - which notably contains the RLE deleted range set).- Interval tree clocks refer to an "event". The merger of many events is not a list of events - ITC's contribution is in having a much more compact form than you could get by simply naming the contributing parents. Ref: https://gsd.di.uminho.pt/members/cbm/ps/itc2008.pdf- Dotted version vectors (DVV) extend vector clocks in a similar way...
Maybe, can you fill out this example so I can understand?HTTP/1.1 209 Subscriptioncache-control: no-cacheconnection: keep-alivecontent-type: application/automergeTransfer-Encoding: chunkedversion: "a:0,b:0"content-length: xxxversion: "???" <-- What value does this contain?<<document at {a:1, b:0}>>----(later)----content-length: xxxversion: "b:0"parents: "a:0", "b:0" <-- How should the client interpret this? Its never seen "a:0" or "b:0"content-range: json=.messages[1:1] <-- What are these indexes relative to?<<patch content>> <-- Is this content not transformed by the server? Even when doing OT?I'm not sure what you are trying to do, there. I see three ways of representing versions here:
- "a:0,b:0"
- "{a:0, b:0}"
- "a:0"
A resource needs to be consistent in how it represents versions. Each version is a unique string, that uniquely identifies a point in time in the Time DAG. When you reference a version in the Parents: field, that version needs to have been defined somewhere in a Version: field, so you need to use the same representation of versions throughout the resource.
- "a:0,b:0"
- "{a:0, b:0}"
Huh? How are these different? Are you using different syntax to imply different semantics, or just being a pedant?
The question I was asking is, if the client subscribes at (start) - ie {a:1}, then receives operation operation {b:1}, what data does the client receive? In ShareDB/ShareJS it would be:1. document snapshot at {a:1}. The version would be named "1".2: operation {b:0} *transformed by {a:1}*. This transformed operation would be renamed "2" by the server. (The client only needs to consider the causal history of itself and its server.)In a CRDT (yjs or automerge):1. document snapshot with complete document history at {a:0} - note this is much more data than just the document! This includes identifiers for every separable item in the document.2: operation {b:0}. The client has enough context to be able to merge and update:- Its document state- Its internal mapping between document contents and IDs- Its version (Automerge's version would now be {a:1,b:0}. Yjs would be ({a:1,b:0}, deletes)What I was asking was - in your mind what does braid do? Do you know? Does it do just one thing, or does it support either?And I think your answer was it supports either method?
Hi— first can you confirm whether the mapping I proposed is technically accurate? I want to make sure that I'm understanding your model correctly.Once we can confirm that our models map together, then it makes sense to move onto agreeing on terminology for the different parts.
In other words, we could simplify the mapping to this:
- "Operation Version" => a version ID
- "Document Version" => a set of version IDs
I think that captures the difference between your model and the current Braid spec.
Seph Gentle:But maybe better terms would be something like:"Operation identifier" - unique identifier for a single operation"Version" / "document version" - Identifier that uniquely names the state a document is in. (Which is a set of operation identifiers in your synchronizer)...Example usage:"The up to date header names a document version""Every braid patch has a unique operation identifier" (Or maybe it should be patch identifier?)Yes, I would call this a Patch ID rather than an Operation ID.
HTTP calls these things patches (like in the PATCH method), and the Braid specs and docs also call them patches rather than operations. See e.g. section 2.3 of Braid-HTTP:
...
To include the Patch ID, I'd suggest adding a Patch: header:PUT /chat Version: "g09ur8z74r" | Version Patches: 1 ||Patch: "eh872" | Content-Length: 62 | | Patch Content-Range: json .messages[1:1] | | | | [{text: "Yo!", | | author: {type: "link", value: "/user/yobot"}] | |This patch ID isn't in the spec yet, but we could add it if it helps. However, we'd need to define if it has any particular properties. Is it unique? Or can you use a single patch in multiple different versions? If you do, does that mean that the contents of the patch have to be the same in those versions, or does the patch transform?I also want to point out that if the Patch ID is unique, and a patch thus cannot be included in multiple versions, then you don't actually need a Patch ID in the protocol to get full consistency — you can re-derive a unique ID as e.g. "Patch 4 of Version 'g09ur8'" or "Offset 7 of Version 'g09ur8'" in every place where you'd need an ID. You'd do this using the wrapper technique where a wrapper translates internal representations into and out of "network representations", which use different unique IDs than the implementations themselves use. This translation can be done in O(1) time to write, and O(log(n)) time to read, for typical sequence CRDTs.But you don't need to do that. You can include a Patch ID if you want. :)
Seph:The background is great but {c:4} in this example isn't a lamport timestamp. Its an operation identifier.In the name of finding consensus agreement rather than disagreement, let me take a page out of the book of "yes and", and say:Actually both are true. It is a Lamport Timestamp. It is also a Patch ID. Specifically, your patches are "identified" by their "Lamport Timestamp".
This distinction shows up in many other pieces of work:- Yjs makes a distinction between a delta (with its (agent,seq) pair) and a "full version" (used for snapshoting - which notably contains the RLE deleted range set).- Interval tree clocks refer to an "event". The merger of many events is not a list of events - ITC's contribution is in having a much more compact form than you could get by simply naming the contributing parents. Ref: https://gsd.di.uminho.pt/members/cbm/ps/itc2008.pdf- Dotted version vectors (DVV) extend vector clocks in a similar way...Yes, you can distinguish a Patch ID from a Version ID. And you are certainly allowed to do so in the Braid-HTTP Spec.However, the fact that existing systems distinguish Patch IDs from Version IDs is not an argument that you need to distinguish them in order to have full interoperating compatibility. Greg and I prove this by building Sync9, a CRDT that maintains Patch IDs internally, but does not expose them over the network— all communication is done in terms of Version IDs and ranges only.
As for your question about the 2 different models:
...
Yes, Braid supports both of these methods. Do you need any explanation of how? It seems it might help you to see an example of the network messages.To articulate a little bit:(1) In the OT case, each client will be receiving different sequences of patches, with different version histories, in their subscription responses.(2) In the CRDT case, all clients will see the same patches, and piece together identical version histories.
I also want to draw out a distinction that you might not be aware of. In (2) the CRDT case, the "initial snapshot" could be expressed in two different forms:
- As an opaque blob of CRDT data (as you seem to suggest)
- As the first N messages expressing the N versions and patches that recreate the blob.
The former will work, but the latter is more semantically meaningful, and requires peers to understand and agree upon one fewer file format. They have to learn how to read and write the patches anyway. Why force them to also read and write the same internal data structure blob?The latter is also what the "all caught up" header is for: https://github.com/braid-work/braid-spec/pull/79/files