Merge types and patch types

71 views
Skip to first unread message

Seph Gentle

unread,
Jan 20, 2021, 9:28:26 PM1/20/21
to braid...@googlegroups.com
Hi everyone!

I had a chat with Mike this morning trying to figure out how merge types and patch types should work. We have some disagreement about the semantics. We want to involve the community and move away from closed-door conversations talking this stuff through, so I want to lay out the disagreement as best I can here. (I'm leaning on Mike to represent the strongest version of his perspective.)

As written right now there's only one format for expressing patches. Example from the spec[1]:

Content-Length: 53
Content-Range: json=.messages[1:1]

[{"text": "Yo!",
"author": {"link": "/user/yobot"}]

This expresses the change:

- Replace document.messages[1:1] (an empty set)
- With the content in the list [{text:..., author:...}]

The spec lets you specify custom algorithms for describing how concurrent patches are *merged*. But the format of the patch itself is hardcoded. I appreciate the intent of this design - it keeps clients simple. I want that too. But I think its too simple - there is software I've already worked on (sharedb, lever.co, and google wave) that leaned on features that are missing from this format.

Some concerns:

- A hardcoded format means we can't use any of the other established patch formats (json-patch, ot-json, yjs, automerge, etc). Or any of the tooling to process those patches, or pass them into react/svelte/etc. Instead. If a user wants to use yjs with braid, we'll need some (very complex) bridging code to convert patches to and from this braid patch format.

- The patch type isn't isomorphic to any of those other forms. There are some operations that I want to express that (I think) its currently not possible to express using braid's format. If I have an application built on yjs and some clients use braid and some don't, there are patches that I can't express through braid.

For example, The format seems to only let you specify replace operations (and insert/delete). Other useful operations include counter increments, rich text annotations ("bold/italicise this region of text"), and moves (move a.b to c.d). I've also seen split operations ("split this line at this position into two separate lines in my array of lines"). ShareDB also allows users to invent their own patches if they want custom semantics for their data. (Eg for 3d modelling transformations)

Mike made the point that each document could have its own merge type, but my impression is that he's imagining each document has a single merge type. I'm not sure how this works for eg a counter that supports both increment and reset operations. And the other piece of that puzzle: How do you express "Add 5" or "split at this position" or "bold this region" using the proposed braid format?

- The Content-Range header format ("Content-Range: json=.messages[1:1]") is incompatible with CRDTs like yjs and automerge. Automerge specifies list/string positions by naming the ID (agent id, seq) of the preceeding item. Yjs specifies positions by naming the ID of the predecessor and successor.

- Yjs and automerge also use custom formats to describe the *content* that was inserted. (Eg a patch might insert a run of characters. Each character needs its own ID and predecessor. The ID and precessors can be stored very compactly if the characters are inserted in order). I'm not sure how the braid format would encode this information in JSON / text in a compatible way. Both of those CRDTs have settled on their own custom run-length-encoded binary columnar formats because without that, automerge needs 83mb to express the edits in a ~200kb document[2]).

Mike mentioned we could add a separate "content-range-type" or something for the content-range semantics, but that seems like the worst of all worlds. Braid clients will need deep understanding of the CRDT implementations to interpret any changes. We need to invent yet another way to efficiently encode text operations in a CRDT, and we need complex code to convert back and forth between braid changes and automerge/yjs changes.

And all that, with no benefits that I can see over just sending yjs / automerge patches directly using their own formats.

I'm not sure if I've made it clear, but my preferred approach here (as I mentioned in another thread) would be to add another header for naming a Patch-Type, with values like "yjs", "automerge", "ot-json", "json-patch", etc. To me this is like Content-Type in HTTP. I'm not sure if Merge-Type should be implied by patch-type, or be included independently.

I expect over time the community / application developers might converge on an implementation (like automerge). But I don't think we're ready to pick a winner in that competition yet. I also don't think the proposed patch format is expressive enough to encompass all the use cases people have for braid. We couldn't have used this format in google wave, and it would be a great pity if the google wave of the future needs to avoid using braid because braid makes their patches inexpressible.

-Seph

[1] https://github.com/braid-work/braid-spec/blob/bf179d8c84ea48b5cb1ea4faf23c6ea9bfc28416/draft-toomim-httpbis-braid-http-03.txt

[2] B4 in this table: https://github.com/dmonad/crdt-benchmarks/tree/d7f4d774a302f13f26cded6e614d44e0b5e496c9#results

Michael Toomim

unread,
Feb 3, 2021, 4:17:38 AM2/3/21
to Seph Gentle, braid...@googlegroups.com
Seph:

As written right now there's only one format for expressing patches. Example from the spec[1]:

        Content-Length: 53
        Content-Range: json=.messages[1:1]

        [{"text": "Yo!",
          "author": {"link": "/user/yobot"}]

This expresses the change:

- Replace document.messages[1:1] (an empty set)
- With the content in the list [{text:..., author:...}]

The spec lets you specify custom algorithms for describing how concurrent patches are *merged*. But the format of the patch itself is hardcoded.

This is incorrect.  You can use any patch format you'd like.  There's an example using JSON Patch [RFC6902] at the bottom of section 2.3:

      Request:

         PUT /chat
         Version: "up12vyc5ib"                                 | Version
         Parents: "2bcbi84nsp"                                 |
         Content-Type: application/json                        |
         Merge-Type: sync9                                     |
         Patches: 1                                            |
                                                               |
         Content-Length: 326                                   | | Patch
         Content-Type: application/json-patch+json             | |
                                                               | |
         [                                                     | |
           { "op": "test", "path": "/a/b/c", "value": "foo" }, | |
           { "op": "remove", "path": "/a/b/c" },               | |
           { "op": "add", "path": "/a/b/c", "value": [] },     | |
           { "op": "replace", "path": "/a/b/c", "value": 42 }, | |
           { "op": "move", "from": "/a/b", "path": "/a/d },    | |
           { "op": "copy", "from": "/a/d/d", "path": "/a/d/e" }| |
         ]                                                     | |

      Response:

         HTTP/1.1 200 OK
         Patches: OK
In general you can define any patch format you'd like, in arbitrary english, with a spec. See [RFC6902] for an example. You would probably like to make one for sharedb as well.

Patch formats are not actually a Braid thing. They are an HTTP thing, that we are just re-using for Braid.

You probably also want to check out spec for the PATCH method [RFC5789]. In particular, section 3.1 defines the Accept-Patch header, which you can use with an OPTIONS request to determine which types of patches a server supports.

Does this resolve your concerns?

It appears that you think Braid only supports Range Patches, but in fact those are only one type of patch. Furthermore, you might be interested to learn that Range Patches can, in fact, support most of the fancy features you want (clones, moves, bolding text, splitting a sequence into two regions) if we just add a simple feature called the portal. We haven't specified this yet because it hasn't been used necessary, but I'd be very happy to provide detail if you'd like.

[RFC5780] PATCH Method for HTTP https://tools.ietf.org/html/rfc5789#section-3.1

Seph Gentle

unread,
Feb 10, 2021, 4:12:22 AM2/10/21
to Michael Toomim, braid...@googlegroups.com
Hey! Sorry it took a few days to respond properly. This took some time to prepare. Note I'm not attempting to be balanced here. I'm leaning on you Mike (any anyone else who disagrees) to make the case for the spec staying as-is. (Since you'll be able to do that better than me!)

It makes sense to talk through an example.

Lets say we're sending this automerge document through braid (thanks Greg!). This operation inserts "hi" after the user's previous inserted text. Note, this is the expanded JSON form of this operation. Automerge has a compact run-length-encoded form which is approximately 50x more efficient for encoding operations. (Described here: https://github.com/automerge/automerge/blob/performance/BINARY_FORMAT.md ).

(I'm taking a few liberties with the protocol here to hopefully make it easier to read.)

automerge_change = {
    "actor": "abc2",
    "seq": 2,
    "startOp": 13,
    "time": 1612489353,
    "deps": ["100@face", "2@beef"],
    "ops": [{
            "obj": "7@abc1",
            "elemId": "12@abc2",
            "action": "set",
            "insert": true,
            "value": "h",
            "pred": []
    }, {
            "obj": "7@abc1",
            "elemId": "13@abc2",
            "action": "set",
            "insert": true,
            "value": "i",
            "pred": []
    }]
}

There's a few ways to express this using braid using the current spec. The extremes are:

1. Lean heavily on braid. Here each patch is broken into a separate merge object. Each single character typed turns into 167 bytes sent over the wire (before gzip). (Gzip will not save us - this will still be pretty big gzipped.)

The startOp field is split out explicitly into each patch object. We also depend on a new content-range header which expresses the insert location using automerge semantics:

         PUT /chat
         Version: "abc2@2"
         Parents: ["100@face", "2@beef"]
         Patches: 2

         Content-Length: xxx
         Content-Type: application/json+automerge
         Content-Range: automerge="7...@abc1.12@abc2"

         { "id": "13@abc2", "action": "set", "insert": true, "value": "h", "pred": [] }

         Content-Length: xxx
         Content-Type: application/json+automerge
         Content-Range: automerge="7...@abc1.13@abc2"

         { "id": "14@abc2", "action": "set", "insert": true, "value": "i", "pred": [] }


2. Lean heavily on automerge's compact form. Let the automerge library handle encoding and decoding of the patch contents, handle applying the patch and interpret the location of the changes:

         PUT /doc
         Version: "abc2@2"
         Parents: ["100@face", "2@beef"]
         Patches: 1

         Content-Length: xxx
         Content-Type: application/automerge

    <<Automerge's compact binary encoding of operations>>


There's some other ways to represent automerge in between these two extremes.

---

My claim is that the second format is better for several reasons (below). And if we standardize on this second form then we can compact it further to this, which on the write path is already (mostly) part of HTTP:

         PATCH /doc
         Version: "abc2@2"
         Parents: ["100@face", "2@beef"]
         Content-Length: xxx
         Content-Type: application/automerge

    <<Automerge's compact binary encoding of operations>>

When reading we currently have 3 levels of HTTP going on (Actual HTTP, header containing "Patches: X", and individual patches). This format could reduce that to "just" 2 levels, like this:

    HTTP/1.1 209 Subscription
    cache-control: no-cache
    connection: keep-alive
    content-type: application/automerge
    Transfer-Encoding: chunked
    version: ["100@face", "2@beef"]

    content-length: xxx
    version: ["100@face", "2@beef"]

    <<automerge encoded document>>
----(later)----
    content-length: xxx
    version: "abc2@2"

    <<automerge encoded patch>>


Advantages:

- Automerge and Yjs already have compact, optimized binary representations for patches. If we break patches out into a series of operations, we are forced to discard these optimized formats. And that would be a mistake because we need them. Automerge's optimized format is *50x* smaller. Gzip simply doesn't help enough - and even if it did, v8 struggles to parse the hundreds of megabytes of data objects created in a few hour editing session.

- Automerge, YJS and json-ot already each implement their own encoding formats for expressing patch locations inline. These formats are mutually incompatible anyway, and I claim making them all fully compatible would be a lot of effort with very little gain.
  - Yes, we probably can convert json-ot moves to use your unspecified portal format but why bother? That sounds like a lot of effort given it already works well, and works correctly. Lets push that into a separate RFC if its important to you.

- The spec is simpler this way:
  - No more Patches: X in the read or write path.
  - Unless I'm missing something, I don't think we need PUT to send patches to the server
  - No more triply-nested HTTP in subscriptions. We only need 2 levels of HTTP encoding. (Outer and inner)
  - The content-range header can be removed from the braid spec. It can live in the sync9 spec or something, which is the only implementation I expect will actually use it.

Disadvantages:

- The current spec provides a standard way to group many operations into a single transaction
  - But all the patch formats we want to support already know how to do that (except sync9?)

- With the current spec, a consumer of a stream of patches arguably doesn't need to understand some aspects of the patch format
  - Is that true though? You would need a full automerge implementation & ID search tree to be able to interpret automerge content-range headers. Thats 95% of automerge's runtime complexity, so there's no saving.

- I don't know if this will work with the idea of having generic merge-types.

I'm happy to take a stab at rewriting the spec to look like this; but I expect it will be a big change and I don't want to do that work until we reach consensus here.

Thoughts?

-Seph

Duane Johnson

unread,
Feb 10, 2021, 11:44:50 AM2/10/21
to Seph Gentle, Michael Toomim, Braid
For the sake of clarity, in your last example, did you mean "parents: ["100@face", "2@beef"]" instead of "version: ["100@face", "2@beef"]"?


    HTTP/1.1 209 Subscription
    cache-control: no-cache
    connection: keep-alive
    content-type: application/automerge
    Transfer-Encoding: chunked
    version: ["100@face", "2@beef"]   <---- here

    content-length: xxx
    version: ["100@face", "2@beef"]   <---- and here

    <<automerge encoded document>>
----(later)----
    content-length: xxx
    version: "abc2@2"                 <---- and maybe here?

    <<automerge encoded patch>>

--
You received this message because you are subscribed to the Google Groups "Braid" group.
To unsubscribe from this group and stop receiving emails from it, send an email to braid-http+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/braid-http/79b3a497-43c9-477e-a39e-d62bfd7f2106%40www.fastmail.com.

Seph Gentle

unread,
Feb 10, 2021, 2:34:34 PM2/10/21
to Duane Johnson, Michael Toomim, Braid
Nope.

It’s not how Greg presented it the other day but I’m 95% sure we can use the (agent, seq) frontier pair together as a version instead of hashes.

-Seph

Michael Toomim

unread,
Feb 10, 2021, 2:58:33 PM2/10/21
to Seph Gentle, Duane Johnson, Braid

I think "100@face" is an (agent, seq) pair, and thus ["100@face", "2@beef"] is a pair of pairs. A pair of pairs is a pair of versions, not a version. Thus I think Duane is wondering if it was intentional to write a pair of versions in a field named "version". Usually pairs of versions appear in fields labeled "Parents".

Seph Gentle

unread,
Feb 10, 2021, 3:59:24 PM2/10/21
to Michael Toomim, Duane Johnson, Braid
It was intentional - but being sidetracked by this was not. “100@face” names an operation/patch. The document version should be a name somehow based on the set of all operations that have ever been applied to the document. So we want a document version that somehow includes all the operations:

100@face
99@face
98@face
....
20@abcd
19@abcd
...

But obviously just listing them all is unweildy and redundant. We can make it smaller in a few different ways:

- You could just hash all the ids in the set (or use a merkle dag), like git and bitcoin do.

Or use a version vector - [100@face, 20@abcd]

Or name the frontier:

- 100@face implies all the earlier operations @face
- Operations can explicitly name their parents. If we know 90@face directly depends on 20@abcd then we don’t need to list 20@face or below either. So we can just name the operation 100@face in this case.

Intuitively if I “pick up” 100@face and it’s connected to 99@face, and that’s connected to 98@face and so on, do I end up picking up every version ever applied to the document? If yes, naming that single operation is sufficient as a version for the whole document.

So maybe just naming 100@face is sufficient. But a single operation won't always be sufficient. Lets say you send me operation 21@abcd with parents {20@abcd, 99@face}. Operations 21@abcd and 100@face were created concurrently - neither operation is a parent of the other. There's a document state which contains both 21@abcd and 100@face, but I can't call it "21@abcd" because that implies it doesn't contain 100@face. And I can't call it 100@face because that implies it doesn't contain 21@abcd. Hence versioning the document based on the set of operations {"21@abcd", "100@face"}.

Using this approach the rule is this: You version a document based on the set of operations at the frontier of the DAG. This is the set of operations with no children in the operation graph.

Git doesn't need this because it creates merge commits, but merge commits don't work well in a realtime collaborative editor.

I wrote more about this approach in these rough notes for the database I'm working on:

This is also a good read. Riak has a small, known set of servers generating versions and they have a slightly different approach: https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/index.html

-Seph

Duane Johnson

unread,
Feb 10, 2021, 4:15:48 PM2/10/21
to Seph Gentle, Michael Toomim, Braid
Wow, thanks for clarifying. Sorry for the side-track, but I'm glad I asked because it wasn't at all apparent to me how much complexity was behind those pairs.

Back to our regularly scheduled program.

Michael Toomim

unread,
Feb 10, 2021, 4:20:51 PM2/10/21
to Seph Gentle, Duane Johnson, Braid

I think I can see the confusion. There are multiple ways to refer to a version when using (agentid, seq) pairs:

(1) With the (agentid, seq) that created the version
(2) With a version vector of all the latest [(agent, seq) ...] known when the version was created

It sounds like you were using version vectors (2) in this example, and Duane and I were expecting (1). Thus the confusion.

I think we can get back to your proposal now.

Michael Toomim

unread,
Feb 10, 2021, 4:31:28 PM2/10/21
to Seph Gentle, braid...@googlegroups.com
Seph:
         PUT /chat
         Version: "abc2@2"
         Parents: ["100@face", "2@beef"]
         Patches: 2

Ah, except that in this example, I believe you're using a version vector for the "Parents" header as well.

In the Braid spec, the parents header is a list of Version IDs. But since you are referring to Version IDs as version vectors elsewhere, a literal interpretation of the spec would imply that the parents header should be a list of version vectors.

So it sounds like there are a number of ways in which this proposal differs from the current Braid spec. I think we'll need to write them down, and then (hopefully) piece them apart into independent discussions so that we can look at them one by one.

Seph Gentle

unread,
Feb 10, 2021, 4:37:29 PM2/10/21
to Michael Toomim, braid...@googlegroups.com
The versioning system I'm using here is independent of my proposal. I'd love to discuss versions - but lets do it in another thread or we'll miss the forest for the trees.

Michael Toomim

unread,
Feb 10, 2021, 5:16:14 PM2/10/21
to Seph Gentle, braid...@googlegroups.com
Seph Gentle wrote:
The versioning system I'm using here is independent of my proposal. I'd love to discuss versions - but lets do it in another thread or we'll miss the forest for the trees.

I'm all for discussing versions separately. In that case, it would help for the versions in this example to meet the current Braid spec, so that all the differences in the examples are things that you want us to focus on here.

Here are a couple of differences that have been tripping me up so far:

(1) The value of "Parents:" is not supposed to be surrounded by square brackets:

         Parents: ["100@face", "2@beef"]

It's just a list of strings, separated by commas, using the HTTPBis "Structured Headers" format:

   Parents are also specified with a header in a PUT request or GET
   response:

           Parents: "ajtva12kid", "cmdpvkpll2"

   The Parents header is a List of Strings, in the syntax of HTTP's
   [STRUCTURED-HEADERS]. 

(2) A version, likewise, is not supposed to be surrounded by square brackets:

    version: ["100@face", "2@beef"]

It's supposed to just be a string in the "Structured Headers" format:

   Versions are specified as a string in the [STRUCTURED-HEADERS]
   format.  Each version string must be unique, to differentiate any two
   points in time.  ...<snip>...

           Version: "dkn7ov2vwg"
These differences were leading me to think that there was either an unintended bug in the example, or that you were proposing a new versioning system along with the change in patches.

Seph Gentle

unread,
Feb 10, 2021, 8:45:22 PM2/10/21
to Michael Toomim, braid...@googlegroups.com
On Wed, Feb 10, 2021, at 10:16 PM, Michael Toomim wrote:
(1) The value of "Parents:" is not supposed to be surrounded by square brackets:

         Parents: ["100@face", "2@beef"]

This line should be:

   Parents: "100@face", "2@beef"

(2) A version, likewise, is not supposed to be surrounded by square brackets:

    version: ["100@face", "2@beef"]

These differences were leading me to think that there was either an unintended bug in the example, or that you were proposing a new versioning system along with the change in patches.

This was intended as pseudocode (pseudohtml?).

---

You've raised some interesting problems here which are worth addressing (And I've changed the title to hopefully bump this into a separate thread).

- Does the version field in a patch specify the "operation version" or "resultant version" after the client has applied a change? These can be different - eg, with vector clocks the document version might be {a:1, b:2, c:3}. My operation version is {c:4}. Should the patch say {a:1, b:2, c:4} or just {c:4} ?

If we're sending operation versions (which was my understanding from our conversation earlier on discord):

- How does a client compare {c:4} with the "up to date version" field named at the start of the subscription stream, in the spec? Surely that up to date version must say {a:1, b:2, c:4} - but in that case, how will a client know that they're up to date? (They can't directly compare {c:4} with {a:1,b:2,c:4} if those strings are supposed to be opaque)

- Actually how does the client even recieve a snapshot at {a:1,b:2,c:3} ? What does the version header in the first message in the subscription stream say? I'd love to see a worked example of this!

- This might be resolved in the content-ranges draft, but lets say I have versions like this:

 {a:0,b:0}
  /    \
{a:1}  {b:1}

I subscribe. I get a copy of the document at version {a:1,b:0}. Then the server sends me the {b:1} operation with an insert. Is the insert location in the {b:1} patch specified with respect to {a:0,b:0} (as it was when the operation was created) or with respect to {a:1,b:0} (ie, transformed to meet the client's expected world)?

If its with respect to {a:0,b:0}, how does the client interpret the contents and content-range locations, since presumably they're missing an explicit copy of the {b:1} patch?

Maybe, can you fill out this example so I can understand?

    HTTP/1.1 209 Subscription
    cache-control: no-cache
    connection: keep-alive
    content-type: application/automerge
    Transfer-Encoding: chunked
    version: "a:0,b:0"

    content-length: xxx
    version: "???" <-- What value does this contain?

    <<document at {a:1, b:0}>>
----(later)----
    content-length: xxx
    version: "b:0"
    parents: "a:0", "b:0" <-- How should the client interpret this? Its never seen "a:0" or "b:0"
    content-range: json=.messages[1:1] <-- What are these indexes relative to?

    <<patch content>> <-- Is this content not transformed by the server? Even when doing OT?

Thanks!
-Seph

Seph Gentle

unread,
Feb 10, 2021, 8:47:51 PM2/10/21
to braid...@googlegroups.com

    content-length: xxx
    version: "b:0"

Oops thats a typo - should have read version: "v:1".

Seph Gentle

unread,
Feb 10, 2021, 8:58:15 PM2/10/21
to braid...@googlegroups.com
On Thu, Feb 11, 2021, at 1:47 AM, Seph Gentle wrote:

    content-length: xxx
    version: "b:0"

Oops thats a typo - should have read version: "v:1".

I mean "b:1". Sorry for the spam - I don't want to get more replies just focussed on typos rather than content.


    parents: "a:0", "b:0" <-- How should the client interpret this? Its never seen "a:0" or "b:0"
    content-range: json=.messages[1:1] <-- What are these indexes relative to?

    <<patch content>> <-- Is this content not transformed by the server? Even when doing OT?

Thanks!
-Seph


--
You received this message because you are subscribed to the Google Groups "Braid" group.
To unsubscribe from this group and stop receiving emails from it, send an email to braid-http+...@googlegroups.com.

Michael Toomim

unread,
Feb 12, 2021, 5:07:30 PM2/12/21
to Seph Gentle, Braid
Seph, this opinion about Merge Types and Patches is aggregating a number topics into a single narrative. I'd suggest breaking it apart into the smallest possible units, so that we can discuss each one independently. Then we can make more progress together as a group.

It is much easier to solve small independent issues one-by-one. This way you can enumerate, and consider one-by-one, all options for issue, without getting lost in the combinatorial explosion of considering all options x all issues. And this is especially important for a large group, because breaking up issues into smaller issues allows each member of the group to choose different issues to work on, without having to pay attention everything that happens in the entire group.

In this case, I see two primary issues being raised:

  1. Patch Formats: General vs. Custom?
  2. A new syntax for encoding Versions & Patches

In fact, the email is confusing to me— it starts with a critique of the Range-Patch spec, but then goes on to propose a change to the Braid-HTTP spec. Those are different specs. They talk about different things. Let's discuss them as separate topics.

Now here are some brief reactions to each topic:

(1) If you don't find the Range-Patch spec valuable, then don't use it. The IETF does not, and cannot, force anyone to use a spec. It is only a venue for people who already want interoperability to come together to find consensus. If you don't want interoperable patches, then you can ignore the Range-Patch spec.

(2) You propose some syntactic changes in the protocol, but I'm not clear on what they are. You talk about collapsing 3 levels into two, but I'm not sure which level you're removing. It would help to break apart each syntactic change and show before-and-after examples. You are showing a new syntax for some examples without showing the old syntax. You're including Transfer-Encoding: chunked, but I'm not sure if that's an intrinsic part of your proposal, or just a random thing. You talk about removing Patches: X, but it's not clear what you mean.

On the other hand, it's possible that you are trying to argue that "Braid should not have Range Patches, every patch should be a Custom Patch format." Is this what you are trying to say? That's not cool. Range Patches are optional and sit in their own separate file. They aren't causing you any harm. And furthermore, Range Patches are awesome. We can generalize every synchronizer's patch format into a single simple model! That's so fascinating! And it's a model already built into HTTP— the Range Request —and but just hasn't been defined yet for PUTs! This is a great discovery in Computer Science!

Michael Toomim

unread,
Feb 12, 2021, 5:25:37 PM2/12/21
to Seph Gentle, Braid
Now for some specifics:

Seph:

The startOp field is split out explicitly into each patch object. We also depend on a new content-range header which expresses the insert location using automerge semantics:

         PUT /chat
         Version: "abc2@2"
         Parents: ["100@face", "2@beef"]
         Patches: 2

         Content-Length: xxx
         Content-Type: application/json+automerge
         Content-Range: automerge="7...@abc1.12@abc2"

         { "id": "13@abc2", "action": "set", "insert": true, "value": "h", "pred": [] }

         Content-Length: xxx
         Content-Type: application/json+automerge
         Content-Range: automerge="7...@abc1.13@abc2"

         { "id": "14@abc2", "action": "set", "insert": true, "value": "i", "pred": [] }

Woah there, this is super verbose. You shouldn't include the underlying Automerge data in each patch body. Instead of {id: ...., value: "h"}, that should just say "h"! You also shouldn't specify the Content-Type with each patch. That's a property of the entire resource, not the patch.

So let me also fix the parents for you, and your example should look like this:

         PUT /chat
         Version: "abc2@2"
         Parents: "100@face", "2@beef"
         Patches: 2

         Content-Range: automerge="7...@abc1.12@abc2"
         Content-Length: 1

         h

         Content-Range: automerge="7...@abc1.13@abc2"
         Content-Length: 1

         i

- Automerge and Yjs already have compact, optimized binary representations for patches. If we break patches out into a series of operations, we are forced to discard these optimized formats. And that would be a mistake because we need them. Automerge's optimized format is *50x* smaller. Gzip simply doesn't help enough - and even if it did, v8 struggles to parse the hundreds of megabytes of data objects created in a few hour editing session.

Those same compaction mechanisms can work over Braid as well. And after we use H2 header compression, I think the network messages will be comparable to Yjs and Automerge's defaults. Maybe Braid messages will be even smaller. We'll see.

  - Yes, we probably can convert json-ot moves to use your unspecified portal format but why bother? That sounds like a lot of effort given it already works well, and works correctly. Lets push that into a separate RFC if its important to you.

It's already a separate RFC: draft-toomim-httpbis-range-patch-00. I'm disappointed that you don't realize this. Please read the specs. I'm getting the impression that they already do what you want.

Second, RFC is the wrong term. This is an "Internet Draft," or ID. The term RFC only applies to drafts that are further along in the IETF process.

- The spec is simpler this way:
  - No more Patches: X in the read or write path.

You already don't need Patches: X for custom patch formats.

  - Unless I'm missing something, I don't think we need PUT to send patches to the server

You don't need PUT. You can use PATCH. But maybe you mean something else.

  - No more triply-nested HTTP in subscriptions. We only need 2 levels of HTTP encoding. (Outer and inner)

You're not being clear on what level you're removing. I'm guessing that it's the patches. This is already the case with the spec if you use a custom patch format. So I don't see what you are proposing that's different here.

  - The content-range header can be removed from the braid spec. It can live in the sync9 spec or something, which is the only implementation I expect will actually use it.

This header is not defined in Braid. It's defined in HTTP.

The Range-Patch spec (which is a separate spec) defines how we this existing concept can interoperate with Braid synchronizers. Any implementor who wants to use Range Requests with Braid can use this spec, and you are free to ignore it if you do not desire interoperability.

This spec is useful for much more than sync9. You can use it to synchronize a CDN with dynamic content. Let's say a CDN is hosting your javascript files. When you edit them, you can just send a Content-Range: bytes=500-501 update with the new character. It's a very general way to send diffs in HTTP resources.

Seph Gentle

unread,
Feb 14, 2021, 7:17:24 PM2/14/21
to braid...@googlegroups.com
On Fri, Feb 12, 2021, at 10:07 PM, Michael Toomim wrote:
Seph, this opinion about Merge Types and Patches is aggregating a number topics into a single narrative. I'd suggest breaking it apart into the smallest possible units, so that we can discuss each one independently. Then we can make more progress together as a group.

It is much easier to solve small independent issues one-by-one. This way you can enumerate, and consider one-by-one, all options for issue, without getting lost in the combinatorial explosion of considering all options x all issues. And this is especially important for a large group, because breaking up issues into smaller issues allows each member of the group to choose different issues to work on, without having to pay attention everything that happens in the entire group.

These aren't small independent issues. "Do versions work like X or Y? Should they?" is a question & discussion, not an issue. "The spec is ambiguous about whether versions work like X or Y" is a github issue - but that framing is different and the difference matters). It sounds like you're asking to talk about smaller picture issues, but I think we have a big picture disagreement:


I see two visions for braid which in my head are in conflict:

1. Braid as an HTTP based transport format for {yjs, automerge, sharedb, sync9, etc}
2. Braid as synchronizer, which is trying to be backwards compatible with {yjs, automerge, sharedb, sync9?}

Insofar as braid is (1), our goal is:

- Provide common terminology
- Provide a standard way to *package* yjs/automerge/etc semantics into HTTP messages
- Provide some common libraries and tools to speak those protocols over http
- Provide extension points for synchronizers to customize the protocol. (Eg version identifier semantics, merge semantics, patch semantics)
- Make the HTTP protocol as simple as possible (but no simpler). I don't want to bloat each patch with headers that aren't relevant to some of the syncronizers we wrap. (Eg `Patches: 1` - just embed the patch!)

Insofar as braid is (2), our goal is:

- Articulate a synchronizer
- Demonstrate that it supports a superset of the operations of those other systems can support
- Implement the synchronizer + protocol
- Support the development of compatibility shims from automerge/sharedb/etc into the braid unified syncronizer


I've been pushing for (1) for awhile because I haven't heard you fully articulate (2) (I shoulder some responsibility for not being as good a listener here as I'd like to have been). But now hearing the vision for (2) be articulated, I'm worried we're way too early to write an RFC for it - we don't have working code!


In this case, I see two primary issues being raised:

  1. Patch Formats: General vs. Custom?
  2. A new syntax for encoding Versions & Patches

I like this summary!

*You* raised (2) when I wasn't specific about the syntax for versions in my examples and Duane asked about it. We've moved that discussion into a separate thread.

Now I'm raising another issue above:

3. Is braid a container format, a syncronizer, or trying to be both?

And meta: I just deleted a lot of replies to some stuff you said, because I want to be intentional about our focus. The decision around patch syntax is an expression of this point of contention.

On the other hand, it's possible that you are trying to argue that "Braid should not have Range Patches, every patch should be a Custom Patch format." Is this what you are trying to say?

Yes you got it.

That's not cool. Range Patches are optional and sit in their own separate file. They aren't causing you any harm.

They might, if our systems aren't compatible as a result - which matters.

And furthermore, Range Patches are awesome. We can generalize every synchronizer's patch format into a single simple model! That's so fascinating!

Yeah having a unified syncronizer is an awesome goal, if we can achieve it!

> We can generalize every synchronizer's patch format into a single simple model!

So my question is: Can we? That sounds to me like a research hypothesis, not a grounded engineering situation. I don't think we should build IETF specs around a research hypothesis. Maybe we should talk to the IRTF?

If you see it more as an engineering reality - something we can design today, then I have a lot of questions:

- Where is the mapping defined? Or is it tucked into braid.news? Is there an academic paper? I want something like the INTERNALS.md from automerge.

  - Where is the description of move operations (portals)? Where is the description of embedded operations (supported by json-ot)? How about for a counter object that can be replaced or incremented? How about custom application-specific edits (supported by sharedb)?

  - Do versions need to be translated / transformed? Are those transformations isomorphic? (I think the answer for yjs is 'no'...? And if thats true, why doesn't that matter to your hypothesis? I'm still confused about this even after our phone call yesterday.) How about dotted version vectors or interval tree clocks, or other systems which don't express parents as a list?

  - You've said a couple times in passing that you'd like your content-range: JSON format to be used for everything. Why use offsets instead of identifiers, given CRDTs are converging on identifiers?

- Where are the 2-way mappings between "every synchronizer's patch format" and your model? Greg's talk last Monday was the start of this work for automerge, not the end of it. In his example he ended up stuffing a lot of automerge's fields into extra special headers, which seemed incomplete. In my example above I put those extra fields into the HTTP body, and you said that is wrong. Where do they go? How does it map into your synchronizer?

- Is your model correct? Many academic papers in this space have "proven" their system was correct, only to be embarassed to discover correctness bugs when the system was finally implemented. I've been humbled time and again by fuzzers to know we probably aren't smarter than those unfortunate researchers. So I want to see a fuzzer showing correctness, like the fuzzer a few years ago for sharedb and sync9. Where is the equivalent for automerge and yjs?

(And I know correctness doesn't matter to you as much as it does for me, but we can't replace existing concurrency systems with a system that can't achieve the same correctness guarantees.)

I've been working from the assumption that this is infeasable / not yet ready for a spec. I'd love to be wrong about that - it would indeed have big impacts on a common protocol. Are we there yet?

And it's a model already built into HTTP— the Range Request —and but just hasn't been defined yet for PUTs! This is a great discovery in Computer Science!

For now its a great idea. Its not time to celebrate before we have working code doing that does 2 way mapping.

-Seph
Reply all
Reply to author
Forward
0 new messages