Thoughts on protocol improvement

Steven Jewel

unread,

Feb 15, 2014, 12:00:59 AM2/15/14

to clearsk...@googlegroups.com

Hello everyone,

Here is an alternate proposal to the replication log sent earlier this
week. It improves upon ideas already discussed.

Having a vector clock per file gives us causality violation tracking
using much less disk space than the replication log uses. The
improvement that is needed over the current system is how to know what
files are new. Previously we discussed using a locally tracked
last-changed revision number on each file. I am going to rename this to
a transaction ID in the rest of this email.

The problem with tracking the last-known transaction ID for each client
is that a single change will be replicated n*n times, where n is the
number of clients present in the network. We can improve on that by
instead making the last-changed field contain both a peer ID and a
transaction ID. On first connection we will send the last known
transaction ID for all known writers to our peer. Our peer can then
scan its database and identify any new records since that point.

Manifest Exchange
-----------------

For those just tuning in, a peer ID is a 128-bit number, represented as
hex. I'll use "A", "B", and "C" to represent these hex numbers, for
clarity. The transaction ID is a monotonically incrementing number
that's unique to each peer, incremented each time that peer makes a
change to the tree. The transaction IDs don't stay in sync. Let's
pretend we have hosts A, B, and C. An individual file record will look
something like this:

{
path: "bar.txt",
size: 143,
sha256: "af12bc34...",
last_changed_by: "A",
last_changed_tx_id: 34,
vector_clock: {
A: 2,
C: 1
}
}

When B connects to A, it will look at its own database to see what the
maximum transaction ID for each known writer is, using a query like this:

SELECT last_changed_by, max(last_changed_tx_id) FROM files
GROUP BY last_changed_by

Obviously this can be tracked in memory instead of doing a full table
scan each time, since it has an overhead of 20 bytes per known writer.

B will then send the following to A:

{
type: "get_updates",
since: {
A: 30,
B: 5,
C: 12
}
}

A will then find all records with a query something like this:

SELECT * FROM files
WHERE last_changed_by = 'A' AND last_changed_tx_id > 30
OR last_changed_by = 'B' AND last_changed_tx_id > 5
OR last_changed_by = 'C' AND last_changed_tx_id > 12

Tracking these numbers is low impact, with a fixed overhead per file (20
bytes).

A can then respond with all new records.

Conflicts
---------

When the vector clocks detect a conflict is discovered it is handled the
same way as was previously proposed. Said another way, the method of
handling the conflicts is independent of the method for detecting them.

File presence
-------------

Just because a file is mentioned in the tree doesn't mean that our
connected peers have a copy of it. We need a separate way to
communicate this with our peers. With the replication log this is
simple to do since for any given point in the history an identical tree
can be produced, meaning that a simple bitmask is sufficient.

For vector clocks a different method is necessary. Thanks to someone
who emailed me off-list and suggested that we use a bloom filter.

A bloom filter is a way to express which files are present with a fixed
message size. They are easy to calculate and if we use a counting bloom
filter it can be kept in memory and updated as files are added and
removed. (What is exchanged over the wire would be a compact,
non-counting derivative of the in-memory counting filter.)

Bloom filters can sometimes give false positives, but never false
negatives. If we get a false positive, we'll send a "get" message to
our peer, and our peer can respond with a negative response meaning that
it doesn't have the file.

The input to the bloom filter would be the checksums of each file in the
tree. Since they are already hashes, we don't need to do any further
hashing, simplifying the bloom filter implementation.

Analysis
--------

Both this and the replication log proposals reasonably handle my
concerns for mobile devices -- which is that they should be able to
periodically check in and see if anything is new in a share without
having a lengthly or costly manifest exchange.

Steven

Steven Jewel

unread,

Feb 15, 2014, 12:00:24 PM2/15/14

to clearsk...@googlegroups.com

On 02/14/2014 10:00 PM, Steven Jewel wrote:
> Analysis
> --------
>
> Both this and the replication log proposals reasonably handle my
> concerns for mobile devices -- which is that they should be able to
> periodically check in and see if anything is new in a share without
> having a lengthly or costly manifest exchange.

For completeness, here is the full complexity analysis. Skip to
'Conclusion' if you're not interested in such things. Definitions:

n = number of current files
d = number of deleted files
p = average number of read/write peers involved with each file
e = total number of peers ever participating
c = number of new changes
t = total number of changes

Storage requirements
--------------------

replog: O(t)
vclock: O((n+d)*p)

t is going to be bigger than n * p, since it contains all the
information in n * p, and additionally contains information that is no
longer relevant.

vclock is the winner here, although when doing file exchange it
shouldn't matter because the files themselves will take a lot more disk
space than the archived replog.

Hot data
--------

replog: O(n+c)
vclock: O((n+d)*p)

Hot data is the data necessary for normal operation. Most of the replog
is archived and rarely accessed.

replog wins here, as it doesn't need to store information for deleted
files, and also doesn't have a vector clock stored for each file.

Manifest exchange
-----------------

replog: O(c)
vclock: O(c+e)

These are comparable in both cases, since 'e' is not likely to be large.

I think vclock wins here because the algorithm for discovering which
changes are in 'c' is very simple.

Causality violation checking
----------------------------

replog: O(c)
vclock: O(c*p)

Once again these are comparable, and 'c' is almost always going to be
small, unless a peer connects very rarely. 'p' is likely to be small as
well.

I think vclock wins since the semantics of checking for causality
violations is well understood.

File presence exchange
----------------------

replog: bitmask O(n)
vclock: bloom filter O(n+d)

A compressed bitmask is going to perform very well on both the empty and
full cases, but defining the semantics perfectly to ensure that the tree
can be represented deterministically is something I haven't explored
fully and might have unsolvable issues.

On the other hand, the bloom filter can be used in either case and can
be easily tuned to perform well in a variety of situations. I think the
bloom filter wins out here.

Conclusion
----------

Sorry for so many lengthy posts this week. I was undecided on what the
best path to take was and wanted to explore the options publicly.

My personal opinion is that vector clocks are the way to go, as
specified. It works very well for the mobile case, which was my
original motivation for exploration.

I will resume my work in the protocol_cleanup branch with vector clocks,
fully specifying the behavior of the bloom filter, etc. I feel like
this is a major improvement over the old "utime" system as far as the
mobile case goes.

Speak up if you feel I'm making a mistake, as I sometimes miss the obvious.

Steven

[1]: Not all writers actually write to all files, so the per-file vclock
size will vary considerably.

Pedro Larroy

unread,

Feb 15, 2014, 7:12:48 PM2/15/14

to clearsk...@googlegroups.com

Hi Steven

Thanks for this proposal. I think it will work, I prefer it to the replog due to the storage complexity to begin with. I prefer to scale with the number of peers than with the number of changes. I will implement that way in the CPP client unless we find a flaw.

I don't get where the n*n number comes from though.

Should we update the protocol also with the new message format?

Pedro.

Steven Jewel

unread,

Feb 15, 2014, 9:55:02 PM2/15/14

to clearsk...@googlegroups.com

On 02/15/2014 05:12 PM, Pedro Larroy wrote:
> I don't get where the n*n number comes from though.

There are multiple ways we've talked about doing the manifest exchange.

Method #1: Send the entire manifest each connection (current spec).

Method #2: I called this the "database revision number" in emails
earlier this week. Keep a private tx-id on each record, and a public
tx-id that is shared. Every time a record is updated, regardless of
where the record originally came from, this private field is updated.
Then, on connection with a peer, you ask that peer what it has new since
the last time you talked to this peer.

Method #3: The most recent proposal for manifest exchange, which is
where each record has a public "last_changed_by" and "last_changed_tx_id".

The comment about n*n was in reference to method #2. A single change
made will eventually result in n*n messages, where n is the number of
writers in the share.

> Should we update the protocol also with the new message format?

Yes, I will do that as soon as I can. In the meanwhile we can consider
what's in the emails as a working draft, as I believe any changes will
be minor and cosmetic.

Steven

Pedro Larroy

unread,

Feb 16, 2014, 5:40:40 AM2/16/14

to clearsk...@googlegroups.com, clear...@stevenjewel.com

Still, not clear where the n*n figure is coming. If all clients are connected at a point in time, the peer that updated the file sends n update messages to all connected clients and that's it, or did I miss something?

I know what is a bloom filter, but it's not clear to me where exactly you propose it to be used, maybe you can clarify this. Anyway it's just an optimization.

About the tx_id, we called before "revision", as I understand each file has the global revision number which is increased by one for every change and then this same revision is stored in the file entry in the manifest, these revision numbers are local to each peer, correct? Or something else changed?

Pedro.

Steven Jewel

unread,

Feb 16, 2014, 10:50:57 AM2/16/14

to Pedro Larroy, clearsk...@googlegroups.com

On 02/16/2014 03:40 AM, Pedro Larroy wrote:
> Still, not clear where the n*n figure is coming. If all clients are
> connected at a point in time, the peer that updated the file sends n
> update messages to all connected clients and that's it, or did I miss
> something?

The problem existed because after that update was sent to each peer,
that peer would update its database. Since the topology of the network
isn't known, that peer would happily replicate the change to each of its
peers. For example with peers A, B, C, D.

* A is connected to B and C.

* B is connected to A and C.

* C is connected to A, B, and D.

* D is only connected to C.

Here is how the sequence of events would happen.

1. The change happens on A.

2. A replicates the change to B and C.

3a. B replicates the change to C (but not back to A).

3b. Simultaneously, C replicates the change to B and D.

In a more fully-connected example, such as having ten peers which are
all connected to each other, the change would be replicated ten times
from A, and then each of the other peers would replicate it again.

Now, let's look at an example of a share with 100 peers, where each is
connected to ten others at random. Over time, peers are disconnected
from one and connected to another. This is where n*n behavior matters.

Quoting myself from previous email:

>> Method #2: I called this the "database revision number" in emails
>> earlier this week. Keep a private tx-id on each record, and a public
>> tx-id that is shared. Every time a record is updated, regardless of
>> where the record originally came from, this private field is updated.
>> Then, on connection with a peer, you ask that peer what it has new
>> since the last time you talked to this peer.
>>
>> Method #3: The most recent proposal for manifest exchange, which is
>> where each record has a public "last_changed_by" and
>> "last_changed_tx_id".

When looking at a share with lots of writers, method #2 has O(n*n)
behavior. Method #3 is something closer to O(n), albeit it still will
be O(n*n) in a small, fully-connected system [1].

> I know what is a bloom filter, but it's not clear to me where exactly
> you propose it to be used, maybe you can clarify this. Anyway it's just
> an optimization.

The bloom filter is used to tell other peers which files are actually
present on a particular peer. Just because something is present in the
manifest doesn't meant that the file contents have been downloaded to
the peer yet. You're right that the bloom filter is an optimization,
but I think it's a necessary one to have in the core protocol. If we
don't have bloom filters, we'd need to send something like this upon
first connection:

{
type: "files_present",
sha256s: [
"633de81f4d9651bd9d333287f3340ff786970174b5e86dd3c9df8f9c736c61af",
"b700a2f42728a69a2ac8107a1ee0401d7267a8ee8a36385d40c5bb2d5d933c4b",
"bed0558c70f6759dc3d79d9a533d2f2b71ea9ee9c652de645bc82a847c856b90"
]

}

> About the tx_id, we called before "revision", as I understand each file
> has the global revision number which is increased by one for every
> change and then this same revision is stored in the file entry in the
> manifest, these revision numbers are local to each peer, correct? Or
> something else changed?

This is where things have changed. Before the revision number was
private, and now it's public. In addition, a public "last_changed_by"
field was added that contains the peer_ID of the last peer to write to
that file. The revision numbers still increment on each peer
independently, but instead of incrementing with all writes, they only
increment with writes originating locally. My previous email, with the
example SQL queries, explains in more detail.

Changing this behavior is what reduces n*n to something much smaller,
for the case of having lots of writers.

Steven

[1]: I think having O(n*n) behavior in the case of only having a few
writers is probably acceptable.

Pedro Larroy

unread,

Feb 17, 2014, 3:48:01 AM2/17/14

to clearsk...@googlegroups.com, Pedro Larroy, clear...@stevenjewel.com

Thanks for the clarification, I wasn't aware that peers have to replicate updates. I guess this can be configurable in case one expects a fully connected network, then the number of messages on update is n - 1.

Wouldn't it be better to commit changes to the share only when files are downloaded? Then you can guarantee that the manifest only contains commited files. Then there's no need for bloom filter.

Steven Jewel

unread,

Feb 17, 2014, 12:05:07 PM2/17/14

to clearsk...@googlegroups.com, Pedro Larroy

On 02/17/2014 01:48 AM, Pedro Larroy wrote:
> Thanks for the clarification, I wasn't aware that peers have to
> replicate updates. I guess this can be configurable in case one expects
> a fully connected network, then the number of messages on update is n - 1.

I think it's going to be OK to always replicate. The complexity of
replication, using the new method, depends on the degree of connectivity.

t = total number of peers
n = average number of other peers connected per peer

On a large network, such as with 100 peers, let's say that the software
limits itself to 10 connections (chosen randomly). That means the
number of messages that are sent is only going to be 2*t or close to it.

On a small network, such as with only 10 peers, where all are fully
connected, then it will be t*t messages, but we're only talking about
100 messages, so that also seems acceptable.

Once again, since the size of these update messages is much smaller than
the binary files that will have to be transfered, I don't think it'll be
too big of a concern.

I'd be worried about having a configuration option for this unless the
default is to replicate, since in the real world connectivity is messy.

> Wouldn't it be better to commit changes to the share only when files are
> downloaded? Then you can guarantee that the manifest only contains
> commited files. Then there's no need for bloom filter.

I've considered this, but it has some downsides. It forces us to use
method #2. It also makes subtree copy, partial copy, and streaming more
complex. Even excluding those cases, it means we either have to keep a
local copy of which files are present on each remote peer, or we have to
let the manifest exchange become more repetitive.

Let's say we have a share with peers A and B which are sporadically
connected. We'll be using method #2, so let's say the revision number
is 1 on all of them.

100 files are added on A. A is now at 100.

When B connects to A, it asks for everything since A-1. A sends the
whole list.

B downloads some of the files in random order, and is then disconnected.
It can't update the last-known ID for A yet, since it hadn't gotten
the whole set. B is now at 10.

When B connects to A later on, it asks for everything since A-1. A
sends the whole list again. B also asks A for everything since B-1, and
B sends its list. A is able to update its last-known ID for B to 10.

This is still an improvement on method #1, which would be repeating the
entire manifest on each connection.

One big advantage is that it won't be possible for files to be stuck at
orphaned versions. (Orphans are created when a file is added on a peer
but then that peer is taken offline before the file is copied to other
peers.)

There is one further complication, and that is that you also need file
presence information for read-only peers. Read-only peers don't write
to the tree, but it's safe to download the file contents from them (due
to the SHA256 checksum). This is a case where the bloom filter would
work really well.

I will continue to explore options as I update the spec in the
protocol_cleanup branch. I'd like to keep the core as simple as
possible, so maybe the core can use method #1, and method #3 can be
moved to an optional extension. Likewise with bloom filters.

Steven

Pedro Larroy

unread,

Feb 17, 2014, 4:57:15 PM2/17/14

to clearsk...@googlegroups.com, Pedro Larroy, clear...@stevenjewel.com

I'm more inclined towards method #2 with manifest with transactions.

Pedro.

Pedro Larroy

unread,

Feb 18, 2014, 5:22:17 AM2/18/14

to clearsk...@googlegroups.com, Pedro Larroy, clear...@stevenjewel.com

Hi Steven

I saw your latest changes in github, I was thinking yesterday, why don't we make the share id and the PSKs all the same length? namely 32 bytes?

Have you already decided to use SRP? maybe you can share something on this.

It would also like to have an agreement to go with method #2 with atomic changes to the manifest, ie only files present should go in the manifest, so scanning logic is more self contained, I have included these fields in the DB and I will follow your proposal to get latest changes per peer. I think it's a good approach.

Regards.

Pedro.

Steven Jewel

unread,

Feb 18, 2014, 11:58:06 AM2/18/14

to Pedro Larroy, clearsk...@googlegroups.com

On 02/18/2014 03:22 AM, Pedro Larroy wrote:
> I saw your latest changes in github, I was thinking yesterday, why don't
> we make the share id and the PSKs all the same length? namely 32 bytes?
> Have you already decided to use SRP? maybe you can share something on this.

Yes, we'll definitely switch to SRP. The keys will just be called keys
instead of PSKs.

I'm also moving the handshake to happen after the TLS session is
established, for security reasons. That means we need to pack enough
information into the SRP "username" to be able to identify both the
share and the access level. I've decided to do this by replacing the
share ID with an ID for each key.

I meant to change the lengths of the keys and IDs to all be the same.
Should be fixed now.

> It would also like to have an agreement to go with method #2 with atomic
> changes to the manifest, ie only files present should go in the
> manifest, so scanning logic is more self contained, I have included
> these fields in the DB and I will follow your proposal to get latest
> changes per peer. I think it's a good approach.

I'm uncomfortable with method #2 because it doesn't handle partial
copies or read-only peers. Specifically, read-only peers need a way to
communicate to each other and to read-write peers which files they
currently have downloaded.

Situations where there are a few writers and hundreds of readers seem
like they will be common enough that they should be handled efficiently
by the core protocol.

A bloom filter handles the read-only peer situation very effectively.

In any case, I'm not sure we need to come to a complete agreement yet.
I'd like to fully write out the spec using method #3 + bloom filters to
see if I can simplify it. I don't see any harm if you write the
software using method #2 in the meanwhile. Method #2 and method #3 are
very similar, and switching from one to the other in the spec or the
software shouldn't be a hard change.

Since you're the one doing the work, feel free to do it the way you feel
is best. I will change the spec to conform with the implementation or
submit pull requests to the implementation to have it conform to the
spec, as most appropriate.

Steven

Pedro Larroy

unread,

Feb 18, 2014, 12:17:12 PM2/18/14

to Steven Jewel, clearsk...@googlegroups.com

I think I meant #3. I'm now a little confused as we are mixing quite some things.

In the end it would not be possible without your spec. It's quite complex for a single person to do both spec and implementation alone.

If you have some time on the weekend maybe we can make a call to talk about the different approaches. Otherwise I will implement with the 'last changed by' idea. Still lots of work to do including SSL and everything. But if we get it to work I think it will be a nice product.

Pedro.

Steven Jewel

unread,

Feb 18, 2014, 12:40:33 PM2/18/14

to clearsk...@googlegroups.com

On 02/18/2014 10:17 AM, Pedro Larroy wrote:
> I think I meant #3. I'm now a little confused as we are mixing quite
> some things.

Ah, OK. I probably should have come up with unambiguous names for each
method.

> In the end it would not be possible without your spec. It's quite
> complex for a single person to do both spec and implementation alone.

I am glad that you suggested vector clocks long ago because that is what
has started me down the path of improving things. I think once we're
through this rewrite we'll have a protocol that's both simpler and more
efficient.

> If you have some time on the weekend maybe we can make a call to talk
> about the different approaches. Otherwise I will implement with the
> 'last changed by' idea.

I should have some time Saturday. We can schedule it better off-list.

The last-changed-by approach is the one I think is best. I didn't
realize that you were proposing to do last-changed-by in combination
with only updating the local database once the file is present locally,
which is I think where the confusion came from. Sorry about that.

It sounds like we've both got the same vision of how it will work, or at
least close enough that the differences will be easy to resolve. I will
keep working on the spec.

> Still lots of work to do including SSL and everything. But if we get
> it to work I think it will be a nice product.

I agree. I think once we have a released binary we can promote it
heavily in the relevant circles and get a lot of adoption.

Steven

zulla...@gmail.com

unread,

Mar 5, 2014, 4:33:05 AM3/5/14

to clearsk...@googlegroups.com, clear...@stevenjewel.com

Hi,

I'd like to point you to

https://github.com/ricardobcl/Dotted-Version-Vectors

which claims to use a bit less space than traditional vector clocks (although I haven't checked yet if it applies to your scenario).

Steven Jewel

unread,

Mar 5, 2014, 12:08:41 PM3/5/14

to clearsk...@googlegroups.com

I've actually written the spec using Interval Tree Clocks. This can be
found in the protocol_cleanup branch, in the database.md file.

I may have made some errors in how I'm suggesting we use them, so I'd
welcome corrections from anyone with more experience.

Steven

Pedro Larroy

unread,

Mar 5, 2014, 6:37:06 PM3/5/14

to clearsk...@googlegroups.com, clear...@stevenjewel.com

Hi Steven

I had a look. I committed some changes on my fork, refining the format of the message and specifying that there can be other encodings apart from JSON.

I'm a bit afraid that we are maybe moving a bit too fast. I still don't understand the interval tree clocks, that's fine. But if now we want to generalize to a generic key / value store before we were able to finish the C++ client... aren't we overreaching?

There's also things about IM etc. I'm not saying they are all good ideas, actually something like this is implemented already in retroshare. Don't we want to start simple? I would like to have a basic client working and there's a lot of work to do still, get a working implementation, SSL, etc...

I would prefer the spec not to move too fast, like this is impossible to implement something if the spec keeps changing at the fundamental level so much.

I hope I'm able to communicate these concerns. On the other hand I think it's great the work you do with the spec, maybe we should fork protocol v1 draft and a more generalized kv store v2 for example. For sure we could reutilize a lot of components from the C++ client for a v2, even for a eventually consistent kv store. But I think the goal here is to synchronize files ATM.

Pedro.

Steven Jewel

unread,

Mar 6, 2014, 12:02:14 AM3/6/14

to clearsk...@googlegroups.com

On 03/05/2014 04:37 PM, Pedro Larroy wrote:
> I'm a bit afraid that we are maybe moving a bit too fast. I still don't
> understand the interval tree clocks, that's fine. But if now we want to
> generalize to a generic key / value store before we were able to finish
> the C++ client... aren't we overreaching?

> There's also things about IM etc. I'm not saying they are all good
> ideas, actually something like this is implemented already in
> retroshare. Don't we want to start simple? I would like to have a basic
> client working and there's a lot of work to do still, get a working
> implementation, SSL, etc...
>
> I would prefer the spec not to move too fast, like this is impossible to
> implement something if the spec keeps changing at the fundamental level
> so much.

My goal with creating the separate "database" and "directory" extensions
was just to create a better separation of concerns. I'm hoping that
this makes the protocol easier to understand, since each piece can be
tackled separately.

I don't have any intention of implementing an IM client yet.

The total word count went from 6333 for core to 7688 for
core+database+directory.

> I hope I'm able to communicate these concerns. On the other hand I think
> it's great the work you do with the spec, maybe we should fork protocol
> v1 draft and a more generalized kv store v2 for example. For sure we
> could reutilize a lot of components from the C++ client for a v2, even
> for a eventually consistent kv store. But I think the goal here is to
> synchronize files ATM.

Thanks for speaking up. The things that we'd definitely want to
implement from the "protocol_cleanup" branch are:

* the switch to TLS-SRP mode
* associated access key changes
* associated handshake changes
* the new tracker protocol
* the UUIDs instead of paths as a unique identifier [1]
* the last_update_by and last_updated_clock [2]
* the rename from "feature" to "extension" and the associated section
about reintegration.

I'd be OK with not implementing ITCs until later, using utime instead in
the interim.

I believe at that point all that would be left is minor cosmetic
differences between the two specs. It seems like a big change, but it's
mostly just changes to the "type" field or to some attributes.

Steven

[1]: This handles renames much better, as well as detecting conflicts
with newly-created files that have the same path.

[2]: This was previously known as last_updated_tx_id in previous emails
on the mailing list.

Steven Jewel

unread,

Mar 6, 2014, 12:38:09 AM3/6/14

to clearsk...@googlegroups.com

On 03/05/2014 10:02 PM, Steven Jewel wrote:
> I'd be OK with not implementing ITCs until later, using utime instead in
> the interim.

Actually, nevermind. We need ITCs (or vector clocks) or we can't detect
file conflicts. With utime all that is possible is "newest file wins".

Steven

Steven Jewel

unread,

Mar 6, 2014, 12:57:47 PM3/6/14

to clearsk...@googlegroups.com

On 03/05/2014 04:37 PM, Pedro Larroy wrote:

> I hope I'm able to communicate these concerns. On the other hand I think
> it's great the work you do with the spec, maybe we should fork protocol
> v1 draft and a more generalized kv store v2 for example. For sure we
> could reutilize a lot of components from the C++ client for a v2, even
> for a eventually consistent kv store. But I think the goal here is to
> synchronize files ATM.

Another thought here is that protocol spec isn't sacrosanct; until
there's actually multiple implementations being made, it's
unquestionably in draft status, and so I'm willing to change it as is
necessary.

Another way it would get out of draft status would be once the
implementation starts making non-alpha releases.

Let's work towards getting file sync working, punting on any part of the
spec that adds complexity (like the bloom filters, ITCs, and file
conflicts) and then come back and fix it later.

I think a good way to look at the new spec is that I renamed "manifests"
to "databases". The actual messages exchanged are very similar either way.

Steven

Pedro Larroy

unread,

Mar 6, 2014, 2:18:59 PM3/6/14

to clearsk...@googlegroups.com

I agree, I don't think last write wins is acceptable for us. I wouldn't want that as a user unless explicitly configured.

Will try to find time to read the paper about ITC.

Pedro.

--
Pedro Larroy Tovar | http://pedro.larroy.com/

Pedro Larroy

unread,

Mar 16, 2014, 9:05:37 AM3/16/14

to clearsk...@googlegroups.com, clear...@stevenjewel.com

Hi

Can we clarify the initial keys negotiation? (What used to be starttls). To me it's very unclear in core.md

Are we sending the private keys on plaintext? As I understood initially we were going to use Diffie Hellman to establish the initial connection without sending the keys in plain text.

If we send the access code in plain text initially, aren't we opening the door for an eavesdropper to hijack the code and gain access to the share?

Please we have to clarify these things.

I will go on implementing file syncrhonization without the crypto layer for this reason.

Pedro.

Pedro Larroy

unread,

Mar 16, 2014, 12:50:11 PM3/16/14

to clearsk...@googlegroups.com, clear...@stevenjewel.com

I'm thinking on how to implement the "get_updates" functionality and I see some problems. (refresher: https://github.com/larroy/clearskies/blob/master/protocol/core.md#updates)

We have client a "CA" and client b "CB"

If a file in CA changes after CB requested "updates_since(vclock v)" we can't just send the updated file with the new vector clock, since this will break the next request for "updates_since". If were to send the update file "F" with the new vector clock, we could miss another file "F*" that has a vector clock v < V(F*) < V(F) since the client got a bulk of updates previous to v.

The conclusion is that we can't send "F" with its updated vector clock anymore.

I thought about possible solutions, and it quickly gets complicated, for example if we send a message to abort the current request for updates, so there can be a new one. This has the problem that there can be an "avalanche effect" where the client can't catch up with any version, since updates are happening too fast.

Another solution which seems simpler to implement is:

1. CB asks CA "next_update_since(vclock v)", which cause CA to check in the database, if an updated file is found, it will open the file (which might now have changed since last scan).

Now CA can open the file, update the meta information about it, and send it. There are more open questions. The file can be opened in the client at the same time and keep changing while we are trying to send it. We can either check if the file is open and go to the next if that's the case, or copy it to a tempdir if it's not open by any other proces so we can send to CB a stable version of this file.

CB calls next_update_since repeatedly until there's no more updates.

Any better ideas on how to attack this?

Pedro.

Steven Jewel

unread,

Mar 17, 2014, 12:50:13 PM3/17/14

to clearsk...@googlegroups.com

On 03/16/2014 07:05 AM, Pedro Larroy wrote:
>
> Can we clarify the initial keys negotiation? (What used to be starttls).
> To me it's very unclear in core.md
>
> Are we sending the private keys on plaintext? As I understood initially
> we were going to use Diffie Hellman to establish the initial connection
> without sending the keys in plain text.

We don't need DHE mode anymore, since SRP mode has perfect forward
secrecy built-in.

I imagine the confusion comes partially because you're familiar with the
old way, which took a completely different approach. The SRP mode is
much simpler and I believe it is more robust.

An access code is made of two parts internally. One is a public ID and
the other is a secret.

When we open a connection, we immediately start TLS. We use the SRP
"username" field to identify which key we want to use, by putting the
public portion of the access code in that field. We then use the secret
portion as the SRP "password".

Once the connection is established successfully, we do a key exchange so
that the clients don't need to use the usually time-limited access code
to connect in the future. The key exchange includes a new, separate
public ID and a new password.

>
> If we send the access code in plain text initially, aren't we opening
> the door for an eavesdropper to hijack the code and gain access to the
> share?

The full access code is never sent in plain text.

Steven

Steven Jewel

unread,

Mar 17, 2014, 2:24:19 PM3/17/14

to clearsk...@googlegroups.com

On 03/16/2014 10:50 AM, Pedro Larroy wrote:
> I'm thinking on how to implement the "get_updates" functionality and I
> see some problems.
> (refresher: https://github.com/larroy/clearskies/blob/master/protocol/core.md#updates)
>
>
>
> We have client a "CA" and client b "CB"
>
> If a file in CA changes after CB requested "updates_since(vclock v)" we
> can't just send the updated file with the new vector clock, since this
> will break the next request for "updates_since". If were to send the
> update file "F" with the new vector clock, we could miss another file
> "F*" that has a vector clock v < V(F*) < V(F) since the client got a
> bulk of updates previous to v.
>
> The conclusion is that we can't send "F" with its updated vector clock
> anymore.
>
> I thought about possible solutions, and it quickly gets complicated, for
> example if we send a message to abort the current request for updates,
> so there can be a new one. This has the problem that there can be an
> "avalanche effect" where the client can't catch up with any version,
> since updates are happening too fast.

This is why the protocol has two separate mechanisms. The vector clocks
are used for causality violation detection only, and aren't used to
determine which database values are missing.

The "database.get_updates" message works with the "last_update_by" and
"last_updated_clock" fields and ignores the vclock value. Confusingly,
which "last_updated_clock" is called a clock, it's a logical clock and
doesn't have

When a client writes a change that originated on another node, it
updates the associated vector clock but it must *not* change the
"last_updated_by" or "last_updated_clock" values. That stops the
avalanche effect.

What further confuses things is that the "database.get_updates" message
has what appears on the surface to be a vector clock for the "since"
attribute, but it is not a vector clock.

>
> Another solution which seems simpler to implement is:
>
> 1. CB asks CA "next_update_since(vclock v)", which cause CA to check in
> the database, if an updated file is found, it will open the file (which
> might now have changed since last scan).
>
> Now CA can open the file, update the meta information about it, and send
> it. There are more open questions. The file can be opened in the client
> at the same time and keep changing while we are trying to send it. We
> can either check if the file is open and go to the next if that's the
> case, or copy it to a tempdir if it's not open by any other proces so we
> can send to CB a stable version of this file.
>

I believe that the protocol will work, but obviously I'm just running it
in my head and might be missing something.

If it will help, I should be able to implement the new protocol in the
proof-of-concept implementation pretty quickly, which might clarify
things and help shake out any issues.

It should be possible to implement the protocol, skipping the vector
clocks, and have working sync. (The vector clocks would still need to
be added later, but only to detect conflicts.)

Steven

Pedro Larroy

unread,

Mar 17, 2014, 7:43:38 PM3/17/14

to clearsk...@googlegroups.com

I didn't get what you are saying. I thought from your proposal that this is using the vclock to get the latest changes.

SELECT * FROM files

WHERE last_changed_by = 'A' AND last_changed_rev > 30

OR last_changed_by = 'B' AND last_changed_rev > 5

OR last_changed_by = 'C' AND last_changed_rev > 12

Steven Jewel

unread,

Mar 18, 2014, 9:54:50 AM3/18/14

to clearsk...@googlegroups.com

On 03/17/2014 05:43 PM, Pedro Larroy wrote:
> I didn't get what you are saying. I thought from your proposal that this
> is using the vclock to get the latest changes.
>
> SELECT * FROM files
> WHERE last_changed_by = 'A' AND last_changed_rev > 30
> OR last_changed_by = 'B' AND last_changed_rev > 5
> OR last_changed_by = 'C' AND last_changed_rev > 12

The numbers don't come from a vclock. They are gathered by looking for
the maximum value for last_changed_rev for each peer:

SELECT last_changed_by, max(last_changed_tx_id) FROM files
GROUP BY last_changed_by

Steven

Pedro Larroy

unread,

Mar 18, 2014, 10:43:37 AM3/18/14

to clearsk...@googlegroups.com

Ok I understand. Then this works as long as the manifest is not updated while it's being sent to the peer. I'm currently working on that. I make a copy of the files table in a temporary table so it can be chunked and sent to the peer. I called it "FrozenManifest"

Pedro.

Steven Jewel

unread,

Mar 18, 2014, 10:54:40 AM3/18/14

to clearsk...@googlegroups.com

On 03/18/2014 08:43 AM, Pedro Larroy wrote:
> Ok I understand. Then this works as long as the manifest is not updated
> while it's being sent to the peer. I'm currently working on that. I make
> a copy of the files table in a temporary table so it can be chunked and
> sent to the peer. I called it "FrozenManifest"

Won't SQLite give a consistent view of the table as long as you read the
result set as you need it?

I forgot to mention, but you'd also want to sort the outgoing records by
last_changed_rev [1]. That way the records can be written to the
database as they are received, without any chance of records being lost.

Steven

Pedro Larroy

unread,

Mar 18, 2014, 12:23:09 PM3/18/14

to clearsk...@googlegroups.com

I don't think so. If you have a cursor open and you change data in the meanwhile, you get changed results. There's not much documentation about this online but I made some experiments and that's the case.

Can you elaborate on records being lost?

Steven Jewel

unread,

Mar 18, 2014, 2:04:57 PM3/18/14

to clearsk...@googlegroups.com

On 03/18/2014 10:23 AM, Pedro Larroy wrote:
> I don't think so. If you have a cursor open and you change data in the
> meanwhile, you get changed results. There's not much documentation about
> this online but I made some experiments and that's the case.

It looks like if you'd like isolation, you have to open a separate
connection to the database. Obviously whatever approach you'd like to
use is fine.

http://sqlite.org/isolation.html

> Can you elaborate on records being lost?

Say peer A writes to three different records. They would get values of
1, 2, 3 in the "last_changed_rev" field, and "A" in "last_changed_by"
field. It's *essential* that there is no way for someone to receive
them out of order.

Let's say that the records are sent to peer B in reverse order, 3, 2, 1.
However, before 1 crosses the wire, the connection is broken or the
computer running peer B is rebooted.

The next time they connect, B will ask A for all records for A newer
than record 3. Since 1 was never written, it won't ever be synced.

This problem is solved by always sorting by last_changed_rev when
sending a series of changes. As long as this is always done, there is
no opportunity [1] for records to be lost, even when many peers are
involved.

Steven

[1]: It now occurs to me that a malicious read-only peer could
selectively omit records to cause inconsistencies, since the updates are
signed individually. I will add a FIXME into the spec for this and will
come up with a way to avoid it.

Pedro Larroy

unread,

Mar 18, 2014, 3:04:22 PM3/18/14

to clearsk...@googlegroups.com

It has some problems, if you open another connection your queries might fail when the db is locked as far as I understand, no good.

Good point about the sorting!

Pedro.

Pedro Larroy

unread,

Mar 18, 2014, 3:57:49 PM3/18/14

to clearsk...@googlegroups.com

I think I found a hole in your approach.

Let's say we have clients A, B ,C

Let's call the argument to get_updates since, the "update vector"

A file Was changed by C:1, got replicated.

C disconnects.

A changes, A:2, replicated to B. State is A: A2 C:C1 B:A2

B disconnects.

C connects.

Gets A2, changes to C5. There is no conflict.

State is A:C5, B:A2, C:C5

B connects, requests

SELECT * FROM files
WHERE last_changed_by = 'A' AND last_changed_rev > 2

updates since {A:2}

And there's nothing.

Steven Jewel

unread,

Mar 18, 2014, 4:35:08 PM3/18/14

to clearsk...@googlegroups.com

On 03/18/2014 01:57 PM, Pedro Larroy wrote:
> I think I found a hole in your approach.
>
> Let's say we have clients A, B ,C
>
> Let's call the argument to get_updates since, the "update vector"
>
> A file Was changed by C:1, got replicated.
>
> C disconnects.
> A changes, A:2, replicated to B. State is A: A2 C:C1 B:A2
>
> B disconnects.
>
> C connects.
> Gets A2, changes to C5. There is no conflict.
>
> State is A:C5, B:A2, C:C5

That's good notation to help talk about potential issues with this
update mechanism.

>
>
> B connects, requests
> SELECT * FROM files
> WHERE last_changed_by = 'A' AND last_changed_rev > 2
>
> updates since {A:2}
>
> And there's nothing.

Good point. You'd need to additionally include any updates from
unmentioned peer IDs. If the message was updates_since({a:2,b:1}) the
associated query would be:

SELECT * FROM files
WHERE last_changed_by = 'A' AND last_changed_rev > 2

OR last_changed_by = 'B' AND last_changed_rev > 1
OR last_changed_by NOT IN ('A', 'B')

Steven

Pedro Larroy

unread,

Mar 18, 2014, 4:39:44 PM3/18/14

to clearsk...@googlegroups.com

This brings its own set of additional problems. We need to store discovered peers, and transmit this information. There might be peers that can't reach other peers and even have no knowledge about them.

Things are getting complex!

Steven Jewel

unread,

Mar 18, 2014, 4:43:12 PM3/18/14

to clearsk...@googlegroups.com

On 03/18/2014 02:39 PM, Pedro Larroy wrote:
> This brings its own set of additional problems. We need to store
> discovered peers, and transmit this information. There might be peers
> that can't reach other peers and even have no knowledge about them.
>
> Things are getting complex!
>

I think you've misread. There is no need to keep a full peer list. It
doesn't require any more information than is already in the
"updates_since" message.

Note that C isn't in the updates_since message, yet its records will be
selected:

> updates_since({A:2,B:1}):

Pedro Larroy

unread,

Mar 18, 2014, 4:45:13 PM3/18/14

to clearsk...@googlegroups.com

You are right.

Thanks.

Pedro Larroy

unread,

Mar 28, 2014, 8:23:34 AM3/28/14

to clearsk...@googlegroups.com

I have been developing this idea more, and I extended the share to support what we discussed.

https://github.com/larroy/clearskies_core/blob/master/src/cs/share.cpp#L121

I will soon program some tests to synchronize different shares and test our ideas.

I was thinking that althought not needed strictly to answer the get_updates request, it might be very useful for debug and diagnose that the share knows the last revision of every other peer to which it synchronized, so I will put them in another table.

I was trying to come up with a counterexample where is needed for the request:

* SELECT * FROM files

* WHERE last_changed_by = 'A' AND last_changed_rev > 2

* OR last_changed_by = 'B' AND last_changed_rev > 1

* OR last_changed_by NOT IN ('A', 'B')

But I didn't find any, so it should work without that.

One question that got me thinking is that when you open a "Frozen Manifest" to start chunking and send it to the client, if the file changed in between we should handle this situation. For example rsync will warn "File vanished".

The file might also change after the client got the manifest and it's doing gets for individual files.

Do we have any suggestions on how to handle these cases best? I was thinking that in this case we should send some kind of "File changed" message, either with new file metadata or something else, the case where the file is changed after it has been opened is more hairy, but we can do something about it, such as copy the file first to temporary store so the client has the previous version....

Any comments?

Pedro.

On Tuesday, March 18, 2014 9:45:13 PM UTC+1, Pedro Larroy wrote:

Steven Jewel

unread,

Mar 28, 2014, 11:04:42 AM3/28/14

to clearsk...@googlegroups.com

On 03/28/2014 06:23 AM, Pedro Larroy wrote:
> One question that got me thinking is that when you open a "Frozen
> Manifest" to start chunking and send it to the client, if the file
> changed in between we should handle this situation. For example rsync
> will warn "File vanished".
>
> The file might also change after the client got the manifest and it's
> doing gets for individual files.
>
> Do we have any suggestions on how to handle these cases best? I was
> thinking that in this case we should send some kind of "File changed"
> message, either with new file metadata or something else, the case where
> the file is changed after it has been opened is more hairy, but we can
> do something about it, such as copy the file first to temporary store so
> the client has the previous version....
>
> Any comments?

This is why we sync the file metadata (i.e. database) separate from the
file content. If we discover the file has changed, we'll just add
another "update" message to the tail of the outgoing queue, and handle
it just like we would handle a file change that happens when there is no
outgoing changes.

In testing the proof of concept, I found that having a small debounce
time of a second eliminated almost all cases of this happening. In
other words, if a file is being continuously, but slowly written because
it is being wget'd from another host, we delay adding it to the database
and propagating it until it stops being written.

This is also why I switched to using the SHA256 to request file contents
instead of the file path. That way, if a peer requests an old copy of
the file, we can respond that it no longer exists.

It's always going to be possible that a file is changed at the exact
moment while it is being sent. I think this is going to be rare enough
not to try and handle it specially. It will just be rejected on the
remote end and the new SHA256 requested, so it doesn't do any harm.
(Note that the rsync extension makes this case bandwidth efficient,
since the file can be re-requested.)

Steven

Steven Jewel

unread,

Mar 28, 2014, 11:10:16 AM3/28/14

to clearsk...@googlegroups.com

On 03/28/2014 06:23 AM, Pedro Larroy wrote:
> I will soon program some tests to synchronize different shares and test
> our ideas.

You might find the integration test I made for clearskies-ruby to be a
good starting point. It is a bash script that launches two copies of
the daemon in a tmux session, makes changes to their repositories and
then checks to see if they are synced properly. It was very handy to
eliminate update loops where the scanner thought a remote change being
written locally was in fact a new local change (which caused endless loops).

It also has code to combine the logging from both peers into a single
log file. I found this was unbelievably helpful to track down timing
errors and race conditions [1].

https://github.com/jewel/clearskies-ruby/blob/master/test/read-write-integration

Steven

[1]: Even though clearskies-ruby only allows a single thread to run at a
time via a global lock, you end up with similar issues to threading when
there are two daemons communicating.

Pedro Larroy

unread,

Mar 28, 2014, 11:27:03 AM3/28/14

to clearsk...@googlegroups.com

It's ok as long as you don't have huge files, then it's a waste of resources to transfer them just to be thrown away because the hash doesn't match. I think we should do something smarter in these case.

I will follow your switch to using the hash as the request token as you suggest.

Pedro.

Steven Jewel

unread,

Mar 28, 2014, 11:41:04 AM3/28/14

to clearsk...@googlegroups.com

On 03/28/2014 09:27 AM, Pedro Larroy wrote:
> It's ok as long as you don't have huge files, then it's a waste of
> resources to transfer them just to be thrown away because the hash
> doesn't match. I think we should do something smarter in these case.

I think implementing the rsync extension is a good way to handle this
case, since it benefits many other situations as well. The librsync
library makes this pretty straightforward to do.

https://github.com/jewel/clearskies/blob/master/protocol/rsync.md

Note that "path" should probably be replaced by the database "uuid", so
that a file that is renamed and then changed can still be rsync'd.

Steven

Pedro Larroy

unread,

Mar 28, 2014, 3:25:08 PM3/28/14

to clearsk...@googlegroups.com

Thanks I will have a look at your tests.

Reply all

Reply to author

Forward

Thoughts on protocol improvement - vector clocks

Steven Jewel

Steven Jewel

Pedro Larroy

Steven Jewel

Pedro Larroy

Steven Jewel

Pedro Larroy

Steven Jewel

Pedro Larroy

Pedro Larroy

Steven Jewel

Pedro Larroy

Steven Jewel

zulla...@gmail.com

Steven Jewel

Pedro Larroy

Steven Jewel

Steven Jewel

Steven Jewel

Pedro Larroy

Pedro Larroy

Pedro Larroy

Steven Jewel

Steven Jewel

Pedro Larroy

Steven Jewel

Pedro Larroy

Steven Jewel

Pedro Larroy

Steven Jewel

Pedro Larroy

Pedro Larroy

Steven Jewel

Pedro Larroy

Steven Jewel

Pedro Larroy

Pedro Larroy

Steven Jewel

Steven Jewel

Pedro Larroy

Steven Jewel

Pedro Larroy