Wanted to point out something on saw Reddit.

54 views
Skip to first unread message

Chris Holland

unread,
Jan 14, 2016, 3:20:42 PM1/14/16
to GoshawkDB
https://www.reddit.com/r/programming/comments/40uohr/tapir_a_new_opensource_highperformance/

Something similar, from my glancing at it they seem to compare to spanner more, not sure if they've got a global mediatior.

Matthew Sackman

unread,
Jan 15, 2016, 3:18:13 AM1/15/16
to GoshawkDB
Thank you for the link. I'd spotted the discussion on HN, but I'd not
gone looking for it on Reddit.

My reply:

"I've not read the paper yet on TAPIR (job for today), but I've watched
the presententation at https://www.youtube.com/watch?v=yE3eMxYJDiE.

"There are differences between the two but there are some important key
similarities too. Basically, we've both had the realisation that there's
no need to impose a total global ordering on transactions. In both cases
that means a reduction in the number of network hops necessary versus
anything that has come before. Both GoshawkDB and TAPIR have been
developed independently - I had no idea they were working on this - so
the fact we've both made the same realisation is great validation.

"There are then some differences too: TAPIR uses 2PC and I need to
carefully read through the paper to figure out how they get around the
typical problems with 2PC, whereas GoshawkDB uses Paxos Synod in place
of 2PC. The use of Paxos Synod in GoshawkDB means resynchronisation is
achieved by "learners" in Paxos whereas TAPIR has a separate
resynchronisation protocol. Also, TAPIR uses loosely synchronized clocks
which are added to the transaction by the client in order to achieve
ordering. GoshawkDB uses Vector Clocks which are added during the voting
process to model dependencies between transactions and achieve
ordering."

In addition:

In the presentation Irene doesn't mention isolation levels so I'll check
in the paper whether they actually go for strong serializability. One
issue with their use of 2PC, which is the same as in spanner, is that
the client is the transaction coordinator. This is a neat idea - it
saves another network hop, but it means every client has to connect to
every server. There may be a cost here and it could mean you have to
restart services if your cluster size changes. There are implications
for running connections through load balancers too. Again, hopefully
there's more detail in the paper.

Anyway, all in all, it's great that this team at Uni Washington has had
a lot of the same ideas that I've had with GoshawkDB!

Matthew

Chris Holland

unread,
Jan 23, 2016, 1:57:31 PM1/23/16
to GoshawkDB, mat...@goshawkdb.io
I watched the video. She seems to indicate the "client" is a proxy server for the user. So the connection issue would at least be relegated to your cluster. There were a couple of things from the video I'd question more. In the IR discussion they seem to say if there are multiple answers from servers, the "client" (proxy client) re-submits with one of the values. Which  would be bad in the consistency sense from the original request. Its possible this language was just too high level to what they were trying to convey. I'd expect that to abort back to the user. They also said something about timestamps, which would imply some kind of sync across the servers, they said there was a fallback/resubmit path if the timestamp was out of whack. Using the timestamps for anything meaningful across machines implies you have another problem of keeping the servers in sync. Maybe they mean the timestamps just have to be sane as a reference from the previous timestamp on the same node.

Matthew Sackman

unread,
Jan 24, 2016, 2:05:54 PM1/24/16
to GoshawkDB
Hi Chris,

On Sat, Jan 23, 2016 at 10:57:31AM -0800, Chris Holland wrote:
> I watched the video. She seems to indicate the "client" is a proxy server
> for the user.

Yes, basically the same as spanner, the assumption is the client is
essentially the webserver. So it's fairly safe to assume it has broad
connectivity and doesn't have massive sudden spikes of growth either.

> So the connection issue would at least be relegated to your
> cluster. There were a couple of things from the video I'd question more. In
> the IR discussion they seem to say if there are multiple answers from
> servers, the "client" (proxy client) re-submits with one of the values.
> Which would be bad in the consistency sense from the original request. Its
> possible this language was just too high level to what they were trying to
> convey. I'd expect that to abort back to the user.

Nah, that's fine. I've read the paper too now and had some email contact
with Irene. GoshawkDB is the same: different nodes which have copies of
the same object can return different answers - for example if the
"question" is "I read x at version 3, is that ok?", one node with x could
say "yup, fine", whilst another could say "no, you're out of date, x is
now at version 4; here is version 4". This is inherent in a system where
you don't have a global total order. At some point you have to combine
all these answers and create one "final" answer. In this case, what
GoshawkDB would do is get back to the client and say "you need to
restart your client transaction but now with x at version 4". So that
reruns, and then the new outcome gets submitted again. GoshawkDB does
this combination in the acceptors, whereas TAPIR pushes it out to the
clients themselves (figure 9 in the paper). So there's much more logic
in the client, but there are also some performance benefits to doing
that.

This is not really much different from any database: even with postgres
you have the possibility of "concurrent modification exceptions" to
which the only sane response is to rerun the transaction. This is why
all transaction functions should always be side-effect free until after
the transaction has committed. In a lot of modern databases, the db
itself will automatically rerun the transaction function, whereas in
older designs that's often not done and it's up to the user to deal with
that sort of thing (and most likely also implement random binary
exponential backoff).

> They also said something
> about timestamps, which would imply some kind of sync across the servers,
> they said there was a fallback/resubmit path if the timestamp was out of
> whack. Using the timestamps for anything meaningful across machines implies
> you have another problem of keeping the servers in sync. Maybe they mean
> the timestamps just have to be sane as a reference from the previous
> timestamp on the same node.

So they use timestamps where GoshawkDB uses vector clocks. The GoshawkDB
approach is more complex, but strictly speaking would allow more
transactions to commit in more circumstances than with TAPIR. How much
that actually matters in practise I've no idea just yet. I find it very
odd that the benchmarks they did for TAPIR were against
non-transactional stores, and that they implemented a k-v store for
TAPIR - just seems odd to me to apparently not want to take advantage of
the transactional support.

I believe the basic idea is that a transaction can only be "prepared" by
an object if the txn clock is greater than the committed clock of the
object (this is figure 8 in the paper). This is almost exactly the same
logic as in GoshawkDB, it's just that is GoshawkDB rather than there
being one clock for the whole transaction, there's one clock for each
object mentioned within the transaction, which just allows a slightly
more fine-grained modelling of dependencies between transactions. Also
in GoshawkDB, the clocks come from the object-copies instead of the
client, and they're entirely logical clocks and have nothing to do with
actual time.

In TAPIR, if the client clocks are badly out then you'll suffer
starvation: only the client with the biggest clock will be able to get
transactions committed, though I think they do also build in some
ability to resync: if you get your txn rejected because your clock is
too small, the rejection will contain with in the current max clock and
so you can bump yours forwards to at least that (which in some ways has
a fair amount in common with phase 1 of Paxos, which in itself is pretty
interesting).

Hopefully these ramblings make some sort of sense!

Matthew

Chris Holland

unread,
Jan 30, 2016, 12:22:29 AM1/30/16
to GoshawkDB, mat...@goshawkdb.io
Thanks for the compare and contrast. Helps visualize things better.

As for the different versions of things. It seems like you're saying what I was suggesting would need to happen, if somehow during the transaction an object is discovered to be to "old". The transaction is restarted, as long as "reads" are also considered part of the transaction, then everything is good. If reads aren't in the transaction then you end up getting write skew like in MySQL or Postgres, if you don't specify FOR SHARED/UPDATE or take an advisory lock, this is a side effect of snapshot isolation. It sounds like this issue doesn't come up in Goshawkdb, I'm not sure if Tapir considers reads in their transactions as well.

The vector clock is more elegant in my opinion. It captures only that which is really relevant to the data, what its exact clock/version is. The quorum vote for mutation should keep the object's clock stable. I suppose its more complicated in that, you have to validate all clocks for all objects in the transaction, although given fast replication the happy path should have everyone up to date as fast as the network will allow.

So if I understand your TAPIR description correctly, if a "client" has a faster clocking and is writing a lot, no other "client" can write the same object, because the objects stamp will be ahead of other clients. Although it also sounds like this clock is somewhat relative to the cluster as a whole, because slower clients can somewhat sync to the max clock of all the objects they're trying to operate on. But again that furious faster client can keep updating past the max. Granted this is a pretty extreme situation this would be happening in.

Matthew Sackman

unread,
Jan 30, 2016, 3:01:45 AM1/30/16
to GoshawkDB
On Fri, Jan 29, 2016 at 09:22:29PM -0800, Chris Holland wrote:
> As for the different versions of things. It seems like you're saying what I
> was suggesting would need to happen, if somehow during the transaction an
> object is discovered to be to "old". The transaction is restarted, as long
> as "reads" are also considered part of the transaction, then everything is
> good. If reads aren't in the transaction then you end up getting write skew
> like in MySQL or Postgres, if you don't specify FOR SHARED/UPDATE or take
> an advisory lock, this is a side effect of snapshot isolation. It sounds
> like this issue doesn't come up in Goshawkdb, I'm not sure if Tapir
> considers reads in their transactions as well.

Well quite. Having tried to write papers and get them published myself
several years ago, I know just how frustrating it is to have a page
limit. That said, it is a bit disappointing they don't really mention
the transaction stuff at all anywhere - for example, it could well be
that they're doing exactly the same as GoshawkDB and doing all the work
in the client, it's just not clear. However, they do explicitly claim
strong serializability (at least in the paper) in a few places, so
whatever they're doing, I'm sure they must be including reads too in
their transaction set.

> The vector clock is more elegant in my opinion. It captures only that which
> is really relevant to the data, what its exact clock/version is. The quorum
> vote for mutation should keep the object's clock stable. I suppose its more
> complicated in that, you have to validate all clocks for all objects in the
> transaction, although given fast replication the happy path should have
> everyone up to date as fast as the network will allow.

Right, pretty much. The vector clock stuff sounds simple, but actually
the logic around how and when to manipulate it is anything but!
Eventually, reasoning about causality through numbers starts to make
some sort of sense, but it took me a good long while to figure it out -
there actually doesn't seem to be much literature on this sort of thing.
The downside to vector clocks is there's a greater serialization cost
because you're sending lots of numbers rather than just one. As soon as
your vector clock gets even modestly wide, that destroys performance.
By further reasoning about causality, I figured out a safe way to allow
the vector clocks to shrink back down, so this issue never occurs.
Again, there is basically no literature on this at all (though it sort
of ties in with some causality research that's going on on CRDTs).

All of which is to say that yes, the vector clock approach of GoshawkDB
should allow fewer retries, and is in some ways more elegant, but I
don't see any problem with the TAPIR approach at all - KISS, YAGNI,
STTCPW etc etc.

> So if I understand your TAPIR description correctly, if a "client" has a
> faster clocking and is writing a lot, no other "client" can write the same
> object, because the objects stamp will be ahead of other clients.

Yes, I believe that is correct with all systems based on clocks. I
believe this is why for Spanner they have to invest in highly accurate
GPS clocks to minimise this issue.

> Although
> it also sounds like this clock is somewhat relative to the cluster as a
> whole, because slower clients can somewhat sync to the max clock of all the
> objects they're trying to operate on. But again that furious faster client
> can keep updating past the max. Granted this is a pretty extreme situation
> this would be happening in.

I think the way to look at it is the clock is just rough synchronisation
service. You could do it all without, in which case the submission would
look a bit like this:

1. my next txn id is 3. Send this txn with id 3 to all the nodes
necessary.
2. at least one node has rejected this due to 3 being too small. They
say it should be at least 12367.
3. resubmit same txn but now with id 12368.

The problem with this is that nearly every single txn will require
multiple submissions and so round trips and so will have very slow
throughput. Setting the next txn id to the current clock time will
essentially ensure that most of the time you will be able to get away
without extra resubmissions because all the clocks will increase at the
same rate.

Of course the downside is that every txn must have a unique id so if
you're submitting faster than your clock changes you have to do more
work. However if your clock is measuring nanoseconds then you probably
don't have to worry too much!

Matthew

Reply all
Reply to author
Forward
0 new messages