Why http for raft? Heartbeat interval a lower bound for latency?

203 views
Skip to first unread message

susanggg...@gmail.com

unread,
Feb 21, 2017, 11:16:34 PM2/21/17
to etcd-dev
Hello,
Why is coreos using HTTP for its raft transport? Why not use something more efficient and specific to this use case, like gRPC?

Also, I found this train of thought on another mailing list at my university:

"etcd/raft leader only appends to its own log without fanning out to followers. It’s the heartbeat that is used to replicate the newly added entries. The heartbeat is sent out at certain configurable interval. This means that the latency would not possibly be lower than heartbeat interval."

However, looking at the code, it does seem that append entries moves entries forward to followers. Can someone confirm that the above train of thought is no longer accurate? Was it ever accurate if it is no longer accurate?

Thanks,

Susan

anthony...@coreos.com

unread,
Feb 22, 2017, 12:41:19 PM2/22/17
to etcd-dev
Hi,


> Why is coreos using HTTP for its raft transport? Why not use something more efficient and specific to this use case, like gRPC?

rafthttp opens an http connection to each peer and uses the tcp socket to stream messages indefinitely. The bulk of the payload is already protobuf-encoded; I don't think it's clear that gRPC would offer a stunning performance advantage. Historically speaking, etcd's rafthttp code predates go-grpc.


> This means that the latency would not possibly be lower than heartbeat interval.


This is easily falsifiable with the current etcd codebase; it can be refuted with cmd/tools/benchmark.


> Can someone confirm that the above train of thought is no longer accurate? Was it ever accurate if it is no longer accurate?

go-raft would coalesce appends into heartbeats for simplicity, but etcd/raft but never worked like that. Even with a heartbeat coalescing policy, latencies are still lower than the heartbeat interval since an entry can be attached to the next heartbeat shortly before the it is due to fire (on average the latency would be half the interval).

susanggg...@gmail.com

unread,
Feb 26, 2017, 12:51:55 AM2/26/17
to etcd-dev
I've used the setup from the etcd raftexample - which uses etcd/raft wal, httptransport, and in memory storage. For the state machine I just use a trivial no-op, rather than the kv storage. I'm trying to test latency characteristics of a simple client that just continuously sends requests to the leader. The 50th of the latency is 1ms, but the 99th can be as high as 20ms. The test I'm doing is: two nodes in the same datacenter with ~0.2ms latency and a third node in another datacenter that is about 70ms away from the other two. The client is in the datacenter with the other two nodes, one of which is a stable leader. Any thoughts on why the 99th percentile could be so high? I've set the ticker to 100ms with a heartbeat of 1 and election of 10, so the leader is stable. Are there any other recommendations for performance tuning or other insights you can offer? Whats the best way to identify the performance issue? Could it be in the wal write?

Thanks,

Susan

susanggg...@gmail.com

unread,
Feb 26, 2017, 2:54:03 PM2/26/17
to etcd-dev
I simplified my setup by moving all the nodes to the same datacenter and instrumented the different raft operations. I found that the call to wal.Save has a 99th of 7ms, and presumably 2 wal.Save's are necessary for every operation, which partially explains why I'm observing a 99th of 17ms. None of the other operations exhibit so much latency. I know the raft example is not meant to be performant - curious if there are any examples or ideas you have around speeding up the persistent write here?

Susan

susanggg...@gmail.com

unread,
Feb 26, 2017, 3:17:09 PM2/26/17
to etcd-dev
I guess this is one optimization I can add (parallel send/disk write at the leader). What is the 99th for etcdserver mutations if they all require at least one synchronous disk write? Besides this optimization, does etcdserver do anything more to optimize this path compared to raft example?

Susan

Xiang Li

unread,
Feb 26, 2017, 3:24:23 PM2/26/17
to susanggg...@gmail.com, etcd-dev
There is no other optimization you can do if you want the safety. To commit an entry, you MUST persist to disk. Read raft paper for more details.

If you want to reduce latency, you can switch to a faster disk (SSD to make 99% lower and more stable). Or you can use a RAID controller with WB enabled if you can afford the operational overhead.

For etcd, we mainly care about finding the best tradeoff between throughput, latency and safety, not pure latency. 

--
You received this message because you are subscribed to the Google Groups "etcd-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

susanggg...@gmail.com

unread,
Feb 28, 2017, 10:58:41 PM2/28/17
to etcd-dev, susanggg...@gmail.com
Thanks Xiang, I really appreciate your help!

One final question I have about etcd/raft. I have a question about the sample implementation.

There are two structs being used:


My understanding is that when an AppendEntry request is made the flow is:

leader: 
1. wal.Save() - this includes an fsync to disk
2. raftStorage.Save()

follower
3. wal.Save() - this includes an sync to disk
4. raftStorage.Save()

quorum:
5. Commit()

That is, the AppendEntry is persisted to disk via fsync by quorum nodes before the commit is done. Is this correct? What is raftStorage for? Is it just a cache of the wal? Why is it called raftStorage? I find it confusing to be called raftStorage because it's in-memory and I can't imagine we would want to store this data in memory and not also on disk via fsync. This sample implementation is in fact durable/safe? And your etcdserver attempts to make an optimization where 1 and 3 happen in parallel?

Thank you!

Susan
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.

RaftLearner

unread,
Feb 28, 2017, 11:22:34 PM2/28/17
to etcd-dev, susanggg...@gmail.com
I have similar question for the purpose of "raftStorage".  https://godoc.org/github.com/coreos/etcd/raft says that:

"Second, all persisted log entries must be made available via an implementation of the Storage interface. The provided MemoryStorage type can be used for this (if you repopulate its state upon a restart), or you can supply your own disk-backed implementation."

1) what does "all persisted log entries must be made available via an implementation of the Storage interface" imply? is Storage interface served as an log store so that if other nodes ask it for logs, it can retrieve from there? or it also serves some other purpose like reconstructing snapshot?

2) More important, is it recommended Storage interface should be implemented on top of disk in production? but that means  a singe append-entry actually incurs two disk seeks sequentially on the critical write path(one for wal.save, another for writing to the disk-backed Storage), which is a big concern due to the extra disk seek latency.  My guess is based on line  428 -- 433 in https://github.com/coreos/etcd/blob/master/contrib/raftexample/raft.go
Reply all
Reply to author
Forward
0 new messages