multi-raft support in tikv?

jz...@uber.com

unread,

Apr 6, 2017, 9:15:32 PM4/6/17

to TiDB user group

Hello,

I am curious about the integration of raft with tikv. Do you support multi-raft which (1) multiplexes all regions/ranges in a node in the same local storage, and (2) coalesce heartbeats so that they are sent out per-store based, instead of per range/region?

if so, it would be great if you can point me to the reference or code on that. Thanks!

tl

unread,

Apr 6, 2017, 11:17:35 PM4/6/17

to TiDB user group

Hi

We support multi-raft.

For 1, we use a hash map to store the raft region ID to the raft node, and multiplex the raft message to the according raft group with region ID.

For 2, we don't support coalescing heartbeat now, the raft heartbeat is 3s and election timeout is 10s, so the heartbeat load is not heavy even we have many regions (10000+) in one store, but this may cause the raft election too long (10+), we will improve this later.

在 2017年4月7日星期五 UTC+8上午9:15:32，jz...@uber.com写道：

Dongxu Huang

unread,

Apr 6, 2017, 11:26:54 PM4/6/17

to TiDB user group

What's more, for 2, since the transport layer is pipelined, performance is not a big problem, but coalescing heartbeats brings great complexity, I don't think it's worth the effort.

在 2017年4月7日星期五 UTC+8上午11:17:35，tl写道：

jz...@uber.com

unread,

May 26, 2017, 2:25:02 AM5/26/17

to TiDB user group

On Thursday, April 6, 2017 at 8:17:35 PM UTC-7, tl wrote:

Hi

We support multi-raft.

For 1, we use a hash map to store the raft region ID to the raft node, and multiplex the raft message to the according raft group with region ID.
For 2, we don't support coalescing heartbeat now, the raft heartbeat is 3s and election timeout is 10s, so the heartbeat load is not heavy even we have many regions (10000+) in one store, but this may cause the raft election too long (10+), we will improve this later.

Thanks for the reply. With a heartbeats interval of 3s, will that increase the chance of a stale follower? I mean, the leader may mark many log entries as committed, but may not propagate the commitIndex to the follower until the next heartbeat.

tl

unread,

May 26, 2017, 2:35:35 AM5/26/17

to TiDB user group

Hi jzhan

> the leader may mark many log entries as committed, but may not propagate the commitIndex to the follower until the next heartbeat

The leader will send the committed index in the next AppendEntry message, If we have heavy write flow, this may be not a problem.

在 2017年5月26日星期五 UTC+8下午2:25:02，jz...@uber.com写道：

Shawn Li

unread,

May 26, 2017, 2:47:59 AM5/26/17

to TiDB user group

one way to reduce heartbeat number is to use quiescing raft.

jz...@uber.com

unread,

May 26, 2017, 3:00:57 AM5/26/17

to TiDB user group

On Thursday, May 25, 2017 at 11:35:35 PM UTC-7, tl wrote:

Hi jzhan

> the leader may mark many log entries as committed, but may not propagate the commitIndex to the follower until the next heartbeat

The leader will send the committed index in the next AppendEntry message, If we have heavy write flow, this may be not a problem.

Yeah, I just realized that after posting the question. So I'll rephrase the question: For a cold group that have very few writes. Then it could happen that, a write to a leader may not be readable in the follower until the next heartbeat? It's definitely a rare case but I wonder if that's gonna cause a issue.

tl

unread,

May 26, 2017, 7:27:50 PM5/26/17

to TiDB user group

> For a cold group that have very few writes. Then it could happen that, a write to a leader may not be readable in the follower until the next heartbeat

This may be a problem if you want to support follower lease read. But if you only let leader handle read/write, it is not a problem because the leader can apply the raft log ASAP.

If the leader is down, the new leader will apply the log immediately after it becomes the leader.

在 2017年5月26日星期五 UTC+8下午3:00:57，jz...@uber.com写道：

Reply all

Reply to author

Forward