Testing and validating a raft implementation.

1,034 views
Skip to first unread message

kjnilsson

unread,
Oct 4, 2013, 4:53:41 AM10/4/13
to raft...@googlegroups.com
Ok should I start this off then?

How do those of you who have implemented raft gone about testing and validating that the implementation is sound?

We have implemented raft for use with a distributed hosting application we are writing. Naturally we unit test individual functions but automated testing of the behaviour of the whole cluster is of course more difficult. Currently we fall back on a fair bit of manual/semi-manual testing in addition to continuously deploying the hosting application and putting it under test.

I'd be interested in hearing what testing strategies others have used.

Cheers
Karl

xian...@coreos.com

unread,
Oct 4, 2013, 10:58:09 AM10/4/13
to raft...@googlegroups.com


On Friday, October 4, 2013 1:53:41 AM UTC-7, kjnilsson wrote:
Ok should I start this off then?

How do those of you who have implemented raft gone about testing and validating that the implementation is sound?
   For go-raft, mostly we do unit testing/integrated random testing/integrated case by case testing and audit codes against by looking at codes. 

We have implemented raft for use with a distributed hosting application we are writing. Naturally we unit test individual functions but automated testing of the behaviour of the whole cluster is of course more difficult. Currently we fall back on a fair bit of manual/semi-manual testing in addition to continuously deploying the hosting application and putting it under test.
 
   I think you can test you raft implementation as a whole by 

   1. simulation network partition and then merge.  
   2. keep on killing the leader and bring it back. 
   3. keep on killing the follower and bring it back.
   4. keep on killing random number of nodes and bring them back.
   5. set the heartbeat and election time very small to help you find race
   6. stress the whole cluster with commands and do above again.
   7. test your raft implementation with your application.

Ben Johnson

unread,
Oct 4, 2013, 11:14:40 AM10/4/13
to xian...@coreos.com, raft...@googlegroups.com
Karl-

Xiang gave a good run down of some of the internal unit and integration testing on go-raft. If you look at "runTestHttpServers()" you can see how we set up a multi-node cluster in a single process for some basic testing:


That file just tests the transporter so it's not the best example but hopefully it gives you an idea of the basic test harness.

Testing against a real world application across multiple physical nodes is really useful as well. It gives you a much more representative example of how your library is used that you wouldn't think of while trying to build unit/integration tests.


Ben Johnson



--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Diego Ongaro

unread,
Oct 4, 2013, 4:47:11 PM10/4/13
to kjnilsson, xian...@coreos.com, raft...@googlegroups.com, Ben Johnson
> On Friday, October 4, 2013 1:53:41 AM UTC-7, kjnilsson wrote:
>>
>> How do those of you who have implemented raft gone about testing and
>> validating that the implementation is sound?

Xiang and Ben's responses both contain good ideas, and I don't think
there's any one way to do this. I've also mostly done unit testing on
LogCabin for now, but here's one easy idea that should get you a lot
of mileage: randomly drop messages either at the sender or at the
recipient. For example, if you had each server randomly drop around
10% of the messages it received, my guess is the "network" would
sometimes function well enough to make progress and confirm things are
still working, and it'd test a wide variety of edge cases. And there's
a few easy ways to expand this to cover more edge cases: vary the drop
percentage between servers, add random message delays, random server
restarts, etc.

-Diego

Peter Bourgon

unread,
Oct 4, 2013, 4:56:15 PM10/4/13
to Diego Ongaro, kjnilsson, xian...@coreos.com, raft...@googlegroups.com, Ben Johnson
I exposed a lot of interesting edge cases by dropping the minimum
election timeout an order of magnitude (250ms -> 25ms) and running my
multiple-server scenario tests on a single virtual machine, in my case
on Linode. If it's overprovisioned—and most are—you miss a lot of
deadlines.
Reply all
Reply to author
Forward
0 new messages