Hi Kyle,
TL;DR The use of different server-client topology reveals more bugs; simplified linearizability check makes it's possible to test long histories, and the recent MongoDB analysis may have a mistake.
I was exploring different ways of checking linearizability, and it looks like I found a violation of linearizability in MongoDB 3.6.4 (I haven't tested 3.4.0, but there is a possibility that it's also affected).
- Each node hosts an application exposing HTTP key/value interface, so Jepsen is working over HTTP instead of hitting MongoDB directly. As a result, different MongoDB clients may have a different view during a network partition and may reveal more bugs.
- The configuration doesn't use gen/stagger, so workers work without delays
Also, I used a different checker; the existing linearizability checker is very demanding, and validations of long histories often end with OOM exceptions which is expectable since the problem is NP-complete. But if we add additional restrictions, it's possible to shift complexity into O(x ln x) or even to O(x) where x is the length of the history.
In my case, I had a single writer per key which was writing values forming an increasing sequence of numbers. Obviously, in a linearizable system, every read must be greater or equal to the previous read. So instead of linearizability, I was checking for monotonicity: when the latter is violated the former is broken too.
And, hopefully, I was able to observe the violation (reading 65 after 73 had been already read):
{:type :invoke, :f :read, :value nil, :host "node3", :process 295, :time 33119025100}
...
{:type :ok, :f :read, :value 73, :host "node3", :process 295, :time 33372398800}
...
{:type :invoke, :f :read, :value nil, :host "node3", :process 307, :time 34403679100}
...
{:type :ok, :f :read, :value 65, :host "node3", :process 307, :time 35101938700}
I hope you'll find it interesting :)
Cheers,
Denis