Need Help Reproducing Results for MongoDB

57 views
Skip to first unread message

Malhar Thakkar

unread,
Mar 22, 2018, 10:34:48 PM3/22/18
to Jepsen Talk
Hello everyone,

I wish to study the shortcomings of MongoDB 2.4.3, 2.6.7 and 3.4.0-rc3 by running the jepsen tests myself but I can't seem to find a way to run these tests. Neither the documentation nor the blog posts mention a way to reproduce them.

Could anyone please guide me on the same?


Thank you.


Regards,
Malhar Thakkar

Kyle Kingsbury

unread,
Mar 23, 2018, 3:54:23 AM3/23/18
to ta...@jepsen.io
At the risk of issuing the canonical mailing list response, may I suggest consulting the file in the MongoDB test repo mysteriously labeled "README"? The first section is called "Examples", and features examples of how to run the test suite. Might be worth starting there.

--
You received this message because you are subscribed to the Google Groups "Jepsen Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to talk+uns...@jepsen.io.
To post to this group, send email to ta...@jepsen.io.
To view this discussion on the web visit https://groups.google.com/a/jepsen.io/d/msgid/talk/2b185e96-38bc-4744-912d-ad93a7c4eb7a%40jepsen.io.

Malhar Thakkar

unread,
Mar 23, 2018, 2:21:43 PM3/23/18
to ta...@jepsen.io
Hey Kyle,

Thank you for your response. I'll look into the jepsen-io/mongodb repository.


Regards,
Malhar

Malhar Thakkar

unread,
Mar 27, 2018, 12:06:27 PM3/27/18
to ta...@jepsen.io
I followed the steps on this URL to reproduce results for MongoDB 2.6.7 and I initially encountered an error due to the dependency on jepsen 0.0.3-SNAPSHOT which I then changed to 0.0.3 as suggested in one of the pull requests.

However, after doing this and running lein test, I encountered the following error message.
Tried to use insecure HTTP repository without TLS.
This is almost certainly a mistake; however in rare cases where it's
intentional please see `lein help faq` for details.

Any idea why that might be happening and how I can rectify it?

I'm using a Google Cloud Platform instance to run jepsen and I've made sure that I allow both HTTP and HTTPS traffic.

Thank you.


Regards,
Malhar

Malhar Thakkar

unread,
Mar 27, 2018, 12:06:27 PM3/27/18
to ta...@jepsen.io
My apologies for the missing URL in my previous email.

The steps I followed are on this link.

Kyle Kingsbury

unread,
Mar 31, 2018, 5:25:14 AM3/31/18
to ta...@jepsen.io
I haven't seen that before, but given the error message, I imagine the answer might be in lein's FAQ.

--Kyle

Malhar Thakkar

unread,
Apr 1, 2018, 6:37:15 PM4/1/18
to ta...@jepsen.io
I had a look at lein's FAQ to find the following answer (find screenshot attached). I tried doing it, but it didn't work.


Also, I have a few queries regarding MongoDB 3.4.0-rc3.
  • The scenario mentioned in the blog-post that exposes the problem in MongoDB's v0 replication (isolating the primary node) is definitely possible, but isn't the probability of that happening low?
  • Also, in order to try and see for myself that MongoDB's 3.4.0-rc3 indeed loses acknowledged writes, I wanted to know the parameters that were set in order to find the vulnerability. Currently, I am running the test in the following way.
    • lein run test -t set -p 0 --time-limit 700 --key-time-limit 300
    • So far, I've run about 100 tests with the above setting only to find that the test passed every time. Am I missing something? Or is 100 too small a number to find the problem with the replication protocol?
  • Statistics about approximately how many times the aforementioned test was run (and with which parameters) to expose the vulnerability would be helpful.

Thank you.


Regards,
Malhar

Screen Shot 2018-04-01 at 18.21.04.png

Malhar Thakkar

unread,
Apr 1, 2018, 8:22:49 PM4/1/18
to ta...@jepsen.io
Moreover, has anyone come across an issue due to the system clock getting changed?

For instance, running the test for MongoDB-3.4.0-rc3 eventually results in the following error.
ERROR [2255-01-29 04:21:14,925] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: sudo -S -u root bash -c "cd /; apt-get update" returned non-zero exit status 100 on n1. STDOUT:
Hit http://security.debian.org jessie/updates InRelease


STDERR:
E: Release file for http://security.debian.org/dists/jessie/updates/InRelease is expired (invalid since 86489d 6h 59min 49s). Updates for this repository will not be applied.

Notice the timestamp on that error message. The year is 2255 (which I think is due to the nemesis changing the system clocks). Should I do something like resetting the clock after every run of the test? Or is there a cleaner solution?


Thank you.


Regards,
Malhar

Kyle Kingsbury

unread,
Apr 4, 2018, 10:57:25 AM4/4/18
to ta...@jepsen.io
On 04/01/2018 05:37 PM, Malhar Thakkar wrote:
> I had a look at lein's FAQ to find the following answer (find screenshot
> attached). I tried doing it, but it didn't work.
I read the FAQ, poked around in lein, googled for the error message, found this
thread on the first page: https://github.com/technomancy/leiningen/issues/2392,
added the suggested aether snippet to project.clj, re-ran `lein test`, and found
an issue with a transitive dependency: high-scale-lib, which is a part of the
knossos deps, and at the time used an HTTP repository.

Back then, lein was fine with this, but newer versions of lein refuse (sensibly)
to use http. I suggest either re-running with a contemporary version of lein, or
grabbing the new https repository from the current `knossos` project and
rebuilding the *old* version of knossos with that repo instead of the http one.
You can locally install your changes using `lein install`.

> Also, I have a few queries regarding MongoDB 3.4.0-rc3.
>
> * The scenario mentioned in the blog-post that exposes the problem in
> MongoDB's v0 replication (isolating the primary node) is definitely
> possible, but isn't the probability of that happening low?

That's one particular example that suffices to break v0, but not the only one.
"Low" is sort of a tricky question, and it's difficult to answer empirically
since setups vary so much. It's gonna depend on your workload, hardware,
topology, client distribution, etc etc.

> * Also, in order to try and see for myself that MongoDB's 3.4.0-rc3 indeed
> loses acknowledged writes, I wanted to know the parameters that were set in
> order to find the vulnerability. Currently, I am running the test in the
> following way.
> o *lein run test -t set -p 0 --time-limit 700 --key-time-limit 300*
> o So far, I've run about 100 tests with the above setting only to find
> that the test passed every time. Am I missing something? Or is 100 too
> small a number to find the problem with the replication protocol?

I'd say one to five tests ought to do it, but it's been a bit and I don't
remember all the details. I was firing up an AWS cluster to confirm for you, but
my ISP actually lost its route to AWS so... that's gonna have to be another
time. Gotta do some real work today too. ;-)

> * Statistics about approximately how many times the aforementioned test was
> run (and with which parameters) to expose the vulnerability would be helpful
It's going to depend on your environment: node speed, network performance, etc
are going to affect concurrency intervals and throughput, which can have a
significant effect on test reproducibility. I go through a fair bit of tuning to
try and create repeatable tests, but it's a big ball of nondeterministic
concurrent state. YMMV.

--Kyle

Kyle Kingsbury

unread,
Apr 4, 2018, 11:42:47 AM4/4/18
to ta...@jepsen.io
> * Also, in order to try and see for myself that MongoDB's 3.4.0-rc3 indeed
> loses acknowledged writes, I wanted to know the parameters that were set in
> order to find the vulnerability. Currently, I am running the test in the
> following way.
> o *lein run test -t set -p 0 --time-limit 700 --key-time-limit 300*
> o So far, I've run about 100 tests with the above setting only to find
> that the test passed every time. Am I missing something? Or is 100 too
> small a number to find the problem with the replication protocol?

Ah, AWS is back for me now. On a fresh cluster, and SHA
6cbb2291aad2468f872c1dad8731cfe42168164d,

lein run test -t set -p 0 --time-limit 300 --username admin --nodes-file ~/nodes
--test-count 10

hit a lost-elements case in about five runs--bout half an hour.

--Kyle

Malhar Thakkar

unread,
Apr 4, 2018, 11:45:14 AM4/4/18
to ta...@jepsen.io
Oh, I see. Thank you so much for letting me know. I'll try to do that and hopefully, I'll be able to obtain lost acknowledged writes. 

--Kyle


--
You received this message because you are subscribed to the Google Groups "Jepsen Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to talk+uns...@jepsen.io.
To post to this group, send email to ta...@jepsen.io.

Kyle Kingsbury

unread,
Apr 4, 2018, 11:47:55 AM4/4/18
to ta...@jepsen.io
On 04/01/2018 07:22 PM, Malhar Thakkar wrote:
> Moreover, has anyone come across an issue due to the system clock getting changed?
>
> For instance, running the test for MongoDB-3.4.0-rc3 eventually results in the
> following error.
> *ERROR [2255-01-29 04:21:14,925] main - jepsen.cli Oh jeez, I'm sorry, Jepsen
> broke. Here's why:*
> *java.util.concurrent.ExecutionException: java.lang.RuntimeException: sudo -S -u
> root bash -c "cd /; apt-get update" returned non-zero exit status 100 on n1.
> STDOUT:*
> *Hit http://security.debian.org jessie/updates InRelease*
> *
> *
> *
> *
> *STDERR:*
> *E: Release file for http://security.debian.org/dists/jessie/updates/InRelease
> is expired (invalid since 86489d 6h 59min 49s). Updates for this repository will
> not be applied.*
> *
> *
> Notice the timestamp on that error message. The year is 2255 (which I think is
> due to the nemesis changing the system clocks). Should I do something like
> resetting the clock after every run of the test? Or is there a cleaner solution?

No, I've never seen that before--Jepsen doesn't ever touch the control node's
clock, at least, as far as I'm aware. Something deeper must be going on.

--Kyle

Kyle Kingsbury

unread,
Apr 4, 2018, 11:55:23 AM4/4/18
to ta...@jepsen.io
On 04/04/2018 10:45 AM, Malhar Thakkar wrote:
> Oh, I see. Thank you so much for letting me know. I'll try to do that and
> hopefully, I'll be able to obtain lost acknowledged writes.

Well, uh, those are basically the same CLI options you said you were using, so
if hundreds of those runs didn't find anything, this probably won't either.
That, plus the weird clock issue you ran into, suggests there's something
different about your *environment*. You haven't really said much about how
you've set up your Jepsen cluster, and I don't have any experience with Google
Cloud, so... can't really tell you what to do there. Investigate!

If you want to run with the same setup I use, You might try the 5+1 node cluster
setup here:

https://aws.amazon.com/marketplace/pp/B01LZ7Y7U0?qid=1486758124485&sr=0-1&ref_=srh_res_product_title

It's relatively easy to get going, and that's how I run most of my tests.

--Kyle

Malhar Thakkar

unread,
Apr 4, 2018, 11:59:35 AM4/4/18
to ta...@jepsen.io
On Wed, Apr 4, 2018 at 11:55 AM, Kyle Kingsbury <ap...@jepsen.io> wrote:
On 04/04/2018 10:45 AM, Malhar Thakkar wrote:
Oh, I see. Thank you so much for letting me know. I'll try to do that and hopefully, I'll be able to obtain lost acknowledged writes.

Well, uh, those are basically the same CLI options you said you were using, so if hundreds of those runs didn't find anything, this probably won't either. That, plus the weird clock issue you ran into, suggests there's something different about your *environment*. You haven't really said much about how you've set up your Jepsen cluster,
I'm using docker to set up my Jepsen cluster.
 
and I don't have any experience with Google Cloud, so... can't really tell you what to do there. Investigate!

If you want to run with the same setup I use, You might try the 5+1 node cluster setup here:

https://aws.amazon.com/marketplace/pp/B01LZ7Y7U0?qid=1486758124485&sr=0-1&ref_=srh_res_product_title

It's relatively easy to get going, and that's how I run most of my tests.

Thank you. 


--Kyle

--
You received this message because you are subscribed to the Google Groups "Jepsen Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to talk+uns...@jepsen.io.
To post to this group, send email to ta...@jepsen.io.

Kyle Kingsbury

unread,
Apr 4, 2018, 12:03:51 PM4/4/18
to ta...@jepsen.io
On 04/04/2018 10:59 AM, Malhar Thakkar wrote:
> I'm using docker to set up my Jepsen cluster.

Oohhhhhhhh! This makes sense. Docker uses containers, and you can't set the
clock in containers. That's why your Mongo tests aren't showing any interesting
behavior. :)

--Kyle
Reply all
Reply to author
Forward
0 new messages