Cluster performing worse than single node in same load test scenario

Aditya Chandak

unread,

Jul 20, 2022, 12:24:35 PM7/20/22

to Tinode General

Hi, I’d like to report an observation regarding Tinode cluster performance, hoping to understand if I am doing something wrong, it’s expected behaviour or it’s possibly an issue.

I ran a load test using Gatling to test group chat performance.

I ran the same scenario on a single node and on a 3 node cluster:

In both scenarios there was a load balancer in front of the node/s
Each Tinode node had a 2.5 CPU limit on an AWS EC2 c5.xlarge machine (4 VCPUs, 8GB RAM)
MongoDB as database adapter
A new group was created before each test

I noticed that the performance on the single node was much better than the cluster’s performance. Particularly, the pub-topic-ctrl step took a lot of time and had timeout errors as well. I can imagine there is some overhead of Tinode operating in cluster mode, but it definitely shouldn't be this much? Please help me understand if this is expected behaviour or not.

Find the summary of Gatling reports for both scenarios here.

Gatling scenario details -

512 sessions
60 second ramp up time
Single group
3 messages
5 second message sending interval

tinode.MultiTopicLoadtestV3
```
// MUC with waiting ramp up time before and after sending messages (with getSubs)
class MultiTopicLoadtestV3 extends TinodeBase {
// Input file can be set with the "accounts" java option.
// E.g. JAVA_OPTS="-Daccounts=/tmp/z.csv" gatling.sh -sf . -rsf . -rd "na" -s tinode.Loadtest
val feeder = csv(System.getProperty("accounts", "users.csv")).random
var allTopics = Array("grpE3bYaq3q24U")

val scn = scenario("WebSocket")
.exec(ws("Connect WS").connect("/v0/channels?apikey=AQEAAAABAAD_rAp4DJh05a1HAwFT3A6K"))
.exec(session => session.set("id", "tn-" + session.userId))
.pause(1)
.exec(hello)
.pause(1)
.feed(feeder)
.doIfOrElse({session =>
val uname = session("username").as[String]
var token = session("token").asOption[String]
if (token == None) {
token = tokenCache.get(uname)
}
token == None
}) { loginBasic } { loginToken }
.exitHereIfFailed
.pause(1)
.exec(subMe)
.exitHereIfFailed
.exec(getSubs)
.exitHereIfFailed
.doIf({session =>
session.attributes.contains("subs")
}) {
exec(session => session.set("sub", allTopics({session.userId % allTopics.length}.toInt)))
.pause(1)
.exec(subTopic)
.exitHereIfFailed
.pause(60)
.exec(publish)
.pause(60)
.exec(leaveTopic)
.pause(0, 3)
}
.exec(ws("close-ws").close)

setUp(scn.inject(rampUsers(numSessions) during (rampPeriod.seconds))).protocols(httpProtocol)
}
```

Let me know if I can provide any more information. The config is pretty much the same as the default one provided with the following changes:

max_subscriber_count: 4096
mongodb related config
cluster related config i.e. array with name and address of the 3 nodes

Any help will be appreciated.

Aforge Forgess

unread,

Jul 20, 2022, 3:14:54 PM7/20/22

to tin...@googlegroups.com

How many accounts did you use in your experiments? The stock users.csv?

“The information contained in this message is intended for the addressee only and may contain classified information. If you are not the addressee, please delete this message and notify the sender; you should not copy or distribute this message or disclose its contents to anyone. Any views or opinions expressed in this message are those of the individual(s) and not necessarily of the Docquity. No reliance may be placed on this message without written confirmation from an authorised representative of its contents.”

--
You received this message because you are subscribed to the Google Groups "Tinode General" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tinode+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/d7363fdf-d316-4428-acb0-3b19c18ae158n%40googlegroups.com.

Aditya Chandak

unread,

Jul 20, 2022, 3:42:23 PM7/20/22

to tin...@googlegroups.com

No, I generated around 10k new users before the test.

You received this message because you are subscribed to a topic in the Google Groups "Tinode General" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tinode/9ZfVGdlOgfU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tinode+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CA%2BH57kY4zv-vnkqdbS9Db%2B5QUXN5KAV%2B0YYKjKAXp1%2B358dozA%40mail.gmail.com.

Aditya Chandak

unread,

Jul 20, 2022, 3:43:51 PM7/20/22

to Tinode General

Generated and registered successfully using a custom script*.

Aforge Forgess

unread,

Jul 20, 2022, 4:08:42 PM7/20/22

to tin...@googlegroups.com

I see. So 4096 group members (out of 10'000 accounts) and 512 sessions representing these group members, correct?

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/064c7788-8e8b-4467-b17c-ee6f432582aen%40googlegroups.com.

Aditya Chandak

unread,

Jul 20, 2022, 4:16:27 PM7/20/22

to Tinode General

Only 512 users were subscribed (part of / members) of the group.

The max group members number of 4096 is just an upper limit.

If you refer to the Gatling scenario class, the 512 sessions are picked from users.csv (10k accounts) and they sub to the given group topic (join the group) in the subTopic step. They are not members of the group from before.

New group created for each test run.

Aforge Forgess

unread,

Jul 20, 2022, 4:19:25 PM7/20/22

to tin...@googlegroups.com

Thanks for explaining. Let me try to reproduce and I'll get back to you.

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/f904b4eb-bb0b-4b75-ae8b-2e63bea7e838n%40googlegroups.com.

Aditya Chandak

unread,

Jul 20, 2022, 4:26:49 PM7/20/22

to tin...@googlegroups.com

Yes, sure! Let me know if you need any more information.

Thanks!

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CA%2BH57kaedPDSMnY7Whs0WJzj4-kvRkXYp-1LQPkNebuS35YDMg%40mail.gmail.com.

Aforge Forgess

unread,

Jul 21, 2022, 11:03:52 AM7/21/22

to tin...@googlegroups.com

Looking at the information I have right now, my hypothesis is you may be maxing out networking (in which case you may need machines with bigger NICs).

Can you help me check it?

Cluster setups have a ~2x networking overhead on each node for processing requests:

* on topic proxy nodes: forwarding requests to the topic host node + receiving responses from the topic host node

* on the topic host node: receiving requests from the proxies + forwarding responses back to the proxies.

Note that the topic host node will post higher network & cpu utilization because it has to actually process requests

arriving from all topic proxy nodes.

c5.xlarge instances allow up to 10 Gbps (~1250 MBps).

I wonder what you are seeing in your EC2 network (in + out) monitoring charts for each of the instances.

Also, check the output from "$ sudo ethtool -S eth0".

Look for "bw_in_allowance_exceeded" and "bw_out_allowance_exceeded" lines.

If you see non-zero values, you are dropping packets.

Also can you attach your tinode log files after you run your 3 node loadtest please?

And just in case, please take a look at your CPU utilization charts - you shouldn't be maxing out CPU on neither Tinode nor DB machines.

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CAMVbMwmUZSv6MME0yG4CDE%2BteJ%2BAZOB96Kt9G_CmomTRyT%3DfWw%40mail.gmail.com.

Aditya Chandak

unread,

Jul 22, 2022, 5:14:29 AM7/22/22

to tin...@googlegroups.com

Hi, thanks for the prompt response.

This was one of the first suspicions I had as well, but we are not reaching anywhere close to the networking limits available.

I had the single node and the cluster deployed on an EKS cluster with prometheus storing the metrics. On the cluster networking dashboard, I don’t see any packets dropped, and the total bandwidth is as follows:

Single Node load test

Max Receive Bandwidth: 17.67 MB/s

Max Transmit Bandwidth: 19.72 MB/s

3 Node Cluster load test

Max Receive Bandwidth: 29.88 MB/s

Max Transmit Bandwidth: 31.25 MB/s

NOTE: These numbers somewhat match the ~2x networking overhead that you mentioned.

Running `ethtool` command on the EC2 commands didn’t show any non-zero values for bw_in_allowance_exceeded and bw_out_allowance_exceeded, so it checks out with the Prometheus stats.

DB machine is a c5.xlarge EC2 instance with a docker container running on it. Following are numbers from EC2 metrics page:

Peak average CPUUtilization: 6.8%
Peak max CPUUtilization: 9.43%

Tinode max single node CPU usage: 26%

Tinode max 3 node cluster CPU usages: 19%, 15%, 17%

Attaching a log file from all 3 nodes during the load test.

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CA%2BH57kZAiXmLD2dzKe-tFfcsuCh04uTp9pQHEds-5HJ8%2B5myVw%40mail.gmail.com.

--

Thanks and Regards,

Aditya Chandak

3-node-cluster.log

Aforge Forgess

unread,

Jul 22, 2022, 6:54:54 AM7/22/22

to tin...@googlegroups.com

Looking at the logs, the group's broadcast queue appears full. See lines like:

E2022/07/19 19:36:15 hub: topic's broadcast queue is full grpd4nN5Al4tHw

E2022/07/19 19:36:16 s.publish: sub.broadcast channel full, topic  grpd4nN5Al4tHw e5rkwyN3aJ0

Basically, it means the topic is unable to process requests quickly enough and thus the server ends up dropping the requests.

The first typical candidate for it would be your database. Can you check if you are maxing it out (cpu-wise, network-wise, or in any other way) or if there are any errors in your db logs?

Another thing to check if the goroutine stacks at the time the server started dropping these requests to see what blocks further processing of requests

(you can get them by sending SIGABRT to tinode servers).

Can you get these and post them here please?

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CAMVbMw%3DNFvpUgrHb7PZV-VVttRhqk6sBcb_Mn57nxwy6%3DKHBaA%40mail.gmail.com.

Aditya Chandak

unread,

Jul 25, 2022, 5:09:04 AM7/25/22

to tin...@googlegroups.com

I understand that the errors occur due to the broadcast queue and channel being full (which can be seen as the 500 errors in the gatling results). Another type of issue we noticed were the timeout errors in Gatling (which was set to 60s). That means the server didn't reply at all or took too long to reply. Can that be due to the goroutines being 'stuck'?

The first typical candidate for it would be your database. Can you check if you are maxing it out (cpu-wise, network-wise, or in any other way) or if there are any errors in your db logs?

CPU and Network wise we are not working the DB machine at all. I have included MongoDB logs in the attached zip file.

Also attaching the tinode logs after sending SIGABRT to the tinode servers during and after the load tests.

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CA%2BH57kbxzHr%3DiAY7%2Bfnbnx7GZRQdsCCCRPKJD1S56k5ZmVFF9Q%40mail.gmail.com.

tinode-logs.zip

Aditya Chandak

unread,

Jul 25, 2022, 5:10:04 AM7/25/22

to tin...@googlegroups.com

Attaching MongoDB logs in a separate message due to the attachment size limit.

On Mon, Jul 25, 2022 at 2:34 PM Aditya Chandak <aditya....@docquity.com> wrote:

I understand that the errors occur due to the broadcast queue and channel being full (which can be seen as the 500 errors in the gatling results). Another type of issue we noticed were the timeout errors in Gatling (which was set to 60s). That means the server didn't reply at all or took too long to reply. Can that be due to the goroutines being 'stuck'?

The first typical candidate for it would be your database. Can you check if you are maxing it out (cpu-wise, network-wise, or in any other way) or if there are any errors in your db logs?

CPU and Network wise we are not working the DB machine at all. I have included MongoDB logs in the attached zip file.

Also attaching the tinode logs after sending SIGABRT to the tinode servers during and after the load tests.

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CA%2BH57kbxzHr%3DiAY7%2Bfnbnx7GZRQdsCCCRPKJD1S56k5ZmVFF9Q%40mail.gmail.com.

--
Thanks and Regards,
Aditya Chandak

mongodb-logs.zip

Aforge Forgess

unread,

Jul 27, 2022, 10:14:28 AM7/27/22

to tin...@googlegroups.com

So I am still unable to reproduce your error.

I have run the loadtests with both mysql and mongodb.

My 3-node cluster reports very reasonable response times for the pub-topic-ctrl operation for the load (512 sessions publishing 3 messages with a 5 second interval to a group)

* mongodb:

Min	50th pct	75th pct	95th pct	99th pct	Max	Mean	Std Dev

1

3

4

9

15

51

4

* mysql

Min	50th pct	75th pct	95th pct	99th pct	Max	Mean	Std Dev

5

7

9

19

56

91

10

9

Both mysql and mongo were out-of-the-box, very basic single-node setup with literally no tuning.

Note, however, that I ran both the Tinode and the database servers on separate EC2 instances *without* the docker container.

So I recommend you look into your docker configuration.

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CAMVbMw%3DPHB1m5LdN0uXxfTcMTKN1XbNkbNBZ0wL0GsVJnmPcyA%40mail.gmail.com.

Aditya Chandak

unread,

Jul 27, 2022, 10:28:18 AM7/27/22

to tin...@googlegroups.com

Hi, thanks for getting back.

Can you share the following:

gatling load test class code
gatling result HTML files
EC2 instance specifications
Each tinode node in the cluster running on a separate EC2 instance?

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CA%2BH57kaiFNe4dtyfKe6vmw48kA2Ho1QNUhwhCjin37mn2pDTtw%40mail.gmail.com.

Aforge Forgess

unread,

Jul 27, 2022, 11:31:55 AM7/27/22

to tin...@googlegroups.com

Sure.

Gatling test class is essentially your original MultiTopicLoadtestV3 without the getSubs step (it's irrelevant in this scenario):

```

class MultiTopicLoadtestV3 extends TinodeBase {
// Input file can be set with the "accounts" java option.
// E.g. JAVA_OPTS="-Daccounts=/tmp/z.csv" gatling.sh -sf . -rsf . -rd "na" -s tinode.Loadtest

val feeder = csv(System.getProperty("accounts", "/tmp/users.csv")).random
var allTopics = Array("grp884VPjBjKqc")

val scn = scenario("WebSocket")
.exec(ws("Connect WS").connect("/v0/channels?apikey=AQEAAAABAAD_rAp4DJh05a1HAwFT3A6K"))
.exec(session => session.set("id", "tn-" + session.userId))
.pause(1)
.exec(hello)
.pause(1)
.feed(feeder)
.doIfOrElse({session =>
val uname = session("username").as[String]
var token = session("token").asOption[String]
if (token == None) {
token = tokenCache.get(uname)
}
token == None
}) { loginBasic } { loginToken }
.exitHereIfFailed
.pause(1)
.exec(subMe)
.exitHereIfFailed

//.exec(getSubs)
//.exitHereIfFailed
.doIf({session =>
session.attributes.contains("username")

}) {
exec(session => session.set("sub", allTopics({session.userId % allTopics.length}.toInt)))
.pause(1)
.exec(subTopic)
.exitHereIfFailed
.pause(60)
.exec(publish)
.pause(60)
.exec(leaveTopic)
.pause(0, 3)
}
.exec(ws("close-ws").close)

setUp(scn.inject(rampUsers(numSessions) during (rampPeriod.seconds))).protocols(httpProtocol)
}

```

I ran 4 loadtesters simultaneously as

$ JAVA_OPTS="-Daccounts=./users.csv -Dnum_sessions=128 -Dramp=60 -Dpublish_count=3 -Dpublish_interval=5" gatling.sh -sf . -rsf . -rd "na" -s tinode.MultiTopicLoadtestV3

All servers and loadtesters were run on separate EC2 t3.micro instances

(database + 3 tinode servers + 4 loadtesters = 8 instances).

Result files are packed in mongo.tar.gz.

mongo.tar.gz

To view this discussion on the web visit https://groups.google.com/d/msgid/tinode/CAMVbMwnetPo9EsUuFKYRqGy1we%2BTB7hMQM%2B4ic7oBoxeVgDfkA%40mail.gmail.com.

Gene

unread,

Jul 27, 2022, 12:52:06 PM7/27/22

to Tinode General

It's pretty clear by now that the issue is not with Tinode but rather with your setup. Given that you work for a for-profit company, please consider asking your company to pay us for further consulting.

Thank you.

Aditya Chandak

unread,

Aug 19, 2022, 3:29:00 AM8/19/22

to Tinode General

Thank you for taking the time to reproduce the scenario @aforge. I setup the exact same infrastructure as you mentioned and didn't get any errors either.

As suggested, the issue was in the previous setup. I am still trying to figure out what the issue could be. I'll update this thread if I find anything meaningful.

The only inference I am able to make is that running the Tinode binary directly performed better than in a docker container.

Reply all

Reply to author

Forward