nats in cluster mode and queuing

578 views
Skip to first unread message

nut

unread,
Apr 4, 2017, 4:33:01 AM4/4/17
to nats
Hi,
I have a question regarding nats in cluster mode and clients in queue mode.

I have the following setup: 3 servers in full mesh mode.

E.g. i start server1 with:
./gnatsd -DV -m 8222 -cluster nats://0.0.0.0:6222 -routes nats://<ip-server2>:6222,nats://<ip-server3>:6222
and server2 with:
./gnatsd -DV -m 8222 -cluster nats://0.0.0.0:6222 -routes nats://<ip-server1>:6222,nats://<ip-server3>:6222
and server3 with:
./gnatsd -DV -m 8222 -cluster nats://0.0.0.0:6222 -routes nats://<ip-server1>:6222,nats://<ip-server2>:6222

Then I start three clients (ruby nats-pure) in queue mode.
Every client connects to another server.
Say
client1 --> Connected to nats://<ip-server3>:4222
client2 --> Connected to nats://<ip-server2>:4222
client3 --> Connected to nats://<ip-server1>:4222

Next I publish 30 messages with another client.

All servers are still up and running and I get the following messages

client1 --> 15
client2 --> 10
client3 --> 5

All is fine. 30 messages published 30 messages received.
Now I shut down server 3 and I publish again 30 messages.

client1 --> does not reconnect, it receives 0 messages
client2 --> get 10 messages
client3 --> get 10 messages

So 10 messages are missing. My assumption would be not to lose any message.

- Is it a problem of my test code? (it comes below)
- Is it a problem of the ruby client?
- Is queueing mode in cluster workmode a problem?
- Is streaming mode necessary for this?

>snip client-code
require 'rubygems'
require 'nats/io/client'

nats = NATS::IO::Client.new

nats.on_error do |e|
  puts "Error: #{e}"
end

nats.on_reconnect do
  puts "Reconnected to server at #{nats.connected_server}"
end

nats.on_disconnect do
  puts "Disconnected!"
end

nats.on_close do
  puts "Connection to NATS closed"
end

rcv_msgs = 0
cluster_opts = {
  servers: ["nats://192.168.0.123:4222",
            "nats://192.168.0.124:4222",
            "nats://192.168.0.125:4222"
           ],
  dont_randomize_servers: false,
  reconnect_time_wait: 0.5,
  max_reconnect_attempts: 1
}
nats.connect(cluster_opts)
puts "Connected to #{nats.connected_server}"

  nats.subscribe('donkey', :queue => 'donkey-queue') { |msg|
     rcv_msgs += 1
     puts "Received '#{msg}' #{rcv_msgs}"
  }
while true
  sleep 0.1
end

>snip

Thomas






Waldemar Quevedo

unread,
Apr 4, 2017, 2:12:47 PM4/4/17
to nat...@googlegroups.com
Hi Thomas,

If the messages were published during the `reconnect_time_wait: 0.5 seconds` wait
when the client was not connected, the messages would not be received in that client. For that type
of usage you could take a look at the ruby-nats-streaming client (https://github.com/nats-io/ruby-nats-streaming).

As for the queue subscribe usage it is correct in terms that, once the server detected the disconnection
from the first client then the other 2 subscribers part of the group should have been receiving the messages (e.g. [15,15])...
I can't reproduce locally the failure mode with the ruby client so wonder if we could open a GH issue
in the pure-ruby-nats repo with the logs from the nats servers to investigate further?

Also btw, in case you don't want the client to give up reconnecting you could set `max_reconnect_attempts: -1`
as with max_reconnect_attempts: 1 it will try to reconnect only once to servers after disconnecting.

Thanks,

- Wally



--
You received this message because you are subscribed to the Google Groups "nats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to natsio+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nut

unread,
Apr 5, 2017, 7:44:25 AM4/5/17
to nats
Hi Wally,

I tried the java-client today with the follwoing scenario (queuing mode):
The cluster of 3 servers is running.
I have two clients, the first one is connected to server1 und the second to server3.
I publish with a third client 10 messages.
The client connected to server1 gets 7 messages, the other one gets 3. Perfect.

Now I shut down server3.

In the terminal I see no reconnect assumption by the client which was connected to server3.

I publish again 10 messages. I see only that 6 messages are received by the first client (which is still connected to server1).

I miss the other 4 messages...

Now I reboot server 3 but do not start the gnatsd at this server.

However when server3 is running again I see message in the second client terminal.

Got disconnected from nats://<ip-of-server1>:4222!
Got reconnected to nats://<ip-of-server1>:4222!

This output is wrong because the client was connected to server3 (not server1).

pause....

Now a second experiment: Like the first one, but now I do not shutdown the server3 but I only cancel the gnatsd-process on server3. The server3 still keeps running.

And in this case I see a reconnect of the client!
It says:
Got disconnected from nats://<server1-ip>:4222!
Got reconnected to nats://<server1-ip>:4222!
It was indeed connected to server3 but this is a matter of the program source.
((
  The java-client source -->
    nc.setDisconnectedCallback(event -> {
      System.out.printf("Got disconnected from %s!\n", event.getConnection().getConnectedUrl());
    });
    nc.setReconnectedCallback(event -> {
      System.out.printf("Got reconnected to %s!\n", event.getConnection().getConnectedUrl());
    });
))


The good news: No message is lost. When I publish 10 messages, the two client together receive all 10 messages.

The same behavior with the ruby-client. The client reconnects when the gnatsd-process get canceled but not when I shutdown the whole server.

Is this the expected behavior?
Because in out open-stack-cloud it is easy to shutdown/restart a server via webinterface I simply choosed this way to imitate an server failure.
I think this could happen in real life too.

Thomas
To unsubscribe from this group and stop receiving emails from it, send an email to natsio+un...@googlegroups.com.

R.I.Pienaar

unread,
Apr 5, 2017, 9:01:25 AM4/5/17
to nat...@googlegroups.com
when it shuts down this way is it like a power off or a clean shutdown?

If power off the client will take a while to realise the server is dead,
thats normal

nut

unread,
Apr 6, 2017, 4:00:54 AM4/6/17
to nats
Hi,

the server runs in a openstack-cloud. I use the webfronted and there the "Shut Off instance" menu entry.

This is from the documentation:
"Shut off will prompt the operating system to shutdown gracefully, power off the virtual machine, but preserve the compute resources (CPU and RAM) allocated on a compute node. Instances in this state are still charged as if they were running. When re-started it will continue to be hosted on the same compute node."

>If power off the client will take a while to realize the server is dead, thats normal

What is typically "a while"? I think the client never reconnects when the server is going down this way.

Thomas

R.I.Pienaar

unread,
Apr 6, 2017, 4:33:48 AM4/6/17
to nat...@googlegroups.com


On Thu, Apr 6, 2017, at 10:00, nut wrote:
> Hi,
>
> the server runs in a openstack-cloud. I use the webfronted and there the
> "Shut Off instance" menu entry.
>
> This is from the documentation:
> "Shut off will prompt the operating system to shutdown gracefully, power
> off the virtual machine, but preserve the compute resources (CPU and RAM)
> allocated on a compute node. Instances in this state are still charged as
> if they were running. When re-started it will continue to be hosted on
> the
> same compute node."
>
> >If power off the client will take a while to realize the server is dead,
> thats normal
>
> What is typically "a while"? I think the client never reconnects when the
> server is going down this way.

if it was indeed a graceful shutdown it should be quicker - the TCP
session will be shut down via normal TCP mechanisms and the client will
be disconnected triggering a reconnect.

Should the machine just vanish - ie. as if you yanked out the power
cable - then the situation is different, with no effort on the protocol
side it can be hours which will be influenced by things like your OS TCP
keep alive settings

NATS however do make effort in its protocol and clients - has client to
server PING and server to client PING and the client use these to verify
the TCP sessions, it will wait for a few such failed pings before
shutting down and reconnecting.

For the ruby client settings like :reconnect_time_wait, :ping_interval,
:max_outstanding_pings and :connect_timeout will all contribute to how
long a client might be missing for.

nut

unread,
Apr 6, 2017, 7:57:03 AM4/6/17
to nats
You are right. It was NOT a graceful shutdown. I looked in the syslog.
When I shutdown graceful the server e.g. by "shutdown -h now" the clients reconnect. Good news.

However, how can the the clients protect them self  in the case of a yanked out power cable?
I think we can live with some seconds for an reconnect but not hours.

Do the ruby clients settings :ping_interval ... apply only in the case of a graceful shutdown or are they of any help in case of power cut too?

Thomas

R.I.Pienaar

unread,
Apr 6, 2017, 8:19:26 AM4/6/17
to nat...@googlegroups.com
in the case of a graceful shutdown the TCP protocol informs the client
immediately, no problems.

In cases of power-yank like outages which would include things like
routes/nat gateways between you and the NATS daemon going down,
firewalls killing your packets etc, the pings will detect your outage.

So basically it comes down to:

- how frequent do you want to ping the server? (:ping_interval)
- how many pings should go unanswered before it considered the machine
dead (:max_outstanding_pings)
- how long it should wait before attempting reconnect
(:reconnect_time_wait)
- should it reconnect to another dead server, how long to try for?
(:connect_timeout)
- how many times it should reconnect at most (:max_reconnect_attempts)

Together these exist to handle unexpected failure scenarios both in
established connections and while establishing a new connection.

the right values for these vary greatly by your situation. Do you have
1000s of clients? Don't be too aggressive you'll trigger your kernel SYN
DDOS protection. Do you have few clients but you really care for a short
delay? make them more aggressive etc.

You can calculate from these how long a outage is likely to last for
your scenario in a power out/router down/NAT/firewall stale session etc
scenario.

--
R.I.Pienaar / www.devco.net / @ripienaar

nut

unread,
Apr 6, 2017, 8:58:30 AM4/6/17
to nats
Thank you for the answer. I played around with the options and decreased the ping_interval: to 5 seconds (default was 120 [ruby nats-pure]).
An indeed: The client reconnected which is good.
BUT: I observed still a difference in queuing mode between graceful and not-graceful shutdown.
In both cases the clients do now reconnect but only in the graceful scenario I get ALL messages published with the connected clients.
An example:
I publish 10 messages and we have 3 clients in queueing mode. Together they receive all of the messages.
No I shutdown graceful one server. The client which was connected to the now dead server reconnects.
I publish again 10 messages. The 3 clients receive again all 10 messages. Great.

But in the same scenario I have a problem when it was no graceful shutdown. After the reconnect the clients receive not all 10 messages but
say only 8 or another time 7.
What could be the reason?




R.I.Pienaar

unread,
Apr 6, 2017, 9:03:00 AM4/6/17
to nat...@googlegroups.com
I am not sure how the internals of NATS work wrt queueing - but I can
imagine what happens is that like client->server the server<->server
connections also takes time to determine the server is dead, it uses the
same mechanisms.

While it thinks the now-dead server is still alive, it will send stuff
there for its clients and potentially things get lost since NATS has no
persistence.

This is totally just guesswork, one of the devs should confirm.

Ivan Kozlovic

unread,
Apr 6, 2017, 9:39:23 AM4/6/17
to nats
R.I. Pienaar is spot on. What happens between server and client also happens between servers.

When the server receives a message and decides to send it to the other server for its local queue subscribers, it does not know that the server is dead. Using the ping/max out in the server configuration will also help reduce the time it takes to discover than the machine is gone. With a higher message rate, the socket buffer will fill up and the server will close the connection to the other server after a write deadline of 2 seconds.

But the bottom line is this: you may be making the wrong assumption that no message is ever going to be lost.

If you simply publish a message, you have no guarantee that it will be received by the server, or routed properly, or received by a subscriber, or processed by the application. But you can - with a risk of a duplicate processing - if your publisher is sending a request and waiting for a reply from the subscriber. Only then you would have the guarantee that a given message was processed. You could get a duplicate because the reply from the subscriber could be lost, or the application could crash after processing the message but before sending the reply, etc... The requestor would get a timeout and so have to decide what to do (resend, report as possibly lost), etc...

nut

unread,
Apr 6, 2017, 10:31:13 AM4/6/17
to nats
Ok, thank you for the explanation. We have not the requirement that not a single is going to be lost.
I would like to experiment with the "ping/max out in the server configuration" but I fail to find the proper documentation.
I searched in the README.md and here http://nats.io/documentation/server/gnatsd-config .

Another question regarding documentation. gnatsd has this cool http-monitoring interface -->
http://nats.io/documentation/server/gnatsd-monitoring/
Is there a place where I can find the meaning of all this attributes like

/vars
"ping_interval": 120000000000,  #Bla bla

Thomas

R.I.Pienaar

unread,
Apr 6, 2017, 10:46:14 AM4/6/17
to nat...@googlegroups.com
yes would also love to see better docs for these, I read the code to
figure most of them out :(

nut

unread,
Apr 7, 2017, 2:59:26 AM4/7/17
to nats
Ok. But may be a maximized/complete configfile with some inline comments for non-intuitive stuff would be a good and achievable starting point.
I would be interested ;-)

Ivan Kozlovic

unread,
Apr 7, 2017, 11:01:46 AM4/7/17
to nats
For the server configuration of ping interval and max pings, you would use "ping_interval" and "max_ping" (https://github.com/nats-io/gnatsd/pull/372/files), expressed in seconds.
The value that you see in /varz is expressed as a Go duration, which default unit is nanoseconds.

We all agree that our doc require a lot of attention. We are working on it.
Thanks for the feedback!

Ivan.
Reply all
Reply to author
Forward
0 new messages