Messages silently dropped when load is high

178 views
Skip to first unread message

MichaelC

unread,
Nov 14, 2011, 2:35:27 PM11/14/11
to nginxpushstream
Hi

Firstly, I'd just like to say that this is a great tool and from some
evaluations I've done versus other notification servers, it would seem
to be lightning fast. However, I've run into an issue and I wonder if
you've seen this before.

I run nginx on one server and have 2 other servers for simulating
publishers and subscribers. The publisher machine opens up a single
tcp connection and writes a configurable number of pipelined POSTs per
second at varying degrees of burstiness. Each POST contains a message
payload of 140 bytes and is responded to correctly by nginx according
to the protocol e.g:
HTTP/1.1 200 OK
Server: nginx/1.0.8
Date: Mon, 14 Nov 2011 19:00:17 GMT
Content-Type: application/json
Content-Length: 100
Connection: keep-alive

{"channel": "my_channel_0", "published_messages": "1",
"stored_messages": "0", "subscribers": "0"}

The third server runs a process which opens a configurable number of
connections to the nginx machine. I'm using it in its default http
streaming mode.

When I open 10,000 connections, it seems to behave quite nicely.
Sending half a million messages, I am able to get a throughput of
around 9,000 message per second. At this rate "top" shows the nginx
process as high as 90% of cpu. If I push it harder, I start to receive
SIGIO in the nginx main log and the writer/poster is throttled down
meaning a lower throughput but all messages appear to get through to
the clients on the other machine.
However, when I perform the same tests but with 50,000 connections I
see a similar pattern of throughput up to about 6,000 or 7,000
messages/second. As before, when I push faster I get the same SIGIO in
the log but the difference is not all the messages get through to
clients!

Obviously, my first suspicion was my testing code so I managed to
replicate the issue using just one channel and netcat as the "client"
and I get similar behaviour.

What does nginxpushstream do when it receives new messages at a higher
rate than it is able to forward them on to clients? If it silently
drops them, that would be a big barrier for us to use it which would
be a real shame because its so much faster than the competition.

I'm using version 0.3.1 buildt with nginx 1.0.8 on 32bit Ubuntu 11.04
with 4Gb ram and a 5 or 6 year old dual core cpu. I'm not storing
messages (push_stream_store_messages off), using a shared_memory_size
of 800m and am happy to share my config file if it helps.

Thanks

Michael.

Wandenberg Peixoto

unread,
Nov 14, 2011, 3:16:13 PM11/14/11
to nginxpu...@googlegroups.com
Hi Michael,

thanks for testing the module and share your numbers.

About your problem I have a suspicion. When you do not store messages at server (push_stream_store_messages off) the objects are kept on shared memory by the value set on push_stream_shared_memory_cleanup_objects_ttl, which default is 30 seconds.
Depending the number of workers you have set, I imagine is not so big in a dual core, the time needed to delivery the message to all subscribers (50,000) probably will be bigger than 30 seconds under hi load of publish, and the messages will be discarded before to be delivered to all subscribers.

Try to repeat your tests setting this time to a high value, push_stream_shared_memory_cleanup_objects_ttl 5m, for example. May be necessary to set more space for shared memory.

When you do this let me know the results.

Thanks,
Wandenberg

MichaelC

unread,
Nov 15, 2011, 12:25:15 PM11/15/11
to nginxpushstream
Many thanks for your explanation. Your suspicion was correct. I was
using the default 30sec for that parameter. I tried upping it to 5m
and I was able to receive messages more reliably with 50,000 clients
connected. Sometimes, however, the rate at which messages were sent
from nginx slowed right down. e.g. I could get 9,000/sec for a
sustained minute or so and then when the poster stopped posting, the
rate of messages would slow almost to a stop but not quite until all
messages were successfully sent.

In any case the main question I have is in the case when the hardware
resources can't cope with the messages flow and messages need to be
discarded, shouldn't the pushstreammodule report an error to the
publisher? Surely a 200 response is not reasonable in that case? I'm
not trying to pick small faults with the module - as said previously I
think its great. Its just if publishers can't reliably send messages
to clients, I think its use (certainly for us) is limited.

On 14 Nov, 20:16, Wandenberg Peixoto <wandenb...@gmail.com> wrote:
> Hi Michael,
>
> thanks for testing the module and share your numbers.
>
> About your problem I have a suspicion. When you do not store messages at
> server (push_stream_store_messages off) the objects are kept on shared
> memory by the value set on push_stream_shared_memory_cleanup_objects_ttl,
> which default is 30 seconds.
> Depending the number of workers you have set, I imagine is not so big in a
> dual core, the time needed to delivery the message to all subscribers
> (50,000) probably will be bigger than 30 seconds under hi load of publish,
> and the messages will be discarded before to be delivered to all
> subscribers.
>
> Try to repeat your tests setting this time to a high value,
> push_stream_shared_memory_cleanup_objects_ttl 5m, for example. May be
> necessary to set more space for shared memory.
>
> When you do this let me know the results.
>
> Thanks,
> Wandenberg
>

Wandenberg Peixoto

unread,
Nov 15, 2011, 1:24:51 PM11/15/11
to nginxpu...@googlegroups.com
Hi Michael,

in fact, send and receive messages are two distinct operations.

When a publisher send a message it receives 200 as response when all requirements are supplied and the message was queued to be delivered to subscribers. It is not synchronous with delivery.
As I said before, your messages are being queued to be delivered respecting your publishing flow without problems. But it pass less time in the queue than needed to delivery the message to all 50,000 subscribers on your servers, and is discarded. Because of that increasing the time to 5 minutes make things better.
I don't see that as a problem, the module is configurable to answer your application requirements.

When the system resources are finishing the publisher will receive the proper response, you can see that setting a small shared memory size for example.

Another alternative to increase the memory_cleannup_ttl is to store messages at server for some time, may be the same 5 minutes.

If you need help to configure the server, describe your application and I will be pleasure to help you.

Regards,
Wanden

MichaelC

unread,
Nov 15, 2011, 2:35:27 PM11/15/11
to nginxpushstream
Hi Wanden

I believe I understand your reply. Essentially, you are saying that if
the POST message is syntactically correct etc and is accepted onto
the message queue it is given a 200 OK. I also understand that you
wouldn't want to make the service synchronous to block the publisher
until the message is received by the client. That makes sense.

However, if the nginx+pushstream becomes so overloaded that it starts
to drop messages, then I would think it desirable for it not to accept
any new messages whilst it is in that state. Wasn't the http code 503
was designed for this?

If I can't convince you of this, maybe you could give me a couple of
pointers on what to look for in the code? Perhaps it could be based
upon a threshold of messages in the queue? I'm comfortable in c/c++/
linux but obviously I'd rather not learn a new code base if I don't
have to.... 8-)

The application is for a scenario of (potentially) low millions of
connected devices being able to push messages back to the cloud
(potentially) nearly simultaneously.

I'm happy to share the source of my homegrown test tool (also written
using epoll !) that simulates large numbers of connected clients if
that's any use.


By the way, I _have_ seen proper errors for out-of-memory etc.

Michael.

On 15 Nov, 18:24, Wandenberg Peixoto <wandenb...@gmail.com> wrote:
> Hi Michael,
>
> in fact, send and receive messages are two distinct operations.
>
> When a publisher send a message it receives 200 as response when all
> requirements are supplied and the message was queued to be delivered to
> subscribers. It is not synchronous with delivery.
> As I said before, your messages are being queued to be delivered respecting
> your publishing flow without problems. But it pass less time in the queue
> than needed to delivery the message to all 50,000 subscribers on your
> servers, and is discarded. Because of that increasing the time to 5 minutes
> make things better.
> I don't see that as a problem, the module is configurable to answer your
> application requirements.
>
> When the system resources are finishing the publisher will receive the
> proper response, you can see that setting a small shared memory size for
> example.
>
> Another alternative to increase the memory_cleannup_ttl is to store
> messages at server for some time, may be the same 5 minutes.
>
> If you need help to configure the server, describe your application and I
> will be pleasure to help you.
>
> Regards,
> Wanden
>

Wandenberg Peixoto

unread,
Nov 15, 2011, 3:12:34 PM11/15/11
to nginxpu...@googlegroups.com
Hi Michael,

the current code don't have any mechanism to detect when it is overloaded, and any threshold based only on the number of messages on the queue or subscribers will not be secure.
But your use case give me an idea to improve the code.
Give me a week (or two) to do that. I will try to do some kind of reference count before mark the message as disposable.
For now, let that ttl as 5 minutes or more, and do your tests, you will not be disappointed :)

I think the publish limit should be only the shared memory space.
The first version of this module was designed for a high number of subscribers, after was improved to supports a large number of publisher, now I have to deal with both on the same time.

What I mean is, if I kept the message on shared memory for a time necessary to deliver it for all subscribers and then mark it as disposable, the memory may reach the limit, and the publisher will receive the proper response without any arbritary threshold.

About your test tool I will be grateful if you can send it to me, this will help me to test the module.

Thanks again for your tests.

Wanden
Reply all
Reply to author
Forward
0 new messages