Channels, Websockets and 'Backpressure'

1,253 views
Skip to first unread message

hank...@gmail.com

unread,
Dec 1, 2016, 2:26:35 PM12/1/16
to Django users

Can someone help me understand the concept of websocket “backpressure” in a Django Channels project? What is it? How do diagnose it? At what level of the stack does it occur? How do I cure it? The docs are a little hazy on this.


I wired up a quick Channels project for my mid-sized website. Before deploying the project, I load-tested it with thor and started scaling up. When I reached two Daphne processes and four worker processes, it seemed like I had enough power behind the project to handle the load on my site. It was able to handle 2000 simultaneous websocket connections without errors, according to thor. That should have been more than enough.


I deployed, and everything went fine for a while. After a bit, though, the websockets got slow and the server started to drop connections. Eventually the whole project stalled out. I looked through the Daphne logs and found a flurry of lines like this:


2016-12-01 14:35:14,513 WARNING WebSocket force closed for websocket.send!QbxCqPhvyxVt due to receive backpressure


I restarted all the server and worker processes to no effect. I was able to put the project back online by manually deleting all the “asgi:*” keys in Redis. But then, after a while, the backpressure built up and everything crashed again.


The problem, I suppose, has something to do with the high frequency of messages that were to be passed via websocket in this particular project. A click triggers a message in each direction, and people were encouraged to click rapidly. So I probably have to throttle this, or else launch more workers and/or servers.


But I'd like to know what, specifically, triggers these “backpressure” disconnections, and where I might look to monitor them /before/ errors start to occur. In one of the Redis queues, I suppose? If so, which one(s) – inbound or outbound? I suppose my idea, here, is that I might be able to automatically scale up if the queues start to fill up.


Thank you in advance. Fun project!

Andrew Godwin

unread,
Dec 1, 2016, 3:34:22 PM12/1/16
to django...@googlegroups.com
"Backpressure" is designed exactly for what you describe, which is when clients are making requests of the server faster than you can handle them. Each channel has a maximum capacity of messages (100 by default), beyond which trying to add a new one results in an error.

Webservers, when they see this, return an error to the client to try and resolve the overload situation. If they didn't, then the server would clog up trying to buffer all the pending requests. It's like returning a 503 error on a webpage when a server is overloaded.

To solve the situation, just provision more workers so the channel is drained as fast as messages are put onto it.

If you want to monitor the size of channels to anticipate this stuff, there's a plan for an API in ASGI that would let you do that but it's not in place yet. You may look at the length of the Redis lists directly in the meantime if you wish (there's one list per channel).

Andrew



--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/79eb8c4d-0223-4320-8295-c936ebc4a68f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hank Sims

unread,
Dec 1, 2016, 3:48:02 PM12/1/16
to Django users
Thanks, Andrew. A few follow-up questions:

1. How would one go about increasing the default maximum queue size? I saw some reference to this when I was researching the problem yesterday, but I couldn't find the setting that would change it. 

2. Shouldn't there be a way to resolve the backpressure by draining the queue before allowing new messages to be written to it? It seems like cutting the connection between client and server would exacerbate the problem, rather than remedying it. In my particular case, it wouldn't be that big if a block of messages were skipped. But closing the socket when maximum queue size is reached seems to cause a cascade of problems.

Thanks for your response, and even more thanks for your work on this project.



On Thursday, December 1, 2016 at 12:34:22 PM UTC-8, Andrew Godwin wrote:
"Backpressure" is designed exactly for what you describe, which is when clients are making requests of the server faster than you can handle them. Each channel has a maximum capacity of messages (100 by default), beyond which trying to add a new one results in an error.

Webservers, when they see this, return an error to the client to try and resolve the overload situation. If they didn't, then the server would clog up trying to buffer all the pending requests. It's like returning a 503 error on a webpage when a server is overloaded.

To solve the situation, just provision more workers so the channel is drained as fast as messages are put onto it.

If you want to monitor the size of channels to anticipate this stuff, there's a plan for an API in ASGI that would let you do that but it's not in place yet. You may look at the length of the Redis lists directly in the meantime if you wish (there's one list per channel).

Andrew


On Thu, Dec 1, 2016 at 11:26 AM, hank...@gmail.com <hank...@gmail.com> wrote:

Can someone help me understand the concept of websocket “backpressure” in a Django Channels project? What is it? How do diagnose it? At what level of the stack does it occur? How do I cure it? The docs are a little hazy on this.


I wired up a quick Channels project for my mid-sized website. Before deploying the project, I load-tested it with thor and started scaling up. When I reached two Daphne processes and four worker processes, it seemed like I had enough power behind the project to handle the load on my site. It was able to handle 2000 simultaneous websocket connections without errors, according to thor. That should have been more than enough.


I deployed, and everything went fine for a while. After a bit, though, the websockets got slow and the server started to drop connections. Eventually the whole project stalled out. I looked through the Daphne logs and found a flurry of lines like this:


2016-12-01 14:35:14,513 WARNING WebSocket force closed for websocket.send!QbxCqPhvyxVt due to receive backpressure


I restarted all the server and worker processes to no effect. I was able to put the project back online by manually deleting all the “asgi:*” keys in Redis. But then, after a while, the backpressure built up and everything crashed again.


The problem, I suppose, has something to do with the high frequency of messages that were to be passed via websocket in this particular project. A click triggers a message in each direction, and people were encouraged to click rapidly. So I probably have to throttle this, or else launch more workers and/or servers.


But I'd like to know what, specifically, triggers these “backpressure” disconnections, and where I might look to monitor them /before/ errors start to occur. In one of the Redis queues, I suppose? If so, which one(s) – inbound or outbound? I suppose my idea, here, is that I might be able to automatically scale up if the queues start to fill up.


Thank you in advance. Fun project!

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.

Andrew Godwin

unread,
Dec 1, 2016, 3:52:15 PM12/1/16
to django...@googlegroups.com
On Thu, Dec 1, 2016 at 12:48 PM, Hank Sims <hank...@gmail.com> wrote:
Thanks, Andrew. A few follow-up questions:

1. How would one go about increasing the default maximum queue size? I saw some reference to this when I was researching the problem yesterday, but I couldn't find the setting that would change it. 

You set it in the channel layer configuration in Django, like this: https://github.com/django/asgi_redis/#usage
 

2. Shouldn't there be a way to resolve the backpressure by draining the queue before allowing new messages to be written to it? It seems like cutting the connection between client and server would exacerbate the problem, rather than remedying it. In my particular case, it wouldn't be that big if a block of messages were skipped. But closing the socket when maximum queue size is reached seems to cause a cascade of problems.

How would you propose this worked? The only alternative to closing the socket is to buffer the messages in memory and retry sending them, at which point you might have the case where the client thinks they have a working connection but it's not actually delivered anything for 30 seconds. Hard failure is preferable in distributed systems in my experience; trying to solve the problem with soft failure and retry just makes problems even more difficult to detect and debug.
 
Andrew



On Thursday, December 1, 2016 at 12:34:22 PM UTC-8, Andrew Godwin wrote:
"Backpressure" is designed exactly for what you describe, which is when clients are making requests of the server faster than you can handle them. Each channel has a maximum capacity of messages (100 by default), beyond which trying to add a new one results in an error.

Webservers, when they see this, return an error to the client to try and resolve the overload situation. If they didn't, then the server would clog up trying to buffer all the pending requests. It's like returning a 503 error on a webpage when a server is overloaded.

To solve the situation, just provision more workers so the channel is drained as fast as messages are put onto it.

If you want to monitor the size of channels to anticipate this stuff, there's a plan for an API in ASGI that would let you do that but it's not in place yet. You may look at the length of the Redis lists directly in the meantime if you wish (there's one list per channel).

Andrew



On Thu, Dec 1, 2016 at 11:26 AM, hank...@gmail.com <hank...@gmail.com> wrote:

Can someone help me understand the concept of websocket “backpressure” in a Django Channels project? What is it? How do diagnose it? At what level of the stack does it occur? How do I cure it? The docs are a little hazy on this.


I wired up a quick Channels project for my mid-sized website. Before deploying the project, I load-tested it with thor and started scaling up. When I reached two Daphne processes and four worker processes, it seemed like I had enough power behind the project to handle the load on my site. It was able to handle 2000 simultaneous websocket connections without errors, according to thor. That should have been more than enough.


I deployed, and everything went fine for a while. After a bit, though, the websockets got slow and the server started to drop connections. Eventually the whole project stalled out. I looked through the Daphne logs and found a flurry of lines like this:


2016-12-01 14:35:14,513 WARNING WebSocket force closed for websocket.send!QbxCqPhvyxVt due to receive backpressure


I restarted all the server and worker processes to no effect. I was able to put the project back online by manually deleting all the “asgi:*” keys in Redis. But then, after a while, the backpressure built up and everything crashed again.


The problem, I suppose, has something to do with the high frequency of messages that were to be passed via websocket in this particular project. A click triggers a message in each direction, and people were encouraged to click rapidly. So I probably have to throttle this, or else launch more workers and/or servers.


But I'd like to know what, specifically, triggers these “backpressure” disconnections, and where I might look to monitor them /before/ errors start to occur. In one of the Redis queues, I suppose? If so, which one(s) – inbound or outbound? I suppose my idea, here, is that I might be able to automatically scale up if the queues start to fill up.


Thank you in advance. Fun project!

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/79eb8c4d-0223-4320-8295-c936ebc4a68f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.

To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.

Hank Sims

unread,
Dec 1, 2016, 4:04:17 PM12/1/16
to django...@googlegroups.com
You set it in the channel layer configuration in Django, like this: https://github.com/django/asgi_redis/#usage

Ah, thank you. Sorry I missed that. 
 
How would you propose this worked? The only alternative to closing the socket is to buffer the messages in memory and retry sending them, at which point you might have the case where the client thinks they have a working connection but it's not actually delivered anything for 30 seconds. Hard failure is preferable in distributed systems in my experience; trying to solve the problem with soft failure and retry just makes problems even more difficult to detect and debug.

I guess the "hard failure" I would prefer in this case -- though maybe not all cases -- is simply discarding new outbound messages when their queue is full. Or else some sort of mechanism from within my consumers.py that would allow me to forgo writing to a channel if its queue is full.

Andrew Godwin

unread,
Dec 1, 2016, 11:06:24 PM12/1/16
to django...@googlegroups.com
You already get this - trying to send to an outbound channel when it is full will raise the ChannelFull exception. What you're seeing is the inbound channel filling up, and the ASGI spec says that websocket protocol servers should drop the connection if they can't send an incoming message from a socket.

Andrew

Luís Antonio De Marchi

unread,
Apr 27, 2017, 9:22:12 AM4/27/17
to Django users
First I need to ask for patience with my English, the translator is helping.

We are creating an project with forecasts of millions of connections per second. We've never worked with websocket before, I heard that Crossbar.io was better. But I've been playing with Django Channels for some time and I love it.

We are 60% of the project with Django Channels and I discovered a stress test tool, which is called "tsung". With this test "I was able" to reach the mentioned message.

I also heard that Django Channels with Docker is fully scalable, but how do you actually scale it?

Sorry for my ignorance, I'm very worried that the project is a failure in start. It may never reach the expected numbers, but the system will appear on the television network and we have chances that it will actually happen (at least a few minutes)

Andrew Godwin

unread,
Apr 27, 2017, 1:16:50 PM4/27/17
to django...@googlegroups.com
Hi Luis,

If you are getting ChannelFull exceptions under load it means that the channels are not being drained fast enough, which means you need more worker processes.

This means you should run more worker instances. If you are using Docker to run worker instances, you would run multiple copies of the same docker container - all the workers you run will connect to the same Redis server and be able to drain the channels faster.

Andrew

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
Reply all
Reply to author
Forward
0 new messages