**Reproduction**
* Create a view which returns a large response body. I haven't tried to
find a minimum size where it reproduces, but I was testing with a 50kb
`JSONResponse`. The bigger the better for ensuring reproduction.
* Serve the project with `runserver`
* Request the view with `Connection: close` - this is the important bit
* The server will close the connection before the response is fully
received by the client. Using Wireshark I can see that the server sends a
FIN packet before all of the data has been sent.
I believe those to be generic reproduction steps. In my actual testing and
debugging the request was a POST, but I don't think that factors in at
all. I was also running Django from inside the `python:3.9` Docker image.
**Analysis**
I believe what is happening here is that the socket is being shut down
before the send buffer has been cleared by the OS, and that shut down is
causing it to discard the send buffer and send a FIN packet rather than
continue to send the remaining contents of the send buffer.
If you trace through the code to see where the response is sent, you end
up (with CPython 3.9.2 at least)
[https://github.com/python/cpython/blob/v3.9.2/Lib/socketserver.py#L798-L801
here] in the `socketserver` module, where it is sent using
`socket.sendall`:
{{{
#!python
def write(self, b):
self._sock.sendall(b)
with memoryview(b) as view:
return view.nbytes
}}}
That call returns once the data is in the send buffer, but hasn't been
sent to the client yet.
In Django's code the socket is then shut down
[https://github.com/django/django/blob/3.1.7/django/core/servers/basehttp.py#L176
here]:
{{{#!python
def handle(self):
self.close_connection = True
self.handle_one_request()
while not self.close_connection:
self.handle_one_request()
try:
self.connection.shutdown(socket.SHUT_WR)
except (AttributeError, OSError):
pass
}}}
However, that's not the only spot that shut down or close is called on the
socket. There's several places in the Python standard library (CPython
implementation) such as
[https://github.com/python/cpython/blob/v3.9.2/Lib/socketserver.py#L784-L785
here] and
[https://github.com/python/cpython/blob/v3.9.2/Lib/socketserver.py#L501-L509
here]. I mention this to point out that it's not just Django's code
shutting down the socket, and removing that shut down doesn't change the
overall outcome. I'm not sure why there's an explicit shut down in
Django's code, but it's not the source of the trouble - the same shutdown
occurs very shortly after that in the Python standard library code.
I tried setting `SO_LINGER` on the socket (including on the listening
socket before a connection is accepted) since in theory it is supposed to
fix this sort of situation and cause `close` or `shutdown` to block until
all data is sent. I couldn't seem to get this behavior, even with
`SO_LINGER` set, a `shutdown` or `close` was immediately causing a FIN on
the TCP connection, with the data still to be sent discarded. `SO_LINGER`
also wouldn't fully solve the problem and ensure that the client received
all the data, since it doesn't wait for confirmation (the client closing
the connection), and the client could theoretically need retransmissions
of packets.
**Proposed Fix**
It seems that the best solution to this issue is to wait for the client to
close the connection. This ensures that the response was fully received by
the client.
This is actually more or less the existing behavior for a client using
`Connection: keep-alive`, and why the issue only shows up when
`Connection: close` is used. The former is going to be the default
behavior of any browser, while HTTP clients from scripts or servers (like
Node.js) often default to the latter rather than persistent connections.
In the request handling loop (shown above) in
`django.core.servers.basehttp.WSGIRequestHandler` a connection with `keep-
alive` will cause the server to immediately try to read another request
from the socket, which blocks and keeps the socket from being shut down on
the server until the client closes it. For an HTTP client using `keep-
alive`, but only sending a single request before closing the connection,
that means the server is handling a single request and ends up waiting for
the client to close the connection.
I believe the issue would probably also show up with `keep-alive` if
multiple requests were sent, the final request included `Connection:
close`, and that final request had a large response payload. I haven't
tested that, but the way the code reads that would lead to the same
situation.
Here's a proposed patch to unify the behavior for both cases, and always
do a blocking read on the socket before shutting it down on the server:
{{{
--- django/core/servers/basehttp.py 2021-02-21 17:31:25.000000000
-0800
+++ django/core/servers/basehttp.py 2021-02-21 17:32:37.000000000
-0800
@@ -7,6 +7,7 @@
been reviewed for security issues. DON'T USE IT FOR PRODUCTION USE!
"""
+from http import HTTPStatus
import logging
import socket
import socketserver
@@ -172,19 +173,33 @@
self.handle_one_request()
while not self.close_connection:
self.handle_one_request()
+
+ # Wait for the connection to be closed by the client to ensure
+ # that all data was received. Shutting down the connection seems
+ # to flush any data in the send buffer and immediately ends the
+ # connection, which risks having large responses cut off.
+ self.rfile.peek()
+
try:
self.connection.shutdown(socket.SHUT_WR)
except (AttributeError, OSError):
pass
def handle_one_request(self):
- """Copy of WSGIRequestHandler.handle() but with different
ServerHandler"""
+ """
+ Copy of WSGIRequestHandler.handle() but with different
ServerHandler,
+ and re-aligned with BaseHTTPRequestHandler.handle()
+ """
self.raw_requestline = self.rfile.readline(65537)
if len(self.raw_requestline) > 65536:
self.requestline = ''
self.request_version = ''
self.command = ''
- self.send_error(414)
+ self.send_error(HTTPStatus.REQUEST_URI_TOO_LONG)
+ return
+
+ if not self.raw_requestline:
+ self.close_connection = True
return
if not self.parse_request(): # An error code has been sent, just
exit
@@ -196,6 +211,8 @@
handler.request_handler = self # backpointer for logging &
connection closing
handler.run(self.server.get_app())
+ self.wfile.flush() #actually send the response if not already
done.
+
def run(addr, port, wsgi_handler, ipv6=False, threading=False,
server_cls=WSGIServer):
server_address = (addr, port)
}}}
The fix itself is the use of `self.rfile.peek()` before shutting down the
socket. The earlier linked parts of the Python standard library which will
shutdown/close the connection all occur after `handle()`, so as long as
the blocking read occurs in that method it should happen before the
connection could be closed. In the case of a connection initially opened
with `Connection: close`, or multiple requests on a persistent connection
with the final one specifying `Connection: close`, this line will block
until the client closes the connection. In the case of a persistent
connection that never specifies `Connection: close`, this should be a no-
op as it will only be reached once the `close_connection` loop has exited,
due to the client closing the connection. So, in every case the server
would wait for the client to close the connection.
I decided to use `peek()` instead of read since it makes it more clear
that the data isn't important, and no arbitrary (and meaningless) size for
the read needs to be specified. Any read should work - `read`, `readline`,
etc - there just needs to be a blocking call trying to read from the
socket.
The other part of the patch is not needed for the fix, but I thought it
would be good to update that code. Unfortunately `WSGIRequestHandler` in
the Python standard library has diverged from `BaseHTTPRequestHandler`,
which has some added code, including marking a blank `raw_requestline` as
a `close_connection` condition. While `parse_request` also does this, it's
probably a bit fragile to rely on that behavior, so explicitly handling
that case seems like a worthwhile change. `self.wfile.flush()` and the use
of `HTTPStatus.REQUEST_URI_TOO_LONG` are from `BaseHTTPRequestHandler` and
also seemed reasonable to pull in at the same time. I don't know if the
flush does anything meaningful for this case, but I don't think it hurts
to have.
----
If anyone wants to champion fixing this issue, it would be much
appreciated, and the above patch is free to use.
--
Ticket URL: <https://code.djangoproject.com/ticket/32472>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* cc: Florian Apolloner (added)
* component: Core (Management commands) => HTTP handling
* stage: Unreviewed => Accepted
Comment:
Thanks for detailed report. I didn't reproduce yet, but I'm going to
tentatively accept for future investigation. Unfortunately proposed patch
breaks some tests in the `tests/servers/`.
This can be a regression in 934acf1126995f6e6ccba5947ec8f7561633c27f.
--
Ticket URL: <https://code.djangoproject.com/ticket/32472#comment:1>
Comment (by Florian Apolloner):
I wonder if we can reproduce purely via python stdlib, this might as well
be a python stdlib bug, no? Either way impressive analysis so far.
--
Ticket URL: <https://code.djangoproject.com/ticket/32472#comment:2>
* cc: Chris Jerdonek (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/32472#comment:3>
* Attachment "wip_patch.diff" added.
WIP Patch With Chunked Encoding
* Attachment "test_server.py" added.
Python stdlib-Only Reproduction Server
* Attachment "test_client.js" added.
NodeJS Test Client
Comment (by David Sanders):
> This can be a regression in 934acf1126995f6e6ccba5947ec8f7561633c27f.
@Mariusz, I did review that change while investigating, but came to the
conclusion that the behavior would have been there before that change as
well, since the main issue seemed to be a race between sending the data
and closing the socket.
@Florian, yea, I was actually able to reproduce with a little script using
just the Python standard library. Since the issue was due to a race
between sending and closing the socket, and that's the standard behavior
from the Python standard library, the issue was in some ways inherited
from it.
Unfortunately I can no longer reproduce, and don't think anyone else will
have much luck reproducing.
I had to go deeper down the rabbit hole, and now I believe the reason for
the premature connection closure was actually between the client and the
server. I was running Django in a VS Code dev container (see
https://code.visualstudio.com/docs/remote/containers) using the
`python:3.9` Docker image as the base. I'd forgotten that VS Code has some
"magic" when using dev containers, where it will forward a port from the
host VS Code is running on to a port inside the dev container. I don't
know exactly how they have that set up (they haven't open-sourced the code
for dev containers), but looking at what's running inside the dev
container, I'm imagining there's probably a bucket brigade which goes
something like: bind the port on the host VS Code is running on, pipe
traffic to a script running inside the dev container, pipe traffic from
that script inside the dev container to the target port inside the dev
container.
I'm guessing there's a race condition or bug in their scripts for that
bucket brigade where the connection with Django closing causes the
observed behavior where the connection is closed to the client without
finishing sending the response data. Since there's likely multiple legs
involved in that forwarding magic, and the source isn't available, who
knows exactly where that was.
I somewhat confirmed this theory by skipping the forwarding magic and
having my test client script go straight to the Django port in the
container, and that didn't seem to have the premature closure.
However, at some point I decided (or thought it was a good idea) to
recreate the dev container, and now it won't reproduce reliably. I did see
it for a moment when the container was doing work installing extensions,
so I'm guessing the race condition or bug may be exasperated by load on
the container. The container I'd been using before when it was reliably
reproducing had been up several days. Note to self, add a tool for
debugging - try a full system reboot before investing time to chase an
issue.
So, in effect, there was a buggy network link (since there was software
forwarding involved) between the client and Django. The proposed fix of
waiting for the client to close the connection inadvertently worked around
that buggy link behavior.
Certainly explains the seemingly odd behavior and why `SO_LINGER` didn't
do anything to help the situation.
----
I think the proposed patch in the original comment helps add robustness to
network glitches since it waits for the client to close the connection,
ensuring that the response was received, but it has trade offs which
ultimately might make it not worthwhile. There could also be HTTP clients
which might not close the connection if they expect the server to do so,
so it could do more harm than good.
@Mariusz's comment about test failures caused me to spend some time
looking at that before I realized the VS Code mischief, and doing so
pointed out that waiting for the client to close the connection can't work
when the response has no `Content-Length` header (aka,
`StreamingHttpResponse`). I made a quick and dirty implementation of
chunked encoding for the dev server to handle this case, and it did work
in my testing, allowing the HTTP client to know all data was received and
close the connection itself. However, HTTP/1.0 clients wouldn't be able to
use that, so for a streaming response to HTTP/1.0 the existing behavior of
the server closing the connection would need to be maintained.
I'm going to attach some in-progress stuff for posterity's sake, and in
case anyone has any interest. The implementation of chunked encoding for
HTTP/1.1 can be done pretty minimally and might be worth implementing in a
different ticket. It's always good to have the dev server behavior act
more like production, and I think most production web servers would be
using chunked encoding for a response like that.
----
Apologies for chasing ghosts with this issue, definitely a case of a bit
too much "magic" in modern development tools.
--
Ticket URL: <https://code.djangoproject.com/ticket/32472#comment:4>