1. When using daemon mode, any request content is not sent across to
the daemon process unless the WSGI application actually attempts to
read it. This avoids one scenario for temporary socket deadlock
issues. Namely, the case where request content is greater than UNIX
socket buffer size (8KB) but the WSGI application doesn't consume any
of the content and generates a response anyway, the content of which
is also greater than the UNIX socket buffer size. Well behaved
applications would not normally hit this problem, but spam bots which
try and do POSTs against arbitrary URLs had a habit of triggering it.
This doesn't generally affect the server unless the spam bot is
sending lots and lots of request so as to tie up all available threads
until the timeout which breaks the deadlock expires.
2. As a result of 1 above, it is now also possible for a WSGI
application running in daemon mode to safely generate a 413 error
response, namely that the request content was too large. This can be
done without having to consume the request content to avoid the
deadlock problem and the content is not unnecessarily sent to the
daemon process either. The way in which this is all implemented means
that 100-continue from the actual client should now work end to end
right through to the daemon process.
3. All HEAD requests are now silently changed into a GET request. This
is necessary because WSGI applications tend to assume that they are
the sole generator of all HTTP response headers. When using Apache
this is not the case and Apache output filters can add additional
response headers. When a HEAD request arrives, that WSGI applications
can not bother to generate the actual response content as a result
would cause Apache output filters such as mod_deflate to not generate
response headers which would be the same as a GET request. This could
cause problems for any client or intermediate caching system.
Change 1 above should be sufficient to now allow fixing of the
remainder of the socket deadlock triggers to be deferred until
mod_wsgi 3.0. The two scenarios left which can trigger the issue are,
where application partially consumes request content such that more
than UNIX socket buffer size remains and then generates a response
greater than UNIX socket buffer size, or where a WSGI application is
trying to stream response content at the same time as reading the
request content.
Am still looking at one more change which is what to do about if
wsgi.input read() method is called without an argument. The WSGI
specification actually dictates that an optional argument shouldn't be
allowed and so what happens is undefined. Many WSGI implementations
still implement it and take it to mean return all request content. I
originally did do that in very early snapshots of mod_wsgi, but
frankly didn't see that of much use given one had the content length
anyway, plus there was various internal Apache issues around doing
that. So, implemented it instead to return next available chunk of
data based on what Apache input filter system had available at the
time. This meant it would only block if no data was available at the
time. It also meant it wasn't necessary to be gluing back together
blocks of data from what Apache was holding them as. So, behaviour
which was perhaps better suited to how Apache worked but not
necessarily what people would expect.
Anyway, optional argument isn't supposed to be supported, so looking
at whether to simply disallow argument being optional. What makes it
all stupid is that readline() by WSGI specification is not meant to
take an argument, but no one does that as stuff like cgi module
expects to be able to pass an argument and so not accepting it would
cause lots of stuff to fail. So, this area of WSGI specification isn't
exactly well defined and what I will do at this point I am not sure.
It isn't just a simple matter of returning remaining content based on
Content-Length header as mutating Apache input filters can screw with
that and more or less data may actually be available. It may thus be
better to simply not allow optional argument as required by
specification. Any other suggestions most welcome.
BTW, these few changes will probably make there way back into a
mod_wsgi 1.4 along with the daemon mode speed improvements and other
changes to address bugs or WSGI compliance.
Graham
--
-------------------------------------------------------------------------------
Carl J. Nobile (Software Engineer)
carl....@gmail.com
-------------------------------------------------------------------------------
> If HEAD requests are being changed to GET requests
> then the client will get an entity which is not what
> is specified in RFC2616 for HEAD requests only the
> headers should be the same as GET requests. Is this
> what is happening now? I am locally trapping all
> requests in my WSGI code so that I can conform to
> a RESTful interface, will the environment be left
> as it or will I only see GETs for HEADs in there?
Please see
http://groups.google.com/group/python-web-sig/browse_thread/thread/24cb9
b4ea8aa44f2 (the whole thread). I provided a test case that shows why
this change is necessary:
def application(env, start_response):
start_response("200 OK",
[("Content-Length", "10000")])
if env["REQUEST_METHOD"] == "HEAD":
return []
else:
return ["a"*10000]
Before this change was made, the above application gave the wrong
results when an output filter like mod_deflate was in use. After this
change, the application gives the correct results.
After this change, your application will never see a request with
environ['REQUEST_METHOD'] == 'HEAD' when running under mod_wsgi. Apache
requires modules like mod_wsgi to give it a response body for HEAD
requests. Apache throws the response body away after all the output
filters have processed it, so that it isn't sent to the user. As a
result, your application will be more likely to be conformant to RFC
2616, not less. Unfortunately, the performance of HEAD requests may
decrease in order to ensure that correctness.
If you are using CGI/FastCGI/SCGI/ajp-wsgi behind Apache to deploy your
WSGI applications then you should write some middleware like this
(untested):
class Apache_HEAD_Hack:
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
if environ['REQUEST_METHOD'] == 'HEAD':
environ['REQUEST_METHOD'] = 'GET'
return self.application(environ,start_latin)
- Brian
Could you post the test case you used where you saw the problem ~8K?
When I tested this, it seemed like the buffer size was over 100K on
CentOS 5.1, not 8K. If the boundary case really was as low as 8K then
this was a more urgent problem than I thought.
I updated to the Subversion trunk and the deadlock_1.wsgi test case I
just posted earlier now works.
> 2. As a result of 1 above, it is now also possible for a WSGI
> application running in daemon mode to safely generate a 413
> error response, namely that the request content was too
> large. This can be done without having to consume the request
> content to avoid the deadlock problem and the content is not
> unnecessarily sent to the daemon process either. The way in
> which this is all implemented means that 100-continue from
> the actual client should now work end to end right through to
> the daemon process.
Again, if you have the test case for this, I would like to see it. My
413 test case already worked perfectly as of the version available on
Saturday.
> Change 1 above should be sufficient to now allow fixing of
> the remainder of the socket deadlock triggers to be deferred
> until mod_wsgi 3.0. The two scenarios left which can trigger
> the issue are, where application partially consumes request
> content such that more than UNIX socket buffer size remains
> and then generates a response greater than UNIX socket buffer
> size, or where a WSGI application is trying to stream
> response content at the same time as reading the request content.
Both of these cases can be grouped together as "the application
generates any output larger than the socket buffer size after consuming
some, but not all, of the request entity." The workaround is to always
read the entire request body any time you read any of it. For a WSGI
application to work portably under other WSGI gateways, (especially as a
CGI under Apache), it needs to always consume the request entity before
generating any output. The easiest way to ensure that is (1) use the
Apache LimitRequestBody directive to prevent useless large request
bodies on requests that should not have a request body at all (GET,
HEAD, DELETE), (2) Use the Apache Limit directive to disable methods
that your application doesn't need at all (e.g. TRACE, OPTIONS, WebDAV
methods, etc.), and (3) requests create some middleware that throws away
any data remaining in wsgi.input before yielding the first non-empty
string.
> Am still looking at one more change which is what to do about
> if wsgi.input read() method is called without an argument.
>
> Anyway, optional argument isn't supposed to be supported, so
> looking at whether to simply disallow argument being
> optional. What makes it all stupid is that readline() by WSGI
> specification is not meant to take an argument, but no one
> does that as stuff like cgi module expects to be able to pass
> an argument and so not accepting it would cause lots of stuff
> to fail. So, this area of WSGI specification isn't exactly
> well defined and what I will do at this point I am not sure.
> It isn't just a simple matter of returning remaining content
> based on Content-Length header as mutating Apache input
> filters can screw with that and more or less data may
> actually be available. It may thus be better to simply not
> allow optional argument as required by specification. Any
> other suggestions most welcome.
For the same reason that every WSGI container implements readline() with
the argument, read() and read(-1) should return X bytes of input, where
X = min(<Content-Length>,<actual request size>) - <bytes already read>.
If that is difficult to do, raise an exception. Returning anything other
than the full remaining input is likely to result in applications that
silently corrupt data; if you raise an exception with a clear error
message, then developers will know about the mistake right away and can
correct it.
While the non-blocking read() is actually a great optimization, I think
that is something that should be added to the WSGI spec. or added as an
extension, using a different method name on environ["wsgi.input"].
All of these issues should be added to the documentation and to the
release notes.
Thanks for making these changes. The work you've been doing has really
made a big improvement on how my application operates.
- Brian
Apache will chop off the response body before sending the response to
the client; the client will never see a response body. I've already
tested it out this morning. As far as HTTP protocol correctness goes,
this change substantially decreases the likelihood of the entity headers
for HEAD and GET requests being different.
The requirement for mod_wsgi to give Apache the request body is
documented in several places, including here:
http://svn.apache.org/repos/asf/httpd/httpd/trunk/STATUS: "All handlers
should always send content down even if r->header_only is set. If not,
it means that the HEAD requests don't generate the same headers as a GET
which is wrong." But, mod_wsgi giving the response body to Apache
doesn't imply that Apache is going to send the response body to the
client.
> I understand that the headers weren't being assembled correctly,
> by Apache, but I don't think mod_wsgi should be telling the
> downstream WSGI handlers something that is not true either.
I agree with you. However, this is a result WSGI is designed. If you
read the Web-SIG thread, you can see that it was generally agreed that,
if a gateway or middleware set headers based on the content of the
response body, then they need to set REQUEST_METHOD to 'GET' before
forwarding the request on to the application.
Having said all of that, it *may* be possible for mod_wsgi to detect
whether or not the output filters for the request reserve the right to
change the entity headers. For example, if all of the output filters for
the request are of type AP_FTYPE_PROTOCOL, AP_FTYPE_TRANSCODE,
AP_FTYPE_CONNECTION, and AP_FTYPE_NETWORK, then it mod_wsgi may be able
to avoid changing environ["REQUEST_METHOD"] from "HEAD" to "GET" in
those cases. However, this is going against the requirement that
"handlers should always send content". Maybe it should be taken up on
the Apache module developers' list. If this optimization is sound, then
it would noticeably effect the performance of my application in some
cases.
You said you were writing a REST application. So am I. This change was
requested by me specifically so that REST applications will work more
correctly under Apache+mod_wsgi. But, even with this change, I highly
recommend that you disable all Apache output filters for your REST
application whenever possible, *especially* mod_deflate. When
mod_deflate and similar output filters are active, most WSGI gateways
will not respond to HEAD requests correctly. Also, Until Apache 2.2.3
(subversion trunk), mod_deflate has generated a bad ETag for responses.
Apache's conditional request processing (If-Match, If-None-Match)
doesn't work correctly when mod_deflate or other content-altering
filters are involved. As I'm sure you are aware, these are all major
issues in a REST application, as we rely on ETags and HEAD much more
than web browsers traditionally have. In fact, I know that some (all?)
web browsers avoid issuing HEAD requests for any reason, specifically
due to these issues.
Do all your compression with middleware that uses the Python zlib module
instead. But, note that if/when you use middleware to do compression,
that middleware still has to set REQUEST_METHOD to GET for HEAD
requests, in order to work correctly. There is no way of getting around
it, if you want the middleware to work correctly for any WSGI
application.
Cheers,
Brian
Apache will chop off the response body before sending the response to
the client; the client will never see a response body. I've already
tested it out this morning. As far as HTTP protocol correctness goes,
this change substantially decreases the likelihood of the entity headers
for HEAD and GET requests being different.
The requirement for mod_wsgi to give Apache the request body is
documented in several places, including here:
http://svn.apache.org/repos/asf/httpd/httpd/trunk/STATUS: "All handlers
should always send content down even if r->header_only is set. If not,
it means that the HEAD requests don't generate the same headers as a GET
which is wrong." But, mod_wsgi giving the response body to Apache
doesn't imply that Apache is going to send the response body to the
client.
I agree with you. However, this is a result WSGI is designed. If you
read the Web-SIG thread, you can see that it was generally agreed that,
if a gateway or middleware set headers based on the content of the
response body, then they need to set REQUEST_METHOD to 'GET' before
forwarding the request on to the application.
Having said all of that, it *may* be possible for mod_wsgi to detect
whether or not the output filters for the request reserve the right to
change the entity headers. For example, if all of the output filters for
the request are of type AP_FTYPE_PROTOCOL, AP_FTYPE_TRANSCODE,
AP_FTYPE_CONNECTION, and AP_FTYPE_NETWORK, then it mod_wsgi may be able
to avoid changing environ["REQUEST_METHOD"] from "HEAD" to "GET" in
those cases. However, this is going against the requirement that
"handlers should always send content". Maybe it should be taken up on
the Apache module developers' list. If this optimization is sound, then
it would noticeably effect the performance of my application in some
cases.
You said you were writing a REST application. So am I. This change was
requested by me specifically so that REST applications will work more
correctly under Apache+mod_wsgi. But, even with this change, I highly
recommend that you disable all Apache output filters for your REST
application whenever possible, *especially* mod_deflate. When
mod_deflate and similar output filters are active, most WSGI gateways
will not respond to HEAD requests correctly. Also, Until Apache 2.2.3
(subversion trunk), mod_deflate has generated a bad ETag for responses.
Apache's conditional request processing (If-Match, If-None-Match)
doesn't work correctly when mod_deflate or other content-altering
filters are involved. As I'm sure you are aware, these are all major
issues in a REST application, as we rely on ETags and HEAD much more
than web browsers traditionally have. In fact, I know that some (all?)
web browsers avoid issuing HEAD requests for any reason, specifically
due to these issues.
Do all your compression with middleware that uses the Python zlib module
instead. But, note that if/when you use middleware to do compression,
that middleware still has to set REQUEST_METHOD to GET for HEAD
requests, in order to work correctly. There is no way of getting around
it, if you want the middleware to work correctly for any WSGI
application.
> I must be missing something here. How will apache chop off the
response
> body if the method has been changed to a GET, it won't know that it
> once was a HEAD?
mod_wsgi tells the WSGI application that the request is a GET, but
Apache, mod_wsgi and all other Apache modules still know that it is a
HEAD request.
>> Having said all of that, it *may* be possible for mod_wsgi to detect
>> whether or not the output filters for the request reserve the right
to
>> change the entity headers. For example, if all of the output filters
for
>> the request are of type AP_FTYPE_PROTOCOL, AP_FTYPE_TRANSCODE,
>> AP_FTYPE_CONNECTION, and AP_FTYPE_NETWORK, then it mod_wsgi may be
able
>> to avoid changing environ["REQUEST_METHOD"] from "HEAD" to "GET" in
>> those cases. However, this is going against the requirement that
>> "handlers should always send content". Maybe it should be taken up on
>> the Apache module developers' list. If this optimization is sound,
then
>> it would noticeably effect the performance of my application in some
>> cases.
> If it is known that apache needs a request body even in the case
> of a response to HEAD request could not the developer be sure to
> send a body with this response and not have mod_wsgi munging the
> environment? I still don't like this, but I could live with it.
Because Apache is not the only web server in the world. If you use
CherryPy's web server or Paste's web server, without any middleware
involved, then your application will work as-is without being lied to.
If you changed your application to send the response body even when
environ["REQUEST_METHOD"] == "HEAD", then your application would stop
working in standalone servers like CherryPy and Paste. You can't have it
both ways.
>> Do all your compression with middleware that uses the Python zlib
module
>> instead. But, note that if/when you use middleware to do compression,
>> that middleware still has to set REQUEST_METHOD to GET for HEAD
>> requests, in order to work correctly. There is no way of getting
around
>> it, if you want the middleware to work correctly for any WSGI
>> application.
> I guess this explains why flup provides it's own compression
> middleware. Humm, food for thought.
Flup removed the GZip middleware before version 1.0 was released.
Make sure that you test whatever middleware you end up using very
thoroughly. Almost all of the GZip filters I have seen are worse than
mod_deflate.
- Brian
Code in Apache is:
terminate_header(b2);
ap_pass_brigade(f->next, b2);
if (r->header_only) {
apr_brigade_destroy(b);
ctx->headers_sent = 1;
return OK;
}
r->sent_bodyct = 1; /* Whatever follows is real body stuff... */
So, Apache has remembered that it is a HEAD request in r->header_only.
When processing the response and it sees it was a HEAD it simply
destroys the response content bucket chain, only passing the headers
and not the content.
BTW, thanks to Brian for explaining all this, saved me some time in
responding. :-)
Graham
Code in Apache is:
terminate_header(b2);
ap_pass_brigade(f->next, b2);
if (r->header_only) {
apr_brigade_destroy(b);
ctx->headers_sent = 1;
return OK;
}
r->sent_bodyct = 1; /* Whatever follows is real body stuff... */
So, Apache has remembered that it is a HEAD request in r->header_only.
When processing the response and it sees it was a HEAD it simply
destroys the response content bucket chain, only passing the headers
and not the content.
BTW, thanks to Brian for explaining all this, saved me some time in
responding. :-)
Graham
Unfortunately there isn't a way of knowing if mod_wsgi has changed it
from HEAD to GET. The same would apply if a WSGI middleware had done
the same thing.
Technically one could set some extra variable in the environment to
record the fact it had changed, but this would be outside of the WSGI
specification and so non standard and non portable.
Graham
I have though disabled the feature all together for the time being
while I investigate other problems with it. Thus optimisation for 2
also not being done any longer.
The problem is that possibly only can avoid sending data when it is a
non 2xx response, with possibly other responses being off limits as
well. Will say more when I work it out and decide whether I just need
to revert all the changes.
If using trunk let me know if still problems as I have entirely put
code back how it was but just tried to fudge things for the moment to
make it work like before. :-)
Graham
Please consider applying this patch. If there are no output filters
defined for the location (besides protocol level filters) then it is
safe to pass HEAD to the application.
To get the HEAD request, you would either need to ensure that no filters
were enabled for the location, or you need to explicitly disable them,
such as by using "FilterChain !" or RemoveOutputFilter.
- Brian
--- mod_wsgi.c (revision 802)
+++ mod_wsgi.c (working copy)
@@ -6180,6 +6180,7 @@
const char *path_info = NULL;
conn_rec *c = r->connection;
+ ap_filter_t *filters;
/* Grab request configuration. */
@@ -6205,8 +6206,12 @@
* content is generated.
*/
- if (!strcmp(r->method, "HEAD"))
- apr_table_setn(r->subprocess_env, "REQUEST_METHOD", "GET");
+ if (!strcmp(r->method, "HEAD")) {
+ if (!r->header_only ||
+ (r->output_filters &&
+ r->output_filters->frec->ftype < AP_FTYPE_PROTOCOL))
+ apr_table_setn(r->subprocess_env, "REQUEST_METHOD", "GET");
+ }
/* Determine whether connection uses HTTPS protocol. */