Support for utf8 in headers

Antony Chazapis

unread,

Aug 11, 2011, 5:36:20 AM8/11/11

to modwsgi

Hello.

I'm using apache2/mod_wsgi to drive a django project that aims to
implement/extend the OpenStack Object Storage API. In OpenStack Object
Storage they use arbitrary X-Object-Meta-<key>=<value> headers to
assign metadata to objects.

While the embedded django server allows utf8 characters in the
headers, I have found that when I post utf8 to the apache/mod_wsgi
installation, I receive an underscore ('_') in place of every non-
ascii character. I traced this to the wsgi_http2env() function, which
converts all non letter or number characters to '_'.

For example, when posting 'X-Object-Meta-ασδφ=a', I get
'HTTP_X_OBJECT_META_________=a'.

Is wsgi_http2env() really the source of this? If yes, why does
mod_wsgi keep only letters and numbers?

This is really a problem, as I can not even use url encoding - '%'s
are converted to '_' as well.

Thanks,

Antony

Graham Dumpleton

unread,

Aug 11, 2011, 5:49:31 AM8/11/11

to mod...@googlegroups.com

HTTP header names by the HTTP RFC must be ASCII so the code generating
headers with full UTF-8 in header names is violating the
specification.

FWIW, the wsgi_http2env is more or less an exact copy of similar
routine in Apache itself used in its mod_cgi modules when generating
similar variable names for CGI, which WSGI basically adheres to for
that encoding convention.

Graham

2011/8/11 Antony Chazapis <chaz...@gmail.com>:

> --
> You received this message because you are subscribed to the Google Groups "modwsgi" group.
> To post to this group, send email to mod...@googlegroups.com.
> To unsubscribe from this group, send email to modwsgi+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/modwsgi?hl=en.
>
>

Antony Chazapis

unread,

Aug 14, 2011, 7:37:20 AM8/14/11

to modwsgi

Thanks for the reply Graham.

I can easily change the code generating UTF-8 in header names, but the
question here is - with what?

I have found this:
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q6
Where it is suggested that url-encoding is used to send arbitrary
characters in headers.

The real problem here is that the isalnum() function in wsgi_http2env
will not even allow '%'s to pass through - or '/'s, or '?'s (used in
other UTF-8 encoding schemes). I can patch my mod_wsgi installation to
overcome this, but I asked here in case someone else had the same
problem and found a better solution.

So, is anybody here aware of an "official" or "standard" guideline on
how to send UTF-8 in headers?

Antony

On Aug 11, 12:49 pm, Graham Dumpleton <graham.dumple...@gmail.com>
wrote:

> HTTP header names by the HTTP RFC must be ASCII so the code generating
> headers with full UTF-8 in header names is violating the
> specification.
>
> FWIW, the wsgi_http2env is more or less an exact copy of similar
> routine in Apache itself used in its mod_cgi modules when generating
> similar variable names for CGI, which WSGI basically adheres to for
> that encoding convention.
>
> Graham
>

> 2011/8/11 Antony Chazapis <chaza...@gmail.com>:

Graham Dumpleton

unread,

Aug 14, 2011, 7:46:33 PM8/14/11

to mod...@googlegroups.com

What one can use for a HTTP field name is dictated by:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2

When you step through the various standards, you end up with:

The field-name must be composed of printable ASCII characters
(i.e., characters that have values between 33. and 126.,
decimal, except colon).

That is only for the header name though for HTTP.

Anyway, definitely can't have arbitrary characters such that could
handle byte string version of a Unicode string.

For WSGI, that header name gets converted to a CGI meta variable name
as defined in:

http://www.ietf.org/rfc/rfc3875

as:

Where working back for token you get:

alpha = lowalpha | hialpha
lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
"i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
"q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
"y" | "z"
hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" |
"I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
"Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" |
"Y" | "Z"

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
alphanum = alpha | digit
OCTET = <any 8-bit byte>
CHAR = alpha | digit | separator | "!" | "#" | "$" |
"%" | "&" | "'" | "*" | "+" | "-" | "." | "`" |
"^" | "_" | "{" | "|" | "}" | "~" | CTL
CTL = <any control character>

token = 1*<any CHAR except CTLs or separators>

So, technically the code borrowed from Apache could well be too
restrictive as that would appear on first read to allow '%'.

Would have to do some more investigation as to why Apache does it that
way. Since for CGI it becomes a process environment variable, maybe
there is some restriction because of cross platform compatibility.

As far as what is accepted practice, I have never ever seen anyone
using anything for header names that wasn't alphanumeric and dash.

Graham

2011/8/14 Antony Chazapis <chaz...@gmail.com>:

Reply all

Reply to author

Forward