Q: Double-encoded environment variables?

21 views
Skip to first unread message

Albrecht Dreß

unread,
Aug 17, 2022, 4:25:46 PM8/17/22
to mod...@googlegroups.com
Hi all,

I ran into a strange issue regarding the encoding of the environment being passed to the Python3 script via

def application(env, startResponse): […]

as it appears to be utf-8 encoded /twice/. For testing, I added

SetEnv X-Test ä

to the Apache config. When I call a “traditional” Python3 CGI script, the value of os.environ['X-Test'] contains the UTF-8 encoded value as expected (i.e. the bytes 0xc3 0xa4).

However, in the script called via modwsgi

WSGIScriptAlias /test /path/to/my/script.py

the value env['X-Test'] contains the bytes 0xc3 0x83 0xc2 0xa4, which is actually the value being regarded as ISO8859-1 (or similar), and then encoded to UTF-8 again.

My system:
* Debian Bullseye 64-Bit
* apache2 v. 2.4.54-1~deb11u1 (standard Debian package)
* libapache2-mod-wsgi-py3 v. 4.7.1-3 (standard Debian package)
* python3 v. 3.9.2-3 (standard Debian package)

I am pretty sure I just missed a crucial configuration setting, but I could not find which one… Thus, any help to solve this issue would be highly appreciated!

Thanks in advance,
Albrecht.

Graham Dumpleton

unread,
Aug 17, 2022, 4:29:24 PM8/17/22
to mod...@googlegroups.com
Can you provide a simple WSGI hello world with some Python code showing how you are checking this and what mechanism you are display it if you are?

If you are looking at the Apache error logs to deduce this it can be confusing because Apache does it's own encoding on values it gets and so what you see in the logs isn't actually what the value may be.
> --
> You received this message because you are subscribed to the Google Groups "modwsgi" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/modwsgi/2HM57NL5.5HNUHOW5.K3BMCLQD%40PTUKMMCY.UOWZU43L.FHHJJQFX.

albrech...@arcor.de

unread,
Aug 18, 2022, 2:50:02 AM8/18/22
to modwsgi
Hi, Thanks a lot for your fast response!

The attached archive contains
  • the Apache Virtualhost configuration,
  • a sample Python3 WSGI script, and
  • a sample Python3 CGI script.
Calling the two scripts using curl produces the following output:

$ curl https://testserver/test
WSGI: value of 'X_TEST': 0xc3 0x83 0xc2 0xa4
$ curl https://testserver/cgi
CGI: value of 'X_TEST': 0xc3 0xa4

Thanks, Albrecht.
wsgi-env-issue.zip

Graham Dumpleton

unread,
Aug 18, 2022, 3:15:32 AM8/18/22
to mod...@googlegroups.com
Can you use the recommended daemon mode of mod_wsgi and when doing so set the lang/locale to what you use.

See the lang/locale options in:


You might see a difference in behaviour because a standard Linux Apache distribution doesn't set UTF-8 as the lang/locale which affects what happens in a WSGI application. For CGI, it may be seeing the correct locale as side effect of sub process/interpreter initialisation.

There is no easy way to set correct lang/locale except by using daemon mode and setting those options. Only way to set it for embedded mode properly is to set it for the whole Apache instance.

As quick alternative to test, try using mod_wsgi-express instead, as it will use daemon mode by default and set UTF-8 lang/locale for you.


Graham

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.

albrech...@arcor.de

unread,
Aug 18, 2022, 4:09:10 AM8/18/22
to modwsgi
Hi Graham!

Graham Dumpleton schrieb am Donnerstag, 18. August 2022 um 09:15:32 UTC+2:
Can you use the recommended daemon mode of mod_wsgi and when doing so set the lang/locale to what you use.

See the lang/locale options in:


Unfortunately, this doesn't change the behaviour…

I changed the WSGI settings in the Apache VirtualHost config to

        WSGIDaemonProcess       testserver_proc lang=en_US.UTF-8 locale=en_US.UTF-8 threads=4
        WSGIProcessGroup        testserver_proc
        WSGIApplicationGroup    testserver_app
        WSGIScriptAlias         /test   /usr/lib/test/test.py process-group=testserver_proc

…which is actually picked up, according to the Apache log:

src/server/mod_wsgi.c(10229): mod_wsgi (pid=1353502): Setting lang to en_US.UTF-8 for daemon process group testserver_proc.
src/server/mod_wsgi.c(10242): mod_wsgi (pid=1353502): Setting locale to en_US.UTF-8 for daemon process group testserver_proc.
mod_wsgi (pid=1353502): Initializing Python.
mod_wsgi (pid=1353504): Initializing Python.
mod_wsgi (pid=1353503): Initializing Python.
mod_wsgi (pid=1353503): Attach interpreter ''.
mod_wsgi (pid=1353504): Attach interpreter ''.
mod_wsgi (pid=1353502): Attach interpreter ''.
src/server/mod_wsgi.c(9115): mod_wsgi (pid=1353502): Started thread 0 in daemon process 'testserver_proc'.
src/server/mod_wsgi.c(9115): mod_wsgi (pid=1353502): Started thread 1 in daemon process 'testserver_proc'.
src/server/mod_wsgi.c(9115): mod_wsgi (pid=1353502): Started thread 2 in daemon process 'testserver_proc'.
src/server/mod_wsgi.c(9115): mod_wsgi (pid=1353502): Started thread 3 in daemon process 'testserver_proc'.
mod_wsgi (pid=1353502): Create interpreter 'testserver_app'.
[remote 172.16.96.65:47970] mod_wsgi (pid=1353502, process='testserver_proc', application='testserver_app'): Loading Python script file '/usr/lib/test/test.py'.

…but the script output reported by curl still is

WSGI: value of 'X_TEST': 0xc3 0x83 0xc2 0xa4

Any idea?

Thanks, Albrecht.

Graham Dumpleton

unread,
Aug 18, 2022, 4:58:23 AM8/18/22
to mod...@googlegroups.com
Thinking back about this some more and realising in one case you are talking about environment variables, and in another case you are talking about per request WSGi variables, then the behaviour is probably correct as required by the WSGI specification. At least to the extent that the WSGI specification can be applied, since these per request variables are Apache specific.

Anyway, the issue goes back to Python 2 and the updates to WSGI for Python 3.

In Python 2 various things passed in the WSGI environ dictionary related to a request were passed as normal strings (which are actually byte strings, not unicode strings, which were different in Python 2). Since the WSGI server can't know what encoding the WSGI application wants to use, it will take whatever it gets and passes through the raw byte stream. If a WSGI application wanted to interpret that as UTF-8, it was up to the WSGI application to decode the byte string and convert it to a unicode string.

In Python 3, the same issue still existed in that the WSGI server would not know what encoding a WSGI application wanted applied. At the same time though, the default string in Python 3 was a unicode capable string.

Now it wasn't practical in Python 3 to pass through variables as byte strings as the range of operations you could do on byte strings was very limited. Thus the rule for WSGI under Python 3 was that the WSGI server was required to take the underlying byte stream and convert it to the unicode capable default string as ISO-8859-1 (Latin-1). It was then up to the WSGI application to convert that to another string with the correct encoding. Since it was a unicode string at that point, to do that it would need to do.

    value.encode('ISO-8859-1').decode('UTF-8')

So this little dance was necessary. It was a bit ugly, but that is just how it was defined.

So apply that and see if you get what you expect.

Graham

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.

albrech...@arcor.de

unread,
Aug 18, 2022, 5:25:10 AM8/18/22
to modwsgi
Hi Graham:

Graham Dumpleton schrieb am Donnerstag, 18. August 2022 um 10:58:23 UTC+2:
[snip] 
Now it wasn't practical in Python 3 to pass through variables as byte strings as the range of operations you could do on byte strings was very limited. Thus the rule for WSGI under Python 3 was that the WSGI server was required to take the underlying byte stream and convert it to the unicode capable default string as ISO-8859-1 (Latin-1). It was then up to the WSGI application to convert that to another string with the correct encoding. Since it was a unicode string at that point, to do that it would need to do.

 Ah!  That fully explains the effect I observe!  As Apache passes the environment as UTF-8, actually every value in it is double-encoded, right?

    value.encode('ISO-8859-1').decode('UTF-8')

Not sure if it makes any difference in practice, but IMHO

    value.encode('raw_unicode_escape').decode('utf-8')

might be more appropriate.  At least it works fine with mixed input from different code pages (I added greek, cyrillic and hiragana chars to the latin1 one).

Thanks again for your help!

Best, Albrecht.

Graham Dumpleton

unread,
Aug 18, 2022, 5:37:37 AM8/18/22
to mod...@googlegroups.com
I don't know what "raw_unicode_escape" is. In mod_wsgi, as per WSGI spec at time, it converts as Latin-1 (ISO-8859-1). If "raw_unicode_escape" is an alias for that then fine, but "raw_unicode_escape" never existed back long ago.

Seems it isn't quite the same.

raw_unicode_escape
Latin-1 encoding with \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol.

I would stick with plain Latin-1.

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages