mod_wsgi script encodings (and other issues with mod_wsgi script loading)

17 views
Skip to first unread message

Lucas Thode

unread,
Mar 21, 2024, 9:31:03 PMMar 21
to modwsgi
What determines which encoding mod_wsgi uses when it reads WSGI scripts: Apache's configured locale (which for me is en_us.UTF8), or something else?  (I ask about this because mod_wsgi appears to do low-level manual hackery when reading wsgi script files instead of going through importlib or runpy, which means that it can't handle a zipapp or even something that uses a PEP 263 magic comment to convey encoding information, the latter making it impossible to "wrap" a zipapp with a loader shim even unless something else gives.)

Minimized example (works when you run it using python3 breaks.py, breaks with the errors below if you try to load it using `mod_wsgi-express start-server breaks.py` using a mod_wsgi-express installed into a venv with pip install), note that you will have to save breaks.py as latin1/iso-8859-1 to cause this to break):

$ cat breaks.py
# coding: latin1
import sys
from wsgiref.simple_server import make_server

def application(environ, start_response):
    start_response('200 OK', [('Content-Type', 'text/plain')])
    message = 'It works!\n'
    version = 'Python v' + sys.version.split()[0] + '\n'
    response = '\n'.join([message, version])
    return [response.encode()]

def main():
    with make_server('', 8100, application) as httpd:
        httpd.serve_forever()

blow_up_unicode = 'â(¡' # \xe2\x28\xa1

if __name__ == '__main__':
    main()

Errors it generates when run under mod_wsgi-express:
[Thu Mar 21 20:12:56.942439 2024] [wsgi:error] [pid 3288289:tid 140356515776384]
 mod_wsgi (pid=3288289): Failed to exec Python script file '/tmp/mod_wsgi-localh
ost:8000:1000/handler.wsgi'.
[Thu Mar 21 20:12:56.942486 2024] [wsgi:error] [pid 3288289:tid 140356515776384]
 mod_wsgi (pid=3288289): Exception occurred processing WSGI script '/tmp/mod_wsg
i-localhost:8000:1000/handler.wsgi'.
[Thu Mar 21 20:12:56.943223 2024] [wsgi:error] [pid 3288289:tid 140356515776384] Traceback (most recent call last):
[Thu Mar 21 20:12:56.943329 2024] [wsgi:error] [pid 3288289:tid 140356515776384]   File "/tmp/mod_wsgi-localhost:8000:1000/handler.wsgi", line 90, in <module>
[Thu Mar 21 20:12:56.943335 2024] [wsgi:error] [pid 3288289:tid 140356515776384]     handler = mod_wsgi.server.ApplicationHandler(entry_point,
[Thu Mar 21 20:12:56.943337 2024] [wsgi:error] [pid 3288289:tid 140356515776384]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Thu Mar 21 20:12:56.943345 2024] [wsgi:error] [pid 3288289:tid 140356515776384]   File "/home/lucas/wsgizip/lib/python3.11/site-packages/mod_wsgi/server/__init__.py", line 1475, in __init__
[Thu Mar 21 20:12:56.943348 2024] [wsgi:error] [pid 3288289:tid 140356515776384]     code = compile(fp.read(), entry_point, 'exec',
[Thu Mar 21 20:12:56.943350 2024] [wsgi:error] [pid 3288289:tid 140356515776384]                    ^^^^^^^^^
[Thu Mar 21 20:12:56.943356 2024] [wsgi:error] [pid 3288289:tid 140356515776384]   File "<frozen codecs>", line 322, in decode
[Thu Mar 21 20:12:56.943371 2024] [wsgi:error] [pid 3288289:tid 140356515776384] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 458: invalid continuation byte

Graham Dumpleton

unread,
Mar 21, 2024, 10:00:35 PMMar 21
to mod...@googlegroups.com
Depends a little bit on whether you are using embedded mode or daemon mode of mod_wsgi, or whether using mod_wsgi-express.

The Python embedded in Apache when not using mod_wsgi-express should by default inherit the system default locale. This is often the C or POSIX locale from memory and not any variant of UTF-8 because Linux distros don't necessarily do sane things, although this may actually have changed.

What is calculated for language/local for specific HTTP requests to Apache based on Apache's rules makes no difference.

If you are using daemon mode of mod_wsgi you can use the lang/locale option to the WSGIDaemonProcess directive to explicitly set it for those processes.


I can't remember if there is a way of overriding it for embedded mode easily besides setting it in systemd or other startup files which startup Apache, I don't think so, so it is governed by what Apache process inherits from the system. You can possibly use Python functions to change it after the process started, but that may be too late for stuff which is already imported.

If you are using mod_wsgi-express, it tries to set things itself to a sane value if not set by the --locale command line option.

Bit of a description about it in:


  The behaviour of the --locale option to mod_wsgi-express has changed. Previously if this option was not defined, then both of the locales en_US.UTF-8 and C.UTF-8 have at times been hardwired as the default locale. These locales are though not always present. As a consequence, a new algorithm is now used.

  If the --locale option is supplied, the argument will be used as the locale. If no argument is supplied, the default locale for the executing mod_wsgi-express process will be used. If that however is C or POSIX, then an attempt will be made to use either the en_US.UTF-8 or C.UTF-8 locales and if that is not possible only then fallback to the default locale of the mod_wsgi-express process.

  In other words, unless you override the default language locale, an attempt is made to use an English language locale with UTF-8 encoding.


So the wisest thing to do if you have a special requirement is to set --locale option.

If you force mod_wsgi-express into embedded mode though, it possibly just inherits whatever parent shell is using again, I can't remember if mod_wsgi-express tries to set it in the parent process as well so inherited in the child process.

As to the initial WSGI script file, it is not a module import and so any special language encoding definition in a magic header of the file is ignored and it should just use whatever the Python lang/locale is set to.

If you need such a thing to be honoured then don't put your real code in the WSGI script file and instead hold your project code in a distinct Python package structure and import modules from it in the WSGI script file.

Not sure if this answers your question or not. My memory is very murky about some of this stuff, especially what happens in embedded mode.

Graham

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modwsgi/18b16e2e-4e3c-49f7-84af-7351e4619687n%40googlegroups.com.

Lucas Thode

unread,
Mar 21, 2024, 10:36:07 PMMar 21
to modwsgi
The problem with your statement that "As to the initial WSGI script file, it is not a module import and so any special language encoding definition in a magic header of the file is ignored and it should just use whatever the Python lang/locale is set to." is that the CPython interpreter itself does not ignore magic headers and zipapp functionality when passed a script file on the command line.  Indeed, `python3 myapp.pyz`, where myapp.pyz is a valid Python zipapp, will run the Python code in the zipapp's __main__.py.  Should I file a bug against mod_wsgi regarding its lack of support for what is normal Python functionality in every other context (not just imported modules)?

Graham Dumpleton

unread,
Mar 22, 2024, 12:09:00 AMMar 22
to mod...@googlegroups.com
A WSGI script file is not a __main__ module which is how a script file given as an argument to command line Python is treated. From memory the code related to importing that script file as the __main__ module and all the special treatment related to it is very convoluted and not something that can be reused by anything else. I even recollect it being implemented in an internal C function of CPython that can't be called by anything else.

That was back in Python 1.X/2.X days though. The question thus is whether Python 3.X has refactored the code for processing the script file as the __main__ module out into a separate pure Python code module. If that has been done, and one can dictate a different name for the module besides __main__ (which cannot be used in mod_wsgi since one could technically have multiple loaded WSGI script files in the same interpreter context which need to be named differently), then maybe it could be reused. This also depends though on whether module reload for embedded mode of mod_wsgi can still be handled.

So there are quite a lot of technical problems that would need to be solved first.

If you have the time and can at least identify for me where in the CPython code (or Python stdlib) the importing of __main__ Python script file is handled with the behaviour you need, that would give me a head start to work out whether it is practical. A starting point may be the run.py module, but not sure these days after all the rewrites of the Python module system over time where it is even handled.

So it may be possible now that mod_wsgi is only supporting Python 3, it definitely wouldn't have been possible when was supporting both Python 2 and 3 though, as pretty sure how it was done in Python 2 meant code wasn't reusable (or that could have been Python 1.X).

Lucas Thode

unread,
Mar 23, 2024, 10:51:38 AMMar 23
to modwsgi
The handling you're after was indeed refactored extensively around Python 2.3 or so with the introduction of the PEP 302 import hooks, so you're right that the way mod_wsgi does it is a legacy of very early Python (1.x most likely).  Modern Pythons (recent 2.x versions and all 3.x versions) expose a module called "importlib" that contains the various support functions that implement the import statement (and therefore, how all Python code is loaded from disk by CPython, even main scripts to the best of my understanding).
Reply all
Reply to author
Forward
0 new messages