[Django] #36777: Exception raised when accessing files with UTF-8 characters in filename on debian/Apache

9 views
Skip to first unread message

Django

unread,
6:09 AM (13 hours ago) 6:09 AM
to django-...@googlegroups.com
#36777: Exception raised when accessing files with UTF-8 characters in filename on
debian/Apache
-----------------------+-----------------------------------------
Reporter: Caram | Type: Bug
Status: new | Component: Uncategorized
Version: 6.0 | Severity: Normal
Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-----------------------+-----------------------------------------
= Unicode Filename Handling Issues in Django under Apache/WSGI

== Environment
* **Django Version**: 5.2/6.0
* **Python Version**: 3.12
* **Web Server**: Apache 2.4.65 with mod_wsgi 5.0.0
* **OS**: Debian Linux
* **Database**: MySQL with utf8mb3_general_ci collation

== Problem Description

Files with Unicode characters in their filenames (e.g., `Note
d'information Gestion des récupérations.pdf`) fail under Apache/WSGI in
two ways:

1. **File size displays as "0 bytes"** when using `{{
attachment.file.size|filesizeformat }}`
2. **File downloads return HTTP 404 errors**

Both issues work correctly under Django's `runserver` but fail in
production under Apache/WSGI.

== Root Cause Analysis

=== 1. ASCII Encoding Default
Apache/WSGI defaults to ASCII encoding for standard streams, unlike
`runserver` which uses UTF-8.

=== 2. FileField.size Property Failure
The `FileField.size` property attempts to access file metadata using the
default ASCII codec, which fails for non-ASCII characters in paths:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in
position 88: ordinal not in range(128)

=== 3. UTF-8 Mojibake
File paths from the database (stored as UTF-8) get incorrectly interpreted
as Latin-1 by
Apache/WSGI. For example:
* **Actual filename**: `récupérations.pdf`
* **In database**: UTF-8 bytes `\xc3\xa9` (correct encoding of "é")
* **Received by Django**: String `r\xc3\xa9cup\xc3\xa9rations` (UTF-8
bytes misinterpreted as Latin-1 characters)

=== 4. Filesystem Operations
`os.path.exists()`, `os.path.getsize()`, and `open()` fail when Python
tries to encode strings using the default ASCII codec.

== Workaround Overview

The workaround requires three components:

=== 1. Custom `filesize` Template Filter
Replace `{{ attachment.file.size|filesizeformat }}` with a custom filter
that:
* Fixes UTF-8 mojibake by re-encoding:
`path.encode('latin-1').decode('utf-8')`
* Uses explicit UTF-8 byte paths: `path.encode('utf-8')`
* Performs filesystem operations with byte strings to bypass ASCII codec

**Usage**:

{{{
{{ attachment.file.path|filesize|filesizeformat }}
}}}

=== 2. Custom File Serving View

Replace django.views.static.serve with a Unicode-aware version
(serve_unicode) that:
- Fixes UTF-8 mojibake in incoming URL paths
- Converts paths to UTF-8 bytes before filesystem operations
- Opens files using byte paths: open(fullpath_bytes, 'rb')
- Maintains security checks for path traversal
- Handles HTTP caching headers properly

**URL Configuration**:

{{{
re_path(r'^%s(?P<path>.*)$' %
re.escape(settings.MEDIA_URL.lstrip('/')),
serve_unicode,
{'document_root': settings.MEDIA_ROOT})
}}}

=== 3. URL Encoding Filter (Optional)

Add urlencode_path filter to properly encode URLs for href attributes:
- Decodes existing encoding to avoid double-encoding
- Re-encodes with proper UTF-8 percent-encoding
- Handles special characters (apostrophes, spaces, accented characters)

**Usage**:

{{{
<a href="{{ attachment.file.url|urlencode_path }}?filename={{
attachment.friendly_name|urlencode }}">
}}}

== Key Techniques

=== 1. Mojibake Fix

Convert UTF-8 bytes incorrectly decoded as Latin-1 back to proper UTF-8

{{{
path = path.encode('latin-1').decode('utf-8')`
}}}

=== 2. Byte Paths for Filesystem Operations

Always use byte strings for filesystem access

{{{
path_bytes = path.encode('utf-8')
if os.path.exists(path_bytes):
size = os.path.getsize(path_bytes)
with open(path_bytes, 'rb') as f:
# ...
}}}

=== 3. Explicit UTF-8 Encoding

Never rely on default encoding (os.fsencode() uses ASCII in Apache/WSGI).
Always specify UTF-8 explicitly: path.encode('utf-8')

== Testing Checklist

Test with filenames containing:
- Accented characters: café.pdf
- Apostrophes: Note d'information.pdf
- Multiple Unicode characters: récupérations.pdf
- Spaces and apostrophes: Note d'information Gestion des récupérations.pdf
- Non-Latin scripts: 文档.pdf
- Mixed characters: rapport_année_2024.pdf

== Related Issues

This addresses the common Apache/WSGI Unicode problem where:
- UnicodeEncodeError: 'ascii' codec can't encode character
- File operations work in development (runserver) but fail in production
(Apache/WSGI)
- Database stores UTF-8 correctly but Apache/WSGI mangles the encoding
--
Ticket URL: <https://code.djangoproject.com/ticket/36777>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
6:09 AM (13 hours ago) 6:09 AM
to django-...@googlegroups.com
#36777: Exception raised when accessing files with UTF-8 characters in filename on
debian/Apache
-------------------------------+--------------------------------------
Reporter: Caram | Owner: (none)
Type: Bug | Status: new
Component: Uncategorized | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+--------------------------------------
Changes (by Caram):

* Attachment "views.py" added.

Django

unread,
6:11 AM (13 hours ago) 6:11 AM
to django-...@googlegroups.com
#36777: Exception raised when accessing files with UTF-8 characters in filename on
debian/Apache
-------------------------------+--------------------------------------
Reporter: Caram | Owner: (none)
Type: Bug | Status: new
Component: Uncategorized | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+--------------------------------------
Changes (by Caram):

* Attachment "tags.py" added.

Django

unread,
9:02 AM (10 hours ago) 9:02 AM
to django-...@googlegroups.com
#36777: Exception raised when accessing files with UTF-8 characters in filename on
debian/Apache
-------------------------------+--------------------------------------
Reporter: Caram | Owner: (none)
Type: Bug | Status: closed
Component: Uncategorized | Version: 6.0
Severity: Normal | Resolution: invalid
Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+--------------------------------------
Changes (by Simon Charette):

* resolution: => invalid
* status: new => closed

Comment:

Please refer to the
[https://docs.djangoproject.com/en/6.0/howto/deployment/wsgi/modwsgi/ How
to use Django with Apache and mod_wsgi documentation] on the subject and
avoid using LLM to generate overly verbose reports.

> Fixing `UnicodeEncodeError` for file uploads
>
> If you get a UnicodeEncodeError when uploading or writing files with
file names or content that contains non-ASCII characters, **make sure
Apache is configured to support UTF-8 encoding**

Your mention of

> Apache/WSGI defaults to ASCII encoding for standard streams, unlike
runserver which uses UTF-8.

Clearly point out you didn't refer to the documentation on the subject.
--
Ticket URL: <https://code.djangoproject.com/ticket/36777#comment:1>

Django

unread,
11:18 AM (7 hours ago) 11:18 AM
to django-...@googlegroups.com
#36777: Exception raised when accessing files with UTF-8 characters in filename on
debian/Apache
-------------------------------+--------------------------------------
Reporter: Caram | Owner: (none)
Type: Bug | Status: new
Component: Uncategorized | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+--------------------------------------
Changes (by Caram):

* resolution: invalid =>
* status: closed => new

Comment:

Thanks Simon. Could you please have a closer look before you close the
ticket? It' a real issue, and it crashed my production server after moving
to Django 6.0 and I've spent 3 hours yesterday debugging and fixing, so I
think it deserves a little more consideration.

I realise the ticket description may be suboptimal, and I'm sorry if it
caused any irritation. But closing tickets maybe a little hastily does
quite send the kind of positive message that we would like to send the
community when they are reporting or fixing bugs, and I feel that the
board would agree with me.

Again, please have a closer look and let me know if you need any
additional information. I have a quite detailed trace of the debugging
work I performed yesteday.
--
Ticket URL: <https://code.djangoproject.com/ticket/36777#comment:2>

Django

unread,
1:52 PM (5 hours ago) 1:52 PM
to django-...@googlegroups.com
#36777: Exception raised when accessing files with UTF-8 characters in filename on
debian/Apache
-------------------------------+--------------------------------------
Reporter: Caram | Owner: (none)
Type: Bug | Status: closed
Component: Uncategorized | Version: 6.0
Severity: Normal | Resolution: invalid
Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+--------------------------------------
Changes (by Jacob Walls):

* resolution: => invalid
* status: new => closed

Comment:

I appreciate you're feeling frustrated, but please don't reopen tickets
without bringing new information to light. Same-day triage isn't hasty:
firsthand experience in an area can lead to more efficient triage. We're
not stubborn; we change our minds when new information compels it.
--
Ticket URL: <https://code.djangoproject.com/ticket/36777#comment:3>

Django

unread,
4:18 PM (2 hours ago) 4:18 PM
to django-...@googlegroups.com
#36777: Exception raised when accessing files with UTF-8 characters in filename on
debian/Apache
-------------------------------+--------------------------------------
Reporter: Caram | Owner: (none)
Type: Bug | Status: closed
Component: Uncategorized | Version: 6.0
Severity: Normal | Resolution: invalid
Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+--------------------------------------
Comment (by Simon Charette):

Some extra note you can get to if you `git-blame` the origin of the
documentation linked above

- #17686 (25b912abbe31fa440e702b5273c18cf74e2d6e0b) which I started
watching 14 years ago when I ran into a similar problem
- Some [https://docs.djangoproject.com/en/6.0/ref/unicode/#files extra
documentation on diagnozing misconfiguration that lead to file upload
mishandling of unicode file name]
- Similar report on Apache2, Ubuntu, `mod_wsgi` on the forum
[https://forum.djangoproject.com/t/unicodeencodeerror-ubuntu-apache2
-admin-page/29351/2 resolved by following the documentation]
--
Ticket URL: <https://code.djangoproject.com/ticket/36777#comment:4>
Reply all
Reply to author
Forward
0 new messages