The idea behind errors='replace' is that any chars that cannot be
properly decoded as unicode will be replaced with an acceptable
unicode char (something like '?'). So, if you want to avoid crashing
your program on bad input, this is the only acceptable approach and, I
assume, why the current approach was taken way back when Django was
converted to be all unicode all the time (internally).
However, you bring up an interesting edge case. Apparently the
replacement char is causing an invalid SQL statement to be generated -
which breaks things in ugly ways. Probably not what we want.
I see a couple approaches to this:
1) This could fall under the 'we need better error messages' umbrella
and it should be easier to determine what the bad SQL statement was.
2) One could argue that errors='replace' is mangling user input and
really an error should be returned to the user, allowing them to fix
the error in the form and resubmit.
I'm not sure which approach is the way to go here. However, forcing
users to deal with encodings is generally a bad idea. Besides, you
never can trust a browser to give you what it says it is giving you.
In other words, the user may not be able to get the browser to send
the correct encoding anyway. For those reasons I'm leaning toward #1.
Of course, that begs the question: should Django be doing a better job
escaping the data used to build the SQL statement? I guess we won't
know unless we get the bad SQL statement. Which takes us back to #1.
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/django-developers/-/ByLiu7RzHtIJ.
> To post to this group, send email to django-d...@googlegroups.com.
> To unsubscribe from this group, send email to
> django-develop...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-developers?hl=en.
--
----
\X/ /-\ `/ |_ /-\ |\|
Waylan Limberg
This reminds me of Postel's Law, "Be conservative in what you do, be
liberal in what you accept from others." I see the reason behind your
desire to respond to this unicode error at the point of encoding, but
the issue is that there is no reasonable way for anyone to handle the
error condition. I mean, do you spit back a validation error, "Your
browser encoded the image filename in an inconsistent way. Go jump in
a ditch"? If the browser gave you a filename in a wonky encoding, a
best-effort approximation of the intended filename is the best anyone
can unless you want to send a reminder to upgrade to the latest Chrome
ve.
So the real bug here is not, "Why can't I catch and fix Unicode
problems when decoding the filename?" because you almost certainly
don't want to do that. The question is why force_encoding(...,
errors='replace') is giving you a string that PostgreSQL can't handle.
Best,
Alex Ogier
-Mike
> --
> You received this message because you are subscribed to the Google Groups "Django developers" group.
I consider silently replacing characters in user data to avoid an exception while decoding to be a silent data loss / corruption issue.
Depending on the actual data being submitted and the web site in question, this may be a non-issue, it may be inconvenient but acceptable, it may be critical, and it may cause errors down the line (like the `DatabaseError` exception).
However, neither users or developers currently have any choice in the matter, and most won't even know it is happening.
If simply switching to `errors='strict'` is not acceptable, what do people think about adding a `dirty` or `raw` property to `request.FILES` and `request.POST` when a decoding error occurs, and storing the original byte strings in there while keeping the forcibly decoded strings in `request.FILES` and `request.POST` as they are now?
This would give developers a chance to deal with this in a way that suits them. They could use middleware to re-raise the silenced exception, or they could try to decode with common alternative encodings before falling back to forcibly decoding (or raising an exception), or they could return a form validation error in their view, or they could still use the forcibly decoded data but warn users that their data was altered.
If we were to go down this route, I would still prefer to see Django ship with such a middleware that re-raises the `UnicodeDecodeError` exception and have it enabled by default, simply because this issue involves silent changes to user supplied data. If it causes an error for anyone, and it shouldn't normally as this appears to be an edge case, the issue will be easily diagnosed and the developer can then choose if they want to silently replace data, or attempt alternative decoding, or display an error to users.
The Django docs for file uploads say:
> The content-type header uploaded with the file (e.g. text/plain or application/pdf). Like any data supplied by the user, you shouldn't trust that the uploaded file is actually this type. You'll still need to validate that the file contains the content that the content-type header claims -- "trust but verify."
I think Django should follow the "trust but verify" principle when decoding all POST data. It's true that we can't accurately detect what character encoding is being used for supplied data, and this is precisely why we shouldn't forcibly decode POST data. We can at least inform developers (if not users) when the specified character encoding is proven to be wrong, and allow them to choose how to handle it.
Cheers.
Tai.