>>> invalid_text = '\ud802\udf12'
>>> print(invalid_text) # we'd expect this to fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1:
surrogates not allowed
>>> import django.utils.encoding
>>> django.VERSION
(2, 2, 0, 'alpha', 1)
>>> valid_text = django.utils.encoding.force_text(invalid_text)
>>> print(valid_text) # we'd expect this to succeed?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1:
surrogates not allowed
>>> valid_text
'\ud802\udf12'
}}}
Perhaps this is a flaw in my expectations? I'd expect `force_text()`'s
output to always be a valid text -- even though Python allows me to create
_non-text_ `str` objects. (In this case, I'd expect maybe `\ufffd\ufffd`
-- Unicode replacement characters.)
Unicode primer: `\ud802` is a "lone surrogate" in this context. A lone
surrogate is a valid Unicode _code point_ but it does not represent
_text_. (Lone surrogates can crop up if someone decodes valid UCS-2 as
UTF-16.) I don't think any caller of `force_text()` expects it to ever
return a non-textual Unicode string.
--
Ticket URL: <https://code.djangoproject.com/ticket/30481>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
Comment (by Claude Paroz):
I don't think that fixing unvalid unicode input is in the contract of
`force_str`/`force_text`.
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:1>
Comment (by Adam Hooper):
That's fair; then perhaps there should be some documentation to that
effect?
Where I'm coming from: Postgres gave me an error when I tried to INSERT a
string that was passed to my handler via JSON. It turns out Python's
`json.loads()` can produce lone surrogates (because JSON can contain them
-- https://bugs.python.org/issue17906); but Postgres TEXT (or JSON or
JSONB) fields only store well-formed Unicode text. "I mustn't be the only
person with this problem," I figured. I found `force_text()`. It looks
like exactly the utility I need -- especially since it's littered all over
the `django.db` package.
Then I needed to learn that it wasn't.
I ended up writing my own utility to replace surrogates. For anyone
reading:
{{{
import re
Surrogates = re.compile(r'[\ud800-\udfff]')
def force_valid_text(text):
return Surrogates.sub('\ufffd', text)
}}}
(I had to add `\u0000` to the regex, too, because Postgres doesn't allow
that, either. But I feel that's a Postgres-specific issue, whereas the
utility of `force_text()` is more general.)
In the end, I wrote my own `force_text()` utility. It would have saved me
some effort if the documentation had told me that `force_text()` wasn't
what I want when preparing arbitrary input text for a database text field.
I'd be happy to compose a few sentences to clarify this in the docs. Where
does this documentation belong? I was startled when I learned Django can
allow invalid-text `str` as input in perfectly ordinary usage; but it
turns out it ''must'' because JSON allows them.
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:2>
* component: Utilities => Documentation
* type: Uncategorized => Cleanup/optimization
* stage: Unreviewed => Accepted
Comment:
Replying to [comment:2 Adam Hooper]:
> I'd be happy to compose a few sentences to clarify this in the docs.
Where does this documentation belong?
Hey Adam. Since you're happy to compose the patch, let's Accept this to
see what you come up with. (I'm a bit _meh_ to be honest: this looks like
more trouble that it's worth to explain but...)
The place for it would be the
[https://docs.djangoproject.com/en/2.2/ref/utils/#django.utils.encoding.force_text
`force_text()` docs].
Thanks!
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:3>
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:4>
* status: new => closed
* resolution: => wontfix
Comment:
Django 2.2 has reached the end of mainstream support and `force_text()` is
deprecated in Django 3.0, so this ticket is not valid anymore.
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:5>
* status: closed => new
* resolution: wontfix =>
Comment:
I think this issue still stands, `force_text` was just an alias for
`force_str`.
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:6>
Comment (by Baptiste Mispelon):
I think the original issue came up because of the confusing usage of the
word "text" in `force_text()`
Django used "text" in opposition to "bytes" but the reporter understood
"text" in the context of Unicode which has a slightly different meaning.
The original report said:
[...] Python allows me to create _non-text_ str objects
So I think the renaming of `force_text` to `force_str` fixed this issue by
removing the association with the concept of "text".
As things are now, `force_str` has the same limitations as python's `str`
when it comes to Unicode issues like lone surrogates so I don't believe we
need to document them.
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:7>
* status: new => closed
* resolution: => wontfix
Comment:
Thanks Baptiste, convincing conclusion :-)
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:8>
Comment (by Simon Charette):
Makes sense to me.
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:9>
Comment (by Adam Hooper):
As the original reporter, I agree: calling it `force_str()` makes clear
what it does.
I still perceive Django to lack functionality. I originally filed this bug
report because I assumed the Django framework supported JSON-encoded
requests. This which led me to assume `force_text()` was a solution.
But Django docs don't mention JSON-encoded requests. So I think it's
consistent to close this bug and declare, "Django doesn't support JSON-
encoded requests, unless you invest serious effort."
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:10>
Comment (by Baptiste Mispelon):
Replying to [comment:10 Adam Hooper]:
> As the original reporter, I agree: calling it `force_str()` makes clear
what it does.
>
> I still perceive Django to lack functionality. I originally filed this
bug report because I assumed the Django framework supported JSON-encoded
requests. This which led me to assume `force_text()` was a solution.
>
> But Django docs don't mention JSON-encoded requests. So I think it's
consistent to close this bug and declare, "Django doesn't support JSON-
encoded requests, unless you invest serious effort."
Personally I'm not familiar with JSON-encoded requests or what would be
required for Django to support them but if that's a feature you're
interested in, you could try starting a discussion on the
DevelopersMailingList.
--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:11>