[Django] #30481: force_text() allows lone surrogates

26 views
Skip to first unread message

Django

unread,
May 15, 2019, 12:46:04 PM5/15/19
to django-...@googlegroups.com
#30481: force_text() allows lone surrogates
-------------------------------------+-------------------------------------
Reporter: Adam | Owner: nobody
Hooper |
Type: | Status: new
Uncategorized |
Component: Utilities | Version: 2.2
Severity: Normal | Keywords: force_text unicode
Triage Stage: | Has patch: 0
Unreviewed |
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------+-------------------------------------
{{{
$ python3
Python 3.7.3 (default, Mar 27 2019, 13:36:35)
[GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> invalid_text = '\ud802\udf12'
>>> print(invalid_text) # we'd expect this to fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1:
surrogates not allowed

>>> import django.utils.encoding
>>> django.VERSION
(2, 2, 0, 'alpha', 1)

>>> valid_text = django.utils.encoding.force_text(invalid_text)
>>> print(valid_text) # we'd expect this to succeed?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1:
surrogates not allowed

>>> valid_text
'\ud802\udf12'
}}}

Perhaps this is a flaw in my expectations? I'd expect `force_text()`'s
output to always be a valid text -- even though Python allows me to create
_non-text_ `str` objects. (In this case, I'd expect maybe `\ufffd\ufffd`
-- Unicode replacement characters.)

Unicode primer: `\ud802` is a "lone surrogate" in this context. A lone
surrogate is a valid Unicode _code point_ but it does not represent
_text_. (Lone surrogates can crop up if someone decodes valid UCS-2 as
UTF-16.) I don't think any caller of `force_text()` expects it to ever
return a non-textual Unicode string.

--
Ticket URL: <https://code.djangoproject.com/ticket/30481>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
May 16, 2019, 2:51:08 AM5/16/19
to django-...@googlegroups.com
#30481: force_text() allows lone surrogates
------------------------------------+--------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Uncategorized | Status: new
Component: Utilities | Version: 2.2
Severity: Normal | Resolution:
Keywords: force_text unicode | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
------------------------------------+--------------------------------------

Comment (by Claude Paroz):

I don't think that fixing unvalid unicode input is in the contract of
`force_str`/`force_text`.

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:1>

Django

unread,
May 16, 2019, 9:41:25 AM5/16/19
to django-...@googlegroups.com
#30481: force_text() allows lone surrogates
------------------------------------+--------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Uncategorized | Status: new
Component: Utilities | Version: 2.2
Severity: Normal | Resolution:
Keywords: force_text unicode | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
------------------------------------+--------------------------------------

Comment (by Adam Hooper):

That's fair; then perhaps there should be some documentation to that
effect?

Where I'm coming from: Postgres gave me an error when I tried to INSERT a
string that was passed to my handler via JSON. It turns out Python's
`json.loads()` can produce lone surrogates (because JSON can contain them
-- https://bugs.python.org/issue17906); but Postgres TEXT (or JSON or
JSONB) fields only store well-formed Unicode text. "I mustn't be the only
person with this problem," I figured. I found `force_text()`. It looks
like exactly the utility I need -- especially since it's littered all over
the `django.db` package.

Then I needed to learn that it wasn't.

I ended up writing my own utility to replace surrogates. For anyone
reading:

{{{
import re
Surrogates = re.compile(r'[\ud800-\udfff]')
def force_valid_text(text):
return Surrogates.sub('\ufffd', text)
}}}

(I had to add `\u0000` to the regex, too, because Postgres doesn't allow
that, either. But I feel that's a Postgres-specific issue, whereas the
utility of `force_text()` is more general.)

In the end, I wrote my own `force_text()` utility. It would have saved me
some effort if the documentation had told me that `force_text()` wasn't
what I want when preparing arbitrary input text for a database text field.

I'd be happy to compose a few sentences to clarify this in the docs. Where
does this documentation belong? I was startled when I learned Django can
allow invalid-text `str` as input in perfectly ordinary usage; but it
turns out it ''must'' because JSON allows them.

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:2>

Django

unread,
May 16, 2019, 10:42:19 AM5/16/19
to django-...@googlegroups.com
#30481: force_text() allows lone surrogates
--------------------------------------+------------------------------------

Reporter: Adam Hooper | Owner: nobody
Type: Cleanup/optimization | Status: new
Component: Documentation | Version: 2.2
Severity: Normal | Resolution:
Keywords: force_text unicode | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------+------------------------------------
Changes (by Carlton Gibson):

* component: Utilities => Documentation
* type: Uncategorized => Cleanup/optimization
* stage: Unreviewed => Accepted


Comment:

Replying to [comment:2 Adam Hooper]:

> I'd be happy to compose a few sentences to clarify this in the docs.
Where does this documentation belong?

Hey Adam. Since you're happy to compose the patch, let's Accept this to
see what you come up with. (I'm a bit _meh_ to be honest: this looks like
more trouble that it's worth to explain but...)

The place for it would be the
[https://docs.djangoproject.com/en/2.2/ref/utils/#django.utils.encoding.force_text
`force_text()` docs].

Thanks!

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:3>

Django

unread,
May 16, 2019, 10:43:14 AM5/16/19
to django-...@googlegroups.com
#30481: Document that force_text() allows lone surrogates.

--------------------------------------+------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Cleanup/optimization | Status: new
Component: Documentation | Version: 2.2
Severity: Normal | Resolution:
Keywords: force_text unicode | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------+------------------------------------

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:4>

Django

unread,
Dec 5, 2019, 4:15:45 AM12/5/19
to django-...@googlegroups.com
#30481: Document that force_text() allows lone surrogates.
--------------------------------------+------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Cleanup/optimization | Status: closed
Component: Documentation | Version: 2.2
Severity: Normal | Resolution: wontfix
Keywords: force_text unicode | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------+------------------------------------
Changes (by felixxm):

* status: new => closed
* resolution: => wontfix


Comment:

Django 2.2 has reached the end of mainstream support and `force_text()` is
deprecated in Django 3.0, so this ticket is not valid anymore.

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:5>

Django

unread,
Dec 5, 2019, 10:22:46 AM12/5/19
to django-...@googlegroups.com
#30481: Document that force_str() allows lone surrogates.

--------------------------------------+------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Cleanup/optimization | Status: new

Component: Documentation | Version: 2.2
Severity: Normal | Resolution:
Keywords: force_text unicode | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------+------------------------------------
Changes (by Simon Charette):

* status: closed => new
* resolution: wontfix =>


Comment:

I think this issue still stands, `force_text` was just an alias for
`force_str`.

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:6>

Django

unread,
Dec 6, 2019, 4:22:05 AM12/6/19
to django-...@googlegroups.com
#30481: Document that force_str() allows lone surrogates.
--------------------------------------+------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Cleanup/optimization | Status: new
Component: Documentation | Version: 2.2
Severity: Normal | Resolution:
Keywords: force_text unicode | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------+------------------------------------

Comment (by Baptiste Mispelon):

I think the original issue came up because of the confusing usage of the
word "text" in `force_text()`
Django used "text" in opposition to "bytes" but the reporter understood
"text" in the context of Unicode which has a slightly different meaning.

The original report said:
[...] Python allows me to create _non-text_ str objects

So I think the renaming of `force_text` to `force_str` fixed this issue by
removing the association with the concept of "text".
As things are now, `force_str` has the same limitations as python's `str`
when it comes to Unicode issues like lone surrogates so I don't believe we
need to document them.

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:7>

Django

unread,
Dec 6, 2019, 9:13:44 AM12/6/19
to django-...@googlegroups.com
#30481: Document that force_str() allows lone surrogates.
--------------------------------------+------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Cleanup/optimization | Status: closed
Component: Documentation | Version: 2.2
Severity: Normal | Resolution: wontfix
Keywords: force_text unicode | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------+------------------------------------
Changes (by Claude Paroz):

* status: new => closed
* resolution: => wontfix


Comment:

Thanks Baptiste, convincing conclusion :-)

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:8>

Django

unread,
Dec 6, 2019, 10:44:35 AM12/6/19
to django-...@googlegroups.com
#30481: Document that force_str() allows lone surrogates.
--------------------------------------+------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Cleanup/optimization | Status: closed
Component: Documentation | Version: 2.2
Severity: Normal | Resolution: wontfix
Keywords: force_text unicode | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------+------------------------------------

Comment (by Simon Charette):

Makes sense to me.

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:9>

Django

unread,
Dec 6, 2019, 11:54:55 AM12/6/19
to django-...@googlegroups.com
#30481: Document that force_str() allows lone surrogates.
--------------------------------------+------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Cleanup/optimization | Status: closed
Component: Documentation | Version: 2.2
Severity: Normal | Resolution: wontfix
Keywords: force_text unicode | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------+------------------------------------

Comment (by Adam Hooper):

As the original reporter, I agree: calling it `force_str()` makes clear
what it does.

I still perceive Django to lack functionality. I originally filed this bug
report because I assumed the Django framework supported JSON-encoded
requests. This which led me to assume `force_text()` was a solution.

But Django docs don't mention JSON-encoded requests. So I think it's
consistent to close this bug and declare, "Django doesn't support JSON-
encoded requests, unless you invest serious effort."

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:10>

Django

unread,
Dec 6, 2019, 3:04:35 PM12/6/19
to django-...@googlegroups.com
#30481: Document that force_str() allows lone surrogates.
--------------------------------------+------------------------------------
Reporter: Adam Hooper | Owner: nobody
Type: Cleanup/optimization | Status: closed
Component: Documentation | Version: 2.2
Severity: Normal | Resolution: wontfix
Keywords: force_text unicode | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------+------------------------------------

Comment (by Baptiste Mispelon):

Replying to [comment:10 Adam Hooper]:


> As the original reporter, I agree: calling it `force_str()` makes clear
what it does.
>
> I still perceive Django to lack functionality. I originally filed this
bug report because I assumed the Django framework supported JSON-encoded
requests. This which led me to assume `force_text()` was a solution.
>
> But Django docs don't mention JSON-encoded requests. So I think it's
consistent to close this bug and declare, "Django doesn't support JSON-
encoded requests, unless you invest serious effort."

Personally I'm not familiar with JSON-encoded requests or what would be
required for Django to support them but if that's a feature you're
interested in, you could try starting a discussion on the
DevelopersMailingList.

--
Ticket URL: <https://code.djangoproject.com/ticket/30481#comment:11>

Reply all
Reply to author
Forward
0 new messages