Can we remove FILE_CHARSET?

199 views
Skip to first unread message

Carlton Gibson

unread,
Oct 3, 2018, 3:46:37 AM10/3/18
to Django developers (Contributions to Django itself)
> FILE_CHARSET (default:'utf-8') 
> The character encoding used to decode any files read from disk. This includes template files and initial SQL data files.

Is there anywhere where this isn't UTF-8? (Or can't be decreed to be so?)

Jon has a suggestion to remove it:

Ticket: https://code.djangoproject.com/ticket/29817


> You preach to a convert! However it's not about not being able to encode in UTF-8, but about the common file encoding on some platforms, especially Windows. I'm not using Windows for a long time now, so I can't say if UTF-8 is a common encoding nowadays or if it needs a special handling (say change a program preference) in most Windows text editors.

Do you know about this? Can I ask for your input here? 

Thanks! 

Vasili Korol

unread,
Oct 3, 2018, 4:04:13 AM10/3/18
to Django developers (Contributions to Django itself)
Some Russian companies still store their old data (in databases and/or files) in KOI8-R. I'm not sure how many of them may be using Django, but I personally worked for a company in 2014-2015, that maintained a huge database of articles stored in KOI8-R. I can assume that, similarly, KOI8-U may be used in Ukraine. This is just how it turned out to be historically. Windows encoding CP1251 is found less often, but even in mid-2000's it was still competing with KOI8, so there may be some old databases in this encoding somewhere, too.
I would suggest keeping this setting for now.

Claude Paroz

unread,
Oct 3, 2018, 8:03:29 AM10/3/18
to Django developers (Contributions to Django itself)
We are not talking about general data encodings here, FILE_CHARSET is used to read Django text files from disk (template files, static files (css, js) or translation catalogs). So the question is mainly about encoding usage in text editors.

Claude

Jon Dufresne

unread,
Oct 3, 2018, 10:01:23 AM10/3/18
to django-d...@googlegroups.com
I'm the one that proposed this setting be removed.

The settings is used in the following areas:

> ./django/template/backends/django.py:23:        options.setdefault('file_charset', settings.FILE_CHARSET)

I suppose this is its main use case. The Django template engine defaults to loading files from disk using the encoding specified by FILE_CHARSET. If a project needs to load templates using a different encoding, it can continue to do so by specifying an OPTION in the TEMPLATES setting:

TEMPLATES = [
    {
        'BACKEND': 'django.template.backends.django.DjangoTemplates',
        'OPTIONS': {
            'file_charset': 'latin1',
        },
    },
]


> ./django/core/management/commands/makemessages.py:106:        encoding = settings.FILE_CHARSET if self.command.settings_available else 'utf-8'

The makemessages management command loads files to preprocess using the encoding specified by FILE_CHARSET.


> ./django/contrib/staticfiles/storage.py:287:                    content = original_file.read().decode(settings.FILE_CHARSET)

The HashedFilesMixin loads files to preprocess using the encoding specified by FILE_CHARSET.


> ./django/template/backends/dummy.py:31:                with open(template_file, encoding=settings.FILE_CHARSET) as fp:

The dummy template backend loads files using the encoding specified by FILE_CHARSET. This dummy backend is used for internal testing purposes only and is not a documented or public API. So I think it could safely be modified without affecting projects or users.


That's it!

I think this setting has the same issue that was identified by DEFAULT_CONTENT_TYPE. That is, if a projects sets FILE_CHARSET to a different value, interactions with third-party apps may be problematic. The third-party app likely encode templates and static files using UTF-8 so the use cases above may not work properly.

Projects using a different encoding will still have a deprecation period to see the Django warnings, adjust the setting, and re-encode files. The removal won't be immediate. If such projects re-encode files to UTF-8 early, the projects will be both backwards and forwards compatible with current and future Django versions.

FWIW, I was unable to find examples of a changed FILE_CHARSET by searching GitHub.

Using a different value for FILE_CHARSET is currently untested internally (although I believe it works as designed).

On Wed, Oct 3, 2018 at 5:03 AM Claude Paroz <cla...@2xlibre.net> wrote:
We are not talking about general data encodings here, FILE_CHARSET is used to read Django text files from disk (template files, static files (css, js) or translation catalogs). So the question is mainly about encoding usage in text editors.

Claude

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/ba98d239-479f-4b21-b899-8c9b39b921a3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Carlton Gibson

unread,
Oct 3, 2018, 10:55:45 AM10/3/18
to Django developers (Contributions to Django itself)
Thanks for the input everyone. 

So Jon, are you basically saying that Vasili's concern shouldn't come up? (That the whole "SQL data files" bit is misleading...?)

Adam Johnson

unread,
Oct 3, 2018, 11:28:38 AM10/3/18
to django-d...@googlegroups.com
Jon's logic seems right to me. I find the lack of tests disturbing, and I wouldn't be surprised if there were other places where django loaded files from disk without using FILE_CHARSET when a user of that setting would expect it to be.

On Wed, 3 Oct 2018 at 15:55, Carlton Gibson <carlton...@gmail.com> wrote:
Thanks for the input everyone. 

So Jon, are you basically saying that Vasili's concern shouldn't come up? (That the whole "SQL data files" bit is misleading...?)

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.

For more options, visit https://groups.google.com/d/optout.


--
Adam

Jon Dufresne

unread,
Oct 3, 2018, 1:29:33 PM10/3/18
to django-d...@googlegroups.com
> So Jon, are you basically saying that Vasili's concern shouldn't come up?

Yeah, I think it shouldn't come up. But I'm not sure I fully understand
Vasili's concern . Maybe if it was more specific with more details, I could
better understand it.

Django's documentation states:

https://docs.djangoproject.com/en/dev/ref/unicode/#creating-the-database

> Make sure your database is configured to be able to store arbitrary string
> data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you
> use a more restrictive encoding – for example, latin1 (iso8859-1) – you won’t
> be able to store certain characters in the database, and information will be
> lost.
>
> ...
>
> All of Django’s database backends automatically convert strings into the
> appropriate encoding for talking to the database. They also automatically
> convert strings retrieved from the database into strings. You don’t even need
> to tell Django what encoding your database uses: that is handled
> transparently.

So, if these non-UTF-8 articles are stored in the database, this doesn't
involve FILE_CHARSET. Are the articles stored as text or binary data? If text,
this violates existing Django documentation & assumptions. The database is
expected to be configured for UTF-8. If binary data, then the project's code
will be responsible for decoding it to a text string.

If, on the other hand, these articles are stored as files, how are they being
loaded? If they are being loaded through a Django code path, which one such
that FILE_CHARSET is involved? Or, are these articles loaded by project code
such that the encoding can be specified.

So, IIUC, it doesn't seem like FILE_CHARSET should be involved for this use
case.


> That the whole "SQL data files" bit is misleading...?

I was unable to find any code with an interaction between FILE_CHARSET & "SQL
data files". If it exists, do you have a link? I think this text may be
outdated or obsolete. Maybe that sentence should be rephrased to "template
files, static files, and translation catalogs".


On Wed, Oct 3, 2018 at 7:55 AM Carlton Gibson <carlton...@gmail.com> wrote:
Thanks for the input everyone. 

So Jon, are you basically saying that Vasili's concern shouldn't come up? (That the whole "SQL data files" bit is misleading...?)

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.

Carlton Gibson

unread,
Oct 3, 2018, 3:14:45 PM10/3/18
to Django developers (Contributions to Django itself)
Thanks for the follow-up Jon. 

I'll let Vasili follow-up on his use-case if possible/relevant. 

TBH I'm not at all sure about the SQL data files bit, which is in part why I asked here. 
(Encoding issues!) 

> Maybe that sentence should be rephrased to "template
files, static files, and translation catalogs".

OK, so IF it's just this, then I'm on Windows doing development in UTF-8 no problem (and can't really envisage doing much different as it stands) but: 

* Is that always available these days? (I'd guess yes.)
* Is is something we want to impose? Not sure. Are there people doing otherwise? (No idea.)

(If we can drop a setting, that'd be 💃🏼)

Vasili Korol

unread,
Oct 4, 2018, 4:19:13 AM10/4/18
to Django developers (Contributions to Django itself)
I guess, my statement doesn't apply if FILE_CHARSET only affects Django text files, so disregard. My point was that non-UTF data is still actively used despite the fact that "the whole world moved to Unicode".

Jon Dufresne

unread,
Oct 5, 2018, 3:25:28 PM10/5/18
to django-d...@googlegroups.com
> Is that always available these days? (I'd guess yes.)

I too would guess yes. I believe any reasonably modern text editor will support
UTF-8 and even likely default to saving in that encoding. I know mine does.


> Is is something we want to impose? Not sure. Are there people doing
> otherwise? (No idea.)

For templates, it wouldn't be imposed. Users can still override the template
engine's encoding with the 'file_charset' option.

For static files, without imposing it, we're back to the third-party app
concern. Just like DEFAULT_CHARSET, it would difficult to change FILE_CHARSET
_and_ integrate third party apps. The third party apps have likely encoded
their static files using UTF-8, so setting FILE_CHARSET to some other value
will break.

Cheers,
Jon


--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
Reply all
Reply to author
Forward
0 new messages