triage: Problem for the up-loading of non-ASCII character file name.

tsuyuki makoto

unread,

Feb 7, 2007, 12:10:36 AM2/7/07

to django-d...@googlegroups.com

I'd like to ask for comments on #3119: problems on FileField/ImageField
with multi-byte filenames. Since this problem is caused by two reasons,
let me describe them step by step.

Multibyte characters in a filename are lost in get_valid_filaname().
--------------------------------------------------------------------

As in django.db.models.fields, FileField and its subtype calls
django.utils.text.get_valid_filename() to remove all "filename-unsafe"
characters from given filename. The resulting filename consists
of alphabets, numbers, hyphens and underscores. However, the behaviour
raises undesirable effect for those country using multibyte filenames.
For example, if original filename consists all of multibyte characters
and '.txt' extension (such as 'ファイル.txt'), the resulting filename
becomes '.txt' (no filename body but only extension).

Underscore-suffix uniquification easily collapses
-------------------------------------------------

Things get worse if we have a lot of such files: since FileField
suffixes underscores after filename until the filename become unique,
if we have files of ['壱号文書.doc', '弐号文書.doc', '参号文書.doc', ...],
then filename records will become ['.doc', '_.doc', '__.doc', ...].
When the number of underscores reaches to maxlength of filename field
(100 or so), then FileField will begin to raise errors because length
of the filename exceeds limit.

Proposed solution: punicode conversion before call
django.util.text.get_valid_filename.
----------------------------------------------------------------------------
Add STORE_FILENAME_AS_PUNYCODE to global_settings as False by default.
Encodes the given string in punycode except the extension if
STORE_FILENAME_AS_PUNYCODE is True.
Then generate a clean file name in get_valid_filename and return it.

Michael Radziej

unread,

Feb 7, 2007, 2:37:32 AM2/7/07

to django-d...@googlegroups.com

Hi Tsuyuki!

tsuyuki makoto:

> Proposed solution: punicode conversion before call
> django.util.text.get_valid_filename.

Why punycode? I'd think that most filesystems these days support UTF-8
(though, with different normalization, which *is* a problem).

* Wouldn't it be better to support any arbitrary
settings.FILE_SYSTEM_ENCODING?
* What encoding does python use if you pass unicode to open()?

Kind regards,

Michael

--
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100

http://www.noris.de - The IT-Outsourcing Company

Gábor Farkas

unread,

Feb 7, 2007, 3:09:23 AM2/7/07

to django-d...@googlegroups.com

Michael Radziej wrote:
> Hi Tsuyuki!
>
> tsuyuki makoto:
>> Proposed solution: punicode conversion before call
>> django.util.text.get_valid_filename.
>
> Why punycode? I'd think that most filesystems these days support UTF-8
> (though, with different normalization, which *is* a problem).
>
> * Wouldn't it be better to support any arbitrary
> settings.FILE_SYSTEM_ENCODING?

> * What encoding does python use if you pass unicode to open()?
>

for os.listdir it uses sys.getfilesystemencoding(), so i assume it does
the same for open().

so usually it does the correct thing.

so using unicode filenames are probably fine, but then we are again back
at the to-unicodify-or-not-to-unicodify-django question :)

gabor

Michael Radziej

unread,

Feb 7, 2007, 3:16:51 AM2/7/07

to django-d...@googlegroups.com

Gábor Farkas:

> Michael Radziej wrote:
>> * What encoding does python use if you pass unicode to open()?
> for os.listdir it uses sys.getfilesystemencoding(), so i assume it does
> the same for open().
>
> so usually it does the correct thing.
>
> so using unicode filenames are probably fine, but then we are again back
> at the to-unicodify-or-not-to-unicodify-django question :)

No, I'd propose simply to use the file system's encoding for files
within the file system, that's all, if there's no compelling reason
to use punycode. There's no connection to "unicode everywhere".

Gábor Farkas

unread,

Feb 7, 2007, 5:33:24 AM2/7/07

to django-d...@googlegroups.com

Michael Radziej wrote:
> Gábor Farkas:
>> Michael Radziej wrote:
>>> * What encoding does python use if you pass unicode to open()?
>> for os.listdir it uses sys.getfilesystemencoding(), so i assume it does
>> the same for open().
>>
>> so usually it does the correct thing.
>>
>> so using unicode filenames are probably fine, but then we are again back
>> at the to-unicodify-or-not-to-unicodify-django question :)
>
> No, I'd propose simply to use the file system's encoding for files
> within the file system, that's all, if there's no compelling reason
> to use punycode. There's no connection to "unicode everywhere".
>

well, maybe not a "direct" connection.

but, if i understand correctly, you propose it to behave like:

=============================
filename1 =
request.POST.somehow_get_the_filename_i_do_not_want_to_look_it_up_right_now()

# let's pray that the user's html templates
# are encoded using settings.DEFAULT_CHARSET

filename2 = filename1.decode(settings.DEFAULT_CHARSET)

filename3 = filename2.encode(sys.getfilesystemencoding())

f = open(filename3)
=============================

which for me seems like a hack..., because:

1. we do not mandate yet that GET/POST data is in settings.DEFAULT_CHARSET

2. generally playing these encode/decode games is not nice :)

but i agree, that doing this would probably solve a lot of cases.

gabor

Michael Radziej

unread,

Feb 7, 2007, 6:00:08 AM2/7/07

to django-d...@googlegroups.com

Gábor Farkas:

> but, if i understand correctly, you propose it to behave like:
>
>
> =============================
> filename1 =
> request.POST.somehow_get_the_filename_i_do_not_want_to_look_it_up_right_now()
>
> # let's pray that the user's html templates
> # are encoded using settings.DEFAULT_CHARSET
>
> filename2 = filename1.decode(settings.DEFAULT_CHARSET)
>
> filename3 = filename2.encode(sys.getfilesystemencoding())
>
> f = open(filename3)
> =============================
>
> which for me seems like a hack..., because:
>
> 1. we do not mandate yet that GET/POST data is in settings.DEFAULT_CHARSET

We don't mandate any particular encoding in
settings.DEFAULT_CHARSET? Take a look at ticket #951 and the recent
discussion about it here. Generic views assume it, and the models
assume that every bytestring they get is in UTF-8, so we already
have conflicting assumptions (when DEFAULT_CHARSET is not UTF-8).
The mysql backends fails if the default connection encoding as set
up with the database server is different from UTF-8.

This will probably all be fixed to use DEFAULT_CHARSET or even
unicode instead of UTF-8, but does anybody currently successfully
use a different DEFAULT_CHARSET with the present Django.

> 2. generally playing these encode/decode games is not nice :)

You don't get around it. Sigh.

I particularly dislike what python is doing here.

You shouldn't be able to call unicode(bytestring) or str(unicode)
without giving an encoding, same for comparing unicode and
bytestrings. Just go through the open tickets in newforms to see
why. If you would have to fill in an encoding, it would be obvious
what to do. But as it is, it's much too easy to forget, and then you
get code that only works for ASCII, and since the test data also
only contains ASCII data (quite naturally, since using English for
test cases is just normal), the test suite doesn't find these bugs.

Gábor Farkas

unread,

Feb 7, 2007, 6:48:44 AM2/7/07

to django-d...@googlegroups.com

Michael Radziej wrote:
> Gábor Farkas:
>>

>> 1. we do not mandate yet that GET/POST data is in settings.DEFAULT_CHARSET
>
> We don't mandate any particular encoding in
> settings.DEFAULT_CHARSET? Take a look at ticket #951 and the recent
> discussion about it here.

i'm sorry, are you sure that it's
http://code.djangoproject.com/ticket/951 ?

> Generic views assume it,

where?

> and the models
> assume that every bytestring they get is in UTF-8, so we already
> have conflicting assumptions (when DEFAULT_CHARSET is not UTF-8).
> The mysql backends fails if the default connection encoding as set
> up with the database server is different from UTF-8.

might be. i use utf-8 with postgres, so i haven't had any problem with
these. but are you sure that the models assume utf-8? don't they simply
take any bytestrings they get?

>
> This will probably all be fixed to use DEFAULT_CHARSET or even
> unicode instead of UTF-8, but does anybody currently successfully
> use a different DEFAULT_CHARSET with the present Django.
>
>> 2. generally playing these encode/decode games is not nice :)
>
> You don't get around it. Sigh.

(except by switching django to unicode completely :)

>
> I particularly dislike what python is doing here.
>
> You shouldn't be able to call unicode(bytestring) or str(unicode)
> without giving an encoding, same for comparing unicode and
> bytestrings. Just go through the open tickets in newforms to see
> why. If you would have to fill in an encoding, it would be obvious
> what to do. But as it is, it's much too easy to forget, and then you
> get code that only works for ASCII, and since the test data also
> only contains ASCII data (quite naturally, since using English for
> test cases is just normal), the test suite doesn't find these bugs.

+1

btw. the same problem exists in java too ("String" vs "byte[]").

gabor

Ivan Sagalaev

unread,

Feb 7, 2007, 10:38:51 AM2/7/07

to django-d...@googlegroups.com

Michael Radziej wrote:
> No, I'd propose simply to use the file system's encoding for files
> within the file system, that's all,

This will fail in some cases. The problem is that file system itself
doesn't check (nor enforce) any encoding in file names. And files may
come to server from different places encoded into whatever encoding
their source is in. I just recently was hit by this with my music
exchange service where we accept files from zip archives that can be
made on Windows or on Unixes with different locales and also can be
uploaded through FTP that works in "C" locale. Anyway you end up with
just raw *bytes* in filenames and you can't rely on their encoding.

I solved the problem by using Postgres' BYTEA field type that stores
strings as bytes and doesn't check their encoding unlike VARCHAR. Thus
you can safely save it and select it back.

If someone needs it here's quick-and-dirty ByteAField that deals with
proper escaping on input:

class ByteAField(models.CharField):
def get_db_prep_save(self, value):
return value and ''.join(['\\%03o' % ord(c) for c in value])

def get_internal_type(self):
return 'CharField'

Yasushi Masuda

unread,

Feb 11, 2007, 12:08:42 AM2/11/07

to django-d...@googlegroups.com

Makoto, Michael, Gabor, Ivan:

In the discussion, I'm getting to think that the definition/policy of
the "valid filename (for storage)" may vary for circumstances or fields
where individual developer concerns, thus it should be hard to provide
the ultimate-flawless way for nomalizing filename. Even more, altering
get_valid_filename() may break a lot of existing code -- that would
be unacceptable for most of developers who has already written apps
with FileField.

Instead, I'd like to propose making get_valid_filename() as default
behaviour of ``filename_normalization``, and adding a
``filename_nomalizer`` parameter in FileField's constructor, like this::

{{{

Index: db/models/fields/__init__.py

===================================================================
--- db/models/fields/__init__.py (revision 4447)
+++ db/models/fields/__init__.py (working copy)
@@ -580,8 +580,10 @@
return forms.EmailField(**defaults)

class FileField(Field):
- def __init__(self, verbose_name=None, name=None, upload_to='', **kwargs):
+ from django.utils.text import get_valid_filename
+ def __init__(self, verbose_name=None, name=None, upload_to='', filename_normalizer=get_valid_filename, **kwargs):
self.upload_to = upload_to
+ self.filename_normalizer = filename_normalizer
Field.__init__(self, verbose_name, name, **kwargs)

def get_manipulator_fields(self, opts, manipulator, change, name_prefix='', rel=False, follow=True):
@@ -656,8 +658,7 @@
return os.path.normpath(datetime.datetime.now().strftime(self.upload_to))

def get_filename(self, filename):
- from django.utils.text import get_valid_filename
- f = os.path.join(self.get_directory_name(), get_valid_filename(os.path.basename(filename)))
+ f = os.path.join(self.get_directory_name(), self.filename_normalizer(os.path.basename(filename)))
return os.path.normpath(f)

class FilePathField(Field):

}}}

By this change, you may specify your own filename nomalization code that
fits your paticular requirements. The change does not affect any existing
code, because we provide present get_valid_filename() as default.

I'll post the same patch (including documentation adds) to #3119.

--
Yasushi Masuda
http://ymasuda.jp/

tsuyuki makoto

unread,

Feb 11, 2007, 3:35:59 AM2/11/07

to django-d...@googlegroups.com

07/02/11 Yasushi Masuda<whos...@gmail.com> wrote:
> Instead, I'd like to propose making get_valid_filename() as default
> behaviour of ``filename_normalization``, and adding a
> ``filename_nomalizer`` parameter in FileField's constructor, like this::

I completely agree with Yasushi.

The point is to avoid patching django's core model code every time.
Proposal of Yasushi is the most realistic, I think.
It's not breaking backward compatibility and it's loose coupling from
base models.
And it doesn't need to discuss which encoding is the best for that purpose;-)

I drop my proposal and patch.

Reply all

Reply to author

Forward