Django - Best option to check file types, even fake or without extensions

134 views
Skip to first unread message

Paul

unread,
Sep 22, 2017, 6:23:32 AM9/22/17
to Django users

I'm trying to validate mime types of files uploaded with a predefined list of validate mime types.


I need to do the check the file in the buffer before save, even if they are faked or no extensions.



1. python own  mimetypes package seems to "guess" only base on extension


2. magic-python looks ok, but has OS dependencies because is using UNIX libmagic.

 I had a lot of trouble with it on Windows 64 bit, and even after I fixed the dependencies error other issue appears and it couldn't identify files.

 This also is a issue because is hard to install OS related filed on a predefined hosting.


3. I found filetype package but it only checks "magic numbers" for a limited file types, and docx and other identifies them as zip file(wich are archive as technology),

 but I need to identify them as what they really are.


What other non OS dependent solutions that can check if the file is faked or with no extension exist ? (pdf,doc,docs,csv,xls,xlsx, ods,odt,odm)










Melvyn Sopacua

unread,
Sep 22, 2017, 7:37:59 AM9/22/17
to django...@googlegroups.com
Sorry to bust your bubble, but docx files really are zip files, with a
predetermined set of files in them. Microsoft even tried to patent the
idea, which I believe was originally coined by Sun's StarOffice. Most
office suites have since adopted the practice so in order to inspect
them you'll first have to extract at least one file from it and
determine the type and version of the document.

The question is what your motivations are. From a security point of
view this is wasted CPU cycles. A valid office document, can still
have perfectly valid malicious code in it. If you want to protect your
users, simply feed it to a malware scanner and be done with it.

For all other cases, this is python. Use duck typing: if looks like an
image, open it with PIL. Fail? Ditch.
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-users...@googlegroups.com.
> To post to this group, send email to django...@googlegroups.com.
> Visit this group at https://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/b5280e1b-ef5c-4749-a243-c75b3275c897%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Melvyn Sopacua

James Schneider

unread,
Sep 22, 2017, 8:42:33 AM9/22/17
to django...@googlegroups.com


On Sep 21, 2017 11:23 PM, "Paul" <sevenr...@gmail.com> wrote:

I'm trying to validate mime types of files uploaded with a predefined list of validate mime types.


I need to do the check the file in the buffer before save, even if they are faked or no extensions.

You're better off specifying what you do want rather than trying to filter out what you don't. 

What other non OS dependent solutions that can check if the file is faked or with no extension exist ? (pdf,doc,docs,csv,xls,xlsx, ods,odt,odm)

Devise minimal tests for each type of file that you may expect. For example, a CSV shouldn't contain raw binary data, and should be readable by the csv Python lib. A PDF file should be readable by a PDF lib, etc.

Use simple logic to filter out likely bad files. For example, it would be rare for an MS Excel file to be missing it's extension, so any tests you devise to check for Excel should be skipped if the file has no extension (and therefore, the file can never be flagged as type MS Excel). Using other libs like mimetypes can also quickly narrow down the tests you'd need to run. All of the pertinent tests should pass with high confidence before associating a MIME type. File size caps also can be useful.

Duck typing is your friend here, and heavy exception handling will be needed.

To reiterate what Melvyn mentioned, you should probably only do this if the file type validation is absolutely necessary. If files are shared among users, virus scanning and interception may be advised. 


-James
Reply all
Reply to author
Forward
0 new messages