[Django] #12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF

0 views
Skip to first unread message

Django

unread,
Jan 11, 2010, 4:35:14 AM1/11/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
----------------------------+-----------------------------------------------
Reporter: jfenwick | Owner: nobody
Status: new | Milestone:
Component: Core framework | Version: 1.1
Keywords: | Stage: Unreviewed
Has_patch: 0 |
----------------------------+-----------------------------------------------
According to RFC 2616, Section 3.7.1:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
"HTTP applications MUST accept CRLF, bare CR, and bare LF as being
representative of a line break in text media received via HTTP."

The Parser object in multipartparser can only parse canonical CRLF because
of this line:
http://code.djangoproject.com/browser/django/trunk/django/http/multipartparser.py#L553

As a result, any data coming through the WSGI gateway that does not
conform to canonical CRLF but is still considered valid by RFC 2616 be
processed incorrectly.

--
Ticket URL: <http://code.djangoproject.com/ticket/12578>
Django <http://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Jan 11, 2010, 7:44:24 AM1/11/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
------------------------------------+---------------------------------------
Reporter: jfenwick | Owner: nobody
Status: new | Milestone:
Component: HTTP handling | Version: 1.1
Resolution: | Keywords: jython
Stage: Unreviewed | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
------------------------------------+---------------------------------------
Changes (by leosoto):

* keywords: => jython
* needs_better_patch: => 0
* component: Core framework => HTTP handling
* needs_tests: => 0
* needs_docs: => 0

Comment:

Just want to point out that this problem is actually causing serious
problems to people deploying Django apps on Jython on Windows platforms.

--
Ticket URL: <http://code.djangoproject.com/ticket/12578#comment:1>

Django

unread,
Jan 11, 2010, 7:45:07 AM1/11/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
------------------------------------+---------------------------------------
Reporter: jfenwick | Owner: nobody
Status: new | Milestone:
Component: HTTP handling | Version: 1.1
Resolution: | Keywords: jython
Stage: Unreviewed | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
------------------------------------+---------------------------------------
Changes (by leosoto):

* cc: leosoto (added)

--
Ticket URL: <http://code.djangoproject.com/ticket/12578#comment:2>

Django

unread,
Jan 11, 2010, 9:21:18 AM1/11/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
------------------------------------+---------------------------------------
Reporter: jfenwick | Owner: nobody
Status: new | Milestone:
Component: HTTP handling | Version: 1.1
Resolution: | Keywords: jython
Stage: Unreviewed | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
------------------------------------+---------------------------------------
Comment (by kmtracey):

More details would be helpful here. From a brief look at the code
referenced, I'm not at all sure that changing the one line identified
would fix any problem. (Perhaps that wasn't intended to be implied by
identifying one line, but that is how I read it). Further in that routine
(line 579, for example), CRLF is used for splitting lines. So I suspect
more than a single-line change would be needed to fix this.

I'm also not entirely sure the code in Django here is wrong, per the RFC.
The name of this routine is `parse_boundary_stream`, implying to me it is
responsible only for splitting a multipart message based on boundary
markers. The very next section of the RFC cited
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.2) says, in
reference to multipart types: "The message body is itself a protocol
element and MUST therefore use only CRLF to represent line breaks between
body-parts." Thus it sounds perfectly legit to me for Django to be
requiring CRLF in this specific case.

So, I'd really like to get a better understanding of the exact form of the
data that is causing problems here.

--
Ticket URL: <http://code.djangoproject.com/ticket/12578#comment:3>

Django

unread,
Jan 11, 2010, 10:12:11 AM1/11/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
------------------------------------+---------------------------------------
Reporter: jfenwick | Owner: nobody
Status: closed | Milestone:
Component: HTTP handling | Version: 1.1
Resolution: invalid | Keywords: jython
Stage: Unreviewed | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
------------------------------------+---------------------------------------
Changes (by leosoto):

* status: new => closed
* resolution: => invalid

Comment:

You are right Karen, Django is doing the right thing and something else is
to blame. Marking the ticket as invalid.

--
Ticket URL: <http://code.djangoproject.com/ticket/12578#comment:4>

Django

unread,
Jan 11, 2010, 10:56:15 AM1/11/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
------------------------------------+---------------------------------------
Reporter: jfenwick | Owner: nobody
Status: closed | Milestone:
Component: HTTP handling | Version: 1.1
Resolution: invalid | Keywords: jython
Stage: Unreviewed | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
------------------------------------+---------------------------------------
Comment (by kmtracey):

Note even if Django is correct per the RFC, if there are non-compliant
clients in the wild that Django can't deal with because it is strictly
enforcing the RFC then we might want to look at relaxing the Django
implementation here. Particularly if there are other server side
implementations that these clients can work properly with. Adhering to
standards is great, but if it gets in the way of interoperating with real
clients that don't have issues with other servers then I don't know we'd
want to stick with a stance of "we're correct per the RFC". Hence the
request for more information. It'd be good to know what these clients are
(Barney's home-grown client used by exactly 8 machines on the internet or
something a little more popular?) and some information about servers that
they currently don't have trouble with.

--
Ticket URL: <http://code.djangoproject.com/ticket/12578#comment:5>

Django

unread,
Jan 12, 2010, 10:29:33 PM1/12/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
------------------------------------+---------------------------------------
Reporter: jfenwick | Owner: nobody
Status: closed | Milestone:
Component: HTTP handling | Version: 1.1
Resolution: invalid | Keywords: jython
Stage: Unreviewed | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
------------------------------------+---------------------------------------
Comment (by leosoto):

The problem occurs on deployments of Django 1.1 on Jython 2.5.1 with
Tomcat as the application server on Windows systems.

The full thread on the observed problem is on
<http://groups.google.com/group/django-jython-
dev/browse_thread/thread/c1d5d718f8d322b1>. It may be a Jython bug, though
but we haven't be able to confirm it yet.

--
Ticket URL: <http://code.djangoproject.com/ticket/12578#comment:6>

Django

unread,
Jan 12, 2010, 11:43:04 PM1/12/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
------------------------------------+---------------------------------------
Reporter: jfenwick | Owner: nobody
Status: closed | Milestone:
Component: HTTP handling | Version: 1.1
Resolution: invalid | Keywords: jython
Stage: Unreviewed | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
------------------------------------+---------------------------------------
Comment (by kmtracey):

Ah, so it's not some client that is sending LF-only in the POST data,
since I gather the clients in use here are standard browsers, which will
be sending CRLFs. Interesting it only happens on Windows, where CRLF is
the "normal" line ending. It sounds like some code that is passing along
the post data is translating CRLFs to just LFs. Besides causing this
problem that behavior has the potential for corrupting binary files
uploaded to a Django server. All the code touching this data before
feeding it to the Django code needs to be treating it as binary, not text,
and not doing any type of line normalization.

--
Ticket URL: <http://code.djangoproject.com/ticket/12578#comment:7>

Django

unread,
Jan 28, 2010, 5:00:00 PM1/28/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
------------------------------------+---------------------------------------
Reporter: jfenwick | Owner: nobody
Status: closed | Milestone:
Component: HTTP handling | Version: 1.1
Resolution: invalid | Keywords: jython
Stage: Unreviewed | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
------------------------------------+---------------------------------------
Comment (by jfenwick):

I'm not sure what the real problem is, so rather than open a new ticket I
will ask some questions here in the hope that someone can answer them.

The issue is that in on Windows, in Django-Jython on Tomcat, multipart
data is not parsed correctly by the LazyStream in multipartparser.py.

As far as I can tell, this happens because at
http://code.djangoproject.com/browser/django/trunk/django/http/multipartparser.py#L553
because the character CRLFCRLF is not found.

I did some experiments. I POSTed some data using the same app on four
different Django platforms.
Here are the platforms I tested on, and the associated data that was
output as hex:

Django on Python on runserver on Windows - multipartpythonwindows.hex
Django on Jython on runserver on Windows -
multipartdjangojythonwindows.hex
Django on Jython on Tomcat on Windows - multiparttomcatjythonwindows.hex
Django on Python on runserver on OS X - multipartpythonosx.hex (note: I
used a different Django app, but I believe the result would have been the
same)

The data was dumped using the code in multipartparser.diff

Note: I ran the files I generated through hexdump -C file to generate the
hex files from the data files I created.

According to
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.2 the message
body of the multipart should use CRLF as a line break between body-parts.
If you look at multipartpythonwindows.hex you will see that what actually
happens is all CRs are replaced with CRCR. This means that when the cited
line in multipartparser.py is looking for CRLFCRLF, it instead finds
CRCRLFCRCRLF. I would have thought this would fail, but it does not! It
works correctly. This is the normal operating procedure of Django on
Python, as far as I can tell.

In multipartdjangojythonwindows.hex you will see that the pattern of CRs
being replaced by CRCR still occurs. This means that Django on Jython
running on runserver works the same as Django on Python running on
runserver, and as a result, still works.

Now we go to multiparttomcatjythonwindows.hex. In this file, the CRLF
comes as you would expect it in RFC 2616. In this case, chunk.find fails,
which is the root of the problem.

Finally, compare the CR characters of multipartpythonosx.hex. You will see
it does not duplicate CR the same way multiparttomcatjythonwindows does.
And yet it works.

These are my questions:

1. Where is that extra CR coming from and why is it required? Why does
this not result in failure?
2. Could there be something wrong with my method of data collection as
specified in multipartparser.diff?
3. kmtracy previously said "All the code touching this data before feeding
it to the Django code needs to be treating it as binary, not text, and not
doing any type of line normalization." Is there a way I can check whether
the data is binary or text in Python to verify this is the case?

I'm sorry if this is not the correct avenue to be asking these questions.
If there is a better one, please point me in that direction.

--
Ticket URL: <http://code.djangoproject.com/ticket/12578#comment:8>

Django

unread,
Jan 28, 2010, 6:41:12 PM1/28/10
to djang...@holovaty.com, django-...@googlegroups.com
#12578: multipartparser.Parser does not accept non-canonical bare CR and bare LF
------------------------------------+---------------------------------------
Reporter: jfenwick | Owner: nobody
Status: closed | Milestone:
Component: HTTP handling | Version: 1.1
Resolution: invalid | Keywords: jython
Stage: Unreviewed | Has_patch: 0
Needs_docs: 0 | Needs_tests: 0
Needs_better_patch: 0 |
------------------------------------+---------------------------------------
Comment (by kmtracey):

To answer your first question, the extra CRs in the debug data are coming
from the way it was collected:

This code:

{{{
#!python
f = open('c:/chunk.txt', 'a')
f.write(chunk)
f.close()
}}}

on Windows, will transform any LFs in `chunk` to CRLF. So where chunk
originally had CRLF, what gets written to the file is CRCRLF. What was
actually fed to the Django parsing code (`chunk`) has the correct CRLF so
you don't see a POST/parse failure.

The problem with the debug data collection code is it does not open the
file in binary mode. See:
http://docs.python.org/tutorial/inputoutput.html#reading-and-writing-
files. To record exactly what is in chunk, the file needs to be opened in
binary mode.

You can see this behavior in action in a Python shell:

{{{
D:\tmp>python
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> txtfile = file.open('xyz.txt','a')
>>> txtfile = open('xyz.txt','a')
>>> txtfile.write('This is a line terminated by CRLF already\r\n')
>>> txtfile.close()
>>> binfile = open('xyz.txt','rb')
>>> data = binfile.read()
>>> data
'This is a line terminated by CRLF already\r\r\n'
>>>
}}}

Note I believe it is the reverse of this problem that is causing the
failure with Jython/Windows. Some code somewhere is reading the data as a
text file, instead of a binary file, and the existing CRLFs are being
turned into just plain LFs:

{{{
>>> binfile = open('abc.txt','ab')
>>> binfile.write('This is a line terminated by CRLF already\r\n')
>>> binfile.close()
>>> txtfile_wrong = open('abc.txt','r')
>>> txtfile_wrong.read()
'This is a line terminated by CRLF already\n'
>>>
}}}

I don't know where this is happening because I've got only the haziest
idea of all the pieces involved in running Django under Jython/tomcat. It
doesn't necessarily have to be happening in Python code either -- if I
remember right Java file I/O can do similar things. But my Java memories
are pretty dim since I haven't had to use it in a while.

For the case that looks like it should work but is failing -- bare LFs in
the `chunk` data have been turned into CRLFs in the debug data due to
writing the file as text. If you write the file as binary, then I expect
you'll see the `chunk` data contains just LFs, which is why the parsing is
failing.

--
Ticket URL: <http://code.djangoproject.com/ticket/12578#comment:9>
Reply all
Reply to author
Forward
0 new messages