Bug in jinja2.lexer

256 views
Skip to first unread message

Tomas Gavenciak

unread,
Feb 26, 2011, 8:01:15 PM2/26/11
to pocoo-libs
Dearest developers,


I found the following strange behaviour in jinja2:

jinja2.Template('\n').render() => ''
jinja2.Template('\n\n').render() => '\n'

This is because of line 552 in lexer.py:

source = '\n'.join(unicode(source).splitlines()) (*)

I believe that this is a bug, since the API docs at
http://jinja.pocoo.org/docs/templates/#whitespace-control
explicitly state that "whitespace is not further modified".


While this is just a nuisance for bigger templates for HTML, it
matters when formatting short snippets, concatenating them later.
(Also, one feels like a fool when some of the newlines just disappear ;-)

As the newlines (as defined by newline_re) are already replaced by
newline_sequence in Lexer._normalize_newlines(), I would propose to
drop the line (*) altogether. Doing so fixes the behavior (so
jinja2.Template('\n').render() => '\n').

I think that the Whitespace control section would be better with an
explanation what happens to the newlines, such as:
[Every occurrence of sequence "\n", "\r" or "\r\n" is replaced by
`Environment.newline_sequence`.]


Just as an open suggestion -- disabling this newline-replacing
behaviour would permit using binary-like templates. This could be done
i.e. by allowing newline_sequence=None and slightly modifying
_normalize_newlines() to pass in that case. As far as I know, there is
currently no way to preserve both '\r' and '\n' while all the other
characters are preserved.

I would gladly provide patches for all proposed changes.


Keep up the great work on the best templating language!

Cheers,
Tomas

Armin Ronacher

unread,
Feb 26, 2011, 8:51:15 PM2/26/11
to pocoo...@googlegroups.com
Hi,

On 2/27/11 2:01 AM, Tomas Gavenciak wrote:
> This is because of line 552 in lexer.py:
>
> source = '\n'.join(unicode(source).splitlines()) (*)
>
> I believe that this is a bug, since the API docs at
> http://jinja.pocoo.org/docs/templates/#whitespace-control
> explicitly state that "whitespace is not further modified".

You are correct in stating that the documentation does not match what is
happening in Jinja2. However at the same time that behavior is what
many people are currently expecting Jinja2 to do in testing conditions.

For most applications it does not matter either wait, it mostly affects
applications that generate HTML for output that might end up in console
sessions. If you look at code that currently uses Jinja2 you will
notice that these applications are appending an additional trailing
newline in order to force the final newline.

The reason why it was decided that way is that it does not matter either
way for most web applications and that most console applications are
using the print statement for output which appends an trailing newline.

However I would consider a flag that switches that behavior. The
following modes would make sense:

1. current behavior. Strip trailing newline
2. force final newline
3. use newline as specified in the source template.

Dropping normalization of newlines altogether however is a bad idea in
my mind as it causes a lot of troubles when using templates created by
designers on a windows system on OS X/Linux or the other way round.
That was the main motivation for the current behavior.


Regards,
Armin

Tomas Gavenciak

unread,
Feb 27, 2011, 7:02:48 AM2/27/11
to pocoo...@googlegroups.com
Hi Armin,

Thank you for the prompt reaction!

> Dropping normalization of newlines altogether however is a bad idea in my
> mind as it causes a lot of troubles when using templates created by
> designers on a windows system on OS X/Linux or the other way round. That was
> the main motivation for the current behavior.

I think you misunderstand me -- the normalization is currently done
twice, which seems unnecessary:

First, different kinds of newlines ('\n', '\r' and '\r\n' in Python)
are replaced by '\n' by
source = '\n'.join(unicode(source).splitlines())

And then again, in _normalize_newlines(), '\n' '\r' and '\r\n' (as
given by newline_re) are replaced with the set newline_sequence.

Dropping the first operation does not change the behaviour except for
preserving the possible final newline (and that is easily added, see
below). Also it probably speeds up and clarifies the parsing a little.

> The reason why it was decided that way is that it does not matter either way
> for most web applications and that most console applications are using the
> print statement for output which appends an trailing newline.
>
> However I would consider a flag that switches that behavior.  The following
> modes would make sense:
>
> 1. current behavior.  Strip trailing newline
> 2. force final newline
> 3. use newline as specified in the source template.

I guess that dealing with the last newline is not an issue that would
deserve a new flag -- if you ALWAYS strip it and state that in the
docs, that seems like a good solution to me (and is the current
behaviour). Even if you drop the splitlines() line, it can be easily
done in _normalize_newlines(). If somebody (like me ;-) wants
more/less newlines, it is easy to just append them. The current docs
just do not state the current behaviour (nor do they explain how is
newline_sequence used for) and confuse (me) with stating that
whitespace is not touched.


What I would REALLY appreciate would be an option not to touch
whitespace (or other characters) at all. Then it would be easier to
use jinja for non-HTML templates where the newlines and special
characters matter (I frequently use jinja for generating program
code). The flag could be implemented i.e. by allowing newline_sequence
to be None (or '' os some other value) and checking that in
_normalize_newlines() (the variable newline_sequence is not used
anywhere else).

I would gladly provide patches for both the doc, lexer and (possibly)
the preserve-whitespace option.

Best,
Tomas

Armin Ronacher

unread,
Feb 28, 2011, 4:37:08 AM2/28/11
to pocoo...@googlegroups.com
Hi,

On 2/27/11 1:02 PM, Tomas Gavenciak wrote:
> I think you misunderstand me -- the normalization is currently done
> twice, which seems unnecessary:
>
> First, different kinds of newlines ('\n', '\r' and '\r\n' in Python)
> are replaced by '\n' by
> source = '\n'.join(unicode(source).splitlines())

That's done because the lexer is using lookbehinds and negative
lookbehinds in regular expressions which in Python are fixed width. As
such the newlines have to be normalized to a fixed length form (and for
simplicities sake I chose to normalize to a unix newline). That said,
the function also currently drops the trailing newline and the
information if such a newline was there or not.

> And then again, in _normalize_newlines(), '\n' '\r' and '\r\n' (as
> given by newline_re) are replaced with the set newline_sequence.

That goes from \n to the target format then. For instance some people
set it to \r\n (probably the only alternative that makes sense these
days) for HTTP and windows environments.

> Dropping the first operation does not change the behaviour except for
> preserving the possible final newline (and that is easily added, see
> below). Also it probably speeds up and clarifies the parsing a little.

It does cause troubles if the source format is windows newlines as
lookbehinds break. I don't have the code in front of me right now but
that was the original reason I went with the newline normalization.

> I guess that dealing with the last newline is not an issue that would
> deserve a new flag -- if you ALWAYS strip it and state that in the
> docs, that seems like a good solution to me (and is the current
> behaviour). Even if you drop the splitlines() line, it can be easily
> done in _normalize_newlines(). If somebody (like me ;-) wants
> more/less newlines, it is easy to just append them. The current docs
> just do not state the current behaviour (nor do they explain how is
> newline_sequence used for) and confuse (me) with stating that
> whitespace is not touched.

I will update the documentation for sure and consider adding a flag that
controls the trailing one.

> What I would REALLY appreciate would be an option not to touch
> whitespace (or other characters) at all. Then it would be easier to
> use jinja for non-HTML templates where the newlines and special
> characters matter (I frequently use jinja for generating program
> code). The flag could be implemented i.e. by allowing newline_sequence
> to be None (or '' os some other value) and checking that in
> _normalize_newlines() (the variable newline_sequence is not used
> anywhere else).

Jinja2 currently only normalizes newline whitespace. The rest is kept
unchanged. On top of that it supports a wide range of unicode supported
whitespace characters as token separator inside the Jinja2 blocks.
Someone was joking last time that this makes it impossible to use Jinja2
to generate "whitespace" (the programming language) sourcecode, but so
far that was the only use case where the normalization of newlines
caused problems. If you have some more use cases I will consider
changing the lexer to operate on arbitrary newlines instead of
normalizing them upfront.


Regards,
Armin

Tomas Gavenciak

unread,
Feb 28, 2011, 2:27:02 PM2/28/11
to pocoo...@googlegroups.com
Hi Armin,

>> First, different kinds of newlines ('\n', '\r' and '\r\n' in Python)
>> are replaced by '\n' by
>> source = '\n'.join(unicode(source).splitlines())
>
> That's done because the lexer is using lookbehinds and negative lookbehinds
> in regular expressions which in Python are fixed width.  As such the
> newlines have to be normalized to a fixed length form (and for simplicities
> sake I chose to normalize to a unix newline).  That said, the function also
> currently drops the trailing newline and the information if such a newline
> was there or not.

Aha, that is the reason behind that! Now I understand that the change
would not be just as trivial as I thought ... Thanks for the
explanation!

> I will update the documentation for sure and consider adding a flag that
> controls the trailing one.

Great. I would not think the option for just that to be necessary, but
it might be useful for someone.

> Jinja2 currently only normalizes newline whitespace.  The rest is kept
> unchanged.  On top of that it supports a wide range of unicode supported
> whitespace characters as token separator inside the Jinja2 blocks. Someone
> was joking last time that this makes it impossible to use Jinja2 to generate
> "whitespace" (the programming language) sourcecode, but so far that was the
> only use case where the normalization of newlines caused problems.  If you
> have some more use cases I will consider changing the lexer to operate on
> arbitrary newlines instead of normalizing them upfront.

You are of course right -- the newline replacement hardly ever matters.
My two reasons for digging into this were the idea of generating
binary-like data (for which Jinja is not exactly suited anyway) and
that I like my favourite tools to work in a well-defined and
straightforward way ;-)

Thanks for being patient with me.

Best regards and good luck,
Tomas

W. Trevor King

unread,
Jan 11, 2013, 8:51:15 AM1/11/13
to pocoo...@googlegroups.com
It's been a while since this discussion, but…


On Monday, February 28, 2011 4:37:08 AM UTC-5, Armin Ronacher wrote:

On 2/27/11 1:02 PM, Tomas Gavenciak wrote:
> I guess that dealing with the last newline is not an issue that would
> deserve a new flag -- if you ALWAYS strip it and state that in the
> docs, that seems like a good solution to me (and is the current
> behaviour). Even if you drop the splitlines() line, it can be easily
> done in _normalize_newlines(). If somebody (like me ;-) wants
> more/less newlines, it is easy to just append them. The current docs
> just do not state the current behaviour (nor do they explain how is
> newline_sequence used for) and confuse (me) with stating that
> whitespace is not touched.
I will update the documentation for sure and consider adding a flag that
controls the trailing one.

I just submitted a pull request adding this flag [1].  It took a while for the doc updates to land upstream [2], so I thought I'd bump things here as well.  There's quite a PR/issue backlog on GitHub.  Is development still ongoing?

[1]: https://github.com/mitsuhiko/jinja2/pull/170
[2]: https://github.com/mitsuhiko/jinja2/pull/137

Reply all
Reply to author
Forward
0 new messages