Issues with encodings

462 views
Skip to first unread message

QuickeneR

unread,
Oct 31, 2012, 9:58:13 AM10/31/12
to blogofil...@googlegroups.com
Hi,
I am new to static site compilers, and am currently trying to start with blogofile. I am using 0.8b1 on Windows with Russian locale, and starting with 'raw' blogofile, without the blog plugin.
While working on a simple site, I encountered a number of unicode errors. With python 2.7 they error messages were rather meaningless (e.g. mako.exceptions.CompileException: Unicode decode operation of encoding 'ascii' failed at line: 0 char: 0 ) - russian characters certainly cannot be decoded from (low) ASCII. So I switched to python 3.3
Here the errors were better - they specified the file that could not be decoded, and the encoding it was supposed to be in. However, I noticed a strange thing - the assumed encodings for index.html.mako and _templates/site.mako were different. If I save index.html.mako in utf-8. I get
------------
...
File "e:\python33\lib\encodings\cp1251.py", line 23, in decode  
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 7570: character maps to <undefined>
-----------
Clearly, blogofile assumes that this file should be in cp1251
If, on the other hand, I save _templates/site.mako in cp1251, I get
-----------
...
File "e:\python33\lib\site-packages\mako-0.7.2-py3.3.egg\mako\lexer.py", line 206, in decode_raw_stream  
0, 0, filename)
CompileException: Unicode decode operation of encoding 'utf-8' failed in file '_templates/site.mako' at line: 0 char: 
-----------
So this file is assumed to be in utf-8.
A quick glance at Mako sources showed that Mako is indeed assuming utf-8 unless told otherwise, while blogofile does not seem to do that and relies instead on open() which assumes the encoding given by locale.getpreferredencoding(). This leads to different (required) encodings for different files, which is undesirable, to put it mildly.

What would be the best way to fix this?
1) the 'feature-complete' way - one implements some method for specifying the encoding of each end every file
2) assume the locale-based encoding
3) assume utf-8

1) it seems, will need much more work for either guessing the encoding or having settings in _config.py
2) is probably fine, unless we remember that sites and blogs are inherently global. No one really cares what locale the site's author prefers, and it really sucks when you can't pass a site to a friend or colleague because he has a different locale and the site won't build due to UnicodeDecodeError
3) is a clever thing if you cannot simple pass the bytes along and HAVE to assume an encoding. Of course, it might not be the same encoding as used by text documents native to the local platform (on Russian Windows we have 866, 1251, and two-byte Unicode in addition to utf-8) but really, it is 2012. Should not we be done with guessing ar character encodings?

So, here is a simple fix I propose: change all instances of open(..) to open(..., encoding='utf-8). It is a quick hack (and it probably won't work on python 2) but is solves the problem for me.
What do you think?
template.patch
config.patch
main.patch

Michael Bayer

unread,
Oct 31, 2012, 12:03:05 PM10/31/12
to blogofil...@googlegroups.com


On Oct 31, 2012, at 9:58 AM, QuickeneR wrote:

> Hi,
> I am new to static site compilers, and am currently trying to start with blogofile. I am using 0.8b1 on Windows with Russian locale, and starting with 'raw' blogofile, without the blog plugin.
> While working on a simple site, I encountered a number of unicode errors. With python 2.7 they error messages were rather meaningless (e.g. mako.exceptions.CompileException: Unicode decode operation of encoding 'ascii' failed at line: 0 char: 0 ) - russian characters certainly cannot be decoded from (low) ASCII. So I switched to python 3.3
> Here the errors were better - they specified the file that could not be decoded, and the encoding it was supposed to be in. However, I noticed a strange thing - the assumed encodings for index.html.mako and _templates/site.mako were different. If I save index.html.mako in utf-8. I get
> ------------
> ...
> File "e:\python33\lib\encodings\cp1251.py", line 23, in decode
> return codecs.charmap_decode(input,self.errors,decoding_table)[0]
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 7570: character maps to <undefined>
> -----------
> Clearly, blogofile assumes that this file should be in cp1251
> If, on the other hand, I save _templates/site.mako in cp1251, I get
> -----------
> ...
> File "e:\python33\lib\site-packages\mako-0.7.2-py3.3.egg\mako\lexer.py", line 206, in decode_raw_stream
> 0, 0, filename)
> CompileException: Unicode decode operation of encoding 'utf-8' failed in file '_templates/site.mako' at line: 0 char:
> -----------
> So this file is assumed to be in utf-8.
> A quick glance at Mako sources showed that Mako is indeed assuming utf-8 unless told otherwise, while blogofile does not seem to do that and relies instead on open() which assumes the encoding given by locale.getpreferredencoding(). This leads to different (required) encodings for different files, which is undesirable, to put it mildly.

The documented system of establishing encoding for your templates is by using the "# coding:" header, or across the board by setting it as a TemplateLookup attribute:

http://docs.makotemplates.org/en/latest/unicode.html#specifying-the-encoding-of-a-template-file

Assuming you don't want to have to enter the "# coding" prefix in each template, you'd want to use input_encoding. In Blogofile, I'm not 100% sure of the best way to set TemplateLookup's input encoding - there seem to be some hooks where we can set up a TemplateLookup of our own, or access the singleton MakoTemplate.template_lookup.input_encoding, but I'm not sure where that can best be set up reliably - blogofile.template.MakoTemplate seems to have a funny way of setting up its lookup.

> 1) the 'feature-complete' way - one implements some method for specifying the encoding of each end every file

this is implemented, via the "# coding" header

> 2) assume the locale-based encoding

I can't find the blog posts on this at the moment but this is considered by many to be a bad idea

> 3) assume utf-8

the default.

> 1) it seems, will need much more work for either guessing the encoding or having settings in _config.py

guessing is out. _config.py setting should be very easy, at least as far as Mako is concerned.

> 2) is probably fine, unless we remember that sites and blogs are inherently global. No one really cares what locale the site's author prefers, and it really sucks when you can't pass a site to a friend or colleague because he has a different locale and the site won't build due to UnicodeDecodeError

yup

> 3) is a clever thing if you cannot simple pass the bytes along and HAVE to assume an encoding.

if no encoding has been specified any other way, then we have to have a default, that's correct

> Of course, it might not be the same encoding as used by text documents native to the local platform (on Russian Windows we have 866, 1251, and two-byte Unicode in addition to utf-8) but really, it is 2012. Should not we be done with guessing ar character encodings?

I don't see us guessing anywhere here....

>
> So, here is a simple fix I propose: change all instances of open(..) to open(..., encoding='utf-8).

yikes, that's so out of left field. All the tools in use here support configurable input encodings as well as per-file input encodings using standard techniques, just see pep 263: http://www.python.org/dev/peps/pep-0263/

> It is a quick hack (and it probably won't work on python 2) but is solves the problem for me.
> What do you think?

a broken hack like that is clearly not an option.


QuickeneR

unread,
Nov 1, 2012, 7:29:18 AM11/1/12
to blogofil...@googlegroups.com
среда, 31 октября 2012 г., 22:02:48 UTC+6 пользователь Michael Bayer написал:

> A quick glance at Mako sources showed that Mako is indeed assuming utf-8 unless told otherwise, while blogofile does not seem to do that and relies instead on open() which assumes the encoding given by locale.getpreferredencoding(). This leads to different (required) encodings for different files, which is undesirable, to put it mildly. 

The documented system of establishing encoding for your templates is by using the "# coding:" header, or across the board by setting it as a TemplateLookup attribute:

http://docs.makotemplates.org/en/latest/unicode.html#specifying-the-encoding-of-a-template-file

Assuming you don't want to have to enter the "# coding" prefix in each template, you'd want to use input_encoding.   In Blogofile, I'm not 100% sure of the best way to set TemplateLookup's input encoding - there seem to be some hooks where we can set up a TemplateLookup of our own, or access the singleton MakoTemplate.template_lookup.input_encoding, but I'm not sure where that can best be set up reliably - blogofile.template.MakoTemplate seems to have a funny way of setting up its lookup.

That would be sort of ok, if it worked. But it does not. Adding ## -*- coding: utf-8 -*- to the top of index.html.mako does not affect the error. Also, you are quoting mako docs, but, as I wrote earlier, there is a disagreement between blogofile and mako about what encoding to choose. Mako works ok (assumes utf-8) even without this line, and blogofile does not seem to care about it. I did not post the complete trace in my first mail, so here it is
----------------
ERROR:blogofile:Fatal build error occured, calling bf.config.build_exception()
Traceback (most recent call last):
File "E:\Python33\Scripts\blogofile-script.py", line 9, in <module>
load_entry_point('blogofile==0.8b1', 'console_scripts', 'blogofile')()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\main.py", line 58, in main
args.func(args)
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\main.py", line 388, in do_build
writer.write_site()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\writer.py", line 50, in write_site
self.__write_files()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\writer.py", line 127, in __write_files
util.path_join(root, html_path))
  File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\template.py", line 386, in materialize_template  
template = template_engine(template_name, caller=caller, lookup=lookup)
  File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\template.py", line 114, in __init__ t_file.read(),  

File "e:\python33\lib\encodings\cp1251.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 398: character maps to <undefined>  
----------------------
No mention of Mako in the trace. Do you think blogofile's file processing routines are built to respect PEP 0263 ? From the text here http://www.python.org/dev/peps/pep-0263/ , it looks like it is meant for python source files, not arbitrary text files. However, files from _templates/ used from index.html.mako by <%inherit file="_templates/site.mako" /> do get affected by specifying an incorrect encoding.

> Of course, it might not be the same encoding as used by text documents native to the local platform (on Russian Windows we have 866, 1251, and two-byte Unicode in addition to utf-8) but really, it is 2012. Should not we be done with guessing ar character encodings? 

I don't see us guessing anywhere here.... 
 
All of these are valid encoding for the said systems. It is just a matter of popularity which encodings are used more and which less. Choosing one simply because it is returned by locale.getpreferredencoding() is no better than guessing.

Michael Bayer

unread,
Nov 1, 2012, 10:01:08 AM11/1/12
to blogofil...@googlegroups.com
On Nov 1, 2012, at 7:29 AM, QuickeneR wrote:

----------------
ERROR:blogofile:Fatal build error occured, calling bf.config.build_exception()
Traceback (most recent call last):
File "E:\Python33\Scripts\blogofile-script.py", line 9, in <module>
load_entry_point('blogofile==0.8b1', 'console_scripts', 'blogofile')()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\main.py", line 58, in main
args.func(args)
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\main.py", line 388, in do_build
writer.write_site()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\writer.py", line 50, in write_site
self.__write_files()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\writer.py", line 127, in __write_files
util.path_join(root, html_path))
  File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\template.py", line 386, in materialize_template  
template = template_engine(template_name, caller=caller, lookup=lookup)
  File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\template.py", line 114, in __init__ t_file.read(),  
File "e:\python33\lib\encodings\cp1251.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 398: character maps to <undefined>  
----------------------

Well the bug here in Blogofile is that they aren't using Mako's routines to open a template file.    We open it with mode='rb' so there's no need for it to attempt a decode.   right where they say Template(t_file.read()), they should be saying Template(filename=t_file) so that Mako handles the details.


From the text here http://www.python.org/dev/peps/pep-0263/ , it looks like it is meant for python source files, not arbitrary text files.

A Mako template file is a form of Python source file.   You can't just send any random text to Mako.

Just to confirm, the Mako engine standalone works as expected, correct ?


QuickeneR

unread,
Nov 2, 2012, 3:41:01 AM11/2/12
to blogofil...@googlegroups.com
четверг, 1 ноября 2012 г., 20:00:52 UTC+6 пользователь Michael Bayer написал:

Well the bug here in Blogofile is that they aren't using Mako's routines to open a template file.    We open it with mode='rb' so there's no need for it to attempt a decode.   right where they say Template(t_file.read()), they should be saying Template(filename=t_file) so that Mako handles the details. 

I applied this advice (patch attached) and it does seem to solve the problem, if you add the pep-0263 comment to all the top-level templates (not <%inherited, these do not need the comment on python 3.3)
But what about other instances of open(... 'r') ? config.py and main.py do not need patching, on closer inspection, but in template.py it is called from
class JinjaTemplateLoader(jinja2.FileSystemLoader):  def get_source(self, environment, template):
class JinjaTemplate(Template):  def render(self, path=None):
class FilterTemplate(Template):   def render(self, path=None): (two times)
def get_base_template_src():
Aren't all of these going to fail if exposed to non-latin characters? Jinja might be able to accept a filename and handle the decoding on its own, but the latter two need to be fixed in blogofile.

Another question. Is python 2.7 still a supported platform? Because this patch does not seem to fix the problem there.
------------------
ERROR:blogofile.template:Error rendering template: .\contacts.html.mako

Traceback (most recent call last):
File "e:\python27\lib\site-packages\blogofile-0.8b1-py2.7.egg\blogofile\template.py", line 156, in render
rendered = self.mako_template.render(**self)
File "e:\python27\lib\site-packages\mako-0.7.2-py2.7.egg\mako\template.py", line 412, in render
return runtime._render(self, self.callable_, args, data)
File "e:\python27\lib\site-packages\mako-0.7.2-py2.7.egg\mako\runtime.py", line 766, in _render
**_kwargs_for_callable(callable_, data))
File "e:\python27\lib\site-packages\mako-0.7.2-py2.7.egg\mako\runtime.py", line 798, in _render_context
_exec_template(inherit, lclcontext, args=args, kwargs=kwargs)
File "e:\python27\lib\site-packages\mako-0.7.2-py2.7.egg\mako\runtime.py", line 824, in _exec_template
callable_(context, *args, **kwargs)
File "_templates/site.mako", line 3, in render_body
<%include file="page-header.mako" /> 13:31
File "e:\python27\lib\site-packages\mako-0.7.2-py2.7.egg\mako\runtime.py", line 693, in _include_file
callable_(ctx, **_kwargs_for_include(callable_, context._data, **kwargs))
File "_templates/page-header.mako", line 36, in render_body
<li><a href="${address}">${display_name}</a></li>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
------------------
Both contacts.html.mako and _templates/page-header.mako are in utf-8 and have the appropriate pep comment.

A Mako template file is a form of Python source file.   You can't just send any random text to Mako.
Just to confirm, the Mako engine standalone works as expected, correct ?
 
Yes, the following snippet display correctly, provided that the coding in the pep comment matches the actual coding. Tested on utf-8, cp1251 and cp866
------------------
## -*- coding: utf-8 -*-
from mako.template import Template
print(Template("В лесу родилась елочка ${data}!").render(data="world"))
----------------
template-new.patch

Michael Bayer

unread,
Nov 2, 2012, 10:40:44 AM11/2/12
to blogofil...@googlegroups.com

On Nov 2, 2012, at 3:41 AM, QuickeneR wrote:

> callable_(context, *args, **kwargs)
> File "_templates/site.mako", line 3, in render_body
> <%include file="page-header.mako" /> 13:31
> File "e:\python27\lib\site-packages\mako-0.7.2-py2.7.egg\mako\runtime.py", line 693, in _include_file
> callable_(ctx, **_kwargs_for_include(callable_, context._data, **kwargs))
> File "_templates/page-header.mako", line 36, in render_body
> <li><a href="${address}">${display_name}</a></li>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

for this one I'd want to see that the $address and $display_name values are Python unicode literals, and not bytestrings. Again if these are being generated from within Blogofile, then Blogofile may need to ensure that all internal values are Python unicode objects.

Python 2.7 remains the primary platform for most Python libraries right now.

Doug Latornell

unread,
Nov 5, 2012, 9:34:19 PM11/5/12
to blogofil...@googlegroups.com
On Thursday, November 1, 2012 10:00:52 AM UTC-4, Michael Bayer wrote:

On Nov 1, 2012, at 7:29 AM, QuickeneR wrote:

----------------
ERROR:blogofile:Fatal build error occured, calling bf.config.build_exception()
Traceback (most recent call last):
File "E:\Python33\Scripts\blogofile-script.py", line 9, in <module>
load_entry_point('blogofile==0.8b1', 'console_scripts', 'blogofile')()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\main.py", line 58, in main
args.func(args)
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\main.py", line 388, in do_build
writer.write_site()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\writer.py", line 50, in write_site
self.__write_files()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\writer.py", line 127, in __write_files
util.path_join(root, html_path))
  File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\template.py", line 386, in materialize_template  
template = template_engine(template_name, caller=caller, lookup=lookup)
  File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\template.py", line 114, in __init__ t_file.read(),  
File "e:\python33\lib\encodings\cp1251.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 398: character maps to <undefined>  
----------------------

Well the bug here in Blogofile is that they aren't using Mako's routines to open a template file.    We open it with mode='rb' so there's no need for it to attempt a decode.   right where they say Template(t_file.read()), they should be saying Template(filename=t_file) so that Mako handles the details.


Another user has reported what looks like the same issue in https://github.com/EnigmaCurry/blogofile/issues/135 with a nice minimal looking reproduction recipe. Only problem is that it won't reproduce for me.

I'm trying to get my head around the blogofile template abstraction layer so that I can fix this, but it's hard without a failing test case to work against.

Michael Bayer

unread,
Nov 6, 2012, 12:18:27 AM11/6/12
to blogofil...@googlegroups.com
one way to simulate a failure is just to make all the open() statements against template files in Blogofile fail - since if it was relying upon the template engine's file opening facilities, things should work again.  

why does blogofile need to open template files ?


Doug Latornell

unread,
Nov 7, 2012, 7:28:30 PM11/7/12
to blogofil...@googlegroups.com, QuickeneR
On Thursday, November 1, 2012 10:00:52 AM UTC-4, Michael Bayer wrote:

On Nov 1, 2012, at 7:29 AM, QuickeneR wrote:

----------------
ERROR:blogofile:Fatal build error occured, calling bf.config.build_exception()
Traceback (most recent call last):
File "E:\Python33\Scripts\blogofile-script.py", line 9, in <module>
load_entry_point('blogofile==0.8b1', 'console_scripts', 'blogofile')()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\main.py", line 58, in main
args.func(args)
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\main.py", line 388, in do_build
writer.write_site()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\writer.py", line 50, in write_site
self.__write_files()
File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\writer.py", line 127, in __write_files
util.path_join(root, html_path))
  File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\template.py", line 386, in materialize_template  
template = template_engine(template_name, caller=caller, lookup=lookup)
  File "e:\python33\lib\site-packages\blogofile-0.8b1-py3.3.egg\blogofile\template.py", line 114, in __init__ t_file.read(),  
File "e:\python33\lib\encodings\cp1251.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 398: character maps to <undefined>  
----------------------

Well the bug here in Blogofile is that they aren't using Mako's routines to open a template file.    We open it with mode='rb' so there's no need for it to attempt a decode.   right where they say Template(t_file.read()), they should be saying Template(filename=t_file) so that Mako handles the details.

Péter Zsoldos

unread,
Nov 7, 2012, 7:57:40 PM11/7/12
to blogofil...@googlegroups.com
nice, thanks Doug!
Reply all
Reply to author
Forward
0 new messages