On Oct 31, 2012, at 9:58 AM, QuickeneR wrote:
> Hi,
> I am new to static site compilers, and am currently trying to start with blogofile. I am using 0.8b1 on Windows with Russian locale, and starting with 'raw' blogofile, without the blog plugin.
> While working on a simple site, I encountered a number of unicode errors. With python 2.7 they error messages were rather meaningless (e.g. mako.exceptions.CompileException: Unicode decode operation of encoding 'ascii' failed at line: 0 char: 0 ) - russian characters certainly cannot be decoded from (low) ASCII. So I switched to python 3.3
> Here the errors were better - they specified the file that could not be decoded, and the encoding it was supposed to be in. However, I noticed a strange thing - the assumed encodings for index.html.mako and _templates/site.mako were different. If I save index.html.mako in utf-8. I get
> ------------
> ...
> File "e:\python33\lib\encodings\cp1251.py", line 23, in decode
> return codecs.charmap_decode(input,self.errors,decoding_table)[0]
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 7570: character maps to <undefined>
> -----------
> Clearly, blogofile assumes that this file should be in cp1251
> If, on the other hand, I save _templates/site.mako in cp1251, I get
> -----------
> ...
> File "e:\python33\lib\site-packages\mako-0.7.2-py3.3.egg\mako\lexer.py", line 206, in decode_raw_stream
> 0, 0, filename)
> CompileException: Unicode decode operation of encoding 'utf-8' failed in file '_templates/site.mako' at line: 0 char:
> -----------
> So this file is assumed to be in utf-8.
> A quick glance at Mako sources showed that Mako is indeed assuming utf-8 unless told otherwise, while blogofile does not seem to do that and relies instead on open() which assumes the encoding given by locale.getpreferredencoding(). This leads to different (required) encodings for different files, which is undesirable, to put it mildly.
The documented system of establishing encoding for your templates is by using the "# coding:" header, or across the board by setting it as a TemplateLookup attribute:
http://docs.makotemplates.org/en/latest/unicode.html#specifying-the-encoding-of-a-template-file
Assuming you don't want to have to enter the "# coding" prefix in each template, you'd want to use input_encoding. In Blogofile, I'm not 100% sure of the best way to set TemplateLookup's input encoding - there seem to be some hooks where we can set up a TemplateLookup of our own, or access the singleton MakoTemplate.template_lookup.input_encoding, but I'm not sure where that can best be set up reliably - blogofile.template.MakoTemplate seems to have a funny way of setting up its lookup.
> 1) the 'feature-complete' way - one implements some method for specifying the encoding of each end every file
this is implemented, via the "# coding" header
> 2) assume the locale-based encoding
I can't find the blog posts on this at the moment but this is considered by many to be a bad idea
> 3) assume utf-8
the default.
> 1) it seems, will need much more work for either guessing the encoding or having settings in _config.py
guessing is out. _config.py setting should be very easy, at least as far as Mako is concerned.
> 2) is probably fine, unless we remember that sites and blogs are inherently global. No one really cares what locale the site's author prefers, and it really sucks when you can't pass a site to a friend or colleague because he has a different locale and the site won't build due to UnicodeDecodeError
yup
> 3) is a clever thing if you cannot simple pass the bytes along and HAVE to assume an encoding.
if no encoding has been specified any other way, then we have to have a default, that's correct
> Of course, it might not be the same encoding as used by text documents native to the local platform (on Russian Windows we have 866, 1251, and two-byte Unicode in addition to utf-8) but really, it is 2012. Should not we be done with guessing ar character encodings?
I don't see us guessing anywhere here....
>
> So, here is a simple fix I propose: change all instances of open(..) to open(..., encoding='utf-8).
yikes, that's so out of left field. All the tools in use here support configurable input encodings as well as per-file input encodings using standard techniques, just see pep 263:
http://www.python.org/dev/peps/pep-0263/
> It is a quick hack (and it probably won't work on python 2) but is solves the problem for me.
> What do you think?
a broken hack like that is clearly not an option.