How to handle unicode posts and titles

127 views
Skip to first unread message

Paul Whipp

unread,
Apr 30, 2014, 5:38:47 AM4/30/14
to mezzani...@googlegroups.com
I may be joining the translation discussion shortly; I have a site that is using Russian, French, English and Indonesian.

I'm importing pages from Wordpress (not blog entries - pages) and I get the dreaded "UnicodeDecodeError: 'ascii' codec can't decode byte..." error in Mezzanine code that joins up the titles and that gets the 'description_from_content' when I save the RichTextPage object created from the wordpress page.

USE_I18N is True in settings.

Obviously I don't want to lose the Cyrillic characters and I need to get these posts imported. I've tried various options and the best one so far seems to be using kitchen's to_unicode and to_bytes e.g:


from kitchen.text.converters import to_bytes, to_unicode
...

    def import_page(self, page, pages):
        title = to_unicode(page['post_title'])
        self.vprint("BEGIN Importing page '{0}'".format(to_bytes(title)), 1)
        mezz_page = self.get_or_create(RichTextPage, title=title)
        if page['post_parent'] > 0:  # there is a parent
            mezz_page.parent = self.get_mezz_page(page['post_parent'], pages)
        mezz_page.created = page['post_modified']
        mezz_page.updated = page['post_modified']
        mezz_page.content = to_unicode(page['post_content'])
        mezz_page.save()

The parent bit is w.i.p. but this works for the content and title - it retains the cyrillic characters correctly. However it seems unwieldy. Is this approach a good one or should I be doing something else?

Ken Bolton

unread,
Apr 30, 2014, 8:36:20 AM4/30/14
to mezzanine-users
Hi Paul,

In my experience, the UnicodeDecodeError only happens if you have not set up your locale correctly.  The highlighted section of the fabfile, here https://github.com/stephenmcd/mezzanine/blob/master/mezzanine/project_template/fabfile.py#L346-L350, remedies this problem every time.

hth,
ken


--
You received this message because you are subscribed to the Google Groups "Mezzanine Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mezzanine-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Whipp

unread,
Apr 30, 2014, 3:52:46 PM4/30/14
to mezzani...@googlegroups.com
Thanks Ken,

I don't think the locale is relevant for this. My locale is set to en_AU.UTF-8 which matches the locale being used in the postgres database. The wordpress is imported from MySQL - I didn't change any settings there so I suspect that is Latin1.

Cheers,
Paul




--
You received this message because you are subscribed to a topic in the Google Groups "Mezzanine Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mezzanine-users/2-4lUfxEzZo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mezzanine-use...@googlegroups.com.

Alex

unread,
May 4, 2014, 11:12:34 AM5/4/14
to mezzani...@googlegroups.com
I'm experiencing the same issue with importing blog posts from Wordpress:

  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 197, in string_literal
    return db.string_literal(obj)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Using debugger I found that most obj'es get there as type 'str'. But once obj comes as 'future.builtins.backports.newstr.newstr', stirng_literal() crashes.
This happens for BlogPost.description auto-generated in mezzanine/core/models.py line 165:
        # Fall back to the title if description couldn't be determined.
        if not description:
            description = str(self)

'str' here is not a common python string but imported from future.builtins as specified in mezzanine/core/models.py line 2.

replacing it with
            description = unicode(self)
fixed the issue for me.

Paul Whipp

unread,
May 4, 2014, 3:42:40 PM5/4/14
to mezzani...@googlegroups.com
Thanks Alex,

I think this is a rarely encountered bug in Mezzanine and coercing the description to unicode is probably fine as a fix because of the way Python is headed with strings: I did not have the problem when I imported into paulwhippconsulting.com which uses Python 3 rather than Python 2.


--
You received this message because you are subscribed to a topic in the Google Groups "Mezzanine Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mezzanine-users/2-4lUfxEzZo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mezzanine-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages