Unicode cleanup: g.u and the new g.ue

2 views
Skip to first unread message

Edward K. Ream

unread,
Dec 20, 2009, 4:22:41 PM12/20/09
to leo-editor
I plan to simplify g.u and introduce a new g.ue method. This follows
Ville's suggestion.

In the new scheme, g.u calls str or unicode with *no* encoding, g.ue
calls str or unicode with a *mandatory* encoding. Experience shows
that it is not possible to dispense with g.ue entirely: there are
places were the encoding is essential.

Within qtGui.py, g.u should always be used, never g.ue. This is
because Qt passes (wrapped) unicode strings to Leo. The call to g.u
merely converts to a plain Python unicode object.

There are just a few places in Leo's core where g.ue must be used.
Explicitly marking those places is important, imo.

Edward

P.S. Here are the definitions of g.u and g.ue. They are not likely
to change::

def u(s,encoding='utf-8'):
'''Convert the string s to unicode if necessary.'''
if g.isUnicode(s):
return s
elif g.isPython3:
return str(s)
else:
return unicode(s)

def ue(s,encoding):
'''Convert the string s to unicode if necessary.'''
if g.isUnicode(s):
return s
elif g.isPython3:
return str(s,encoding)
else:
return unicode(s,encoding)

Ville M. Vainio

unread,
Dec 21, 2009, 2:43:07 AM12/21/09
to leo-e...@googlegroups.com
On Sun, Dec 20, 2009 at 11:22 PM, Edward K. Ream <edre...@gmail.com>

> def u(s,encoding='utf-8'):
>    '''Convert the string s to unicode if necessary.'''
>    if g.isUnicode(s):
>        return s

Is'n this check redundant? unicode(s) => s, if is is unicode.

I still think a much faster way would be to check for python3 only once:

if g.python3:

g.u = str

else:

g.u = unicode

--
Ville M. Vainio
http://tinyurl.com/vainio

Edward K. Ream

unread,
Dec 21, 2009, 8:40:15 AM12/21/09
to leo-e...@googlegroups.com
On Mon, Dec 21, 2009 at 2:43 AM, Ville M. Vainio <viva...@gmail.com> wrote:
> On Sun, Dec 20, 2009 at 11:22 PM, Edward K. Ream <edre...@gmail.com>
>
>> def u(s,encoding='utf-8'):
>>    '''Convert the string s to unicode if necessary.'''
>>    if g.isUnicode(s):
>>        return s
>
> Is'n this check redundant? unicode(s) => s, if is is unicode.

I don't think so. Iirc, str(s) fails if s is already unicode.

When I awoke this morning (before reading your comment), I realized
that it may be possible to make the following definition work:

if g.isPython3:
def u(s):
return s
else:
def u(s)
return unicode(s)

This highlights a fundamental asymmetry between Python 2.x and Python
3.x. In Python 3.x, "almost everything" will *already* be a unicode
string. In such cases, calling str(s) would be wrong/confusing. In
all other cases, we should use g.ue: an encoding is *required*.

I'll play with this today. If it can be made to work (we won't know
for sure until it gets tested in the trunk), it will be an important
clarification of the code.

Edward

Edward

Ville M. Vainio

unread,
Dec 21, 2009, 8:57:28 AM12/21/09
to leo-e...@googlegroups.com
On Mon, Dec 21, 2009 at 3:40 PM, Edward K. Ream <edre...@gmail.com> wrote:

> I don't think so.  Iirc, str(s) fails if s is already unicode.

Well, for python2 it works anyway.

unicode(unicode(1))

returns u"1".

I would be surprised if the screwed this simple case up with python3.

> When I awoke this morning (before reading your comment), I realized
> that it may be possible to make the following definition work:
>
> if g.isPython3:
>    def u(s):
>        return s
> else:
>    def u(s)
>        return unicode(s)

Yes, this should work as well.

Edward K. Ream

unread,
Dec 21, 2009, 9:13:21 AM12/21/09
to leo-e...@googlegroups.com
On Mon, Dec 21, 2009 at 8:57 AM, Ville M. Vainio <viva...@gmail.com> wrote:

>> if g.isPython3:
>>    def u(s):
>>        return s
>> else:
>>    def u(s)
>>        return unicode(s)
>
> Yes, this should work as well.

It does! Furthermore, the equivalent definitions work for g.ue.

I created a local branch to make these changes, worried that perhaps
some other changes would be necessary, but apparently not. All unit
tests still pass "everywhere" with no other changes at all.

So my plan is to merge the local branch into the trunk and commit.
There is an 'if 1' switch that selects the new code (temporarily) so
if there are problems with Russian encodings zpcspm can revert just by
changing this switch. That should be safe enough.

Everyone agree?

Edward

Edward K. Ream

unread,
Dec 21, 2009, 9:19:15 AM12/21/09
to leo-e...@googlegroups.com
On Mon, Dec 21, 2009 at 9:13 AM, Edward K. Ream <edre...@gmail.com> wrote:

> Everyone agree?

Well, I couldn't wait. Here is the commit log for rev 2532 of the trunk:

QQQ
An important simplification to g.u and g.ue:

- It highlights an essential asymmetry between Python 2.x and Python 3.x.

- It removes confusion about what actually is happening.
In particular, it proves conclusively that g.u(s) does nothing in Python 3.x.

- It highlights, more clearly than ever before, just were encodings
are, and aren't, needed.

This is *not* an efficiency measure, though of course it helps slightly.

This can be considered a slightly risky commit, despite all unit tests passing.

Please report any problems, no matter how large or small, immediately.
QQQ

Edward

Edward K. Ream

unread,
Dec 21, 2009, 9:37:32 AM12/21/09
to leo-e...@googlegroups.com
On Mon, Dec 21, 2009 at 9:19 AM, Edward K. Ream <edre...@gmail.com>

> Well, I couldn't wait.  Here is the commit log for rev 2532 of the trunk:

The fact that g.u is a do-nothing for Python 3.x means we can perform
some *safe* simplifications of the code. In particular, here is a
really ugly method:

def outputStringWithLineEndings (self,s):
at = self
if g.isPython3:
s = g.ue(s,at.encoding)
s = s.replace(g.u('\n'),g.u(at.output_newline))
else:
s = s.replace('\n',at.output_newline)
self.os(s)

It can be simplified to:

def outputStringWithLineEndings (self,s):
at = self
if g.isPython3:
s = g.ue(s,at.encoding)
s = s.replace('\n',at.output_newline)
self.os(s)

Edward

Ville M. Vainio

unread,
Dec 21, 2009, 9:42:04 AM12/21/09
to leo-e...@googlegroups.com
On Mon, Dec 21, 2009 at 4:19 PM, Edward K. Ream <edre...@gmail.com> wrote:

> - It highlights, more clearly than ever before, just were encodings
> are, and aren't, needed.

This has been my pet peeve for a long, long time - Leo was using
encoding "just in case" around the code. I'm glad that this "logical
leak" is being fixed, as it tends to hide logical errors elsewhere.

Edward K. Ream

unread,
Dec 21, 2009, 9:50:12 AM12/21/09
to leo-editor
On Dec 21, 9:42 am, "Ville M. Vainio" <vivai...@gmail.com> wrote:

> > - It highlights, more clearly than ever before, just where encodings


> > are, and aren't, needed.
>
> This has been my pet peeve for a long, long time - Leo was using
> encoding "just in case" around the code. I'm glad that this "logical
> leak" is being fixed, as it tends to hide logical errors elsewhere.

I agree completely. This is the reason I'm excited about these
changes.

Edward

zpcspm

unread,
Dec 21, 2009, 2:16:48 PM12/21/09
to leo-editor
On Dec 21, 4:13 pm, "Edward K. Ream" <edream...@gmail.com> wrote:
> So my plan is to merge the local branch into the trunk and commit.
> There is an 'if 1' switch that selects the new code (temporarily) so
> if there are problems with Russian encodings zpcspm can revert just by
> changing this switch.  That should be safe enough.

Just synced the trunk to r2541 and tested it. Everything is ok.
Feel free to make the mentioned change permanent if you wish.

Edward K. Ream

unread,
Dec 21, 2009, 2:46:31 PM12/21/09
to leo-e...@googlegroups.com

Excellent. Thanks for the confirmation. This is a big day for Leo and Unicode.

Edward

Edward K. Ream

unread,
Dec 21, 2009, 2:53:12 PM12/21/09
to leo-editor

On Dec 21, 2:46 pm, "Edward K. Ream" <edream...@gmail.com> wrote:

> > Just synced the trunk to r2541 and tested it. Everything is ok.
> > Feel free to make the mentioned change permanent if you wish.

Done in the trunk at rev 2542. All unit tests pass.

Edward

Reply all
Reply to author
Forward
0 new messages