Most pythonic way to truncate unicode?

Andrew Fong

unread,

May 28, 2009, 11:50:00 AM5/28/09

to

I need to ...

1) Truncate long unicode (UTF-8) strings based on their length in
BYTES. For example, u'\u4000\u4001\u4002 abc' has a length of 7 but
takes up 13 bytes. Since u'\u4000' takes up 3 bytes, I want truncate
(u'\u4000\u4001\u4002 abc',3) == u'\u4000' -- as compared to
u'\u4000\u4001\u4002 abc'[:3] == u'\u4000\u4001\u4002'.

2) I don't want to accidentally chop any unicode characters in half.
If the byte truncate length would normally cut a unicode character in
2, then I just want to drop the whole character, not leave an orphaned
byte. So truncate(u'\u4000\u4001\u4002 abc',4) == u'\u4000' ... as
opposed to getting UnicodeDecodeError.

I'm using Python2.6, so I have access to things like bytearray. Are
there any built-in ways to do something like this already? Or do I
just have to iterate over the unicode string?

-- Andrew

Peter Otten

unread,

May 28, 2009, 11:58:35 AM5/28/09

to

Andrew Fong wrote:

How about

>>> u"äöü".encode("utf8")[:5].decode("utf8", "ignore")
u'\xe4\xf6'
>>> print _
äö

Peter

Steven D'Aprano

unread,

May 28, 2009, 8:49:37 PM5/28/09

to

On Thu, 28 May 2009 08:50:00 -0700, Andrew Fong wrote:

> I need to ...
>
> 1) Truncate long unicode (UTF-8) strings based on their length in BYTES.

Out of curiosity, why do you need to do this?

> For example, u'\u4000\u4001\u4002 abc' has a length of 7 but takes up 13
> bytes.

No, that's wrong. The number of bytes depends on the encoding, it's not a
property of the unicode string itself.

>>> s = u'\u4000\u4001\u4002 abc'
>>> len(s) # characters
7
>>> len(s.encode('utf-8')) # bytes
13
>>> len(s.encode('utf-16')) # bytes
16
>>> len(s.encode('U32')) # bytes
32

> Since u'\u4000' takes up 3 bytes

But it doesn't. The *encoded* unicode character *may* take up three
bytes, or four, or possibly more, depending on what encoding you use.

--
Steven

John Machin

unread,

May 28, 2009, 11:17:35 PM5/28/09

to pytho...@python.org

Andrew Fong <FongAndrew <at> gmail.com> writes:

> I need to ...
> 1) Truncate long unicode (UTF-8) strings based on their length in
> BYTES.

> 2) I don't want to accidentally chop any unicode characters in half.
> If the byte truncate length would normally cut a unicode character in
> 2, then I just want to drop the whole character, not leave an orphaned
> byte.

> I'm using Python2.6, so I have access to things like bytearray.

Using bytearray saves you from using ord()
but runs the risk of accidental mutation.

> Are
> there any built-in ways to do something like this already? Or do I
> just have to iterate over the unicode string?

Converting each character to utf8 and checking the
total number of bytes so far?
Ooooh, sloooowwwwww!

The whole concept of "truncating unicode"
you mean "truncating utf8") seems
rather unpythonic to me.

Another alternative is to iterate backwards
over the utf8 string looking for a
character-starting byte. It leads to a candidate
for Unpythonic Code of the Year:

def utf8trunc(u8s, maxlen):
assert maxlen >= 1
alen = len(u8s)
if alen <= maxlen:
return u8s
pos = maxlen - 1
while pos >= 0:
val = ord(u8s[pos])
if val & 0xC0 != 0x80:
# found an initial byte
break
pos -= 1
else:
# no initial byte found
raise ValueError("malformed UTF-8 [1]")
if maxlen - pos > 4:
raise ValueError("malformed UTF-8 [2]")
if val & 0x80:
charlen = (2, 2, 3, 4)[(val >> 4) & 3]
else:
charlen = 1
nextpos = pos + charlen
assert nextpos >= maxlen
if nextpos == maxlen:
return u8s[:nextpos]
return u8s[:pos]

if __name__ == "__main__":
tests = [u"", u"\u0000", u"\u007f", u"\u0080",
u"\u07ff", u"\u0800", u"\uffff" ]
for testx in tests:
test = u"abcde" + testx + u"pqrst"
u8 = test.encode('utf8')
print repr(test), repr(u8), len(u8)
for mlen in range(4,
8 + len(testx.encode('utf8'))):
u8t = utf8trunc(u8, mlen)
print " ", mlen, len(u8t), repr(u8t)

Tested to the extent shown. Doesn't pretend to check
for all cases of UTF-8
malformation, just easy ones :-)

Cheers,
John

John Machin

unread,

May 29, 2009, 12:09:53 AM5/29/09

to pytho...@python.org

John Machin <sjmachin <at> lexicon.net> writes:

> Andrew Fong <FongAndrew <at> gmail.com> writes:

> Are
> > there any built-in ways to do something like this already? Or do I
> > just have to iterate over the unicode string?
>

> Converting each character to utf8 and checking the
> total number of bytes so far?
> Ooooh, sloooowwwwww!
>

Somewhat faster:

u8len = 0
for u in unicode_string:
if u <= u'\u007f':
u8len += 1
elif u <= u'\u07ff':
u8len += 2
elif u <= u'\uffff':
u8len += 3
else:
u8len += 4

Cheers,
John

Steven D'Aprano

unread,

May 29, 2009, 12:27:58 AM5/29/09

to

On Fri, 29 May 2009 04:09:53 +0000, John Machin wrote:

> John Machin <sjmachin <at> lexicon.net> writes:
>
>> Andrew Fong <FongAndrew <at> gmail.com> writes:
>
> > Are
>> > there any built-in ways to do something like this already? Or do I
>> > just have to iterate over the unicode string?
>>
>> Converting each character to utf8 and checking the total number of
>> bytes so far?
>> Ooooh, sloooowwwwww!
>>
>>
> Somewhat faster:

What's wrong with Peter Otten's solution?

>>> u"äöü".encode("utf8")[:5].decode("utf8", "ignore")
u'\xe4\xf6'

At most, you should have one error, at the very end. If you ignore it,
you get the unicode characters that have length <= 5 in *bytes* when
encoded as UTF-8.

(If you encode using a different codec, you will likely get a different
number of bytes.)

--
Steven

John Machin

unread,

May 29, 2009, 1:31:40 AM5/29/09

to pytho...@python.org

Steven D'Aprano <steve <at> REMOVE-THIS-cybersource.com.au> writes:

>
> On Fri, 29 May 2009 04:09:53 +0000, John Machin wrote:
>
> > John Machin <sjmachin <at> lexicon.net> writes:
> >
> >> Andrew Fong <FongAndrew <at> gmail.com> writes:
> >
> > > Are
> >> > there any built-in ways to do something like this already? Or do I
> >> > just have to iterate over the unicode string?
> >>
> >> Converting each character to utf8 and checking the total number of
> >> bytes so far?
> >> Ooooh, sloooowwwwww!
> >>
> >>
> > Somewhat faster:
>
> What's wrong with Peter Otten's solution?
>
> >>> u"äöü".encode("utf8")[:5].decode("utf8", "ignore")

Given the minimal info supplied by the OP, nothing. However if the OP were to
respond to your "why" question, and give some more info like how long is long,
what percentage of the average string is thrown away, does he have the utf8
version anyway, what's he going to do with the unicode version after it's been
truncated (convert it to utf8??), it may turn out that a unicode forwards search
or a utf8 backwards search may be preferable.

If Pyrex/Cython is an option, then one of those is likely to be preferable if
runtime is a major consideration.

Cheers,
John