Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Least-lossy string.encode to us-ascii?

93 views
Skip to first unread message

Tim Chase

unread,
Sep 13, 2012, 5:26:07 PM9/13/12
to Python
I've got a bunch of text in Portuguese and to transmit them, need to
have them in us-ascii (7-bit). I'd like to keep as much information
as possible, just stripping accents, cedillas, tildes, etc. So
"serviço móvil" becomes "servico movil". Is there anything stock
that I've missed? I can do mystring.encode('us-ascii', 'replace')
but that doesn't keep as much information as I'd hope.

-tkc



Vlastimil Brom

unread,
Sep 13, 2012, 5:44:18 PM9/13/12
to Python
2012/9/13 Tim Chase <pytho...@tim.thechases.com>:
Hi,
would something like the following be enough for your needs?
Unfortunately, I can't check it reliably with regard to Portuguese.

>>> import unicodedata
>>> unicodedata.normalize("NFD", u"serviço móvil").encode("ascii", "ignore").decode("ascii")
u'servico movil'
>>>

There is also "Unidecode", but I haven't used it myself sofar...
http://pypi.python.org/pypi/Unidecode/

hth,
vbr

Christian Heimes

unread,
Sep 13, 2012, 6:00:45 PM9/13/12
to pytho...@python.org
Am 13.09.2012 23:26, schrieb Tim Chase:
> I've got a bunch of text in Portuguese and to transmit them, need to
> have them in us-ascii (7-bit). I'd like to keep as much information
> as possible, just stripping accents, cedillas, tildes, etc. So
> "serviço móvil" becomes "servico movil". Is there anything stock
> that I've missed? I can do mystring.encode('us-ascii', 'replace')
> but that doesn't keep as much information as I'd hope.

The unidecode [1] package contains a large mapping of unicode chars to
ASCII. It even supports cool stuff like Chinese to ASCII:

>>> import unidecode
>>> print u"\u5317\u4EB0"
北亰
>>> print unidecode.unidecode(u"\u5317\u4EB0")
Bei Jing

icu4c and pyicu [2] may contain more methods for conversion but they
require binary extensions. By the way ICU can do a lot of cool, too:

>>> import icu
>>> rbf = icu.RuleBasedNumberFormat(icu.URBNFRuleSetTag.SPELLOUT,
icu.Locale.getUS())
>>> rbf.format(23)
u'twenty-three'
>>> rbf.format(100000)
u'one hundred thousand'

Regards,
Christian

[1] http://pypi.python.org/pypi/Unidecode/0.04.9
[2] http://pypi.python.org/pypi/PyICU/1.4


Tim Chase

unread,
Sep 13, 2012, 6:06:25 PM9/13/12
to Vlastimil Brom, Python
On 09/13/12 16:44, Vlastimil Brom wrote:
> >>> import unicodedata
> >>> unicodedata.normalize("NFD", u"servi�o m�vil").encode("ascii", "ignore").decode("ascii")
> u'servico movil'

Works well for all the test-cases I threw at it. Thanks!

-tkc


Ethan Furman

unread,
Sep 13, 2012, 6:29:40 PM9/13/12
to Python
[sorry for the direct reply, Tim]

Tim Chase wrote:
> I've got a bunch of text in Portuguese and to transmit them, need to
> have them in us-ascii (7-bit). I'd like to keep as much information
> as possible, just stripping accents, cedillas, tildes, etc. So
> "servi�o m�vil" becomes "servico movil". Is there anything stock
> that I've missed? I can do mystring.encode('us-ascii', 'replace')
> but that doesn't keep as much information as I'd hope.

I haven't yet used it myself, but I've heard good things about
http://pypi.python.org/pypi/Unidecode/

~Ethan~

Terry Reedy

unread,
Sep 13, 2012, 7:36:56 PM9/13/12
to pytho...@python.org
On 9/13/2012 5:26 PM, Tim Chase wrote:
> I've got a bunch of text in Portuguese and to transmit them, need to
> have them in us-ascii (7-bit). I'd like to keep as much information
> as possible,just stripping accents, cedillas, tildes, etc.

'keep as much information as possible' would mean an effectively
lossless transliteration, which you could do with a dict.
{<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would
never occur in normal text of the sort you are transmitting), ...}


--
Terry Jan Reedy

Tim Chase

unread,
Sep 13, 2012, 7:54:22 PM9/13/12
to Terry Reedy, pytho...@python.org
Vlastimil's solution kept the characters but stripped them of their
accents/tildes/cedillas/etc, doing just what I wanted, all using the
stdlib. Hard to do better than that :-)

-tkc



Mark Tolonen

unread,
Sep 13, 2012, 10:09:38 PM9/13/12
to Terry Reedy, pytho...@python.org
How about using UTF-7 for transmission and decode on the other end? This keeps the transmission all 7-bit, and no loss.

>>> s=u"serviço móvil".encode('utf-7')
>>> print s
servi+AOc-o m+APM-vil
>>> print s.decode('utf-7')
serviço móvil

-Mark

Mark Tolonen

unread,
Sep 13, 2012, 10:09:38 PM9/13/12
to comp.lan...@googlegroups.com, pytho...@python.org, Terry Reedy
On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:

Tim Chase

unread,
Sep 13, 2012, 10:34:52 PM9/13/12
to Mark Tolonen, pytho...@python.org
On 09/13/12 21:09, Mark Tolonen wrote:
> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>> Vlastimil's solution kept the characters but stripped them of their
>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>> stdlib. Hard to do better than that :-)
>
> How about using UTF-7 for transmission and decode on the other end? This keeps the transmission all 7-bit, and no loss.
>
> >>> s=u"servi�o m�vil".encode('utf-7')
> >>> print s
> servi+AOc-o m+APM-vil
> >>> print s.decode('utf-7')
> servi�o m�vil

Nice if I control both ends of the pipe. Unfortunately, I only
control what goes in, and I want it to be as un-screw-uppable as
possible when it comes out the other end (may be web, CSV files,
PDFs, FTP'ed file dumps, spreadsheets, word-processing documents,
etc), and us-ascii is the lowest-common-denominator of
unscrewuppableness while requiring nothing of the the other end. :-)

-tkc




Steven D'Aprano

unread,
Sep 13, 2012, 11:49:00 PM9/13/12
to
On Thu, 13 Sep 2012 16:26:07 -0500, Tim Chase wrote:

> I've got a bunch of text in Portuguese and to transmit them, need to
> have them in us-ascii (7-bit).

That could mean two things:

1) "The receiver is incapable of dealing with Unicode in 2012, which is
frankly appalling, but what can I do about it?"

2) "The transport mechanism I use to transmit the data is only capable of
dealing with 7-bit ASCII strings, which is sad but pretty much standard."

In the case of 1), I suggest you look at the Unicode Hammer, a.k.a. "The
Stupid American":

http://code.activestate.com/recipes/251871

and especially the very many useful comments.


In the case of 2), just binhex or uuencode your data for transport.



--
Steven

Steven D'Aprano

unread,
Sep 14, 2012, 12:05:32 AM9/14/12
to
On Thu, 13 Sep 2012 21:34:52 -0500, Tim Chase wrote:

> On 09/13/12 21:09, Mark Tolonen wrote:
>> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>>> Vlastimil's solution kept the characters but stripped them of their
>>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>>> stdlib. Hard to do better than that :-)
>>
>> How about using UTF-7 for transmission and decode on the other end?
>> This keeps the transmission all 7-bit, and no loss.
>>
>> >>> s=u"serviço móvil".encode('utf-7')
>> >>> print s
>> servi+AOc-o m+APM-vil
>> >>> print s.decode('utf-7')
>> serviço móvil
>
> Nice if I control both ends of the pipe. Unfortunately, I only control
> what goes in, and I want it to be as un-screw-uppable as possible when
> it comes out the other end (may be web, CSV files, PDFs, FTP'ed file
> dumps, spreadsheets, word-processing documents, etc), and us-ascii is
> the lowest-common-denominator of unscrewuppableness while requiring
> nothing of the the other end. :-)

Wrong. It requires support for US-ASCII. What if the other end is an IBM
mainframe using EBCDIC?

Frankly, I am appalled that you are intentionally perpetuating the
ignorance of US-ASCII-only applications, not because you have no choice
about inter-operating with some ancient, brain-dead application, but
because you artificially choose to follow an obsolete *and incorrect*
standard.

It is *incorrect* because you can change the meaning of text by stripping
accents and deleting characters. Consequences can include murder and suicide:

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail

At least tell me that "ASCII only" is merely an *option* for your
application, not the only choice, and that it defaults to UTF-8 which is
the right standard to use for text.



--
Steven

Vlastimil Brom

unread,
Sep 14, 2012, 3:38:33 AM9/14/12
to Python
2012/9/14 Tim Chase <pytho...@tim.thechases.com>:
> On 09/13/12 16:44, Vlastimil Brom wrote:
>> >>> import unicodedata
>> >>> unicodedata.normalize("NFD", u"serviço móvil").encode("ascii", "ignore").decode("ascii")
>> u'servico movil'
>
> Works well for all the test-cases I threw at it. Thanks!
>
> -tkc
>
>

Hi,
I am glad, it works, but I agree with the other comments, that it
would be preferable to keep the original accented text, if at all
possible in the whole processing.
The above works by decomposing the accented characters into "basic"
characters and the bare accents (combining diacritics) using
normalize() and just striping anything outside ascii in encode("...",
"ignore")
This works for "combinable" accents, and most of the Portuguese
characters outside of ascii appear to fall into this category, but
there are others as well.
E.g. according to
http://tlt.its.psu.edu/suggestions/international/bylanguage/portuguese.html
there are at least ºª«»€, which would be lost completely in such conversion.
ª (dec.: 170) (hex.: 0xaa) # FEMININE ORDINAL INDICATOR
º (dec.: 186) (hex.: 0xba) # MASCULINE ORDINAL INDICATOR

You can preprocess such cases as appropriate before doing the
conversion, e.g. just:

>>> u"ºª«»€".replace(u"º", u"o").replace(u"ª", u"a").replace(u"«", u'"').replace(u"»", u'"').replace(u"€", u"EUR")
u'oa""EUR'
>>>
or using a more elegant function and the replacement lists (eventually
handling other cases as well).

regards,
vbr

wxjm...@gmail.com

unread,
Sep 14, 2012, 12:15:25 PM9/14/12
to Python
Interesting case. It's where the coding of characters
meets characters usage, scripts, typography, linguistic
features.

I cann't discuss the Portugese case, but in French
and in German one way to achieve the task is to
convert the text in uppercases. It preserves a correct
text.

>>> s = 'Lætitia cœur éléphant français LUŸ Stoß Erklärung stören'
>>> libfrancais.SpecMajuscules(s)
'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG
STOEREN'

>>> r = 'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG STOEREN'
>>> r.encode('ascii', 'strict').decode('ascii', 'strict') == r
True

PS Avoid Py3.3 :-)

jmf

wxjm...@gmail.com

unread,
Sep 14, 2012, 12:15:25 PM9/14/12
to comp.lan...@googlegroups.com, Python
Le jeudi 13 septembre 2012 23:25:27 UTC+2, Tim Chase a écrit :

Terry Reedy

unread,
Sep 14, 2012, 4:43:14 PM9/14/12
to pytho...@python.org
On 9/14/2012 12:15 PM, wxjm...@gmail.com wrote:

> PS Avoid Py3.3 :-)

pps Start using 3.3 as soon as possible. It has Python's first fully
portable non-buggy Unicode implementation. The second release candidate
is already out.

--
Terry Jan Reedy

Terry Reedy

unread,
Sep 14, 2012, 4:57:29 PM9/14/12
to pytho...@python.org
On 9/13/2012 10:09 PM, Mark Tolonen wrote:
> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>> On 09/13/12 18:36, Terry Reedy wrote:

>>> 'keep as much information as possible' would mean an effectively
>>> lossless transliteration, which you could do with a dict.
>>> {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would

>> Vlastimil's solution kept the characters but stripped them of their
>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>> stdlib. Hard to do better than that :-)

You mean, hard to do better than what you think you want, as opposed to
what you said you wanted in both the subject line and the text line I
quoted. What you need depends on why you need ascii only text and what
the recipient will do with the ascii only text. Print it on an
ascii-only printer? Or something similar? If so, a lossy encoding may be
sufficient, but why not let the recipient decide to toss info?

> How about using UTF-7 for transmission and decode on the other end?
> This keeps the transmission all 7-bit, and no loss.
>
> >>> s=u"serviço móvil".encode('utf-7')
> >>> print s
> servi+AOc-o m+APM-vil
> >>> print s.decode('utf-7')
> serviço móvil

Nice. I was barely aware of and forgot that option. This and similar
suggestions to use existing methods is much better than my hackish approach.

--
Terry Jan Reedy


wxjm...@gmail.com

unread,
Sep 15, 2012, 4:58:12 AM9/15/12
to pytho...@python.org
- I will drop Python.
- No complaints.
- (OT, luckily one of the two Unicode TeX engines is called LuaTeX.)

jmf

wxjm...@gmail.com

unread,
Sep 15, 2012, 4:58:12 AM9/15/12
to comp.lan...@googlegroups.com, pytho...@python.org
Le vendredi 14 septembre 2012 22:45:05 UTC+2, Terry Reedy a écrit :
0 new messages