encoding problems (é and è)

bussiere bussiere

unread,

Mar 23, 2006, 6:07:31 AM3/23/06

to pytho...@python.org

hi i'am making a program for formatting string,
or
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E')
str = str.replace('È', 'E')
str = str.replace('ê', 'E')

doesn't work it put me " and , instead of remplacing é by E

if someone have an idea it could be great

regards
Bussiere
ps : i've added the whole script under :

__________________________________________________________________________

#!/usr/bin/python
# -*- coding: utf-8 -*-
import fileinput, glob, string, sys, os, re

fichA=raw_input("Entrez le nom du fichier d'entree : ")
print ("\n")
fichC=raw_input("Entrez le nom du fichier de sortie : ")
print ("\n")
normalisation1 = raw_input("Normaliser les adresses 1 (ex : Avenue->
AV) (O/N) ou A pour tout normaliser \n")
normalisation1 = normalisation1.upper()

if normalisation1 != "A":
print ("\n")
normalisation2 = raw_input("Normaliser les civilités (ex :
Docteur-> DR) (O/N) \n")
normalisation2 = normalisation2.upper()
print ("\n")
normalisation3 = raw_input("Normaliser les Adresses 2 (ex :
Place-> PL) (O/N) \n")
normalisation3 = normalisation3.upper()

normalisation4 = raw_input("Normaliser les caracteres / et - (ex :
/ -> ) (O/N) \n" )
normalisation4 = normalisation4.upper()

if normalisation1 == "A":
normalisation1 = "O"
normalisation2 = "O"
normalisation3 = "O"
normalisation4 = "O"

fiA=open(fichA,"r")
fiC=open(fichC,"w")

compteur = 0

while 1:

ligneA=fiA.readline()

if ligneA == "":

break

if ligneA != "":
str = ligneA
str = str.replace('a', 'A')
str = str.replace('b', 'B')
str = str.replace('c', 'C')
str = str.replace('d', 'D')
str = str.replace('e', 'E')
str = str.replace('f', 'F')
str = str.replace('g', 'G')
str = str.replace('h', 'H')
str = str.replace('i', 'I')
str = str.replace('j', 'J')
str = str.replace('k', 'K')
str = str.replace('l', 'L')
str = str.replace('m', 'M')
str = str.replace('n', 'N')
str = str.replace('o', 'O')
str = str.replace('p', 'P')
str = str.replace('q', 'Q')
str = str.replace('r', 'R')
str = str.replace('s', 'S')
str = str.replace('t', 'T')
str = str.replace('u', 'U')
str = str.replace('v', 'V')
str = str.replace('w', 'W')
str = str.replace('x', 'X')
str = str.replace('y', 'Y')
str = str.replace('z', 'Z')

str = str.replace('ç', 'C')
str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E')
str = str.replace('È', 'E')
str = str.replace('ê', 'E')
str = str.replace('Ê', 'E')
str = str.replace('ë', 'E')
str = str.replace('Ë', 'E')
str = str.replace('ä', 'A')
str = str.replace('Ä', 'A')
str = str.replace('à', 'A')
str = str.replace('À', 'A')
str = str.replace('Á', 'A')
str = str.replace('Â', 'A')
str = str.replace('Ä', 'A')
str = str.replace('Ã', 'A')
str = str.replace('â', 'A')
str = str.replace('Ä', 'A')
str = str.replace('ï', 'I')
str = str.replace('Ï', 'I')
str = str.replace('î', 'I')
str = str.replace('Î', 'I')
str = str.replace('ô', 'O')
str = str.replace('Ô', 'O')
str = str.replace('ö', 'O')
str = str.replace('Ö', 'O')
str = str.replace('Ú','U')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')

if normalisation1 == "O":
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('FAUBOURG', 'FBG')
str = str.replace('GENERAL', 'GAL')
str = str.replace('COMMANDANT', 'CMDT')
str = str.replace('MARECHAL', 'MAL')
str = str.replace('PRESIDENT', 'PRDT')
str = str.replace('SAINT', 'ST')
str = str.replace('SAINTE', 'STE')
str = str.replace('LOTISSEMENT', 'LOT')
str = str.replace('RESIDENCE', 'RES')
str = str.replace('IMMEUBLE', 'IMM')
str = str.replace('IMEUBLE', 'IMM')
str = str.replace('BATIMENT', 'BAT')

if normalisation2 == "O":
str = str.replace('MONSIEUR', 'M')
str = str.replace('MR', 'M')
str = str.replace('MADAME', 'MME')
str = str.replace('MADEMOISELLE', 'MLLE')
str = str.replace('DOCTEUR', 'DR')
str = str.replace('PROFESSEUR', 'PR')
str = str.replace('MONSEIGNEUR', 'MGR')
str = str.replace('M ME','MME')

if normalisation3 == "O":
str = str.replace('PLACE', 'PL')
str = str.replace('IMPASSE', 'IMP')
str = str.replace('ESPLANADE', 'ESP')
str = str.replace('ROND POINT', 'RPT')
str = str.replace('ROUTE', 'RTE')
str = str.replace('PASSAGE', 'PAS')
str = str.replace('SQUARE', 'SQ')
str = str.replace('ALLEE', 'ALL')
str = str.replace('ESCALIER', 'ESC')
str = str.replace('ETAGE', 'ETG')
str = str.replace('PORTE', 'PTE')
str = str.replace('APPARTEMENT', 'APT')
str = str.replace('APARTEMENT', 'APT')
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('ZONE D ACTIVITE', 'ZA')
str = str.replace('ZONE D ACTIVITEE', 'ZA')
str = str.replace('ZONE D AMENAGEMENT CONCERTE', 'ZAC')
str = str.replace('ZONE D AMENAGEMENT CONCERTEE', 'ZAC')
str = str.replace('ZONE INDUSTRELLE', 'ZI')
str = str.replace('CENTRE COMMERCIAL', 'CCAL')
str = str.replace('CENTRE', 'CTRE')
str = str.replace('C.CIAL','CCAL')
str = str.replace('CTRE CIAL','CCAL')
str = str.replace('CTRE CCAL','CCAL')
str = str.replace('GALERIE','GAL')
str = str.replace('MARTYR', 'M')
str = str.replace('ANCIENS', 'AC')
str = str.replace('ANCIEN', 'AC')
str = str.replace('REVEREND PERE','R P')

if normalisation4 == "O":
str = str.replace(';\"', ' ')
str = str.replace('\"', ' ')
str = str.replace('\'', ' ')
str = str.replace('-', ' ')
str = str.replace(',', ' ')
str = str.replace('\\', ' ')
str = str.replace('\/', ' ')
str = str.replace('&', ' ')
str = str.replace('%', ' ')
str = str.replace('*', ' ')
str = str.replace(' ', ' ')
str = str.replace('.', ' ')
str = str.replace('_', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace('?', ' ')
str = str.replace('%', ' ')
str = str.replace('|', ' ')

str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
fiC.write(str)
compteur += 1
print compteur, "\n"

print "FINIT"
fiA.close()
fiC.close()

Christoph Zwerschke

unread,

Mar 23, 2006, 10:49:56 AM3/23/06

to

bussiere bussiere wrote:
> hi i'am making a program for formatting string,

> i've added :
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
>
> in the begining of my script but
>
> str = str.replace('Ç', 'C')

> ...

> doesn't work it put me " and , instead of remplacing é by E

Are your sure your script and your input file *is* actually encoded with
utf-8? If it does not work as expected, it is probably latin-1, just
like your posting. Try changing the coding to latin-1. Does it work now?

-- Christoph

Larry Bates

unread,

Mar 23, 2006, 10:55:55 AM3/23/06

to

Seems to work fine for me.

>>> x="éÇ"
>>> x=x.replace('é','E')
'E\xc7'
>>> x=x.replace('Ç','C')
>>> x
'E\xc7'
>>> x=x.replace('Ç','C')
>>> x
'EC'

You should also be able to use .upper() method to
uppercase everything in the string in a single statement:

tstr=ligneA.upper()

Note: you should never use 'str' as a variable as
it will mask the built-in str function.

-Larry Bates

John Machin

unread,

Mar 23, 2006, 4:14:00 PM3/23/06

to

On 23/03/2006 10:07 PM, bussiere bussiere wrote:
> hi i'am making a program for formatting string,
> or
> i've added :
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
>
> in the begining of my script but
>
> str = str.replace('Ç', 'C')
> str = str.replace('é', 'E')
> str = str.replace('É', 'E')
> str = str.replace('è', 'E')
> str = str.replace('È', 'E')
> str = str.replace('ê', 'E')
>
>
> doesn't work it put me " and , instead of remplacing é by E
>
>
> if someone have an idea it could be great

Hi, I've added some comments below ... I hope they help.
Cheers,
John

>
> regards
> Bussiere
> ps : i've added the whole script under :
> __________________________________________________________________________

[snip]

>
> if ligneA != "":
> str = ligneA
> str = str.replace('a', 'A')

[snip]

> str = str.replace('z', 'Z')
>
> str = str.replace('ç', 'C')
> str = str.replace('Ç', 'C')
> str = str.replace('é', 'E')
> str = str.replace('É', 'E')
> str = str.replace('è', 'E')

[snip]

> str = str.replace('Ú','U')

You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.

> str = str.replace(' ', ' ')
> str = str.replace(' ', ' ')
> str = str.replace(' ', ' ')

The standard Python idiom for normalising whitespace is
strg = ' '.join(strg.split())

>>> strg = ' ALLO BUSSIERE\tCA VA? '
>>> strg.split()
['ALLO', 'BUSSIERE', 'CA', 'VA?']
>>> ' '.join(strg.split())
'ALLO BUSSIERE CA VA?'
>>>

[snip]

> if normalisation2 == "O":
> str = str.replace('MONSIEUR', 'M')
> str = str.replace('MR', 'M')

You need to be very careful with this approach. You are changing EVERY
occurrence of "MR" in the string, not just where it is a whole "word"
meaning "Monsieur".
Copnstructed example of what can go wrong:
>>> strg = 'MR IMRE NAGY, 123 PRIMROSE STREET, SHAMROCK VALLEY'
>>> strg.replace('MR', 'M')
'M IME NAGY, 123 PRIMOSE STREET, SHAMOCK VALLEY'
>>>

A real, non-constructed history lesson: A certain database indicated
duplicate records by having the annotation "DUP" in the surname field
e.g. "SMITH DUP". Fortunately it was detected in testing that the
so-called clean-up was causing DUPLESSIS to become PLESSIS and DUPRAT to
become RAT!

Two points here: (1) Split up your strings into "words" or "tokens".
Using strg.split() is a start but you may need something more
sophisticated e.g. "-" as an additional token separator. (2) Instead of
writing out all those lines of code, consider putting those
substitutions in a dictionary:

title_substitution = {
'MONSIEUR': 'M',
'MR': 'M',
'MADAME': 'MME',
# etc
}
Next level of improvement is to read that stuff from a file.
[snip]

>
> if normalisation4 == "O":
> str = str.replace(';\"', ' ')
> str = str.replace('\"', ' ')
> str = str.replace('\'', ' ')
> str = str.replace('-', ' ')
> str = str.replace(',', ' ')
> str = str.replace('\\', ' ')
> str = str.replace('\/', ' ')
> str = str.replace('&', ' ')

[snip]
Again, consider the string translate() method.
Also, consider that some of those characters may have some meaning that
you perhaps shouldn't blow away e.g. compare 'SMITH & WESSON' with
'SMITH ET WESSON' :-)

Peter Otten

unread,

Mar 23, 2006, 4:36:02 PM3/23/06

to

John Machin wrote:

> You can replace ALL of this upshifting and accent removal in one blow by
> using the string translate() method with a suitable table.

Only if you convert to unicode first or if your data maintains 1 byte == 1
character, in particular it is not UTF-8.

Peter

John Machin

unread,

Mar 23, 2006, 5:33:19 PM3/23/06

to

I'm sorry, I forgot that there were people who are unaware that
variable-length gizmos like UTF-8 and various legacy CJK encodings are
for storage & transmission, and are better changed to a
one-character-per-storage-unit representation before *ANY* data
processing is attempted.

:-)
Unicode? I'm just a benighted Anglo from the a**-end of the globe; who
am I to be preaching Unicode to a European?
(-:

Jean-Paul Calderone

unread,

Mar 23, 2006, 10:19:15 PM3/23/06

to pytho...@python.org

On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <sjma...@lexicon.net> wrote:
>On 24/03/2006 8:36 AM, Peter Otten wrote:
>> John Machin wrote:
>>
>>>You can replace ALL of this upshifting and accent removal in one blow by
>>>using the string translate() method with a suitable table.
>>
>> Only if you convert to unicode first or if your data maintains 1 byte == 1
>> character, in particular it is not UTF-8.
>>
>
>I'm sorry, I forgot that there were people who are unaware that
>variable-length gizmos like UTF-8 and various legacy CJK encodings are
>for storage & transmission, and are better changed to a
>one-character-per-storage-unit representation before *ANY* data
>processing is attempted.

Unfortunately, unicode only appears to solve this problem in a sane manner. Most people conveniently forget (or never learn in the first place) about combining sequences and denormalized forms. Consider u'e\u0301', u'U\u0301', or u'C\u0327'. These difficulties can be mitigated to some degree via normalization (see unicodedata.normalize), but this step is often forgotten and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH YIWN), it does not even work.

>
>:-)
>Unicode? I'm just a benighted Anglo from the a**-end of the globe; who
>am I to be preaching Unicode to a European?
>(-:

Heh ;P Same here. And I don't really claim to understand all this stuff, I just know enough to know it's really hard to do anything correctly. ;)

Jean-Paul

John Machin

unread,

Mar 23, 2006, 11:43:23 PM3/23/06

to

On 24/03/2006 2:19 PM, Jean-Paul Calderone wrote:
> On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <sjma...@lexicon.net>
> wrote:
>
>> On 24/03/2006 8:36 AM, Peter Otten wrote:
>>
>>> John Machin wrote:
>>>
>>>> You can replace ALL of this upshifting and accent removal in one
>>>> blow by
>>>> using the string translate() method with a suitable table.
>>>
>>>
>>> Only if you convert to unicode first or if your data maintains 1 byte
>>> == 1
>>> character, in particular it is not UTF-8.
>>>
>>
>> I'm sorry, I forgot that there were people who are unaware that
>> variable-length gizmos like UTF-8 and various legacy CJK encodings are
>> for storage & transmission, and are better changed to a
>> one-character-per-storage-unit representation before *ANY* data
>> processing is attempted.
>
>
> Unfortunately, unicode only appears to solve this problem in a sane
> manner. Most people conveniently forget (or never learn in the first
> place) about combining sequences and denormalized forms. Consider
> u'e\u0301', u'U\u0301', or u'C\u0327'.

Yes, and many people don't even bother to look at their data. If they
did, and found combining forms, then they would treat them as I said as
"variable-length gizmos" which are "better changed to a

one-character-per-storage-unit representation before *ANY* data
processing is attempted."

In any case, as the OP is upshifting and stripping accents [presumably
as elementary preparation for some sort of fuzzy matching], all that is
needed is to throw away the combining accents (0301, 0327, etc).

> These difficulties can be
> mitigated to some degree via normalization (see unicodedata.normalize),
> but this step is often forgotten

It's not a matter of forget or not. People should bother to examine
their data and see what characters are in use; then they would know
whether they had a problem or not.

> and, for things like u'\u0565\u0582'
> (ARMENIAN SMALL LIGATURE ECH YIWN), it does not even work.

Sorry, I don't understand.
0565 is stand-alone ECH
0582 is stand-alone YIWN
0587 is the ligature.
What doesn't work? At first guess, in the absence of an Armenian
informant, for pre-matching normalisation, I'd replace 0587 by the two
constituents -- just like 00DF would be expanded to "ss" (before
upshifting and before not caring too much about differences caused by
doubled letters).

Duncan Booth

unread,

Mar 24, 2006, 4:11:35 AM3/24/06

to

Peter Otten wrote:

There's a nice little codec from Skip Montaro for removing accents from
latin-1 encoded strings. It also has an error handler so you can convert
from unicode to ascii and strip all the accents as you do so:

http://orca.mojam.com/~skip/python/latscii.py

>>> import latscii
>>> import htmlentitydefs
>>> print u'\u00c9'.encode('ascii','replacelatscii')
E
>>>

So Bussiere could replace a large chunk of his code with:

ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
ligneA = ligneA.upper()

INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
his files are actually in some different encoding.

Unfortunately, just as I finished writing this I discovered that the
latscii module isn't as robust as I thought, it blows up on consecutive
accented characters.

:(

Peter Otten

unread,

Mar 24, 2006, 6:16:51 AM3/24/06

to

Duncan Booth wrote:

> There's a nice little codec from Skip Montaro for removing accents from
> latin-1 encoded strings. It also has an error handler so you can convert
> from unicode to ascii and strip all the accents as you do so:
>
> http://orca.mojam.com/~skip/python/latscii.py
>
>>>> import latscii
>>>> import htmlentitydefs
>>>> print u'\u00c9'.encode('ascii','replacelatscii')
> E
>>>>
>
> So Bussiere could replace a large chunk of his code with:
>
> ligneA = ligneA.decode(INPUTENCODING).encode('ascii',
> 'replacelatscii') ligneA = ligneA.upper()
>
> INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
> his files are actually in some different encoding.
>
> Unfortunately, just as I finished writing this I discovered that the
> latscii module isn't as robust as I thought, it blows up on consecutive
> accented characters.
>
> :(

You made me look into it -- and I found that reusing the decoding map as the
encoding map lets you write

>>> u"Élève ééé".encode("latscii")
'Eleve eee'

without relying on the faulty error handler. I tried to fix the handler,
too:

>>> u"Élève ééé".encode("ascii", "replacelatscii")
'Eleve eee'
>>> g = u"\N{GREEK CAPITAL LETTER GAMMA}"
>>> (u"möglich ähnlich üblich ááá" + g*3).encode("ascii", "replacelatscii")
'moglich ahnlich ublich aaa???'

No real testing was performed.

Peter

--- latscii_old.py 2006-03-24 11:45:22.580588520 +0100
+++ latscii.py 2006-03-24 11:48:13.191651696 +0100
@@ -141,7 +141,7 @@

### Encoding Map

-encoding_map = codecs.make_identity_dict(range(256))
+encoding_map = decoding_map

### From Martin Blais
@@ -166,9 +166,9 @@
## ustr.encode('ascii', 'replacelatscii')
##
def latscii_error( uerr ):
- key = ord(uerr.object[uerr.start:uerr.end])
+ key = ord(uerr.object[uerr.start])
try:
- return unichr(decoding_map[key]), uerr.end
+ return unichr(decoding_map[key]), uerr.start + 1
except KeyError:
handler = codecs.lookup_error('replace')
return handler(uerr)

John Machin

unread,

Mar 24, 2006, 6:40:44 AM3/24/06

to

On 24/03/2006 8:11 PM, Duncan Booth wrote:
> Peter Otten wrote:
>
>
>>>You can replace ALL of this upshifting and accent removal in one blow
>>>by using the string translate() method with a suitable table.
>>
>>Only if you convert to unicode first or if your data maintains 1 byte
>>== 1 character, in particular it is not UTF-8.
>>
>
>
> There's a nice little codec from Skip Montaro for removing accents from

For the benefit of those who may read only this far, it is NOT nice.

> latin-1 encoded strings. It also has an error handler so you can convert
> from unicode to ascii and strip all the accents as you do so:
>
> http://orca.mojam.com/~skip/python/latscii.py
>
>
>>>>import latscii
>>>>import htmlentitydefs
>>>>print u'\u00c9'.encode('ascii','replacelatscii')
>
> E
>
>
> So Bussiere could replace a large chunk of his code with:

Could, but definitely shouldn't.

>
> ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
> ligneA = ligneA.upper()
>
> INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
> his files are actually in some different encoding.
>
> Unfortunately, just as I finished writing this I discovered that the
> latscii module isn't as robust as I thought, it blows up on consecutive
> accented characters.
>
> :(
>

Some of the transformations are a little unfortunate :-(
0x00d0: ord('D'), # Ð
0x00f0: ord('o'), # ð
Icelandic capital eth becomes D, OK; but the small letter becomes o!!!
The Icelandic thorn letters become P & p (based on physical appearance),
when they should become Th and th.
The German letter Eszett (00DF) becomes B (appearance) when it should be ss.
Creating alphabetics out of punctuation is scarcely something that
bussiere should be interested in:
0x00a2: ord('c'), # ¢
0x00a4: ord('o'), # ¤
0x00a5: ord('Y'), # ¥
0x00a7: ord('S'), # §
0x00a9: ord('c'), # ©
0x00ae: ord('R'), # ®
0x00b6: ord('P'), # ¶

Peter Otten

unread,

Mar 24, 2006, 7:44:42 AM3/24/06

to

John Machin wrote:

> 0x00d0: ord('D'), # Š
> 0x00f0: ord('o'), # š

> Icelandic capital eth becomes D, OK; but the small letter becomes o!!!

I see information flow from Iceland is a bit better than from Armenia :-)

> Some of the transformations are a little unfortunate :-(

The OP, as you pointed out in your first post in this thread, has more
pressing problems with his normalization approach.

Lastly, even if all went well, turning a list of French addresses into an
ascii-uppercase graveyard would be a sad thing to do...

Peter

Walter Dörwald

unread,

Mar 24, 2006, 10:50:18 AM3/24/06

to duncan...@suttoncourtenay.org.uk, pytho...@python.org

Duncan Booth wrote:

> [...]

> Unfortunately, just as I finished writing this I discovered that the
> latscii module isn't as robust as I thought, it blows up on consecutive
> accented characters.
>
> :(

Replace the error handler with this (untested) and it should work with
consecutive accented characters:

def latscii_error( uerr ):
v = []
for c in uerr.object[uerr.start:uerr.end]
key = ord(c)
try:
v.append(unichr(decoding_map[key]))
except KeyError:
v.append(u"?")
return (u"".join(v), uerr.end)
codecs.register_error('replacelatscii', latscii_error)

Bye,
Walter Dörwald

John Machin

unread,

Mar 24, 2006, 1:49:19 PM3/24/06

to

On 24/03/2006 11:44 PM, Peter Otten wrote:
> John Machin wrote:
>
>
>>0x00d0: ord('D'), # Š
>>0x00f0: ord('o'), # š
>>Icelandic capital eth becomes D, OK; but the small letter becomes o!!!
>
>
> I see information flow from Iceland is a bit better than from Armenia :-)

No information flow needed. Capital letter BLAH -> D and small letter
BLAH -> o should trigger one's palpable nonsense detector for *any* BLAH.

>
>
>>Some of the transformations are a little unfortunate :-(
>
>
> The OP, as you pointed out in your first post in this thread, has more
> pressing problems with his normalization approach.
>
> Lastly, even if all went well, turning a list of French addresses into an
> ascii-uppercase graveyard would be a sad thing to do...

Oh indeed. Not only sad, but incredibly stupid. I fervently hope and
trust that such a normalisation is intended only for fuzzy matching
purposes. I can't imagine that anyone would contemplate writing the
output to storage for any reason other than logging or for regression
testing. Update it back to the database? Do you know anyone who would do
that??

Fredrik Lundh

unread,

Mar 24, 2006, 2:37:03 PM3/24/06

to pytho...@python.org

John Machin wrote:

> Some of the transformations are a little unfortunate :-(

here's a slightly silly way to map a unicode string to its "unaccented"
version:

###

import unicodedata, sys

CHAR_REPLACEMENT = {
0xc6: u"AE", # LATIN CAPITAL LETTER AE
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
0xde: u"Th", # LATIN CAPITAL LETTER THORN
0xdf: u"ss", # LATIN SMALL LETTER SHARP S
0xe6: u"ae", # LATIN SMALL LETTER AE
0xf0: u"d", # LATIN SMALL LETTER ETH
0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
0xfe: u"th", # LATIN SMALL LETTER THORN
}

class unaccented_map(dict):

def mapchar(self, key):
ch = self.get(key)
if ch is not None:
return ch
ch = unichr(key)
try:
ch = unichr(int(unicodedata.decomposition(ch).split()[0], 16))
except (IndexError, ValueError):
ch = CHAR_REPLACEMENT.get(key, ch)
# uncomment the following line if you want to remove remaining
# non-ascii characters
# if ch >= u"\x80": return None
self[key] = ch
return ch

if sys.version >= "2.5":
__missing__ = mapchar
else:
__getitem__ = mapchar

assert isinstance(mystring, unicode)

print mystring.translate(unaccented_map())

###

if the source string is not unicode, you can use something like

s = mystring.decode("iso-8859-1")
s = s.translate(unaccented_map())
s = s.encode("ascii", "ignore")

(this works well for characters in the latin-1 range, at least. no
guarantees for other character ranges)

</F>

"Martin v. Löwis"

unread,

Mar 24, 2006, 5:52:39 PM3/24/06

to John Machin

John Machin wrote:
>> and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH
>> YIWN), it does not even work.
>
> Sorry, I don't understand.
> 0565 is stand-alone ECH
> 0582 is stand-alone YIWN
> 0587 is the ligature.
> What doesn't work? At first guess, in the absence of an Armenian
> informant, for pre-matching normalisation, I'd replace 0587 by the two
> constituents -- just like 00DF would be expanded to "ss" (before
> upshifting and before not caring too much about differences caused by
> doubled letters).

Looking at the UnicodeData helps here:

00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;
0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L;<compat> 0565 0582;;;;N;;;;;

So U+0587 is a compatibility character for U+0565,U+0582. Not sure
what the rationale for *this* compatibility character is, but in many
cases, they are in Unicode only for compatibility with some existing
encoding - if they had gone through the proper Unification, they should
not have been introduced as separate characters.

In many cases, ligature characters exist for typographical reasons;
other examples are

FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;;
FB01;LATIN SMALL LIGATURE FI;Ll;0;L;<compat> 0066 0069;;;;N;;;;;
FB02;LATIN SMALL LIGATURE FL;Ll;0;L;<compat> 0066 006C;;;;N;;;;;
FB03;LATIN SMALL LIGATURE FFI;Ll;0;L;<compat> 0066 0066 0069;;;;N;;;;;
FB04;LATIN SMALL LIGATURE FFL;Ll;0;L;<compat> 0066 0066 006C;;;;N;;;;;

In these cases, it is the font designers which want to have code points
for these characters: the glyphs of the ligature cannot be automatically
derived from the glyphs of the individual characters. I can only guess
that the issue with that Armenian ligature is similar.

Notice that the issue of U+00DF is entirely different: it is a character
on its own, not a ligature. That a common transliteration for this
character exists is again a different story.

Now, as to what might not work: While compatibility decomposition
(NFKD) converts \u0587 to \u0565\u0582, the reverse process is not
supported. This is intentional, of course: there is no "canonical"
compatibility character for every decomposed code point.

Regards,
Martin

Serge Orlov

unread,

Mar 24, 2006, 11:24:20 PM3/24/06

to

Martin v. Löwis wrote:
> John Machin wrote:
> >> and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH
> >> YIWN), it does not even work.
> >
> > Sorry, I don't understand.
> > 0565 is stand-alone ECH
> > 0582 is stand-alone YIWN
> > 0587 is the ligature.
> > What doesn't work? At first guess, in the absence of an Armenian
> > informant, for pre-matching normalisation, I'd replace 0587 by the two
> > constituents -- just like 00DF would be expanded to "ss" (before
> > upshifting and before not caring too much about differences caused by
> > doubled letters).
>
> Looking at the UnicodeData helps here:
>
> 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;
> 0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L;<compat> 0565 0582;;;;N;;;;;
>
> So U+0587 is a compatibility character for U+0565,U+0582. Not sure
> what the rationale for *this* compatibility character is, but in many
> cases, they are in Unicode only for compatibility with some existing
> encoding - if they had gone through the proper Unification, they should
> not have been introduced as separate characters.

The problem is that U+0587 is a ligature in Western Armenian dialect
(hy locale) and a character in Eastern Armenian dialect (hy_AM locale).
It is strange the code point is marked as compatibility char. It either
mistake or political decision. It used to be a ligature before
orthographic reform in 1930s by communist government in Armenia, then
it became a character, but after end of Soviet Union (1991) they
started to think about going back to old orthography. Though it hasn't
happened and it's not clear if it will ever happen. So U+0587 is a
character. By the way, this char/ligature is present on both Western
and Eastern Armenian keyboard layouts:
http://www.datacal.com/products/armenian-western-layout.htm
It is between 9 and (. In Eastern Armenian this character is used in
words և ( the word "and" in English) , արև ( "sun" in English) and
hundreds others. Needless to say how many documents exist with this
character.

>
> In many cases, ligature characters exist for typographical reasons;
> other examples are
>
> FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;;
> FB01;LATIN SMALL LIGATURE FI;Ll;0;L;<compat> 0066 0069;;;;N;;;;;
> FB02;LATIN SMALL LIGATURE FL;Ll;0;L;<compat> 0066 006C;;;;N;;;;;
> FB03;LATIN SMALL LIGATURE FFI;Ll;0;L;<compat> 0066 0066 0069;;;;N;;;;;
> FB04;LATIN SMALL LIGATURE FFL;Ll;0;L;<compat> 0066 0066 006C;;;;N;;;;;
>
> In these cases, it is the font designers which want to have code points
> for these characters: the glyphs of the ligature cannot be automatically
> derived from the glyphs of the individual characters. I can only guess
> that the issue with that Armenian ligature is similar.
>
> Notice that the issue of U+00DF is entirely different: it is a character
> on its own, not a ligature. That a common transliteration for this
> character exists is again a different story.
>
> Now, as to what might not work: While compatibility decomposition
> (NFKD) converts \u0587 to \u0565\u0582, the reverse process is not
> supported. This is intentional, of course: there is no "canonical"
> compatibility character for every decomposed code point.

Seems like NFKD will damage Eastern Armenian text (there are millions
of such documents). The result will be readable but the text will look
strange to the person who wrote the text.

Serge.

Serge Orlov

unread,

Mar 25, 2006, 12:24:05 AM3/25/06

to

Jean-Paul Calderone wrote:
> On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <sjma...@lexicon.net> wrote:
> >On 24/03/2006 8:36 AM, Peter Otten wrote:
> >> John Machin wrote:
> >>
> >>>You can replace ALL of this upshifting and accent removal in one blow by
> >>>using the string translate() method with a suitable table.
> >>
> >> Only if you convert to unicode first or if your data maintains 1 byte == 1
> >> character, in particular it is not UTF-8.
> >>
> >
> >I'm sorry, I forgot that there were people who are unaware that
> >variable-length gizmos like UTF-8 and various legacy CJK encodings are
> >for storage & transmission, and are better changed to a
> >one-character-per-storage-unit representation before *ANY* data
> >processing is attempted.
>
> Unfortunately, unicode only appears to solve this problem in a sane manner.

What problem do you mean? Loose matching is solved by unicode in a sane
manner, it is described in the unicode collation algorithm.

Serge.

"Martin v. Löwis"

unread,

Mar 25, 2006, 8:31:04 AM3/25/06

to Serge Orlov

Serge Orlov wrote:
> The problem is that U+0587 is a ligature in Western Armenian dialect
> (hy locale) and a character in Eastern Armenian dialect (hy_AM locale).
> It is strange the code point is marked as compatibility char. It either
> mistake or political decision. It used to be a ligature before
> orthographic reform in 1930s by communist government in Armenia, then
> it became a character, but after end of Soviet Union (1991) they
> started to think about going back to old orthography. Though it hasn't
> happened and it's not clear if it will ever happen. So U+0587 is a
> character.

Thanks for the explanation. Without any knowledge, I would suspect
a combination of mistake and political decision. The Unicode consortium
(and ISO) always uses native language experts to come up with character
definitions, although the process is today likely more elaborate and
precise than in the early days. Likely, the Unicode consortium found
somebody speaking the Western Armenian dialect (given that many of these
speakers live in North America today); the decision might have been
a mixture of lack of knowledge, ignorance, and perhaps even political
bias.

Regards,
Martin