Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to convert 'ö' to 'oe' or 'o' (or other similar things) in a string?

3,935 views
Skip to first unread message

Peng Yu

unread,
Sep 17, 2016, 12:12:45 PM9/17/16
to
Hi, I want to convert strings in which the characters with accents
should be converted to the ones without accents. Here is my current
code.

========================

$ cat main.sh
#!/usr/bin/env bash
# vim: set noexpandtab tabstop=2:

set -v
./main.py Förstemann
./main.py Frédér8ic@

$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import sys
import unicodedata
print unicodedata.normalize('NFKD',
sys.argv[1].decode('utf-8')).encode('ascii', 'ignore')

========================

The complication is that some characters have more than one way of
conversions. E.g., 'ö' can be converted to either 'oe' or 'o'. I want
to get all the possible conversions of a string. Does anybody know a
good way to do so? Thanks.

--
Regards,
Peng

Kouli

unread,
Sep 17, 2016, 1:06:55 PM9/17/16
to

Martin Schöön

unread,
Sep 17, 2016, 4:20:25 PM9/17/16
to
Den 2016-09-17 skrev Kouli <d...@kou.li>:
> Hello, try the Unidecode module - https://pypi.python.org/pypi/Unidecode.
>
> Kouli
>
> On Sat, Sep 17, 2016 at 6:12 PM, Peng Yu <peng...@gmail.com> wrote:
>> Hi, I want to convert strings in which the characters with accents
>> should be converted to the ones without accents. Here is my current
>> code.

Side note from Sweden. Å, ä and ö are not accented characters in our
language. They are characters of their own. If you want to look up
someone called Öhman in the phone directory you go to the Ö section
not the O section.

Related anecdote from Phoenix AZ. By now you have noticed my family
name: Schöön. On airline tickets and boarding passes in the U.S. it
gets spelled Schoeoen. This landed me in a special security check
once. After a while the youngish lady performing the search started
to look like 1+1 did not sum up to 2. She looked at my passport
and boarding pass again and asked why I was there with her. I pointed
at another young lady, the one that 'scrutinized' passports and
boarding passes while chatting with a friend and said "She told me
to go over here." "Wait here, I have to talk to my supervisor."

A few minutes passed (I was alone in the enhances security check
throughout...) and then "I am sorry but you should not have had
to go through here. There was a mistake."

So there you are: If you are a middle aged, caucasian guy with
a passport from northern Europe and no 'funny' letters in your name
your are not a threat in the eyes of TSA in Arizona.

Sorry for wasting the bandwidth.

/Martin

Thomas 'PointedEars' Lahn

unread,
Sep 17, 2016, 5:19:30 PM9/17/16
to
Peng Yu wrote:

> Hi, I want to convert strings in which the characters with accents
> should be converted to the ones without accents.

Why?

> […]
> ./main.py Förstemann

AFAIK, “ä”, “ö”, and “ü” are not accented characters in any natural
language, but characters of their own (umlauts).

In particular, I know for certain that they are not accented in Germanic
languages. Swedish has been mentioned; I can add my native language,
German, to that list.

> ========================
>
> The complication is that some characters have more than one way of
> conversions. E.g., 'ö' can be converted to either 'oe' or 'o'. I want
> to get all the possible conversions of a string.

Don’t.

> Does anybody know a good way to do so?

No. Because it is a bad idea.

> Thanks.

You’re welcome.

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Marko Rauhamaa

unread,
Sep 17, 2016, 6:10:45 PM9/17/16
to
Martin Schöön <martin...@gmail.com>:
> Related anecdote from Phoenix AZ. By now you have noticed my family
> name: Schöön. On airline tickets and boarding passes in the U.S. it
> gets spelled Schoeoen.

Do Swedes do that German thing, too? If you have to write Finnish
without ä and ö, you simply leave out the dots. (On the other hand, if
you need š or ž and don't have them, you replace them with sh and zh.)


Marko

Steve D'Aprano

unread,
Sep 17, 2016, 11:28:53 PM9/17/16
to
On Sun, 18 Sep 2016 07:19 am, Thomas 'PointedEars' Lahn wrote:

> AFAIK, “ä”, “ö”, and “ü” are not accented characters in any natural
> language, but characters of their own (umlauts).


Are you saying that English is not a natural language?





--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

Peng Yu

unread,
Sep 17, 2016, 11:31:09 PM9/17/16
to
On Sat, Sep 17, 2016 at 3:20 PM, Martin Schöön <martin...@gmail.com> wrote:
> Den 2016-09-17 skrev Kouli <d...@kou.li>:
>> Hello, try the Unidecode module - https://pypi.python.org/pypi/Unidecode.

I don't find a way to make it print oe for ö. Could anybody please
advise what is the correct way to do it?

==> main.py <==
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import sys
from unidecode import unidecode
print unidecode(sys.argv[1].decode('utf-8'))


==> main.sh <==
#!/usr/bin/env bash
# vim: set noexpandtab tabstop=2:

./main.py Schöön

$ ./main.sh
Schoon
> --
> https://mail.python.org/mailman/listinfo/python-list



--
Regards,
Peng

Thorsten Kampe

unread,
Sep 18, 2016, 2:00:13 AM9/18/16
to
* Martin Schöön (17 Sep 2016 20:20:12 GMT)
>
> Den 2016-09-17 skrev Kouli <d...@kou.li>:
> > Hello, try the Unidecode module - https://pypi.python.org/pypi/Unidecode.
> >
> > Kouli
> >
> > On Sat, Sep 17, 2016 at 6:12 PM, Peng Yu <peng...@gmail.com> wrote:
> >> Hi, I want to convert strings in which the characters with accents
> >> should be converted to the ones without accents. Here is my current
> >> code.
>
> Side note from Sweden. Å, ä and ö are not accented characters in our
> language. They are characters of their own.

I think he meant diacritics.

Thorsten

Steven D'Aprano

unread,
Sep 18, 2016, 2:31:17 AM9/18/16
to
On Sunday 18 September 2016 13:30, Peng Yu wrote:

> On Sat, Sep 17, 2016 at 3:20 PM, Martin Schöön <martin...@gmail.com>
> wrote:
>> Den 2016-09-17 skrev Kouli <d...@kou.li>:
>>> Hello, try the Unidecode module - https://pypi.python.org/pypi/Unidecode.
>
> I don't find a way to make it print oe for ö. Could anybody please
> advise what is the correct way to do it?

In general, there is no One Correct Way to translate accented characters into
ASCII. It depends on the language, and the word.

For instance, in English ö will usually be translated into just o with no
accent. We usually write coöperate and zoölogy as cooperate and zoology, or
sometimes with a hyphen co-operate, but never cooeperate or zooelogy. But if
the word is derived from German, or words that *look* like they might be
German, we do sometimes use oe: Roentgen rays (an old term for x-rays) after
Wilhelm Röntgen, for instance.

But in other languages the rules will be different. How, for example, should
one translate an Estonian word containing ö into Turkish, but using ASCII
letters only? I have no idea. But in both languages, and unlike German, ö is
*not* considered an o-with-an-accent, but a distinct letter of the alphabet.

https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29


As far as Python goes, if all you want to do is replace ö with oe and Ö into OE
(or perhaps you should use Œ and œ?) then you can use str.replace or
string.translate:

mystring.replace("Ö", "OE").replace("ö", "oe")




--
Steven
git gets easier once you get the basic idea that branches are homeomorphic
endofunctors mapping submanifolds of a Hilbert space.

Steven D'Aprano

unread,
Sep 18, 2016, 2:45:48 AM9/18/16
to
It doesn't matter whether you call them "accent" like most people do, or
"diacritics" as linguists do. Either way, in some languages they are an
integral part of the letter, like the horizonal stroke in English t or the
vertical bar in English p and b, and in some languages they are modifiers,
where there are rules that tell you how to write them without the modifier.

In English, i is a letter with a dot diacritic, sometimes called the "tittle".
But unlike Turkish, we don't have a dotless i, ı, and to add to the confusion
when we capitalise i we get I with no dot instead of İ. Dropping the dot, or
adding one when you shouldn't, can *literally* get you killed:

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-
more-in-jail

http://www.theinquirer.net/inquirer/news/1017243/cellphone-localisation-glitch


As far as I know, no natural language has a dotted capital J, but there is a
dotless ȷ although I'm not sure what language it is from. (Possibly just used
in mathematics?)

Terry Reedy

unread,
Sep 18, 2016, 3:52:28 AM9/18/16
to
On 9/18/2016 2:45 AM, Steven D'Aprano wrote:

> It doesn't matter whether you call them "accent" like most people do, or
> "diacritics" as linguists do.

I am a native born American and I have never before heard or seen
non-accent diacritic marks called 'accents'. Accents indicate stress.
Other diacritics indicate other pronunciation changes. It is
counterproductive to confuse the two groups. Spanish, for instance, has
vowel accents that change which syllable gets stressed. A tilda is not
an accent; rather, it softens the pronunciation of 'n' to 'ny', as in
'canyon'.

Terry Jan Reedy

Christian Gollwitzer

unread,
Sep 18, 2016, 5:55:15 AM9/18/16
to
Am 17.09.16 um 23:19 schrieb Thomas 'PointedEars' Lahn:
> Peng Yu wrote:
>
>> Hi, I want to convert strings in which the characters with accents
>> should be converted to the ones without accents.
>
> Why?
>
>> […]
>> ./main.py Förstemann
>
> AFAIK, “ä”, “ö”, and “ü” are not accented characters in any natural
> language, but characters of their own (umlauts).
>
> In particular, I know for certain that they are not accented in Germanic
> languages. Swedish has been mentioned; I can add my native language,
> German, to that list.

In German, they are letters, but they collate as either ae, oe, ue
(rarely) or a, o, u (modern style). Ad dictionary or phone book does not
have an "ö" section in Germany. Example from a German-Latin dictionary,
printed in 1958

Laster -> vitium
Lästerer -> homo maledicus
lasterhaft -> vitiosus

If "ä" would sort as a single letter, "Lästerer" would be the last
entry. If it would sort as "ae", it would be the first entry. Therfore,
in this example it sorts as "a", with "ä" > "a" to resolve a tie only.


Christian

Martin Schöön

unread,
Sep 18, 2016, 6:51:22 AM9/18/16
to
This is a problem of the past I think. When the problem arises -- well
I am of aware of standard Swedish way.

American typographer Robert Bringhurst is not happy with the computer
guys who thought half as many glyphs as Gutemberg worked with was
enough back in the days when 7-bit ascii was conceived. See "The
Elements of Typographic Style".

/Martin

Thorsten Kampe

unread,
Sep 18, 2016, 3:35:44 PM9/18/16
to
* Terry Reedy (Sun, 18 Sep 2016 03:51:40 -0400)
>
> On 9/18/2016 2:45 AM, Steven D'Aprano wrote:
>
> > It doesn't matter whether you call them "accent" like most people do, or
> > "diacritics" as linguists do.
>
> I am a native born American and I have never before heard or seen
> non-accent diacritic marks called 'accents'. Accents indicate stress.
> Other diacritics indicate other pronunciation changes. It is
> counterproductive to confuse the two groups. Spanish, for instance, has
> vowel accents that change which syllable gets stressed. A tilda is not
> an accent; rather, it softens the pronunciation of 'n' to 'ny', as in
> 'canyon'.

Had to be said. Nothing to add.

Thorsten

Marko Rauhamaa

unread,
Sep 18, 2016, 3:58:16 PM9/18/16
to
Thorsten Kampe <thor...@thorstenkampe.de>:
<URL: http://www.merriam-webster.com/dictionary/accent>

5 a : a mark (as ´, `, ˆ) used in writing or printing to indicate a
specific sound value, stress, or pitch, to distinguish words
otherwise identically spelled, or to indicate that an ordinarily
mute vowel should be pronounced
b : an accented letter


Marko

not1xor1

unread,
Sep 19, 2016, 12:12:53 AM9/19/16
to
Il 18/09/2016 08:45, Steven D'Aprano ha scritto:
> integral part of the letter, like the horizonal stroke in English t or the
> vertical bar in English p and b, and in some languages they are modifiers,

well... that is the Latin alphabet

English has no T, P or B (or any other character) but is just
(mis)using the Latin alphabet (which is just a few centuries older
than the English language itself) :-)

--
bye
!(!1|1)

Steven D'Aprano

unread,
Sep 19, 2016, 1:08:53 AM9/19/16
to
On Sunday 18 September 2016 17:51, Terry Reedy wrote:

> On 9/18/2016 2:45 AM, Steven D'Aprano wrote:
>
>> It doesn't matter whether you call them "accent" like most people do, or
>> "diacritics" as linguists do.
>
> I am a native born American and I have never before heard or seen
> non-accent diacritic marks called 'accents'. Accents indicate stress.
> Other diacritics indicate other pronunciation changes. It is
> counterproductive to confuse the two groups. Spanish, for instance, has
> vowel accents that change which syllable gets stressed.

Then you're better educated than most people I've met. Most folks I know call
any of those "funny dots and squiggles" on letters "accents".


> A tilda is not
> an accent; rather, it softens the pronunciation of 'n' to 'ny', as in
> 'canyon'.

Hmmm. I'm not a Spanish speaker, but to me, 'canyon' is pronounced can-yen and
the n is pronounced no differently from the n in 'can', 'man', 'men', 'pan',
'panel', 'moon', 'nut', etc.

(P.S. it's tilde. Tilda is short for Matilda, as in Tilda Swinton the actor.)

But what do I know? My missus says I have a tin-ear, and I'm no linguist. But I
can read Wikipedia:

https://en.wikipedia.org/wiki/Diacritic

and it makes it clear that diacritics including accents can have many different
effects on pronunciation, including none at all.

E.g. French là ("there") versus la ("the") are both pronounced /la/. In English
the diaereses found in naïve, Noël, Zoë, coöperate etc. is used to show that
the marked vowel is pronounced separately from the preceding vowel (e.g. co-
operate rather than coop-erate), and accents used to indicate that a vowel
which normally isn't pronounced at all should be, as in saké or Moist von
Lipwig's wingèd hat[1].


Make of that what you will.




[1] A running gag from "Going Postal", one of the Discworld series.

Thomas 'PointedEars' Lahn

unread,
Sep 24, 2016, 7:08:37 PM9/24/16
to
Christian Gollwitzer wrote:

> Am 17.09.16 um 23:19 schrieb Thomas 'PointedEars' Lahn:
>> Peng Yu wrote:
>>> Hi, I want to convert strings in which the characters with accents
>>> should be converted to the ones without accents.
>> […]
>>> […]
>>> ./main.py Förstemann
>>
>> AFAIK, “ä”, “ö”, and “ü” are not accented characters in any natural
>> language, but characters of their own (umlauts).
>>
>> In particular, I know for certain that they are not accented in Germanic
>> languages. Swedish has been mentioned; I can add my native language,
>> German, to that list.
>
> In German, they are letters,

If you read more carefully, my point was: In German, umlauts are not
"accented characters".

> but they collate as either ae, oe, ue
> (rarely) or a, o, u (modern style).

Correct, but irrelevant. The OP did not say anything about sorting.

Fallacy: Red herring.

> Ad dictionary or phone book does not have an "ö" section in Germany.

I would not be so sure about that. Besides, German is spoken/written in
more countries than just Germany.

> Example from a German-Latin dictionary,
> printed in 1958
>
> Laster -> vitium
> Lästerer -> homo maledicus
> lasterhaft -> vitiosus
>
> If "ä" would sort as a single letter, "Lästerer" would be the last
> entry.

Or the first one.

Fallacy: Leaping to a conclusion (hasty generalization).

> If it would sort as "ae", it would be the first entry. Therfore,
> in this example it sorts as "a", with "ä" > "a" to resolve a tie only.

Fallacy: Proof by example (inappropriate generalization).

There are several ways in which umlauts can be sorted. The one used by your
dictionary is only one of them. For example, „Lästerer“ would come last if
the strings were sorted ascending according to the Unicode code points of
the characters in NFD as “a” is U+0061 LATIN SMALL LETTER A but “ä” is
U+00E4 LATIN SMALL LETTER A WITH DIAERESIS.

And as you have mentioned phone books, in all German-speaking phone books
I have come across so far, “ä” does sort like “ae”, “ö” like “oe”, and “ü”
like “ue” (this is specified in DIN 5007 as “variant 1”).

(That does not mean, however, that it is a good idea to *convert* those
letters this way. And there is no good reason to; all modern operating
systems, filesystems and name schemes support Unicode.)

But Wikipedia tells me that the “phone-book sort order” [sic] is
intentionally different from those of dictionaries, and differs between
German-speaking countries. For example, it says that in Austrian phone
books, „~ä…“ follows „~az…“, not „~ad…“.

<https://de.wikipedia.org/wiki/Alphabetische_Sortierung#Einsortierungsregeln_f.C3.BCr_weitere_Buchstaben>

wxjm...@gmail.com

unread,
Sep 25, 2016, 4:16:26 AM9/25/16
to
Collation is one of the most intersting examples to
illustrate Python and its desastrous unicode
implementation.

wxjm...@gmail.com

unread,
Sep 25, 2016, 4:40:13 AM9/25/16
to
[Addendum]

As an European guy, I recommend to use the characters
set used in the "official" font families used in Germany:
BundesSerif / BundesSans.

See
https://styleguide.bundesregierung.de/Webs/SG/DE/PrintMedien/Basiselemente/Schriften/schriften_node.html?__site=SG

(Poor Guido...)

Martin Schöön

unread,
Sep 25, 2016, 6:16:49 AM9/25/16
to
Den 2016-09-25 skrev wxjm...@gmail.com <wxjm...@gmail.com>:
>
> As an European guy, I recommend to use the characters
> set used in the "official" font families used in Germany:
> BundesSerif / BundesSans.
>
> See
> https://styleguide.bundesregierung.de/Webs/SG/DE/PrintMedien/Basiselemente/Schriften/schriften_node.html?__site=SG
>
HTTP Status 404

Die Seite konnte leider nicht gefunden werden.

:-(

Manually poking around I still arrive at:
https://styleguide.bundesregierung.de/Webs/SG/DE/PrintMedien/Basiselemente/Schriften/schriften_node.html?__site=SG

Strange.

/Martin

Christian Gollwitzer

unread,
Sep 25, 2016, 6:47:22 AM9/25/16
to
Am 25.09.16 um 01:08 schrieb Thomas 'PointedEars' Lahn:
> Christian Gollwitzer wrote:
>
>> Am 17.09.16 um 23:19 schrieb Thomas 'PointedEars' Lahn:
>>> Peng Yu wrote:
>>>> Hi, I want to convert strings in which the characters with accents
>>>> should be converted to the ones without accents.
>>> […]
>>>> […]
>>>> ./main.py Förstemann
>>>
>>> AFAIK, “ä”, “ö”, and “ü” are not accented characters in any natural
>>> language, but characters of their own (umlauts).
>>>
>>> In particular, I know for certain that they are not accented in Germanic
>>> languages. Swedish has been mentioned; I can add my native language,
>>> German, to that list.
>>
>> In German, they are letters,
>
> If you read more carefully, my point was: In German, umlauts are not
> "accented characters".
>
>> but they collate as either ae, oe, ue
>> (rarely) or a, o, u (modern style).
>
> Correct, but irrelevant. The OP did not say anything about sorting.
>
> Fallacy: Red herring.
>

Fallacy: Thinking that I disagree with you.

Christian

Steve D'Aprano

unread,
Sep 25, 2016, 7:06:20 AM9/25/16
to
On Sun, 25 Sep 2016 09:08 am, Thomas 'PointedEars' Lahn wrote:

> Christian Gollwitzer wrote:
>
>> Am 17.09.16 um 23:19 schrieb Thomas 'PointedEars' Lahn:
>>> Peng Yu wrote:
>>>> Hi, I want to convert strings in which the characters with accents
>>>> should be converted to the ones without accents.
>>> […]
>>>> […]
>>>> ./main.py Förstemann
>>>
>>> AFAIK, “ä”, “ö”, and “ü” are not accented characters in any natural
>>> language, but characters of their own (umlauts).
>>>
>>> In particular, I know for certain that they are not accented in Germanic
>>> languages. Swedish has been mentioned; I can add my native language,
>>> German, to that list.
>>
>> In German, they are letters,
>
> If you read more carefully, my point was: In German, umlauts are not
> "accented characters".

The umlauts themselves are not. But the combination of vowel-plus-umlaut is
surely an "accented character", is it not? If not, what do you call it in
German?

My understanding is that both officially and popularly, native German
speakers consider that the alphabet has 26 letters (same as English), and
that "accented characters" including the vowels which take umlauts are not
distinct letters of the alphabet but mere variations of the standard
vowels.

That's to be contrasted to (say) Swedish, where ä and ö are *not* "a and o
with an accent/diacritic/umlaut/diaeresis/trema" but distinct letters of
the alphabet in their own right. That's different from ü (the "German Y")
in Swedish, which is only used for loan words and names of German origin,
and *is* considered to be a variant of u.

I use the term "accented character" here in the ignorant, non-linguist,
English-speaker sense of any letter of the alphabet with "funny dots and
squiggles" on it. To people who know what they are talking about, there is
a difference between an accent, umlaut, trema, diaeresis and other
diacritics, but for the purposes of my question, I'm not too worried about
the technical difference between these modifiers, only whether or not they
are considered a modifier on a standard letter or not.



[...]
> And as you have mentioned phone books, in all German-speaking phone books
> I have come across so far, “ä” does sort like “ae”, “ö” like “oe”, and “ü”
> like “ue” (this is specified in DIN 5007 as “variant 1”).
>
> (That does not mean, however, that it is a good idea to *convert* those
> letters this way. And there is no good reason to; all modern operating
> systems, filesystems and name schemes support Unicode.)

Alas, if we only needed to deal with modern operating systems, file systems
and naming schemes, life would be much easier. But sadly we also have to
deal with *old* operating systems, file systems and naming schemes; as well
as ASCII-only or other non-Unicode applications, plus keyboards that give
the user no obvious or easy way to add "accents" (diacritics etc.) to base
letters. See, for example:

http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/

As the author says:

"One of my clients gets address data from Europe, but most of their systems
cannot handle Latin-1 characters. With all due respect to the umlaut,
scharfes s, cedilla, and all the other fine accented characters of Europe,
all I needed to do was to prepare addresses for a shipping system."


Post offices and freight companies are used to dealing with misspelled
addresses. They can usually cope with a few missing accents.

wxjm...@gmail.com

unread,
Sep 25, 2016, 11:08:33 AM9/25/16
to
Just for information.
The link I gave is working for me (google group / Firefox).

Your link is also working.

0 new messages