grapheme cluster library

Rustom Mody

unread,

Oct 21, 2017, 12:11:13 AM10/21/17

to

Is there a recommended library for manipulating grapheme clusters?

In particular, in devanagari
क् + ि = कि
in (pseudo)unicode names
KA-letter + I-sign = KI-composite-letter

I would like to be able to handle KI as a letter rather than two code-points.
Can of course write an automaton to group but guessing that its already
available some place…

Chris Angelico

unread,

Oct 21, 2017, 2:21:57 AM10/21/17

to

On Sat, Oct 21, 2017 at 3:25 PM, Stefan Ram <r...@zedat.fu-berlin.de> wrote:
> Rustom Mody <rusto...@gmail.com> writes:
>>Is there a recommended library for manipulating grapheme clusters?
>

> The Python Library has a module "unicodedata", with functions like:
>
> |unicodedata.normalize( form, unistr )
> |
> |Returns the normal form »form« for the Unicode string »unistr«.
> |Valid values for »form« are »NFC«, »NFKC«, »NFD«, and »NFKD«.
>
> . I don't know whether the transformation you are looking for
> is one of those.

No, that's at a lower level than grapheme clusters.

Rustom, have you looked on PyPI? There are a couple of hits, including
one simply called "grapheme".

ChrisA

Rustom Mody

unread,

Oct 21, 2017, 6:08:50 AM10/21/17

to

On Saturday, October 21, 2017 at 11:51:57 AM UTC+5:30, Chris Angelico wrote:

> On Sat, Oct 21, 2017 at 3:25 PM, Stefan Ram wrote:
> > Rustom Mody writes:
> >>Is there a recommended library for manipulating grapheme clusters?
> >
> > The Python Library has a module "unicodedata", with functions like:
> >
> > |unicodedata.normalize( form, unistr )
> > |
> > |Returns the normal form »form« for the Unicode string »unistr«.
> > |Valid values for »form« are »NFC«, »NFKC«, »NFD«, and »NFKD«.
> >
> > . I don't know whether the transformation you are looking for
> > is one of those.
>
> No, that's at a lower level than grapheme clusters.
>
> Rustom, have you looked on PyPI? There are a couple of hits, including
> one simply called "grapheme".

There is this one line solution using regex (or 2 char solution!)
Not perfect but a good start

>>> from regex import findall
>>> veda="""ॐ पूर्णमदः पूर्णमिदं पूर्णात्पुर्णमुदच्यते
पूर्णस्य पूर्णमादाय पूर्णमेवावशिष्यते ॥
ॐ शान्तिः शान्तिः शान्तिः ॥"""

>>> findall(r'\X', veda)
['ॐ', ' ', 'पू', 'र्', 'ण', 'म', 'दः', ' ', 'पू', 'र्', 'ण', 'मि', 'दं', ' ', 'पू', 'र्', 'णा', 'त्', 'पु', 'र्', 'ण', 'मु', 'द', 'च्', 'य', 'ते', '\n', 'पू', 'र्', 'ण', 'स्', 'य', ' ', 'पू', 'र्', 'ण', 'मा', 'दा', 'य', ' ', 'पू', 'र्', 'ण', 'मे', 'वा', 'व', 'शि', 'ष्', 'य', 'ते', ' ', '॥', '\n', 'ॐ', ' ', 'शा', 'न्', 'तिः', ' ', 'शा', 'न्', 'तिः', ' ', 'शा', 'न्', 'तिः', ' ', '॥']
>>>

Compare

>>> [x for x in veda]
['ॐ', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'द', 'ः', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'ि', 'द', 'ं', ' ', 'प', 'ू', 'र', '्', 'ण', 'ा', 'त', '्', 'प', 'ु', 'र', '्', 'ण', 'म', 'ु', 'द', 'च', '्', 'य', 'त', 'े', '\n', 'प', 'ू', 'र', '्', 'ण', 'स', '्', 'य', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'ा', 'द', 'ा', 'य', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'े', 'व', 'ा', 'व', 'श', 'ि', 'ष', '्', 'य', 'त', 'े', ' ', '॥', '\n', 'ॐ', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', '॥']

What is not working are the vowel-less consonant-joins:
ie ... 'र्', 'ण' ...
[3,4 element of the findall]
should be one 'र्ण'

But its good enough for me for now I think

PS Stefan I dont see your responses unless someone quotes them. Thanks anyway for the inputs

MRAB

unread,

Oct 21, 2017, 11:52:24 AM10/21/17

to

You can use the regex module to split a string into graphemes:

regex.findall(r'\X', string)

Rustom Mody

unread,

Oct 21, 2017, 1:18:32 PM10/21/17

to

Thanks MRAB
Yes as I said I discovered r'\X'
Ultimately my code was (effectively) one line!

print("".join(map[x] for x in findall(r'\X', l)))

with map being a few 100 elements of a dictionary such as
map = {
...
'ॐ': "OM",
...
}

$ cat purnam-deva

ॐ पूर्णमदः पूर्णमिदं पूर्णात्पुर्णमुदच्यते
पूर्णस्य पूर्णमादाय पूर्णमेवावशिष्यते ॥

$ ./devanagari2roman.py purnam-deva
OM pUraNamadaH pUraNamidaM pUraNAtpuraNamudachyate
pUraNasya pUraNamAdAya pUraNamavAvashiShyate ..
OM shAntiH shAntiH shAntiH ..

Basically, an inversion of the itrans input method
https://en.wikipedia.org/wiki/ITRANS

Steven D'Aprano

unread,

Oct 21, 2017, 8:14:00 PM10/21/17

to

On Fri, 20 Oct 2017 21:11:02 -0700, Rustom Mody wrote:

> Is there a recommended library for manipulating grapheme clusters?

Back in July, I asked for anyone interested in grapheme clusters to
consider checking out this issue on the bug tracker:

http://bugs.python.org/issue30717

My post received at least 170 replies from at least 16 unique people
(including you, Rustom). As far as I can see, only two of those people
actually registered on the tracker to follow that ticket.

From time to time, there are repeated complaints that the Python standard
library doesn't handle graphemes. Are those complaints mostly hot air, or
is there actually community interest in having the stdlib deal with this?

If there is community interest, the best ways to register that interest
are, in order (best to worst):

- step up and provide some code;

- make a concrete proposal (not just "support graphemes") on the
Python-Ideas mailing list;

- register a feature request on the tracker;

- complain about the lack of such support here;

- do nothing.

--
Steven D'Aprano

wxjm...@gmail.com

unread,

Oct 22, 2017, 4:12:38 AM10/22/17

to

A good start would be to have a correct Unicode implementation.

Lawrence D’Oliveiro

unread,

Oct 22, 2017, 10:36:03 PM10/22/17

to

On Saturday, October 21, 2017 at 5:11:13 PM UTC+13, Rustom Mody wrote:
> Is there a recommended library for manipulating grapheme clusters?

Is this <http://anoopkunchukuttan.github.io/indic_nlp_library/> any good?

Bear in mind that the logical representation of the text is as code points, graphemes would have more to do with rendering.

Rustom Mody

unread,

Oct 23, 2017, 2:47:13 AM10/23/17

to

On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro wrote:
> On Saturday, October 21, 2017 at 5:11:13 PM UTC+13, Rustom Mody wrote:
> > Is there a recommended library for manipulating grapheme clusters?
>
> Is this <http://anoopkunchukuttan.github.io/indic_nlp_library/> any good?

Thanks looks promising.
Dunno how much it lives up to the claims
[For now the one liner from regex's findall has sufficed:
findall(r'\X', «text»)

[Thanks MRAB for the library]

> Bear in mind that the logical representation of the text is as code points, graphemes would have more to do with rendering.

Heh! Speak of Euro/Anglo-centrism!

In a sane world graphemes would be called letters
And unicode codepoints would be called something else — letterlets??
To be fair to the Unicode consortium, they strive hard to call them codepoints
But in an anglo-centric world, the conflation of codepoint to letter is inevitable I guess.
To hear how a non Roman-centric view of the world would sound:
A 'w' is a poorly double-struck 'u'
A 't' is a crossed 'l'
Reasonable?

The lead of https://en.wikipedia.org/wiki/%C3%9C has

| Ü, or ü, is a character…classified as a separate letter in several extended
Latin alphabets
| (including Azeri, Estonian, Hungarian and Turkish), but as the letter U with an
| umlaut/diaeresis in others such as Catalan, French, Galician, German, Occitan
and Spanish.

Steve D'Aprano

unread,

Oct 23, 2017, 3:45:35 AM10/23/17

to

On Mon, 23 Oct 2017 05:47 pm, Rustom Mody wrote:

> On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro
> wrote:

[...]

>> Bear in mind that the logical representation of the text is as code points,
>> graphemes would have more to do with rendering.
>
> Heh! Speak of Euro/Anglo-centrism!

I think that Lawrence may be thinking of glyphs. Glyphs are the display form
that are rendered. Graphemes are the smallest unit of written language.

> In a sane world graphemes would be called letters

Graphemes *aren't* letters.

For starters, not all written languages have an alphabet. No alphabet, no
letters. Even in languages with an alphabet, not all graphemes are letters.

Graphemes include:

- logograms (symbols which represent a morpheme, an entire word, or
a phrase), e.g. Chinese characters, ampersand &, the ™ trademark
or ® registered trademark symbols;

- syllabic characters such as Japanese kana or Cherokee;

- letters of alphabets;

- letters with added diacritics;

- punctuation marks;

- mathematical symbols;

- typographical symbols;

- word separators;

and more. Many linguists also include digraphs (pairs of letters) like the
English "th", "sh", "qu", or "gh" as graphemes.

https://www.thoughtco.com/what-is-a-grapheme-1690916

https://en.wikipedia.org/wiki/Grapheme

> And unicode codepoints would be called something else — letterlets??
> To be fair to the Unicode consortium, they strive hard to call them
> codepoints But in an anglo-centric world, the conflation of codepoint to
> letter is inevitable I guess. To hear how a non Roman-centric view of the
> world would sound: A 'w' is a poorly double-struck 'u'
> A 't' is a crossed 'l'
> Reasonable?

No, T is not a crossed L -- they are unrelated letters and the visual
similarity is a coincidence. They are no more connected than E is just an F
with an extra line.

But you are more right than you knew regarding W: it *literally was* a
doubled-up V (sometimes written U) once upon a time.

For a long time W did not appear in the Latin alphabet, even after people used
it in written text. It was considered a digraph VV then a ligature and
finally, only gradually, a proper letter. As late as the 16th century the
German grammatican Valentin Ickelshamer complained that hardly anyone,
including school masters, knew what to do with W or what it was called.

https://en.wikipedia.org/wiki/W#History

> The lead of https://en.wikipedia.org/wiki/%C3%9C has
>
> | Ü, or ü, is a character…classified as a separate letter in several
> | extended Latin alphabets
> | (including Azeri, Estonian, Hungarian and Turkish), but as the letter U
> | with an umlaut/diaeresis in others such as Catalan, French, Galician,
> | German, Occitan and Spanish.

Indeed: sometimes the same grapheme is considered a letter in one language and
a letter-plus-diacritic in another.

--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

Rustom Mody

unread,

Oct 23, 2017, 10:25:40 AM10/23/17

to

Um… Ok So I am using the wrong word? Your first link says:
| For example, the word 'ghost' contains five letters and four graphemes
| ('gh,' 'o,' 's,' and 't')

Whereas new regex findall does:

>>> findall(r'\X', "ghost")
['g', 'h', 'o', 's', 't']
>>> findall(r'\X', "church")
['c', 'h', 'u', 'r', 'c', 'h']

Thomas Jollans

unread,

Oct 23, 2017, 5:51:06 PM10/23/17

to

The definition of a "grapheme" in the Unicode standard does not
necessarily line up with linguistic definition of grapheme for any
particular language.

Even if we assumed that there was a universally agreed definition of the
term for every written language (for English there certainly isn't),
you'd dictionaries information on which language you're dealing with to
pull this trick off.

As an example to illustrate why you'd need dictionaries:

In Dutch, "ij" (the "long IJ", as opposed to the "greek Y") is generally
considered a single letter, or at the very least a single grapheme.
There is a unicode codepoint for it (ĳ), but it isn't widely used.

So "vrij" (free) has three graphemes (v r ĳ) and three or four letters.
However, in "bijectie" (bijection), "i" and "j" are two separate
graphemes, so this word has eight letters and seven or eight graphemes.
("ie" may or may not be one single grapheme...)

-- Thomas

PS: This may not be obvious to you at first unless you're Dutch.

Lawrence D’Oliveiro

unread,

Oct 23, 2017, 8:24:20 PM10/23/17

to

On Monday, October 23, 2017 at 7:47:13 PM UTC+13, Rustom Mody wrote:
>
> On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro wrote:
>>
>> Bear in mind that the logical representation of the text is as code
>> points, graphemes would have more to do with rendering.
>
> Heh! Speak of Euro/Anglo-centrism!

You dare say that about somebody who did this <https://default-cube.deviantart.com/gallery/59919308/HarfPy-Examples>> ;)

(By the way, I’d like to add Indic examples to that.)

wxjm...@gmail.com

unread,

Oct 24, 2017, 3:53:47 AM10/24/17

to

Le lundi 23 octobre 2017 08:47:13 UTC+2, Rustom Mody a écrit :
>
> Heh! Speak of Euro/Anglo-centrism!
>

The situation is even more critical you may think.

Your beloved language is not even working correctly
with the "characters" of the the Western European
Windows-1252 charset.