Should editing and selection operate on grapheme clusters?

L. David Baron

unread,

Apr 3, 2006, 9:37:04 PM4/3/06

to dev-...@lists.mozilla.org

Should editing operations (caret movement, backspace, delete, and
selection) operate on characters or grapheme clusters? In our code they
currently operate on characters, but I'd think that we'd probably want
them to operate on grapheme clusters instead.

Part of the reason I think this is that I think we should expose the
difference between composed and decomposed Unicode normalizations to the
user as little as possible. But even in cases where Unicode doesn't
have composed characters, I'd think that it would make more sense for
editing operations to operate on grapheme clusters.

I was thinking of filing bugs on this, but wanted to ask here first to
see if others agree.

-David

--
L. David Baron <URL: http://dbaron.org/ >
Technical Lead, Layout & CSS, Mozilla Corporation

Jean-Marc Desperrier

unread,

Apr 4, 2006, 1:02:18 PM4/4/06

to

L. David Baron wrote:
> Should editing operations (caret movement, backspace, delete, and
> selection) operate on characters or grapheme clusters? In our code they
> currently operate on characters, but I'd think that we'd probably want
> them to operate on grapheme clusters instead.

Hum. "character" means a Unicode code point ? :-)

And by grapheme clusters, let's be precise, do you mean the default
grapheme clusters Unicode Standard Annex #29 defines ?
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

> Part of the reason I think this is that I think we should expose the
> difference between composed and decomposed Unicode normalizations to the
> user as little as possible. But even in cases where Unicode doesn't
> have composed characters, I'd think that it would make more sense for
> editing operations to operate on grapheme clusters.

IMHO edition really should handle combining character sequences, and
ideally all of Unicode's default grapheme clusters.

That would be already a great enhancement, I'd say good enough to stop
there.

But then there's tailored grapheme clusters, I don't know if they are
even really desirable. Will Spaniards or Slovakians expects that one
press on the "delete" key will delete "ch" in one go ? I don't think so.
I don't know enough to say if tailored grapheme clusters are more needed
for Indic, Thai or Tibetan. The Tibetan "U+0F04, U+0F05" character
sequence is quite anecdotal, but maybe not all the other cases not
handled with the default algorithm.

The second annoying point is whether it's better to determine grapheme
clusters inside Mozilla or through an underlying i18n API.

For example, Uniscribe has ScriptXtoCP/ScriptCPtoX to do it.
Using that would probably bring more OS-level consistency (especially
since what it does is a bit strange at time, but might be the real thing
to do or at least what the user expects from former experience).

See details here :
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_97mv.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_6coo.asp
http://blogs.msdn.com/michkap/archive/2005/12/30/508157.aspx#513129

Samphan Raruenrom

unread,

Apr 4, 2006, 11:28:42 PM4/4/06

to

Agree. That's what CTL users want to see.
However, for Thai, Delete should delete a cluster,
but Backspace should delete a code point.
Moreover, mouse hit test (for click/select)
and Find should operate on cluster boundary too.
All is specified in Unicode (except Backspace
which whether it should delete a code point
or a cluster may depend on locale).

L. David Baron wrote:
> Should editing operations (caret movement, backspace, delete, and
> selection) operate on characters or grapheme clusters? In our code they
> currently operate on characters, but I'd think that we'd probably want
> them to operate on grapheme clusters instead.
>
> Part of the reason I think this is that I think we should expose the
> difference between composed and decomposed Unicode normalizations to the
> user as little as possible. But even in cases where Unicode doesn't
> have composed characters, I'd think that it would make more sense for
> editing operations to operate on grapheme clusters.
>
> I was thinking of filing bugs on this, but wanted to ask here first to
> see if others agree.
>
> -David
>

--
_/|\_ Samphan Raruenrom. http://www.osdev.co.th/