Part of the reason I think this is that I think we should expose the
difference between composed and decomposed Unicode normalizations to the
user as little as possible. But even in cases where Unicode doesn't
have composed characters, I'd think that it would make more sense for
editing operations to operate on grapheme clusters.
I was thinking of filing bugs on this, but wanted to ask here first to
see if others agree.
-David
--
L. David Baron <URL: http://dbaron.org/ >
Technical Lead, Layout & CSS, Mozilla Corporation
Hum. "character" means a Unicode code point ? :-)
And by grapheme clusters, let's be precise, do you mean the default
grapheme clusters Unicode Standard Annex #29 defines ?
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
> Part of the reason I think this is that I think we should expose the
> difference between composed and decomposed Unicode normalizations to the
> user as little as possible. But even in cases where Unicode doesn't
> have composed characters, I'd think that it would make more sense for
> editing operations to operate on grapheme clusters.
IMHO edition really should handle combining character sequences, and
ideally all of Unicode's default grapheme clusters.
That would be already a great enhancement, I'd say good enough to stop
there.
But then there's tailored grapheme clusters, I don't know if they are
even really desirable. Will Spaniards or Slovakians expects that one
press on the "delete" key will delete "ch" in one go ? I don't think so.
I don't know enough to say if tailored grapheme clusters are more needed
for Indic, Thai or Tibetan. The Tibetan "U+0F04, U+0F05" character
sequence is quite anecdotal, but maybe not all the other cases not
handled with the default algorithm.
The second annoying point is whether it's better to determine grapheme
clusters inside Mozilla or through an underlying i18n API.
For example, Uniscribe has ScriptXtoCP/ScriptCPtoX to do it.
Using that would probably bring more OS-level consistency (especially
since what it does is a bit strange at time, but might be the real thing
to do or at least what the user expects from former experience).
See details here :
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_97mv.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/uniscrib_6coo.asp
http://blogs.msdn.com/michkap/archive/2005/12/30/508157.aspx#513129
L. David Baron wrote:
> Should editing operations (caret movement, backspace, delete, and
> selection) operate on characters or grapheme clusters? In our code they
> currently operate on characters, but I'd think that we'd probably want
> them to operate on grapheme clusters instead.
>
> Part of the reason I think this is that I think we should expose the
> difference between composed and decomposed Unicode normalizations to the
> user as little as possible. But even in cases where Unicode doesn't
> have composed characters, I'd think that it would make more sense for
> editing operations to operate on grapheme clusters.
>
> I was thinking of filing bugs on this, but wanted to ask here first to
> see if others agree.
>
> -David
>
--
_/|\_ Samphan Raruenrom. http://www.osdev.co.th/