Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Proposed fix for Malayalam (& other Indic?) chars and wcwidth

5 views
Skip to first unread message

Rich Felker

unread,
Oct 14, 2006, 12:22:31 AM10/14/06
to
Working on uuterm[1], I've run into a problem with the characters
0D4A-0D4C and possibly others like them, in regards to wcwidth(3)
behavior. These characters are combining marks that attach on both
sides of a cluster, and have canonical equivalence to the two separate
pieces from which they are built, but yet Markus' wcwidth
implementation and GNU libc assign them a width of 1. It appears very
obvious to me that there's no hope of rendering both of these parts
using only 1 character cell on a character cell device, and even if it
were possible, it also seems horribly wrong for canonically equivalent
strings to have different widths.

I propose amending the wcwidth definitions to assign these characters
(and any like them) a width of 2. Furthermore, I would suggest that
any characters with canonical decompositions be assigned a width that
is the sum of the widths of the decomposition into NFD. This would
avoid similar unfortunate situations in the future that might not yet
have been found. It may also be desirable to do this for compatibility
decompositions (like "dz", etc.); however I suspect it's unlikely that
anyone would use such characters in non-legacy data anyway.

BTW I don't think there's any harm here in breaking compatibility with
existing practice, since obviously no one is using the results of
wcwidth on these characters or they would already have run into thus
problem..

Rich


[1] http://svn.mplayerhq.hu/uuterm/


--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/


Bruno Haible

unread,
Oct 16, 2006, 12:13:58 PM10/16/06
to
Hello Rich,

> These characters are combining marks that attach on both
> sides of a cluster, and have canonical equivalence to the two separate
> pieces from which they are built, but yet Markus' wcwidth
> implementation and GNU libc assign them a width of 1. It appears very
> obvious to me that there's no hope of rendering both of these parts
> using only 1 character cell on a character cell device, and even if it
> were possible, it also seems horribly wrong for canonically equivalent
> strings to have different widths.

What rendering to other terminal emulators produce for these characters,
especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit
a patch to glibc based on the data of just 1 terminal emulator.

Bruno

Ben Wiley Sittler

unread,
Oct 16, 2006, 8:38:45 PM10/16/06
to
just tried this in a few terminals, here are the results:

GNOME Terminal 2.16.1:
U+0D30 U+0D4A displayed with width 3
U+0D30 U+0D46 U+0D3E displayed with width 3
NOTE: displays very differently in each case

Konsole 1.6.5:
U+0D30 U+0D4A displayed with width 3
U+0D30 U+0D46 U+0D3E displayed with width 4
NOTE: displays very differently in each case

mlterm 2.9.3:
U+0D30 U+0D4A displayed with width 2
U+0D30 U+0D46 U+0D3E displayed with width 2
NOTE: displays identically in each case

Rich Felker

unread,
Oct 16, 2006, 9:40:00 PM10/16/06
to
Sorry I originally replied off-list to Bruno because the list mail was
slow coming thru and I thought he was just mailing me in private..

On Mon, Oct 16, 2006 at 05:38:45PM -0700, Ben Wiley Sittler wrote:
> just tried this in a few terminals, here are the results:
>
> GNOME Terminal 2.16.1:
> U+0D30 U+0D4A displayed with width 3
> U+0D30 U+0D46 U+0D3E displayed with width 3
> NOTE: displays very differently in each case
>
> Konsole 1.6.5:
> U+0D30 U+0D4A displayed with width 3
> U+0D30 U+0D46 U+0D3E displayed with width 4
> NOTE: displays very differently in each case
>
> mlterm 2.9.3:
> U+0D30 U+0D4A displayed with width 2
> U+0D30 U+0D46 U+0D3E displayed with width 2
> NOTE: displays identically in each case

As we can see, _none_ of these agrees with the current wcwidth
implementation. In fact I'm pretty sure they all ignore wcwidth and
use their own (possibly font-specific) interpretation of width, which
fundamentally dooms the terminal from being able to be used for
anything with columns or cursor positioning.

If they don't even agree with the current wcwidth, and the current
wcwidth cannot reasonably be used for Indic scripts, I see no good
reason why wcwidth tables shouldn't be fixed to at least match values
that _could_ be used for reasonable rendering...

> >What rendering to other terminal emulators produce for these characters,
> >especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit
> >a patch to glibc based on the data of just 1 terminal emulator.

As I commented in private to Bruno, Apple's Terminal.app even has
broken cursor positioning behavior for CJK and nonspacing characters,
so I think it's hopeless to try to use it for Indic scripts...

Rich

Rich Felker

unread,
Oct 29, 2006, 4:55:30 PM10/29/06
to
In addition to the issues I raised before about consistency of width
under canonical equivalence, I've found additional problems in the
width definitions which are not technical issues like before, but just
feasibility-of-presentation issues. Specifically, several Indic
scripts including Kannada and Malayalam have several characters which
require 6 or 7 vertical strokes for their standard presentation
glyphs, and numerous characters that require 4 or 5. Moreover, the
standard glyphs shapes for these characters are roughly twice as wide
(sometimes more than twice) as they are tall.

This puts their horizontal complexity on par with most ideographic
characters, and makes it impossible to render them legibly in a single
character cell without huge font size. The possible courses of action
are:

1. Leave them with wcwidth of 1 anyway and assume everyone will use
huge font sizes or else put up with completely illegible glyphs.

2. Assign a global wcwidth of 2 to the affected scripts.

3. Perform "a careful analysis not only of each Unicode character,
but also of each presentation form", as Markus suggested in his
wcwidth.c comments, assigning width of 1/2[/3??] on a per-character
basis.

IMO course 1 is ridiculous. The only argument for it is compatibility,
but obviously no one has ever tried using wcwidth with these scripts
since it just plain doesn't work.

Course 3 is difficult but might give the most visually pleasing
results. On the other hand, it may tend to lock one into a particular
style of presentation forms. If preferred glyph forms change due to
"reforms" or just stylistic preferences, users could be left with a
mess. Part of the analysis for #3 would have to include making sure
that the width assignments could remain reasonable under such
variations, as opposed to being font-specific, but this is probably
not infeasible as long as the amount of "width>1" characters is kept
to a minimum.

Finally there's course 2. In a way it's sort of a cop-out, taking the
easy approach of "fixed width", but that's what character cell widths
have done ever since "i" and "m" received the same width of 1 column.
It's font-independent and ensures that text in a single script can
align well in columns regardless of which characters are used.

I can prepare example bitmaps if anyone is interested in seeing what
the choices might look like, and probably will do this soon anyway.
Again, my goal is revising the wcwidth data (which Markus labelled as
incomplete in the original version) to account for scripts for which
it is not currently being used and for which it does not currently
provide reasonable results. But it's useless for me to just say what I
think it should be. There should be some sort of sane process here, by
which we arrive at a de facto standard which glibc and other
implementations can adopt.

Rich

rajeev joseph sebastian

unread,
Oct 30, 2006, 7:17:54 AM10/30/06
to
Hello Rich Felker,

It is impossible to fit Malayalam "glyphs" into a given width class, if you want even barely aesthetic text. This is because a given sequence of Unicode characters may map into somewhat different conjunct styles depending on the font: either proper top to bottom (subjoining), or left to right (adjoining) or something in between as well :)

Regards,
Rajeev J Sebastian

PS: Sorry for the top post; Yahoo forces me to do this.

Rich Felker

unread,
Oct 30, 2006, 12:02:04 PM10/30/06
to
On Mon, Oct 30, 2006 at 04:17:54AM -0800, rajeev joseph sebastian wrote:
> Hello Rich Felker,
>
> It is impossible to fit Malayalam "glyphs" into a given width class,
> if you want even barely aesthetic text. This is because a given
> sequence of Unicode characters may map into somewhat different
> conjunct styles depending on the font: either proper top to bottom
> (subjoining), or left to right (adjoining) or something in between
> as well :)

Yes, I'm aware of the aesthetic considerations but between the choice
of seeing nothing at all and seeing something with excessive spacing
(still correctly subjoining, but with extra width/spacing to make up
for the second character not using horizontal space), wouldn't the
latter be preferable? I don't claim it will be pretty but I believe
one can put together something which at least avoids being hideously
ugly. I also don't mean to insult your script by presenting it in an
ugly way (even having "i" and "m" the same width is ugly although much
less severely so), but a terminal and the apps that can be run on it
are quite useful IMO and it seems a shame for many people to be unable
to use them on account of language.

BTW the situation for Kannada seems much less severe... do you know
enough about the script to confirm this?

Thanks for the comments.

Rich


P.S. There's also the possibility of treating syllable clusters as the
fundamental unit of display and requiring a context-sensative function
rather than wcwidth to measure width; however from my experience
getting application maintainers just to fix their handling of
nonspacing characters is difficult enough without asking them to add
script-specific processing. Also the curses library (which is a bad
library anyway but many apps use it) doesn't support this model. :(
IMO the best long-term solution is to support both, with a terminal
escape to switch the terminal between "dumb" wcwidth-based spacing for
compatibility with apps that are not specifically Indic-script aware,
and "smart" context-sensitive spacing.

0 new messages