Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Bidi considered harmful? :)

28 views

Skip to first unread message

Rich Felker

unread,

Aug 31, 2006, 11:33:06 PM8/31/06

I read an old thread on the XFree88 i18n list started by Markus Kuhn
suggesting (rather strongly) that bidi should not be supported at the
terminal level, as well accusations (from other sources) by the author
of Yudit that UAX#9 bidi algo results in serious security issues due
to the irreversibility of the transformation and that it inevitably
butchers mathematical formulae.

I've also considered examples on my own, such as a program (not
necessarily terminal-aware, just text output) that prints lines of the
form "%s %d %d %s" without any special treatment (such as putting
explicit embedding marks around the %s fields) for bidi text, or a
terminal-based program that draws interface elements over top of
existing RTL text, resulting in nonsense.

In all cases, my personal opinion has been not just that UAX#9 is
broken, but that there's no way to implement any sort of implicit bidi
in a terminal emulator or in the display of text/plain data without
every single program having to go _far_ out of its way to ensure that
it won't give incorrect output when the input contains RTL characters,
which simply isn't going to happen, especially since it would
interfere with use in non-RTL scenarios. Other people may have
different opinions but I have not seen any viable solutions.

At the same time, I'm also very dissatisfied with the lack of proper
support for RTL scripts/languages in most applications and especially
at the terminal level, especially since Arabic is in such widespread
use and has great political importance in world affairs these days. I
do not accept that the solution is to just to print characters in the
wrong visual order.

.eerga ll'uoy tcepxe I ylbatrofmoc ecnetnes siht daer nac uoy sselnU

I experimented with the idea of mirroring glyphs to improve
readability, and was fairly surprised with how little it helped my
perception. Reading English text that had been graphically mirrored
remained almost as difficult as reading the above line, with the b/d
and p/q pairs causing significant pause in comprehension.

So then, reading UAX#9 again, I stumbled across the only section
that's not completely stupid (IMO of course):

5.4 Vertical Text

In the case of vertical line orientation, the bidirectional
algorithm is still used to determine the levels of the text.
However, these levels are not used to reorder the text, since the
characters are usually ordered uniformly from top to bottom.
Instead, the levels are used to determine the rotation of the
text. Sometimes vertical lines follow a vertical baseline in which
each character is oriented as normal (with no rotation), with
characters ordered from top to bottom whether they are Hebrew,
numbers, or Latin. When setting text using the Arabic script in
vertical lines, it is more common to employ a horizontal baseline
that is rotated by 90° counterclockwise so that the characters are
ordered from top to bottom. Latin text and numbers may be rotated
90° clockwise so that the characters are also ordered from top to
bottom.

What this provides is a suggested formatting that makes RTL and LTR
scripts both readable in a single-directional context, a vertical one.
Combined with the recent Mongolian script discussion on this list, I
believe this offers an alternate presentation form for documents that
mix LTR and RTL text without using bidi.

I'm not suggesting that everyone should switch to vertically-oriented
terminals or text-file-presentation, although Mongolian users might
like such a setup and it can certainly be one presentation option
that's fair to both RTL and LTR users by making both RTL and LTR
scripts quite readable.

The key idea to take from the Mongolian discussion and from UAX#9 5.4
is that, by having glyphs for LTR and RTL scripts rotated 180°
relative to one another, both can appear legible in a common
directionality. Thus, perhaps LTR users could present legible
RTL-script text by rotating all glyphs 180° and displaying them in LTR
order, and likewise RTL users could use a dominant RTL direction with
LTR glyphs rotated 180° [1]. Like with Mongolian, directionality could
become a localized user preference, rather than a property of the
script.

Does this actually work?

I repeated my experiment with English text reading, rotating the
graphic representation by 180° rather than mirroring it left-right. I
was pleased to find that I could read it with similar ease (but not
quite as fast) as ordinary LTR English text. Surprisingly, p/d and b/q
confusion did not arise, perhaps due to the obvious visual distinction
between the ascent/descent space of the glyphs.

I do not claim these tests are scientific, since the only subject
participating was myself. :) But they are suggestive of an alternative
possible presentation form for mixed LTR/RTL scripts without utilizing
bidirectionality. I consider bidirectionality harmful because:

- It is inherently slow for one's eyes to jump back and forth
switching directions while reading a single paragraph.
- It quickly becomes impossible to read quotations with multiple
levels of directional embedding. Forget UAX#9's 61 levels; 3 levels
are already undecipherable without slow and meticulous work.
- Implicit directionality is impossible to resolve without interfering
with sane people's expectations under string operations. In
particular the UAX#9 insanity involves _semantic_ interpretations of
text contents based on presupposed cultural conventions (like
whether a comma is a thousands separator or a list separator), which
are simply not valid assumptions you can make at such a low level.
- Visual order does not uniquely convey the logical order.

This is not to say that bidirectional formatting doesn't have its
place, and that, used correctly without multiple embedding levels,
with well-set block quotes, etc., it won't be legible. I also do not
preclude use of advanced ECMA-48 features for explicit bidi at the
terminal level. But I'd like to propose unidirectional formatting with
adjusted glyph orientation as a more logical (and perhaps more easily
readable) alternative to be used in terminal emulators and perhaps
also other contexts where accurate representation of the logical order
is required or where multiple levels of quoting are in use.

The most important thing to realize is that this proposal is not to
reject traditional ways of writing RTL scripts. The proposal is to
reject the (very stupid IMO) idea of mixing LTR and RTL
directionalities in a single paragraph context, except in the case
where higher-level formatting (which is inherently not available in a
plain text file or text printed to stdout) can control it.

Rich

[1] There is a small problem that even without LTR scripts mixed in,
most RTL scripts are "bidirectional" due to numbers being written LTR.
However supporting reversed display of individual numbers (or even
individual words) is a trivial problem compared to full bidi text flow
and can be done without compromising reversibility and without complex
algorithms that cause misinterpretation of adjacent text.

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

George W Gerrity

unread,

Sep 1, 2006, 2:32:40 AM9/1/06

On 2006-09-01, at 13:33, Rich Felker wrote:

> I read an old thread on the XFree88 i18n list started by Markus Kuhn
> suggesting (rather strongly) that bidi should not be supported at the
> terminal level, as well accusations (from other sources) by the author
> of Yudit that UAX#9 bidi algo results in serious security issues due
> to the irreversibility of the transformation and that it inevitably
> butchers mathematical formulae.
>
> I've also considered examples on my own, such as a program (not
> necessarily terminal-aware, just text output) that prints lines of the
> form "%s %d %d %s" without any special treatment (such as putting
> explicit embedding marks around the %s fields) for bidi text, or a
> terminal-based program that draws interface elements over top of
> existing RTL text, resulting in nonsense.
>
> In all cases, my personal opinion has been not just that UAX#9 is
> broken, but that there's no way to implement any sort of implicit bidi
> in a terminal emulator or in the display of text/plain data without
> every single program having to go _far_ out of its way to ensure that
> it won't give incorrect output when the input contains RTL characters,
> which simply isn't going to happen, especially since it would
> interfere with use in non-RTL scenarios. Other people may have
> different opinions but I have not seen any viable solutions.

I did try to tell you that doing a terminal emulation properly would
be complex. I don't know if the algorithm is broken: I doubt it. But
it is difficult getting it to work properly and it essentially
requires internal tables for every glyph describing its direction and
orientation.

No one using arabic script would accept reading it top to bottom: it
is simply never done (to the best of my knowledge), and so any
terminal emulator claiming to work with any script had better be able
to render the text correctly, including mixing rtl and ltr.

George
------

Rich Felker

unread,

Sep 1, 2006, 9:41:44 AM9/1/06

On Fri, Sep 01, 2006 at 04:32:40PM +1000, George W Gerrity wrote:
> I did try to tell you that doing a terminal emulation properly would
> be complex. I don't know if the algorithm is broken: I doubt it. But
> it is difficult getting it to work properly and it essentially
> requires internal tables for every glyph describing its direction and
> orientation.

If that were the problem it would be trivial. The problems are much
more fundamental. The key examples you should look at are things like:
printf("%s %d %d %s\n", string1, number2, number3, string4); where the
output is intended to be columnar. Everything is fine until someone
puts in data where string1 ends in RTL text and string4 begins with
RTL text, in which case the numbers switch places. This kind of
instability is not just awkward; it shows that implicit bidi is
fundamentally broken. Even if it can be handled at the terminal
emulator level with special escapes and whatnot (and I believe it can,
albeit in very ugly ways) it simply cannot be handled in a plain text
file, for reasons like:

columna COLUMNB 1234 5678 columnc
columna COLUMNB 1234 5678 COLUMNC

Implicit bidi requires interpreting a flow of plain text as
sentence/paragraph content which is simply not a reasonable
assumption. Consider also what would happen if your text file is two
preformatted 32-character-wide paragraph columns side-by-side. Now
imagine the kind of havok that could result if this sort of insanity
took place in the presentation of configuration files with critical
security settings, for instance where the strings are usernames (which
MUST be able to contain any letter character from any language) and
the numbers are permission levels. And certainly you can't just throw
explicit direction markers into a config file like that because they'd
alter the semantics (which should be purely byte-oriented; there's no
reason any program not displaying text should include code to process
the contents).

One of the unacceptable things that the Unicode consortium has done
(as opposed to ISO 10646 which, after their initial debacle, has been
quite reasonable and conservative in what they specify) is to presume
they can redefine what a text file is. This has included BOMs,
paragraph break character, implicit(?) deprecation of newline
character as a line/paragraph break, etc. Notice that all of these
redefinitions have been universally rejected by *NIX users because
they are incompatible with the *NIX notion of a text file. My view is
that implicit bidi is equally incompatible with text files and should
be rejected for the same reasons.

This does not mean that storing text in 'visual order' is acceptable
either; that's just disgusting and makes correct ligatures/shaping
impossible. It just means that you cannot create a bidirection
presentation from a text file without higher level markup. Instead you
can use a vertical presentation or either LTR or RTL presentation with
the opposite-directionality glyphs rotated 180°.

My observations were that this sort of presentation is much easier to
edit and quite possibly easier to read than a format where your eyes
have to switch scanning directions.

I'm not unwilling to support implicit bidi if somebody else wants to
code it, but the output WILL BE WRONG in many cases and thus will be
off by default. The data needed to do it correctly is simply not
there.

> > [...]

> >[1] There is a small problem that even without LTR scripts mixed in,
> >most RTL scripts are "bidirectional" due to numbers being written LTR.
> >However supporting reversed display of individual numbers (or even
> >individual words) is a trivial problem compared to full bidi text flow
> >and can be done without compromising reversibility and without complex
> >algorithms that cause misinterpretation of adjacent text.
>
> No one using arabic script would accept reading it top to bottom: it
> is simply never done (to the best of my knowledge), and so any
> terminal emulator claiming to work with any script had better be able
> to render the text correctly, including mixing rtl and ltr.

You misread the above. Of course no one using LTR scripts would want
to read top-to-bottom either. The intent is that users of RTL scripts
could use an _entirely_ RTL terminal with the LTR characters' glyphs
rotated 180° while LTR users could use an _entirely_ LTR terminal with
RTL glyphs rotated 180°. The exception noted in the footnote is that
RTL scripts actually require "bidi" for numbers, but I comment that
this is trivial compared to bidi and suffers from none of the
fundamental problems of bidi.

The vertical orientation thing is mostly of interest to Mongolian
users and perhaps some East Asian users, but it could also be
interesting to (a very few) users of both LTR and RTL scripts who use
both frequently and who want a more equal treatment of both,
especially if they find reading upside-down difficult.

Rich

P.S. Do you have any good screenshots with RTL or LTR embedded text?
If so I can prepare some modified images to show what I mean and you
can see what you think of readability.

Mark Leisher

unread,

Sep 1, 2006, 11:36:44 AM9/1/06

Rich Felker wrote:
>
> If that were the problem it would be trivial. The problems are much
> more fundamental. The key examples you should look at are things like:
> printf("%s %d %d %s\n", string1, number2, number3, string4); where the
> output is intended to be columnar. Everything is fine until someone
> puts in data where string1 ends in RTL text and string4 begins with
> RTL text, in which case the numbers switch places. This kind of
> instability is not just awkward; it shows that implicit bidi is
> fundamentally broken.

I can say with certainty born of 10+ years of trying to implement an
implicit bidi reordering routine that "just does the right thing," there
are ambiguities that simply can't be avoided. Like your example.

Are one or both numbers associated with the RTL text or the LTR text?
Simple question, multiple answers. Some answers are simple, some are not.

The Unicode bidi reordering algorithm is not fundamentally broken, it
simply provides a result that is correct in many, but not all cases. If
you can defy 30 years of experience in implicit bidi reordering
implementations and come up with one that does the correct thing all the
time, you could be a very rich man.

>
> Implicit bidi requires interpreting a flow of plain text as
> sentence/paragraph content which is simply not a reasonable
> assumption. Consider also what would happen if your text file is two
> preformatted 32-character-wide paragraph columns side-by-side. Now
> imagine the kind of havok that could result if this sort of insanity
> took place in the presentation of configuration files with critical
> security settings, for instance where the strings are usernames (which
> MUST be able to contain any letter character from any language) and
> the numbers are permission levels. And certainly you can't just throw
> explicit direction markers into a config file like that because they'd
> alter the semantics (which should be purely byte-oriented; there's no
> reason any program not displaying text should include code to process
> the contents).
>

So you have a choice, adapt your config file reader to ignore a few
characters or come up with an algorithm that displays plain text
correctly all the time.

> One of the unacceptable things that the Unicode consortium has done
> (as opposed to ISO 10646 which, after their initial debacle, has been
> quite reasonable and conservative in what they specify) is to presume
> they can redefine what a text file is. This has included BOMs,
> paragraph break character, implicit(?) deprecation of newline
> character as a line/paragraph break, etc. Notice that all of these
> redefinitions have been universally rejected by *NIX users because
> they are incompatible with the *NIX notion of a text file. My view is
> that implicit bidi is equally incompatible with text files and should
> be rejected for the same reasons.
>

You left out the part where Unicode says that none of these things is
strictly required. The *NIX community didn't reject anything. They
didn't need to. You also seem unaware of how much effort was made by
ISO, the Unicode Consortium, and all the national standards bodies to
avoid breaking a lot of existing practice.

I highly recommend participating in any standards development process
managed by any national or international standards body. You will find
an obsession with avoidance of breaking existing practice.

>
> I'm not unwilling to support implicit bidi if somebody else wants to
> code it, but the output WILL BE WRONG in many cases and thus will be
> off by default. The data needed to do it correctly is simply not
> there.

Why is it someone else's responsibility to code it? You are the one that
finds decades of experience unacceptable. Stop whining and fix it.
That's what I did. I'm still working on it 13 years later, but I'm not
complaining any more.

Human languages and the scripts used to represent them are messy. There
are no neat solutions. Get used to it.

Good day and good luck.
--
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark

Rich Felker

unread,

Sep 1, 2006, 2:08:03 PM9/1/06

On Fri, Sep 01, 2006 at 09:36:44AM -0600, Mark Leisher wrote:
> Rich Felker wrote:
> >
> >If that were the problem it would be trivial. The problems are much
> >more fundamental. The key examples you should look at are things like:
> >printf("%s %d %d %s\n", string1, number2, number3, string4); where the
> >output is intended to be columnar. Everything is fine until someone
> >puts in data where string1 ends in RTL text and string4 begins with
> >RTL text, in which case the numbers switch places. This kind of
> >instability is not just awkward; it shows that implicit bidi is
> >fundamentally broken.
>
> I can say with certainty born of 10+ years of trying to implement an
> implicit bidi reordering routine that "just does the right thing," there
> are ambiguities that simply can't be avoided. Like your example.
>
> Are one or both numbers associated with the RTL text or the LTR text?
> Simple question, multiple answers. Some answers are simple, some are not.

Exactly. Unicode bidi algorithm assumes that anyone putting bidi
characters in a text stream will give them special consideration and
manually resolve these issues with explicit embedding. That is, it
comes from the word processor mentality of the designers of Unicode.
They never stop to think that maybe an automated process that doesn't
know about character semantics could be writing strings, or that
syntax in a particular text file (like passwd, csv files, tsv files,
etc.) could preclude such treatment.

> The Unicode bidi reordering algorithm is not fundamentally broken, it
> simply provides a result that is correct in many, but not all cases. If
> you can defy 30 years of experience in implicit bidi reordering
> implementations and come up with one that does the correct thing all the
> time, you could be a very rich man.

Why is implicit so important? A bidi algorithm with minimal/no
implicit behavior works fine as long as you are not mixing
languages/scripts, and when mixing scripts it makes sense to use
explicit embedding -- especially since the cases of mixed scripts that
MUST work without formatting controls are files that are meant to be
machine-interpreted as opposed to pretty-printed for human
consumption.

In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.

> So you have a choice, adapt your config file reader to ignore a few
> characters or come up with an algorithm that displays plain text
> correctly all the time.

What should happpen when editing source code? Should x = FOO(BAR);
have the argument on the left while x = FOO(bar); has it on the right?
Should source code require all RTL identifiers to be wrapped in
embedding codes? (They're illegal in ISO C and any language taking
identifier rules from ISO/IEC TR 10176, yet Hebrew and Arabic
characters are legal like all other characters used to write non-dead
languages.)

> >One of the unacceptable things that the Unicode consortium has done
> >(as opposed to ISO 10646 which, after their initial debacle, has been
> >quite reasonable and conservative in what they specify) is to presume
> >they can redefine what a text file is. This has included BOMs,
> >paragraph break character, implicit(?) deprecation of newline
> >character as a line/paragraph break, etc. Notice that all of these
> >redefinitions have been universally rejected by *NIX users because
> >they are incompatible with the *NIX notion of a text file. My view is
> >that implicit bidi is equally incompatible with text files and should
> >be rejected for the same reasons.
> >
>
> You left out the part where Unicode says that none of these things is
> strictly required.

This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.

> The *NIX community didn't reject anything. They
> didn't need to. You also seem unaware of how much effort was made by
> ISO, the Unicode Consortium, and all the national standards bodies to
> avoid breaking a lot of existing practice.

I'm aware that unlike many other standardization processes, the
Unicode Consortium was very inconsistent in its application of this
rule. Many people consider Han unification to beak existing practice.
UCS-2 which they initially tried to push onto people, as well as
UTF-1, heavily broke existing practice as well. The semantics of SHY
break existing standards from ISO-8859. Replacing UCS-2 with UTF-16
broke existing practice on Windows by causing MS's implementation of
wchar_t to violate C/POSIX by not representing a complete character
anymore. Etc. etc. etc.

On the other hand they had no problem with filling the beginning of
the BMP with useless legacy characters for the sake of compatibility,
thus forcing South[east] Asian scripts which use many characters in
each word into the 3-byte range of UTF-8...

Unicode is far from ideal, but it's what we're stuck with, I agree.
However UAX#9 is inconsistent with the definition of a text file and
with good programming practice and thus alternate ways to present RTL
text acceptably (such as an entirely RTL display for RTL users) are
needed. I've read rants from some of the Arabeyes folks that they're
so disappointed with UAX#9 that they'd rather go the awful route of
storing text backwards!!

> >I'm not unwilling to support implicit bidi if somebody else wants to
> >code it, but the output WILL BE WRONG in many cases and thus will be
> >off by default. The data needed to do it correctly is simply not
> >there.
>
> Why is it someone else's responsibility to code it? You are the one that
> finds decades of experience unacceptable. Stop whining and fix it.
> That's what I did. I'm still working on it 13 years later, but I'm not
> complaining any more.

You did not fix it because it cannot be fixed anymore than you can
tell me whether 1,200 means the number 1200 (printed in an ugly legacy
form) or a cvs list of 1 and 200. Nor can I fix it. I'm well aware
that any implicit bidi at the terminal level WILL display blatently
wrong and misleading information in numerous real world cases, and
that text will jump around the terminal in an unpredictable and
illogical fashion under cursor control and deletion, replacement, or
insertion over existing text. As such, I deem such a feature a waste
of time to implement. It will mess up more stuff than it 'fixes'.

The alternatives are either to display characters in the wrong order
(siht ekil) or to unify the flow of text to one direction without
alterring the visual representation (which rotation accomplishes).
Naturally fullscreen programs can draw bidi text in its natural
directionality either by swapping character order (but some special
treatment of combining marks and Arabic shaping must be done by the
application first in order for this to work, and it will render
copy-and-paste from the terminal mostly useless) or by using ECMA-48
bidi controls (but don't expect anyone to use these until curses is
adapted for it and until screen supports it).

> Human languages and the scripts used to represent them are messy.
> There are no neat solutions. Get used to it.

Languages are messy. Scripts are not, except for a very few bidi
scripts. Even Indic scripts are relatively easy. UAX#9 requires
imposing language semantics onto characters which is blatently wrong
and which is the source of the mess. This insistence on making simple
things into messes is why no real developers want to support i18n and
why only the crap like GNOME and KDE and OOO support it decently. I
believe that both 'sides' are wrong and that universal i18n on unix is
possible but only after you accept that unix lives by the New Jersey
approach and not the MIT approach.

Rich

Mark Leisher

unread,

Sep 1, 2006, 5:46:44 PM9/1/06

Rich Felker wrote:
>> I can say with certainty born of 10+ years of trying to implement an
>> implicit bidi reordering routine that "just does the right thing," there
>> are ambiguities that simply can't be avoided. Like your example.
>>
>> Are one or both numbers associated with the RTL text or the LTR text?
>> Simple question, multiple answers. Some answers are simple, some are not.
>
> Exactly. Unicode bidi algorithm assumes that anyone putting bidi
> characters in a text stream will give them special consideration and
> manually resolve these issues with explicit embedding. That is, it
> comes from the word processor mentality of the designers of Unicode.
> They never stop to think that maybe an automated process that doesn't
> know about character semantics could be writing strings, or that
> syntax in a particular text file (like passwd, csv files, tsv files,
> etc.) could preclude such treatment.
>

Did it every occur to you that it wasn't the "word processing mentality"
of the Unicode designers that led to ambiguities surviving in plain
text? It is simply the fact that there is no nice neat solution. Unicode
went farther than just about anyone else in solving the general case of
reordering plain bidi text for display without explicit directional codes.

>> The Unicode bidi reordering algorithm is not fundamentally broken, it
>> simply provides a result that is correct in many, but not all cases. If
>> you can defy 30 years of experience in implicit bidi reordering
>> implementations and come up with one that does the correct thing all the
>> time, you could be a very rich man.
>
> Why is implicit so important?

Why does plain text still exist?

> A bidi algorithm with minimal/no
> implicit behavior works fine as long as you are not mixing
> languages/scripts, and when mixing scripts it makes sense to use
> explicit embedding -- especially since the cases of mixed scripts that
> MUST work without formatting controls are files that are meant to be
> machine-interpreted as opposed to pretty-printed for human
> consumption.

I'm not quite sure what point you are trying to make here. Do away with
plain text?

>
> In particular, an algorithm that only applies reordering within single
> 'words' would give the desired effects for writing numbers in an RTL
> context and for writing single LTR words in a RTL context or single
> RTL words in a LTR context. Anything more than that (with unlimited
> long range reordering behavior) would then require explicit embedding.
>

You are aware that numeric expressions can be written differently in
Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions
differ (i.e. 1/2 in Latin and Hebrew would be presented as 2/1 in
Arabic). This also affects other characters often used with numbers such
as percent and dollar sign. So even within strictly RTL scripts,
different reordering is required depending on which script is being
used. But if you know a priori which script is in use, reordering is
trivial.

>> So you have a choice, adapt your config file reader to ignore a few
>> characters or come up with an algorithm that displays plain text
>> correctly all the time.
>
> What should happpen when editing source code? Should x = FOO(BAR);
> have the argument on the left while x = FOO(bar); has it on the right?
> Should source code require all RTL identifiers to be wrapped in
> embedding codes? (They're illegal in ISO C and any language taking
> identifier rules from ISO/IEC TR 10176, yet Hebrew and Arabic
> characters are legal like all other characters used to write non-dead
> languages.)

This is the choice of each programming language designer: either allow
directional override codes in the source or ban them. Those than ban
them obviously assume that knowledge of the language's syntax is
sufficient to allow an editor to present the source code text reasonably
well.

>> You left out the part where Unicode says that none of these things is
>> strictly required.
>
> This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
> consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
> 1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
> files as it is written because text files do not define paragraphs.
>

How is a line ending with newline in a text file not a paragraph? A
poorly formatted paragraph, to be sure, but a paragraph nonetheless. The
Unicode Standard says paragraph separators are required for the
reordering algorithm. There is no reason why a line can't be viewed as a
paragraph. And it even works reasonably well most of the time.

BTW, what part of ISO/IEC 9899 are you referring to? All I see is
§7.19.2.7 which says something about lines being limited to 254
characters and a terminating newline character. No definitions of lines
or paragraphs that I see off hand.

>
> I'm aware that unlike many other standardization processes, the
> Unicode Consortium was very inconsistent in its application of this
> rule. Many people consider Han unification to beak existing practice.
> UCS-2 which they initially tried to push onto people, as well as
> UTF-1, heavily broke existing practice as well. The semantics of SHY
> break existing standards from ISO-8859. Replacing UCS-2 with UTF-16
> broke existing practice on Windows by causing MS's implementation of
> wchar_t to violate C/POSIX by not representing a complete character
> anymore. Etc. etc. etc.

Han unification did indeed break existing practice, but I think you will
find that the IRG (group of representatives from all Han-using
countries) feels that in the long run, it was the best thing to do.

UCS-2 didn't so much break existing practice as come along at one of the
most confusing periods of internationalization retrofitting of the C
libraries and language. The wchar_t type was in the works before UCS-2
came along. And in most implementations it could hold a UCS-2 character.
I don't recall UTF-1 being around long enough to have much of an impact.
Consider how quickly it was discarded in favor of UTF-8. And I certainly
don't recall UTF-1 being forced on anyone.

>
> On the other hand they had no problem with filling the beginning of
> the BMP with useless legacy characters for the sake of compatibility,
> thus forcing South[east] Asian scripts which use many characters in
> each word into the 3-byte range of UTF-8...
>

Those "useless" legacy characters avoided breaking many existing
applications, most of which were not written for Southeast Asia. Some
scripts had to end up in the 3-byte range of UTF-8. Are you in a
position to determine who should and should not be in that range? Have
you even considered why they ended up in that range?

> Unicode is far from ideal, but it's what we're stuck with, I agree.
> However UAX#9 is inconsistent with the definition of a text file and
> with good programming practice and thus alternate ways to present RTL
> text acceptably (such as an entirely RTL display for RTL users) are
> needed. I've read rants from some of the Arabeyes folks that they're
> so disappointed with UAX#9 that they'd rather go the awful route of
> storing text backwards!!
>

So are you implying that good programming practice requires lines are to
be ended with a newline and paragraphs are separated by two newlines?
What about the 25 year convention of CRLF on DOS/Win? What about the 20
year practice of using CR on Mac? Should we denounce them as heretics to
be excommunicated and unilaterally dictate to all that newline is the
only answer, just like you seem to think the Unicode Consortium did?

Like others who didn't like the Unicode bidi reordering approach, the
Arabeyes people were welcome to continue doing things the way they
wanted. Interoperability problems often either kill these companies or
force them to go Unicode at some level.

>> Why is it someone else's responsibility to code it? You are the one that
>> finds decades of experience unacceptable. Stop whining and fix it.
>> That's what I did. I'm still working on it 13 years later, but I'm not
>> complaining any more.
>
> You did not fix it because it cannot be fixed anymore than you can
> tell me whether 1,200 means the number 1200 (printed in an ugly legacy
> form) or a cvs list of 1 and 200. Nor can I fix it. I'm well aware
> that any implicit bidi at the terminal level WILL display blatently
> wrong and misleading information in numerous real world cases, and
> that text will jump around the terminal in an unpredictable and
> illogical fashion under cursor control and deletion, replacement, or
> insertion over existing text. As such, I deem such a feature a waste
> of time to implement. It will mess up more stuff than it 'fixes'.
>

So you do understand. If it isn't fixable, what point is there in
complaining about it? Find a better way.

> The alternatives are either to display characters in the wrong order
> (siht ekil) or to unify the flow of text to one direction without
> alterring the visual representation (which rotation accomplishes).
> Naturally fullscreen programs can draw bidi text in its natural
> directionality either by swapping character order (but some special
> treatment of combining marks and Arabic shaping must be done by the
> application first in order for this to work, and it will render
> copy-and-paste from the terminal mostly useless) or by using ECMA-48
> bidi controls (but don't expect anyone to use these until curses is
> adapted for it and until screen supports it).

Hmm. Sounds just like a bidi reordering algorithm I heard about. You
know. The one the Unicode Consortium is touting.

I have a lot of experience with ECMA-48 (ISO/IEC 6429) and ISO/IEC 2022.
All I will say about them is Unicode is a lot easier to deal with. Have
a look at the old kterm code if you want to see how complicated things
can get. And that was one of the cleaner implementations I've seen over
the years.

Again, I would encourage you to try it yourself. Talk is cheap,
experience teaches.

>
> Languages are messy. Scripts are not, except for a very few bidi
> scripts. Even Indic scripts are relatively easy.

Hah! I often hear the same sentiment from people who don't know the
difference between a glyph and a character. Yes, it is true that Indic,
and even Khmer and Burmese scripts are relatively easy. All you need to
do is create the right set of glyphs.

This frequently gives you multiple glyph codes for each abstract
character. To do anything with the text, a mapping between glyph and
abstract character is necessary for every program that uses that text.

Again, I would encourage you to try it yourself. Talk is cheap,
experience teaches.

> UAX#9 requires
> imposing language semantics onto characters which is blatently wrong
> and which is the source of the mess.

If 30 years of experience has led to blatantly wrong semantics, then
quit whining about it and fix it! The Unicode Consortium isn't deaf,
dumb, or stupid. They have been known to honor actual evidence of
incorrect behavior and change things when necessary. But they aren't
going to change things just because you find it inconveniently complicated.

> This insistence on making simple
> things into messes is why no real developers want to support i18n and
> why only the crap like GNOME and KDE and OOO support it decently. I
> believe that both 'sides' are wrong and that universal i18n on unix is
> possible but only after you accept that unix lives by the New Jersey
> approach and not the MIT approach.

I have been complaining about the general trend to over-complicate and
over-standardize software for years. These days the "art" of programming
only exists in the output of a rare handful of programmers. Don't worry
about it. Software will collapse under it's own weight in time. You just
have to be patient and wait until that happens and be ready with all
your simpler solutions.

<sarcasm>
But you better hurry up with those simpler solutions, the increasing
creep of unnecessary complexity into software is happening fast. The
crash is coming! It will probably arrive with /The Singularity/.
</sarcasm>

--
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark

Rich Felker

unread,

Sep 1, 2006, 8:01:58 PM9/1/06

On Fri, Sep 01, 2006 at 03:46:44PM -0600, Mark Leisher wrote:
> Did it every occur to you that it wasn't the "word processing mentality"
> of the Unicode designers that led to ambiguities surviving in plain
> text? It is simply the fact that there is no nice neat solution. Unicode
> went farther than just about anyone else in solving the general case of
> reordering plain bidi text for display without explicit directional codes.

It went farther because it imposed language-specific semantics in
places where they do not belong. These semantics are correct with
sentences written in human languages which would not have been hard to
explicitly mark up, especially with a word processor doing it for you.
On the other hand they're horribly wrong in computer languages
(meaning any text file meant to be computer-read and -interpreted, not
just programming languages) where explicit markup is highly
undesirable or even illegal.

> Why does plain text still exist?

Read Eric Raymond's "The Art of Unix Programming". He answers the
question quite well.

Or I could just ask: should we write C code in MS Word .doc format?

> >A bidi algorithm with minimal/no
> >implicit behavior works fine as long as you are not mixing
> >languages/scripts, and when mixing scripts it makes sense to use
> >explicit embedding -- especially since the cases of mixed scripts that
> >MUST work without formatting controls are files that are meant to be
> >machine-interpreted as opposed to pretty-printed for human
> >consumption.
>
> I'm not quite sure what point you are trying to make here. Do away with
> plain text?

No, rather that handling of bidi scripts in plain text should be
biased towards computer languages rather than human languages. This is
both because plain text files are declining in use for human language
texts and increasing in use for computer language texts, and because
the display issues in human language texts can be solved with explicit
embeddign markers (which an editor or word processor could even
auto-insert for you) while the same marks are unwelcome in computer
languages.

> >In particular, an algorithm that only applies reordering within single
> >'words' would give the desired effects for writing numbers in an RTL
> >context and for writing single LTR words in a RTL context or single
> >RTL words in a LTR context. Anything more than that (with unlimited
> >long range reordering behavior) would then require explicit embedding.
>
> You are aware that numeric expressions can be written differently in
> Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions
> differ (i.e. 1/2 in Latin and Hebrew would be presented as 2/1 in
> Arabic). This also affects other characters often used with numbers such
> as percent and dollar sign. So even within strictly RTL scripts,
> different reordering is required depending on which script is being
> used. But if you know a priori which script is in use, reordering is
> trivial.

This is part of the "considered harmful" of bidi. :)
I'm not familiar with all this stuff, but as a mathematician I'm
curious how mathematicians working in these languages write. BTW
mathematical notation is an interesting example where traditional
storage order is visual and not logical.

> This is the choice of each programming language designer: either allow
> directional override codes in the source or ban them. Those than ban
> them obviously assume that knowledge of the language's syntax is
> sufficient to allow an editor to present the source code text reasonably
> well.

It's simply not acceptable to need an editor that's aware of language
syntax in order to present the code for viewing and editing. You could
work around the problem by inserting dummy comments to prevent the
bidi algo from taking effect but that's really ugly and essentially
makes RTL scripts unusable in programming if the editor applies
Unicode bidi algo to the display.

> >>You left out the part where Unicode says that none of these things is
> >>strictly required.
> >
> >This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
> >consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
> >1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
> >files as it is written because text files do not define paragraphs.
>
> How is a line ending with newline in a text file not a paragraph? A
> poorly formatted paragraph, to be sure, but a paragraph nonetheless. The
> Unicode Standard says paragraph separators are required for the
> reordering algorithm. There is no reason why a line can't be viewed as a
> paragraph. And it even works reasonably well most of the time.

Actually it does not work with embedding. If a (semantic) paragraph
has been split into multiple lines, the bidi embedding levels will be
broken and cannot be processed by the UAX#9 algorithm without trying
to reconstruct an idea of what the whole "paragraph" meant. Also, a
problem occurs if the first character of a new line happens to be an
LTR character (some embedded English text?) in a (semantic) paragraph
that's Arabic or Hebrew.

As you acknowledge below, a line it not necessarily an
unlimited-length object and in email it should not be longer than 80
characters (or preferably 72 or so to allow for quoting). So you can't
necessarily just take the MS Notepad approach of omitting newlines and
treating lines as paragraphs although this make be appropriate in some
uses of text files.

> BTW, what part of ISO/IEC 9899 are you referring to? All I see is
> §7.19.2.7 which says something about lines being limited to 254
> characters and a terminating newline character. No definitions of lines
> or paragraphs that I see off hand.

I'm talking about the definition of a text file as a sequence of
lines, which might (on stupid legacy implementations) even be
fixed-width fields. It's under the stdio stuff about the difference
between text and binary mode. I could look it up but I don't feel like
digging thru the pdf file right now..

> Han unification did indeed break existing practice, but I think you will
> find that the IRG (group of representatives from all Han-using
> countries) feels that in the long run, it was the best thing to do.

I agree it was best to do too. I just pointed it out as being contrary
to your claim that they made every effort not to break existing
practice.

> UCS-2 didn't so much break existing practice as come along at one of the
> most confusing periods of internationalization retrofitting of the C
> libraries and language.

I suppose this is true and I don't know the history and internal
politics well enough to know who was responsible for what. However,
unlike doublebyte/multibyte charsets which were becoming prevalent at
the time, UCS-2 data does not form valid C strings. A quick glance at
some historical remarks from unicode.org and Rob Pike suggests that
UTF-8 was invented well before any serious deployment of Unicode, i.e.
that the push for UCS-2 was deliberately aimed at breaking things,
though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more
than the consortium as a whole.

> Those "useless" legacy characters avoided breaking many existing
> applications, most of which were not written for Southeast Asia. Some
> scripts had to end up in the 3-byte range of UTF-8. Are you in a
> position to determine who should and should not be in that range? Have

IMO the answer is common sense. Languages that have a low information
per character density (lots of letters/marks per word, especially
Indic) should be in 2-byte range and those with high information
density (especially ideographic) should be in 3-byte range. If it
weren't for so many legacy Latin blocks near the beginning of the
character set, most or all scripts for low-density languages could
have fit in the 2-byte range.

Of course it's pointless to discuss this since we can't change it now
anyway.

> you even considered why they ended up in that range?

Probably at the time of allocation, UTF-8 was not even around yet. I
haven't studied that portion of Unicode history. Still the legacy
characters could have been put at the end with CJK compat forms and
preshaped Arabic forms, etc. or even outside the BMP.

> Like others who didn't like the Unicode bidi reordering approach, the
> Arabeyes people were welcome to continue doing things the way they
> wanted. Interoperability problems often either kill these companies or
> force them to go Unicode at some level.

Thankfully there's not too much room for interoperability problems
with the data itself as long as you stick to logical order, especially
since the need for more than a single embedding level is rare. Unless
you're arguing for visual order, the question is entirely a display
matter, whether bidi display is compatible with other requirements.

> So you do understand. If it isn't fixable, what point is there in
> complaining about it? Find a better way.

That's what I'm trying to do... Maybe some Hebrew or Arabic users who
dislike the whole bidi mess (the one Israeli user I'm in contact with
hates bidi and thinks it's backwards...not a good sample size but
interesting nonetheless) will agree and try my ideas for a
unidirectional presentation and like them. Or maybe they'll think it's
ugly and look for other solutions.

> >Naturally fullscreen programs can draw bidi text in its natural
> >directionality either by swapping character order (but some special
> >treatment of combining marks and Arabic shaping must be done by the
> >application first in order for this to work, and it will render
> >copy-and-paste from the terminal mostly useless) or by using ECMA-48
> >bidi controls (but don't expect anyone to use these until curses is
> >adapted for it and until screen supports it).
>
> Hmm. Sounds just like a bidi reordering algorithm I heard about. You
> know. The one the Unicode Consortium is touting.

Applications can draw their own bidi text with higher level formatting
information, of course. I'm thinking of a terminal-mode browser that
has the bidi text in HTML with <dir> tags and whatnot, or apps with a
text 'gui' consisting of separated interface elements.

> I have a lot of experience

Could you tell me some of what you've worked on and what conclusions
you reached? I'm not familiar with your work.

> with ECMA-48 (ISO/IEC 6429) and ISO/IEC 2022.

ISO 2022 is an abomination, certainly not an acceptable way to store
text due to its stateful nature, and although it works for
_displaying_ text, it's ugly even for that.

I've read ECMA-48 bidi stuff several times and still can't make any
sense of it, so I agree it's disgusting too. It does seem powerful but
powerful is often a bad thing. :)

> All I will say about them is Unicode is a lot easier to deal with. Have

Easier to deal with because it solves an easier problem. UAX#9 tells
you what to do when you have explicit paragraph division and unbounded
search capability forwards and backwards. Neither of these exists in a
character cell device environment, and (depending on your view of what
constitutes a proper text file) possibly not in a text file either. My
view of a text file (maybe not very popular these days?) is that it's
a more-restricted version of a character cell terminal (no cursor
positioning allowed) but with unlimited height.

> a look at the old kterm code if you want to see how complicated things
> can get. And that was one of the cleaner implementations I've seen over
> the years.

Does it implement ECMA-48 version of bidi? Or random unspecified bidi
like mlterm? Or..?

> >Languages are messy. Scripts are not, except for a very few bidi
> >scripts. Even Indic scripts are relatively easy.
>
> Hah! I often hear the same sentiment from people who don't know the
> difference between a glyph and a character.

I think we've established that I know the difference..

> Yes, it is true that Indic,
> and even Khmer and Burmese scripts are relatively easy. All you need to
> do is create the right set of glyphs.

Exactly. That's a lot of work...for the font designer. Almost no work
for the application author or for the machine at runtime.

> This frequently gives you multiple glyph codes for each abstract
> character. To do anything with the text, a mapping between glyph and
> abstract character is necessary for every program that uses that text.

No, it's necessary only for the terminal. The programs using the text
need not have any idea what language/script it comes from. This is the
whole beauty of using such apps.

The same applies to gui apps too if they're using a nice widget kit.
Unfortunately all the existing widget kits are horribly bloated and
very painful to work with for someone not coming from a MS Windows
mentality (i.e. if you want to actually have control over the flow of
execution of your program..).

> Again, I would encourage you to try it yourself. Talk is cheap,
> experience teaches.

That's what I'm working on, but sometimes discussing the issues at the
same time helps.

> >UAX#9 requires
> >imposing language semantics onto characters which is blatently wrong
> >and which is the source of the mess.
>
> If 30 years of experience has led to blatantly wrong semantics, then
> quit whining about it and fix it! The Unicode Consortium isn't deaf,
> dumb, or stupid. They have been known to honor actual evidence of
> incorrect behavior and change things when necessary. But they aren't
> going to change things just because you find it inconveniently complicated.

They generally don't change things in incompatible ways, certainly not
in ways that would require retrofitting existing data with proper
embedding codes. What they might consider doing though is adding a
support level 1.5 or such. Right now UAX#9 (implicitly?) says that an
application not implementing at least implicit bidi algorithm must not
interpret RTL characters visually at all.

> >This insistence on making simple
> >things into messes is why no real developers want to support i18n and
> >why only the crap like GNOME and KDE and OOO support it decently. I
> >believe that both 'sides' are wrong and that universal i18n on unix is
> >possible but only after you accept that unix lives by the New Jersey
> >approach and not the MIT approach.
>
> I have been complaining about the general trend to over-complicate and
> over-standardize software for years. These days the "art" of programming
> only exists in the output of a rare handful of programmers. Don't worry
> about it. Software will collapse under it's own weight in time. You just
> have to be patient and wait until that happens and be ready with all
> your simpler solutions.

Well in many cases my "simple solutions" are too simple for people
who've gotten used to bloated featuresets and gotten used to putting
up with slowness, bugs, and insecurity. But we'll see. My whole family
of i18n-related projects started out with a desire to switch to UTF-8
everywhere and to have Latin, Tibetan, and Japanese support at the
console level without increased bloat, performance penalties, and huge
dependency trees. From there I first wrote a super-small UTF-8-only C
library and then turned towards the terminal emulator issue, which in
turn led to the font format issue, etc. etc. :) Maybe after a whole
year passes I'll have roughly what I wanted.

> <sarcasm>
> But you better hurry up with those simpler solutions, the increasing
> creep of unnecessary complexity into software is happening fast. The
> crash is coming! It will probably arrive with /The Singularity/.
> </sarcasm>

Keep an eye on busybox. It's quickly gaining in features while
shrinking in size, and while currently the i18n support is rather poor
the developers are open to adding good support as long as it's an
option at compiletime. Along with my project I've been documenting the
quality, portability, i18n/m17n support, bloat, etc. of lots of other
software too and I'll eventually be making the results available
publicly.

Rich

> ------------------------------------------------------------------------
> Mark Leisher
> Computing Research Lab We find comfort among those who
> New Mexico State University agree with us, growth among those
> Box 30001, MSC 3CRL who don't.
> Las Cruces, NM 88003 -- Frank A. Clark

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Somehow seems appropriate
to the topic at hand.

Alexandros Diamantidis

unread,

Sep 2, 2006, 7:15:50 AM9/2/06

* Rich Felker [2006-09-01 09:41]:

> The vertical orientation thing is mostly of interest to Mongolian
> users and perhaps some East Asian users, but it could also be

Note that Mongolian is mostly written with the Cyrillic alphabet today.
From what I've seen in movies, articles etc. - never been to
Mongolia myself - the traditional vertical script is still used on signs on
public buildings, monuments, and similar cultural contexts, but not to
write longer texts.

--
Alexandros Diamantidis * ad...@hellug.gr

Mark Leisher

unread,

Sep 4, 2006, 10:19:02 PM9/4/06

Rich Felker wrote:
>
> It went farther because it imposed language-specific semantics in
> places where they do not belong. These semantics are correct with
> sentences written in human languages which would not have been hard to
> explicitly mark up, especially with a word processor doing it for you.
> On the other hand they're horribly wrong in computer languages
> (meaning any text file meant to be computer-read and -interpreted, not
> just programming languages) where explicit markup is highly
> undesirable or even illegal.
>

The Unicode Consortium is quite correctly more concerned with human
languages than programming languages. I think you are arguing yourself
into a dead end. Programming languages are ephemeral and some might
argue they are in fact slowly converging with human languages.

>> Why does plain text still exist?
>
> Read Eric Raymond's "The Art of Unix Programming". He answers the
> question quite well.
>

You missed the point completely. Support of implicit bidirectionality
exists precisely because plain text exists. And it isn't going away any
time soon.

> Or I could just ask: should we write C code in MS Word .doc format?

No reason to. Programming editors work well as they are and will
continue to work well after being adapted for Unicode.

>> I'm not quite sure what point you are trying to make here. Do away with
>> plain text?
>
> No, rather that handling of bidi scripts in plain text should be
> biased towards computer languages rather than human languages. This is
> both because plain text files are declining in use for human language
> texts and increasing in use for computer language texts, and because
> the display issues in human language texts can be solved with explicit
> embeddign markers (which an editor or word processor could even
> auto-insert for you) while the same marks are unwelcome in computer
> languages.
>

You don't appear to have any experience writing lexical scanners for
programming languages. If you did, you would know how utterly trivial it
is to ignore embedded bidi codes an editor might introduce.

Though I haven't checked myself, I wouldn't be surprised if Perl,
Python, PHP, and a host of other programming languages weren't already
doing this, making your concerns pointless. You would probably find it
instructive to look at some lexical scanners.

>>> In particular, an algorithm that only applies reordering within single
>>> 'words' would give the desired effects for writing numbers in an RTL
>>> context and for writing single LTR words in a RTL context or single
>>> RTL words in a LTR context. Anything more than that (with unlimited
>>> long range reordering behavior) would then require explicit embedding.
>> You are aware that numeric expressions can be written differently in
>> Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions
>> differ (i.e. 1/2 in Latin and Hebrew would be presented as 2/1 in
>> Arabic). This also affects other characters often used with numbers such
>> as percent and dollar sign. So even within strictly RTL scripts,
>> different reordering is required depending on which script is being
>> used. But if you know a priori which script is in use, reordering is
>> trivial.
>
> This is part of the "considered harmful" of bidi. :)
> I'm not familiar with all this stuff, but as a mathematician I'm
> curious how mathematicians working in these languages write. BTW
> mathematical notation is an interesting example where traditional
> storage order is visual and not logical.
>

Considered harmful? This is standard practice in these languages and has
been for a long time. You can't seriously expect readers of RTL
languages to just throw away everything they've learned since childhood
and learn to read their mathematical expressions backwards? Or simply
require that their scripts never appear in a plain text file? That is
ignorant at best and arrogant at worst.

>> This is the choice of each programming language designer: either allow
>> directional override codes in the source or ban them. Those than ban
>> them obviously assume that knowledge of the language's syntax is
>> sufficient to allow an editor to present the source code text reasonably
>> well.
>
> It's simply not acceptable to need an editor that's aware of language
> syntax in order to present the code for viewing and editing. You could
> work around the problem by inserting dummy comments to prevent the
> bidi algo from taking effect but that's really ugly and essentially
> makes RTL scripts unusable in programming if the editor applies
> Unicode bidi algo to the display.
>

You really need to start looking at code and stop pontificating from a
poorly understood position. Just about every programming editor out
there is already aware of programming language syntax. Many different
programming languages in most cases.

>>>> You left out the part where Unicode says that none of these things is
>>>> strictly required.
>>> This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
>>> consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
>>> 1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
>>> files as it is written because text files do not define paragraphs.
>> How is a line ending with newline in a text file not a paragraph? A
>> poorly formatted paragraph, to be sure, but a paragraph nonetheless. The
>> Unicode Standard says paragraph separators are required for the
>> reordering algorithm. There is no reason why a line can't be viewed as a
>> paragraph. And it even works reasonably well most of the time.
>
> Actually it does not work with embedding. If a (semantic) paragraph
> has been split into multiple lines, the bidi embedding levels will be
> broken and cannot be processed by the UAX#9 algorithm without trying
> to reconstruct an idea of what the whole "paragraph" meant. Also, a
> problem occurs if the first character of a new line happens to be an
> LTR character (some embedded English text?) in a (semantic) paragraph
> that's Arabic or Hebrew.
>

This is trivially obvious. Why do you think I said "poorly formed
paragraph?" The obvious implication is that every once in a while,
reordering errors will happen because the algorithm is being applied to
a single line of a paragraph.

> As you acknowledge below, a line it not necessarily an
> unlimited-length object and in email it should not be longer than 80
> characters (or preferably 72 or so to allow for quoting). So you can't
> necessarily just take the MS Notepad approach of omitting newlines and
> treating lines as paragraphs although this make be appropriate in some
> uses of text files.

So instead of a substantive argument why a line can't be viewed as a
paragraph, you simply imply that it just can't be done. Weak.

>
> I'm talking about the definition of a text file as a sequence of
> lines, which might (on stupid legacy implementations) even be
> fixed-width fields. It's under the stdio stuff about the difference
> between text and binary mode. I could look it up but I don't feel like
> digging thru the pdf file right now..
>

That section doesn't provide definitions of line or paragraph.

>> Han unification did indeed break existing practice, but I think you will
>> find that the IRG (group of representatives from all Han-using
>> countries) feels that in the long run, it was the best thing to do.
>
> I agree it was best to do too. I just pointed it out as being contrary
> to your claim that they made every effort not to break existing
> practice.
>

For a mathematician, you are quite good at ignoring inconvenient logic.
The phrase "every effort to avoid breaking existing practice" does not
logically imply that no existing practice was broken. Weak.

>
> I suppose this is true and I don't know the history and internal
> politics well enough to know who was responsible for what. However,
> unlike doublebyte/multibyte charsets which were becoming prevalent at
> the time, UCS-2 data does not form valid C strings. A quick glance at
> some historical remarks from unicode.org and Rob Pike suggests that
> UTF-8 was invented well before any serious deployment of Unicode, i.e.
> that the push for UCS-2 was deliberately aimed at breaking things,
> though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more
> than the consortium as a whole.
>

You can ask any of the Unicode people from those companies and will get
the same answer. Something had to be done and UCS-2 was the answer at
the time. Conspiracy theories do not substantive argument make.

>> Those "useless" legacy characters avoided breaking many existing
>> applications, most of which were not written for Southeast Asia. Some
>> scripts had to end up in the 3-byte range of UTF-8. Are you in a
>> position to determine who should and should not be in that range? Have
>
> IMO the answer is common sense. Languages that have a low information
> per character density (lots of letters/marks per word, especially
> Indic) should be in 2-byte range and those with high information
> density (especially ideographic) should be in 3-byte range. If it
> weren't for so many legacy Latin blocks near the beginning of the
> character set, most or all scripts for low-density languages could
> have fit in the 2-byte range.
>

So you simply assume that nobody bothered to look into things like
information density et al during the formation of the Unicode
Standard? You don't appear to be aware of the social and political
ramifications involved in making decisions like that. It doesn't matter
if it makes sense from a mathematical point of view, nations and people
are involved.

>> you even considered why they ended up in that range?
>
> Probably at the time of allocation, UTF-8 was not even around yet. I
> haven't studied that portion of Unicode history. Still the legacy
> characters could have been put at the end with CJK compat forms and
> preshaped Arabic forms, etc. or even outside the BMP.

Scripts were placed when information about their encodings became
available to the Unicode Consortium. It's that simple. No big conspiracy
to give SEA scripts short shrift.

>
>> So you do understand. If it isn't fixable, what point is there in
>> complaining about it? Find a better way.
>
> That's what I'm trying to do... Maybe some Hebrew or Arabic users who
> dislike the whole bidi mess (the one Israeli user I'm in contact with
> hates bidi and thinks it's backwards...not a good sample size but
> interesting nonetheless) will agree and try my ideas for a
> unidirectional presentation and like them. Or maybe they'll think it's
> ugly and look for other solutions.
>

Sure. Lots of people don't like the situation, but nobody has come up
with anything better. There is a very good reason for that.

>
> Applications can draw their own bidi text with higher level formatting
> information, of course. I'm thinking of a terminal-mode browser that
> has the bidi text in HTML with <dir> tags and whatnot, or apps with a
> text 'gui' consisting of separated interface elements.
>

Ahh. Yes. That sounds a lot like lynx. A popular terminal-mode browser.
Have you checked out how it handles Unicode?

>> I have a lot of experience
>
> Could you tell me some of what you've worked on and what conclusions
> you reached? I'm not familiar with your work.

Well, you can refer to the kterm code for some of my work with ISO/IEC
2022, and I may be able to dig up an ancient version of Motif (ca. 1993)
I adapted to use ISO/IEC 6429 and ISO/IEC 2022, and shortly after that
first Motif debacle, I attempted unsuccessfully to get a variant of
cxterm working with a combination of the two standards.

The conclusion was simple. The code quickly got too complicated to
debug. All kinds of little boundary (buffer/screen) effects kept
cropping up thanks to multi-byte escape sequences.

>
> ISO 2022 is an abomination, certainly not an acceptable way to store
> text due to its stateful nature, and although it works for
> _displaying_ text, it's ugly even for that.
>
> I've read ECMA-48 bidi stuff several times and still can't make any
> sense of it, so I agree it's disgusting too. It does seem powerful but
> powerful is often a bad thing. :)
>

Well, ISO/IEC 2022 and ISO/IEC 6429 do things the same way: multibyte
escape sequences.

>> All I will say about them is Unicode is a lot easier to deal with. Have
>
> Easier to deal with because it solves an easier problem. UAX#9 tells
> you what to do when you have explicit paragraph division and unbounded
> search capability forwards and backwards. Neither of these exists in a
> character cell device environment, and (depending on your view of what
> constitutes a proper text file) possibly not in a text file either. My
> view of a text file (maybe not very popular these days?) is that it's
> a more-restricted version of a character cell terminal (no cursor
> positioning allowed) but with unlimited height.

Having implemented UAX #9 and a couple of other approaches that produce
the same or similar results, I don't see any problem using it to render
text files. If your text file has one paragraph per line, then you will
see occasional glitches in mixed LTR & RTL text.

>
>> a look at the old kterm code if you want to see how complicated things
>> can get. And that was one of the cleaner implementations I've seen over
>> the years.
>
> Does it implement ECMA-48 version of bidi? Or random unspecified bidi
> like mlterm? Or..?

kterm had ISO/IEC 2022 support. Very few people attempted to use ISO/IEC
6429 because they didn't understand it very well and they knew how
complicated ISO/IEC 2022 was all by itself.

>> Hah! I often hear the same sentiment from people who don't know the
>> difference between a glyph and a character.
>
> I think we've established that I know the difference..
>
>> Yes, it is true that Indic,
>> and even Khmer and Burmese scripts are relatively easy. All you need to
>> do is create the right set of glyphs.
>
> Exactly. That's a lot of work...for the font designer. Almost no work
> for the application author or for the machine at runtime.
>
>> This frequently gives you multiple glyph codes for each abstract
>> character. To do anything with the text, a mapping between glyph and
>> abstract character is necessary for every program that uses that text.
>
> No, it's necessary only for the terminal. The programs using the text
> need not have any idea what language/script it comes from. This is the
> whole beauty of using such apps.
>

I suspect you missed my point. Using glyph codes as an encoding gets
complicated fast. You can ask anyone who has tried to do any serious NLP
work with pre-Unicode Indic text. We are still having to write analysers
and converters to figure out the correct abstract characters and their
order for many scripts. I can provide a mapping table for one Burmese
encoding that shows how hideously complicated it can get to map a glyph
encoding to the underlying linear abstract character necessary to do any
kind of linguistic analysis.

> They generally don't change things in incompatible ways, certainly not
> in ways that would require retrofitting existing data with proper
> embedding codes. What they might consider doing though is adding a
> support level 1.5 or such. Right now UAX#9 (implicitly?) says that an
> application not implementing at least implicit bidi algorithm must not
> interpret RTL characters visually at all.

Well, they don't want a program that simply reverses RTL segments
claiming conformance with UAX #9, it is better to see it backward than
to see it wrong. You can ask native users of RTL scripts about that. And
ask more than one.

>
> Well in many cases my "simple solutions" are too simple for people
> who've gotten used to bloated featuresets and gotten used to putting
> up with slowness, bugs, and insecurity. But we'll see. My whole family
> of i18n-related projects started out with a desire to switch to UTF-8
> everywhere and to have Latin, Tibetan, and Japanese support at the
> console level without increased bloat, performance penalties, and huge
> dependency trees. From there I first wrote a super-small UTF-8-only C
> library and then turned towards the terminal emulator issue, which in
> turn led to the font format issue, etc. etc. :) Maybe after a whole
> year passes I'll have roughly what I wanted.
>

I don't recall having seen your "simple solutions" so I can't dismiss
them off-hand as not being complicated enough yet. Like I said a couple
emails ago, sometimes it doesn't matter if you have a better answer, but
if it really is simple, accurate, and on the Internet, you can count on
it supplanting the bloat eventually.

BTW, now that the holiday has passed, I probably won't have time to
reply at similar length. But it's been fun.
--
---------------------------------------------------------------------------
Mark Leisher
Computing Research Lab Nowadays, the common wisdom is to
New Mexico State University celebrate diversity - as long as you
Box 30001, MSC 3CRL don't point out that people are
Las Cruces, NM 88003 different. -- Colin Quinn

Behdad Esfahbod

unread,

Sep 4, 2006, 10:57:08 PM9/4/06

On Mon, 2006-09-04 at 22:19 -0400, Mark Leisher wrote:
> Though I haven't checked myself, I wouldn't be surprised if Perl,
> Python, PHP, and a host of other programming languages weren't already
> doing this, making your concerns pointless. You would probably find it
> instructive to look at some lexical scanners.

To add a sidenote to this otherwise pointless conversation, the ECMA
Script (aka Javascript) standard actually ignores all format characters
(gen-cat=Cf) from the source code. This has caused a problem for
Persian computing as U+200C ZERO WIDTH NON-JOINER is Cf and used in
Persian text. Brandon Eich is working on changing the standard to not
ignore formatting characters in string literals (and regexps probably
too.)

--
behdad
http://behdad.org/

"Commandment Three says Do Not Kill, Amendment Two says Blood Will Spill"
-- Dan Bern, "New American Language"

David Starner

unread,

Sep 5, 2006, 12:44:26 AM9/5/06

On 9/1/06, Rich Felker <dal...@aerifal.cx> wrote:
> IMO the answer is common sense. Languages that have a low information
> per character density (lots of letters/marks per word, especially
> Indic) should be in 2-byte range and those with high information
> density (especially ideographic) should be in 3-byte range. If it
> weren't for so many legacy Latin blocks near the beginning of the
> character set, most or all scripts for low-density languages could
> have fit in the 2-byte range.

Once you compress the data with a decent compression scheme, you may
as well store the data by writing out the full Unicode name (e.g.
"LATIN CAPITAL LETTER OU"); the final result will be about the same
size. Furthermore, you can fit a decent sized novel on a floppy
uncompressed and a decent sized library on a DVD uncompressed. The
only application I've seen where text data size was really crucial was
text messaging. Hence, common sense tells _me_ that we should put
scripts used by heavily text-messaging cultures in the 2-byte range;
that is, Latin, Hiragana and Katakana.

Rich Felker

unread,

Sep 5, 2006, 1:13:35 AM9/5/06

On Mon, Sep 04, 2006 at 08:19:02PM -0600, Mark Leisher wrote:
> Rich Felker wrote:
> >
> >It went farther because it imposed language-specific semantics in
> >places where they do not belong. These semantics are correct with
> >sentences written in human languages which would not have been hard to
> >explicitly mark up, especially with a word processor doing it for you.
> >On the other hand they're horribly wrong in computer languages
> >(meaning any text file meant to be computer-read and -interpreted, not
> >just programming languages) where explicit markup is highly
> >undesirable or even illegal.
>
> The Unicode Consortium is quite correctly more concerned with human
> languages than programming languages. I think you are arguing yourself
> into a dead end. Programming languages are ephemeral and some might
> argue they are in fact slowly converging with human languages.

Arrg, C is not going away anytime soon. C is THE LANGUAGE as far as
POSIX is concerned. The reason I said "arrg" is that I feel like this
gap between the core values of the "i18n bloatware crowd" and the
"hardcore lowlevel efficient software crowd" is what keeps good i18n
out of the best software. When you talk about programming languages
converging with human languages somehow all I can think of us Perl...
yuck! Larry Wall's been great about pushing Unicode and UTF-8, but
Perl itself is a horrible mess. The implementation is hopelessly bad
and there's little hope of there ever being a reimplementation.

Anyway as I've said again and again, it's no problem for human
language text to have explicit embedding tagging. It doesn't need to
conform to syntax rules (oh yeah Perl code doesn't need to either ;)).
Fancy editors can even insert tags for you. On the other hand,
stuffing extra control characters into machine-read texts with
specific syntactical and semantic rules is not possible. You can't
even just strip these characters when processing because, depending on
the semantics of the file, they may either be controlling the display
of the file or literal embedding controls to be used when the strings
from the file are printed to their final destination.

> >Or I could just ask: should we write C code in MS Word .doc format?
>
> No reason to. Programming editors work well as they are and will
> continue to work well after being adapted for Unicode.

No, if they perform the algorithm in UAX#9 they will display garbled
unreadable code. Or does C somehow qualify as a "higher level
protocol" for formatting?

> You don't appear to have any experience writing lexical scanners for
> programming languages. If you did, you would know how utterly trivial it
> is to ignore embedded bidi codes an editor might introduce.

I'm quite aware that it's simple to code, but also illegal according
to the specs. Also you're ignoring the more troublesome issues...
Obviously you can't remove them inside strings. :) Issues with
comments too..

> Though I haven't checked myself, I wouldn't be surprised if Perl,
> Python, PHP, and a host of other programming languages weren't already
> doing this, making your concerns pointless.

I doubt it, but even it they do, these are toy languages with one
implementation and no specification (and in Perl's case, for which
it's hopeless to even try to write a specification). It's easy to hack
whatever you want and break compatibility with every new release of
the language when your implementation is the only one. It's much
harder when you're working with an international standard for a
language that's been around (and rather stable!) approaching-40-years
and intended to have multiple interoperable implementations.

> You can't seriously expect readers of RTL
> languages to just throw away everything they've learned since childhood
> and learn to read their mathematical expressions backwards? Or simply
> require that their scripts never appear in a plain text file? That is
> ignorant at best and arrogant at worst.

I've seen examples that show that UAX#9 just butchers mathematical
expressions in the absence of explicit bidi control.

> You really need to start looking at code and stop pontificating from a
> poorly understood position. Just about every programming editor out
> there is already aware of programming language syntax. Many different
> programming languages in most cases.

Cheap regex-based syntax hilighting is not the same thing at all. But
this is aside from the point, that it's fundamentally WRONG to need a
special tool that knows about the syntax of your computer language in
order to edit it. What if you've designed your own language to solve a
particular problem? Do you have to go and modify your editor to make
it display this text correctly for this language? NO! That's the whole
reason we have plain text. You can edit it without having to have a
special program!

> >As you acknowledge below, a line it not necessarily an
> >unlimited-length object and in email it should not be longer than 80
> >characters (or preferably 72 or so to allow for quoting). So you can't
> >necessarily just take the MS Notepad approach of omitting newlines and
> >treating lines as paragraphs although this make be appropriate in some
> >uses of text files.
>
> So instead of a substantive argument why a line can't be viewed as a
> paragraph, you simply imply that it just can't be done. Weak.

No, I agree that it can be. I'm just saying that a line can't do the
things you expect a paragraph to do, though. In particular it can't be
arbitrarily long in any plain text context, although it could be in
some.

> >I'm talking about the definition of a text file as a sequence of
> >lines, which might (on stupid legacy implementations) even be
> >fixed-width fields. It's under the stdio stuff about the difference
> >between text and binary mode. I could look it up but I don't feel like
> >digging thru the pdf file right now..
>
> That section doesn't provide definitions of line or paragraph.

See 7.19.2 Streams.

> >I agree it was best to do too. I just pointed it out as being contrary
> >to your claim that they made every effort not to break existing
> >practice.
>
> For a mathematician, you are quite good at ignoring inconvenient logic.
> The phrase "every effort to avoid breaking existing practice" does not
> logically imply that no existing practice was broken. Weak.

Read the history. Han unification was one of the very first points of
Unicode, even though it was obvious that it would break much existing
practice. This seems to have been connected to the misguided goal of
trying to make everything into fixed-width 16bit characters. From what
I understand, early Unicode was making every effort _to break_
existing practice. Their motto was "...begin at 0 and add the next
character" which to me implies "throw out everything that already
exists and start from scratch." I've never seen the early drafts but I
wouldn't be surprised if the original characters 0-127 didn't even
match ASCII.

> >I suppose this is true and I don't know the history and internal
> >politics well enough to know who was responsible for what. However,
> >unlike doublebyte/multibyte charsets which were becoming prevalent at
> >the time, UCS-2 data does not form valid C strings. A quick glance at
> >some historical remarks from unicode.org and Rob Pike suggests that
> >UTF-8 was invented well before any serious deployment of Unicode, i.e.
> >that the push for UCS-2 was deliberately aimed at breaking things,
> >though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more
> >than the consortium as a whole.
>
> You can ask any of the Unicode people from those companies and will get
> the same answer. Something had to be done and UCS-2 was the answer at
> the time. Conspiracy theories do not substantive argument make.

I've been researching what I can with the little information available
and it seems that the early Unicode architects got a strong disgust
for variable-size characters from their experience with Shift_JIS
(which was extremely poorly designed) and other CJK encodings and
developed a dogma that fixed-width was the way to go. There are
numerous references to this sort of thinking in "10 Years of Unicode"
published under history on unicode.org.

> So you simply assume that nobody bothered to look into things like
> information density et al during the formation of the Unicode
> Standard? You don't appear to be aware of the social and political
> ramifications involved in making decisions like that. It doesn't matter
> if it makes sense from a mathematical point of view, nations and people
> are involved.

Latin text (which is mostly ASCII anyway) would go up in size by a few
percent while many languages would go down by 33%. Sounds like a fair
trade. I'm sure there are political ramifications, and of course the
answer is always: do what pleases the countries with the most
money/power rather than doing what serves the largest population and
the population that has the greatest scarcity of storage space...

> Scripts were placed when information about their encodings became
> available to the Unicode Consortium. It's that simple. No big conspiracy
> to give SEA scripts short shrift.

Honestly I think they just didn't care about UTF-8 at the time because
they still had delusions that people would switch to UCS-2 for
everything. Also I've been told that the arrangement was intended to
be "West to East"..

> >Applications can draw their own bidi text with higher level formatting
> >information, of course. I'm thinking of a terminal-mode browser that
> >has the bidi text in HTML with <dir> tags and whatnot, or apps with a
> >text 'gui' consisting of separated interface elements.
>
> Ahh. Yes. That sounds a lot like lynx. A popular terminal-mode browser.
> Have you checked out how it handles Unicode?

The only app I've seriously checked out is mined simply because most
apps don't have support for bidi on the console (and many still don't
even know how to use wcwidth...! including emacs!! :( ).

If lynx handles bidi specially I'd be interested in seeing what it
does. However this brings up another interesting question: what should
lynx -dump do? :) Naturally dumping in visual order is wrong, but
generating a text file that will look right when displayed according
to UAX#9 sounds quite difficult, especially when you take multiple
columns, etc. into account. Of course lynx is old broken crap that
doesn't even support tables so maybe it has it easier.. :) These days
I use ELinks, but it has very very poor i18n support. :(

> >I've read ECMA-48 bidi stuff several times and still can't make any
> >sense of it, so I agree it's disgusting too. It does seem powerful but
> >powerful is often a bad thing. :)
>
> Well, ISO/IEC 2022 and ISO/IEC 6429 do things the same way: multibyte
> escape sequences.

I'm confused what you mean by multi-byte escape sequences. What I know
of as ISO 2022 is the charset-switching escapes used for legacy CJK
support and "vt100 linedrawing characters", but you seem to be talking
about something related to bidi. Does ISO 2022 have bidi controls as
well?

> >>All I will say about them is Unicode is a lot easier to deal with. Have
> >
> >Easier to deal with because it solves an easier problem. UAX#9 tells
> >you what to do when you have explicit paragraph division and unbounded
> >search capability forwards and backwards. Neither of these exists in a
> >character cell device environment, and (depending on your view of what
> >constitutes a proper text file) possibly not in a text file either. My
> >view of a text file (maybe not very popular these days?) is that it's
> >a more-restricted version of a character cell terminal (no cursor
> >positioning allowed) but with unlimited height.
>
> Having implemented UAX #9 and a couple of other approaches that produce
> the same or similar results, I don't see any problem using it to render
> text files. If your text file has one paragraph per line, then you will
> see occasional glitches in mixed LTR & RTL text.

Seek somewhere in the middle of the line and type a character of the
opposite directionality. Watch the whole line jump around and the
character you just typed end up in a different column from where your
cursor was placed.

This sort of thing will happen all the time in a terminal when the app
goes to draw interface elements, etc. over top of part of the text. If
it doesn't, i.e. if the terminal implements a sort of "hard implicit
bidi", then the terminal will just hopelessly corrupt unless the
program has explicit bidi logic matching the terminal's.

> >>This frequently gives you multiple glyph codes for each abstract
> >>character. To do anything with the text, a mapping between glyph and
> >>abstract character is necessary for every program that uses that text.
> >
> >No, it's necessary only for the terminal. The programs using the text
> >need not have any idea what language/script it comes from. This is the
> >whole beauty of using such apps.
>
> I suspect you missed my point. Using glyph codes as an encoding gets
> complicated fast.

Yes but where did I say anything about glyph codes? In both Unicode
and ISCII text everything is character codes, not glyph codes. Sorry
but I don't understand what you were trying to say..

> Well, they don't want a program that simply reverses RTL segments
> claiming conformance with UAX #9, it is better to see it backward than
> to see it wrong. You can ask native users of RTL scripts about that. And
> ask more than one.

It says more than that; it says that a program is forbidden from
interpreting the characters visually at all if it doesn't perform at
least the implicit part of UAX#9. From my reading, this means that
UAX#9 deems it worse to show the RTL characters in LTR order than not
to show them at all. It also precludes display strategies like the one
I proposed.

> >Well in many cases my "simple solutions" are too simple for people
> >who've gotten used to bloated featuresets and gotten used to putting
> >up with slowness, bugs, and insecurity. But we'll see. My whole family
> >of i18n-related projects started out with a desire to switch to UTF-8
> >everywhere and to have Latin, Tibetan, and Japanese support at the
> >console level without increased bloat, performance penalties, and huge
> >dependency trees. From there I first wrote a super-small UTF-8-only C
> >library and then turned towards the terminal emulator issue, which in
> >turn led to the font format issue, etc. etc. :) Maybe after a whole
> >year passes I'll have roughly what I wanted.
> >
>
> I don't recall having seen your "simple solutions" so I can't dismiss

http://svn.mplayerhq.hu/libc/trunk/
About 100kb of code and a few kb of data. E.g. iconv is 2kb, missing
support for CJK legacy encodings at present, final size should be
about 2.5-2.7kb.

Terminal emulator uuterm isn't checked in yet but it's looking like
the whole program with support for all scripts (except RTL scripts, if
you don't count non-UAX#9-conformant display as support) will come to
about 50kb of code static linked. Plus about 1.5 meg for a complete
font.

On a separate note... maybe it would help it I express and clarify my
view on UAX#9:

I think it very much has its place and it's great when formatting
content that is known to be human-language text for display in the
traditional form expected by most readers. However, IMO what UAX#9
should be seen as is a specification of the correspondence between the
stored "logical order" text and the traditional print form, in a way
as a definition of "logical order" text. It's important to have this
kind of definition for legal purposes especially, so e.g. if someone
has signed a document containing particular bidi text, it's clear what
printed text ordering that binary text is meant to represent and thus
clear what was signed.

On the other hand, I find the whole idea of bidirectionality harmful.
Human language text has always involved ambiguity as far as
interpreting the meaning, but aside from bidi text, at least there is
an unambiguous way to display the characters so that their logical
order is clear to the reader, and this method does not require the
machine to interpret the human language at all.

With bidi thrown in, not only does the presentation completely _fail_
to represent the logical order of the text. In fact it's possible to
construct bidi text where the presentation order is completely
deceptive... this could, for example, be used for googlebombing or
evading spam filters by permuting the characters of your text to
include or avoid certain words or phrases. The author of Yudit also
identifies examples that have security implications.

Along with the other reasons I have discussed regarding breaking text
file and character cell sanity, this is why, in my view, bidi is
"considered harmful". I don't expect RTL script users to switch to
LTR. What I do propose is a way for LTR users to view text containing
RTL characters without the need for bidi and without "ekil esnesnon
siht", as well as a way for RTL users to have an entirely-RTL
environment rather than a bidi one. The latter still requires some
more consideration regarding mathematical expressions and numerals. At
this point I have no idea whether such a thing would be of interest to
a significant number of RTL users but I suspect primarily-LTR users
with an occasional need for reading Arabic or Hebrew words or phrases
would like it. Both of these approaches have the side-effect of making
RTL scripts "just work" in any application without the need for
special bidi support at the application level or the terminal level.

> BTW, now that the holiday has passed, I probably won't have time to
> reply at similar length. But it's been fun.

Ah well, I tried to strip my reply down to the most
interesting/relevant parts in case you do have time for some replies,
but it looks like I've still left a lot in.

Thanks for discussing in any case.

Rich

Rich Felker

unread,

Sep 5, 2006, 1:28:29 AM9/5/06

On Mon, Sep 04, 2006 at 11:44:26PM -0500, David Starner wrote:
> On 9/1/06, Rich Felker <dal...@aerifal.cx> wrote:
> >IMO the answer is common sense. Languages that have a low information
> >per character density (lots of letters/marks per word, especially
> >Indic) should be in 2-byte range and those with high information
> >density (especially ideographic) should be in 3-byte range. If it
> >weren't for so many legacy Latin blocks near the beginning of the
> >character set, most or all scripts for low-density languages could
> >have fit in the 2-byte range.
>
> Once you compress the data with a decent compression scheme, you may
> as well store the data by writing out the full Unicode name (e.g.
> "LATIN CAPITAL LETTER OU"); the final result will be about the same
> size.

With some compression methods this is true, particularly bz2.

> Furthermore, you can fit a decent sized novel on a floppy
> uncompressed and a decent sized library on a DVD uncompressed.

Yet somehow the firefox source code is still 36 megs (bz2), and god
only knows how large OOO is. Imagine now if all the variable and
function names were written in Hindi or Thai... It would be an
interesting test to transliterate the Latin letters to Devanagari and
see how much the compressed tarball size goes up.

> The
> only application I've seen where text data size was really crucial was
> text messaging. Hence, common sense tells _me_ that we should put
> scripts used by heavily text-messaging cultures in the 2-byte range;
> that is, Latin, Hiragana and Katakana.

ROTFL! :)

In all seriousness, though, unless you're dealing with image, music,
or movie files, text weighs in quite heavy in size. It's true that in
html 75-90% of the size is usually tags (in ASCII) but that's due to
incompetence of the web designers and their inability to use CSS
correctly, not anything fundamental. If you're making a website
without fluff and with lots of information, text size will be the
dominant factor in traffic. It's quite unfortunate that native
language text is 3 to 6(*) times larger in countries where bandwidth
is very expensive.

Rich

(*) 6 because a large number of characters in Indic scripts will have
the virama (a combining character) attached to them to remove the
inherent vowel and attach them into clusters.

David Starner

unread,

Sep 5, 2006, 1:57:08 AM9/5/06

On 9/5/06, Rich Felker <dal...@aerifal.cx> wrote:
> On Mon, Sep 04, 2006 at 11:44:26PM -0500, David Starner wrote:
> > Once you compress the data with a decent compression scheme, you may
> > as well store the data by writing out the full Unicode name (e.g.
> > "LATIN CAPITAL LETTER OU"); the final result will be about the same
> > size.
>
> With some compression methods this is true, particularly bz2.
>
> > Furthermore, you can fit a decent sized novel on a floppy
> > uncompressed and a decent sized library on a DVD uncompressed.
>
> Yet somehow the firefox source code is still 36 megs (bz2), and god
> only knows how large OOO is. Imagine now if all the variable and
> function names were written in Hindi or Thai... It would be an
> interesting test to transliterate the Latin letters to Devanagari and
> see how much the compressed tarball size goes up.

The very point of the above test is that it would change the size
minimally. It shouldn't make much if any difference.

> In all seriousness, though, unless you're dealing with image, music,
> or movie files, text weighs in quite heavy in size.

As opposed to what? The vast majority of content is one of the four,
and what's left--say, Flash files--don't seem particularly small
compared to text.

> If you're making a website
> without fluff and with lots of information, text size will be the
> dominant factor in traffic. It's quite unfortunate that native
> language text is 3 to 6(*) times larger in countries where bandwidth
> is very expensive.

Welcome to HTTP 1.1. There's no reason not to compress the data while
you're sending it across the network, which will fix the vast majority
of this problem.

Rich Felker

unread,

Sep 5, 2006, 3:11:52 AM9/5/06

On Tue, Sep 05, 2006 at 12:57:08AM -0500, David Starner wrote:
> On 9/5/06, Rich Felker <dal...@aerifal.cx> wrote:
> >In all seriousness, though, unless you're dealing with image, music,
> >or movie files, text weighs in quite heavy in size.
>
> As opposed to what? The vast majority of content is one of the four,
> and what's left--say, Flash files--don't seem particularly small
> compared to text.

I wasn't thinking of a website but rather a complete computer system.
I have several gigabytes of email which is larger than even a very
bloated OS and several hundred thousand times bigger than a
non-bloated OS. Multiply this by a factor of 3 or more and it could
quite easily go from "feasible to store" to "infeasible to store".

> >If you're making a website
> >without fluff and with lots of information, text size will be the
> >dominant factor in traffic. It's quite unfortunate that native
> >language text is 3 to 6(*) times larger in countries where bandwidth
> >is very expensive.
>
> Welcome to HTTP 1.1. There's no reason not to compress the data while
> you're sending it across the network, which will fix the vast majority
> of this problem.

Here you have the issue of compression performance versus bandwidth,
especially relevant on a heavily loaded server (of course you can
precompress static texts). Also gzip doesn't perform so well on UTF-8
so bzip2 would be better but also much more cpu-hungry and I doubt any
clients support it.

Anyway all of this discussion is in a sense pointless since none of us
have the power to change any of the problem and since there's no real
solution even if we could. But sometimes you just have to bitch about
the stuff the Unicode folks messed up on..

Rich

Mark Leisher

unread,

Sep 5, 2006, 10:07:14 AM9/5/06

Rich Felker wrote:
> On Mon, Sep 04, 2006 at 08:19:02PM -0600, Mark Leisher wrote:

My last gasp on this conversation: I don't think you really understand
what you are talking about and won't until you get some hands-on
experience. Goodbye and good luck.
--
------------------------------------------------------------------------
Mark Leisher

Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark

Rich Felker

unread,

Sep 6, 2006, 12:11:49 AM9/6/06

On Tue, Sep 05, 2006 at 08:07:14AM -0600, Mark Leisher wrote:
> Rich Felker wrote:
> >On Mon, Sep 04, 2006 at 08:19:02PM -0600, Mark Leisher wrote:
>
> My last gasp on this conversation: I don't think you really understand
> what you are talking about and won't until you get some hands-on
> experience.

I'm not sure how to take this but whatever it is, it sounds
condescending and impolite. Was that the intent? What makes you think
I lack hands-on experience? The fact that my code is "too small" and
going to stay that way? Or just that it's not yet checked in for you
to view?

I'm sorry if my long messages to this list have offended, but my
intent was to seek input and discussion. I don't think anything I said
was any more offensive than similar things which Markus and other
people respected in this community have said. If it's just that you
don't have time to deal with this thread anymore, no problem, I won't
take offense.

> Goodbye and good luck.

Thanks I suppose......

Rich

0 new messages