As those familiar with the subject will know, rule L4 of the Unicode
Bidi Algorithm specifies that when certain characters appear in
right-to-left text they must be displayed by mirrored glyphs. For
example,
ABC (DEF)
will be displayed as
(FED) CBA
not
)FED( CBA
In most implentations of BiDi reordering, this becomes quite a
performance bottleneck.
The chief problem is that a significant overhead must be invested for
every character in an right-to-left text run, when only a small
proportion of the characters need anything done to them. If support is
given for all of the characters in the Unicode database a table of 202
character pairs is required, which even in a binary search will require
up to 8 passes to locate an individual character. My aim was to reduce
this overhead to the minimum possible while still complying to the
standard in all cases.
On examining the characters with the "mirrored" property, I noticed two
or three important characteristics:
The characters are confined to particular areas within the total range
of Unicode characters, specifically to five blocks: U00xx, U20xx, U22xx,
U23xx and U30xx.
The mirrored equivalent (if any, see below) is always in the same block,
and never very far away
Taking this into account, I wrote a routine as follows:
There are 5 tables of 256 bytes each, one for each block that contains
mirrored characters. Each byte of the table corresponds to one character
in the block, from Uxx00 to UxxFF, and contains that character XOR its
mirrored equivalent.
(Some space could be saved here by using overlapping tables, but not
very much - maybe 80 bytes out of 1280 - and I doubt if it is worth the
overhead in processing that would be required.)
The routine first tests whether the input character is from one of the
relevant blocks, and if not does nothing. Otherwise it uses the low byte
of the character as an index to the table for that block, and XORs the
character with the table entry.
In a test program, this routine was 14 or 15 times faster than a binary
search.
Note: There are some characters that are defined as "mirrored" in the
Unicode database, but have no mates. At least in Windows, it would be
possible (but fiddly and time-consuming) to add code to the rendering
engine to display these characters as a mirror-image. I have put
skeleton code inside an #ifdef, which could be extended to return some
kind of flag when such a character is encountered. These characters are
all mathematical symbols, e.g. U2201, COMPLEMENT; U2202, PARTIAL
DIFFERENTIAL etc. At least in Hebrew texts, mathematical formulae are
generally written from left to right using the Latin alphabet, so these
characters are never likely to appear with right-to-left directionality,
but I remember reading somewhere that in Arabic texts formulae are
written from right to left. I would very much like to hear feedback on
this issue, also on the practicality of displaying mirrored glyphs on
other platforms.
Note 2: IE5 uses a novel algorithm for symmetric swapping: mirrored
characters are always replaced by the next or previous mirrored
character. This works fine most of the time, certainly for the common
pairs like ( ), < >, { }, [ ], but produces curious results for some of
the mathematical symbols. For example, U2248, ALMOST EQUAL TO in
right-to-left text will be displayed as U2249, NOT ALMOST EQUAL TO.
Thanks for the interesting message. I'd like to see more of Lina's
design work completed before looking at any code contributions. Also,
how do you intend to deal with the performance issue with mirrored
glyphs? Will there be some flag that says whether or not a particular
piece of text contains any RTL characters, for example?
Thanks,
Erik
On Tue, 18 Apr 2000, Simon Montagu wrote:
> but I remember reading somewhere that in Arabic texts formulae are
> written from right to left.
I have seen Iraqi high school math books with right-to-left math and even
mirrored square root signs. Even the variables where SEEN, MEEM, etc
instead of x, y, ...
In answer to your question, we will need a flag showing the
directionality of a piece of text (exactly where the flag will
be is under discussion in the thread on the design), and this
flag (together with other, mostly platform dependent, questions)
will determine whether or not we need to reverse the text and
call the symmetric swapping routine. It is important to remember
that this flag must be set by the reordering engine after
analysis of the whole paragraph. It cannot be determined by
testing whether the piece of text contains RTL characters: for
example, a piece of text may be inside a <BDO> tag, or there
might be a frame containing only neutrals
Best regards,
Simon
In article <38FC91B9...@netscape.com>, er...@netscape.com
* Sent from RemarQ http://www.remarq.com The Internet's Discussion Network *
The fastest and easiest way to search and participate in Usenet - Free!
Yes, I think it should be part of mozilla/intl/unicharutil, and yes, it should be a
cpp file in its final form.
Simon
Frank Tang wrote:
> Where this code should live ?
> where is the nsBidiCategoryImp ?
> Should this be part of the mozilla/intl/unicharutil ?
> It should be a cpp file but not a c file , right ?
> Simon Montagu wrote:
>
> > Here is the code itself:
> >
> > ------------------------------------------------------------------------
> > Name: symmswap.c
> > symmswap.c Type: C Source File (application/x-unknown-content-type-cfile)
> > Encoding: base64