symmetric swapping

0 views
Skip to first unread message

Simon Montagu

unread,
Apr 18, 2000, 3:00:00 AM4/18/00
to
I attach a code fragment which deals with a small but important issue in
BiDi reordering: symmetric swapping.

As those familiar with the subject will know, rule L4 of the Unicode
Bidi Algorithm specifies that when certain characters appear in
right-to-left text they must be displayed by mirrored glyphs. For
example,

ABC (DEF)

will be displayed as

(FED) CBA

not

)FED( CBA

In most implentations of BiDi reordering, this becomes quite a
performance bottleneck.
The chief problem is that a significant overhead must be invested for
every character in an right-to-left text run, when only a small
proportion of the characters need anything done to them. If support is
given for all of the characters in the Unicode database a table of 202
character pairs is required, which even in a binary search will require
up to 8 passes to locate an individual character. My aim was to reduce
this overhead to the minimum possible while still complying to the
standard in all cases.

On examining the characters with the "mirrored" property, I noticed two
or three important characteristics:

The characters are confined to particular areas within the total range
of Unicode characters, specifically to five blocks: U00xx, U20xx, U22xx,
U23xx and U30xx.

The mirrored equivalent (if any, see below) is always in the same block,
and never very far away

Taking this into account, I wrote a routine as follows:

There are 5 tables of 256 bytes each, one for each block that contains
mirrored characters. Each byte of the table corresponds to one character
in the block, from Uxx00 to UxxFF, and contains that character XOR its
mirrored equivalent.

(Some space could be saved here by using overlapping tables, but not
very much - maybe 80 bytes out of 1280 - and I doubt if it is worth the
overhead in processing that would be required.)

The routine first tests whether the input character is from one of the
relevant blocks, and if not does nothing. Otherwise it uses the low byte
of the character as an index to the table for that block, and XORs the
character with the table entry.

In a test program, this routine was 14 or 15 times faster than a binary
search.

Note: There are some characters that are defined as "mirrored" in the
Unicode database, but have no mates. At least in Windows, it would be
possible (but fiddly and time-consuming) to add code to the rendering
engine to display these characters as a mirror-image. I have put
skeleton code inside an #ifdef, which could be extended to return some
kind of flag when such a character is encountered. These characters are
all mathematical symbols, e.g. U2201, COMPLEMENT; U2202, PARTIAL
DIFFERENTIAL etc. At least in Hebrew texts, mathematical formulae are
generally written from left to right using the Latin alphabet, so these
characters are never likely to appear with right-to-left directionality,
but I remember reading somewhere that in Arabic texts formulae are
written from right to left. I would very much like to hear feedback on
this issue, also on the practicality of displaying mirrored glyphs on
other platforms.

Note 2: IE5 uses a novel algorithm for symmetric swapping: mirrored
characters are always replaced by the next or previous mirrored
character. This works fine most of the time, certainly for the common
pairs like ( ), < >, { }, [ ], but produces curious results for some of
the mathematical symbols. For example, U2248, ALMOST EQUAL TO in
right-to-left text will be displayed as U2249, NOT ALMOST EQUAL TO.

Erik van der Poel

unread,
Apr 18, 2000, 3:00:00 AM4/18/00
to Simon Montagu
Hi Simon,

Thanks for the interesting message. I'd like to see more of Lina's
design work completed before looking at any code contributions. Also,
how do you intend to deal with the performance issue with mirrored
glyphs? Will there be some flag that says whether or not a particular
piece of text contains any RTL characters, for example?

Thanks,

Erik


Roozbeh Pournader

unread,
Apr 18, 2000, 3:00:00 AM4/18/00
to Simon Montagu

On Tue, 18 Apr 2000, Simon Montagu wrote:

> but I remember reading somewhere that in Arabic texts formulae are
> written from right to left.

I have seen Iraqi high school math books with right-to-left math and even
mirrored square root signs. Even the variables where SEEN, MEEM, etc
instead of x, y, ...


Simon Montagu

unread,
Apr 19, 2000, 3:00:00 AM4/19/00
to
This code really stands on its own independent of the design
work. Whatever design decisions we make, we will have to do
symmetric swapping *somewhere*.

In answer to your question, we will need a flag showing the
directionality of a piece of text (exactly where the flag will
be is under discussion in the thread on the design), and this
flag (together with other, mostly platform dependent, questions)
will determine whether or not we need to reverse the text and
call the symmetric swapping routine. It is important to remember
that this flag must be set by the reordering engine after
analysis of the whole paragraph. It cannot be determined by
testing whether the piece of text contains RTL characters: for
example, a piece of text may be inside a <BDO> tag, or there
might be a frame containing only neutrals

Best regards,

Simon

In article <38FC91B9...@netscape.com>, er...@netscape.com


* Sent from RemarQ http://www.remarq.com The Internet's Discussion Network *
The fastest and easiest way to search and participate in Usenet - Free!


Simon Montagu

unread,
Apr 27, 2000, 3:00:00 AM4/27/00
to
This is an addition to the code I posted a few weeks ago for identifying the Bidi
category of characters and the reordering engine. I didn't want to post the whole
thing again, firstly because there are constant small changes for the convenience of
the layout code, so it makes sense to post the two together, secondly because the
reordering code is based on the IBM ICU, which has its own license, and that causes
issues that need resolving.

Yes, I think it should be part of mozilla/intl/unicharutil, and yes, it should be a
cpp file in its final form.

Simon

Frank Tang wrote:

> Where this code should live ?
> where is the nsBidiCategoryImp ?
> Should this be part of the mozilla/intl/unicharutil ?
> It should be a cpp file but not a c file , right ?
> Simon Montagu wrote:
>
> > Here is the code itself:
> >
> > ------------------------------------------------------------------------
> > Name: symmswap.c
> > symmswap.c Type: C Source File (application/x-unknown-content-type-cfile)
> > Encoding: base64


Reply all
Reply to author
Forward
0 new messages