Knigaman wrote:
> So, my next question is, in the case of lam-alef, let's say I'm using
> a font that automatically "visually" replaces both glyphs with what
> appears to be a single glyph. I assume that underlyingly there are
> still 2 code points in the text. Does one code point hide itself in
> order to facilitate the correct glyph to appear?
There are several different mechanisms that have been used over the
years to handle the display of Arabic script on computers, including
mechanisms that have performed character level substitutions of
'presentation forms' incl. ligatures. The common mechanism today,
though, makes a distinction between character processing and glyph
processing, and glyph operations such as ligature substitution happen at
a level above text encoding.
Taking the lam+alif as an example, and presuming an OpenType format font
and layout model using ligatures:
These two characters are encoded in text using the Unicode character
codes U+0644 and U+0627. These character codes are stored in the
'backing string' of text and are not changed during layout and display.
The application queries the cmap table of the font and finds entries
that map these two Unicode characters to glyph IDs in the font, e.g. GID
45 and GID 12 (these can be any number, since there is no externally
defined ordering for glyphs in a font). The GIDs are then passed to the
layout engine for display processing.
The layout engine recognises that these two Unicode characters are
Arabic letters, so applies standard Arabic text shaping to them. Text
shaping is based on what I referred to earlier as the letter joining
behaviours of the characters. Presuming this sequence of lam+alif is
occuring in isolation, i.e. not preceded by other left-joining letters,
the layout engine identifies the lam as being in an initial
(left-joining) position and the alif and being in a final
(right-joining) position. The layout engine applies the <init> and
<fina> OpenType Layout features accordingly, and these map the GIDs 45
and 12 to new GIDs for the appropriately shaped glyphs, e.g. GIDs 46 and 13.
The layout engine now performs secondary shaping features to the new
glyph string, including the <rlig> 'Required Ligatures' feature. [Of
course, as I explained in my previous message, no ligatures per se are
actually required for Arabic script, only certain shape joining
behaviour, which may or may not be handled using actual ligatures.] In
this example, there is a ligature glyph in the font for lam+alif, and a
ligature lookup that maps the two GIDs 46+13 to a single GID, e.g. 78 or
whatever. This glyph is displayed.
> As a linguist (even though I'm learning Persian primarily because I
> want to, not for any linguistic purpose) -- I care mostly about that
> the correct code points are present in the text. For instance, if some
> presentation form were present, it would radically complicate
> searching and analysis.
Yes. Presentation form codepoints should be avoided. They are not
necessary, and cause more problems than they ever solved.
> And I really would like to know the answer to one of my original
> questions, that is, which letters tend to morph into ligatures, that
> are not desirable in rendering Persian text?
My guess is that the 'Allah' word form ligature is something that one
wouldn't want to occur automatically in most languages that use the
Arabic script. In most of the Arabic OT fonts I have worked on, this is
treated as a discretionary ligature, one that needs to be actively
turned on by the user (presuming he or she is using an application that
enables this at all), and not something that happens automatically.
JH