On 4/20/18 10:03 AM,
supe...@casperkitty.com wrote:
> On Thursday, April 19, 2018 at 10:05:31 PM UTC-5, Richard Damon wrote:
>> Which sounds like you just want to use the Unicode direction override
>> code. Now we get to the original problem you proposed, the need to take
>> a current logical ordered text string and convert it into this strange
>> format, which again the question is WHY? (or maybe why do you have the
>> expectation that modern standards should provide easy ways to hack up
>> data into ancient formats).
>
> What is needed is an easy way of ending up with chunks that can be
> rendered in context-free fashion. There are many possible ways of
> representing text that would make it possible to subdivide it into
> such chunks. I'm wouldn't be especially attached to any particular
> way of representing such text, provided there was a standard form
> and it allowed for the extraction of chunks that be rendered
> separately.
Unfortunately, the goal that you state seems to be a unicorn. The
accepted grammar of LTR/RTL rendering needs some context (if only
whether that chunk is in an overall LTR or RTL context, and the strength
of that).
>
>>> Otherwise, for type layout purposes, the conversion to fixed-direction text
>>> would be done *after* all line breaks are processed.
>>
>> Yes, the output of a layout program may well be a character string with
>> a LTR direction override code to say that everything following is LTR
>> (even if the character would be RTL), and would include the line breaks,
>> not the strange mishmash you were first describing where each word has a
>> direction code.
>
> If the normalizing function can re-arrange the order of characters so as
> to allow uniform LTR display, that would be great. The mish-mash was to
> allow for the possibility that the designer of the normalizing spec would
> want to keep characters in order, in which case it would have to supply
> information to the application that would let the application do the
> rearranging. As I said, there are many ways of achieving the same goal
> which is to allow for context-free rendering.'
There IS a LTR Override character LRO (U+202D, Left-to-Right Override)
which indicates that the following text is LTR regardless of the
implicit direction implied by the character. This override applies until
some other direction command nests a new embedded direction, or until
that override is terminated with a PDF (U+202C, Pop Directional
Formating) character. IF you can assume that your text is withing such
and override, (i.e. assume the context) then you can do it, but that
isn't 'context-free', that is assumed context.
And eventually they realized that that couldn't be met 100%. Unicode, if
sliced at an arbitrary position will generally not appear as a different
string, as no code point can be found embedded within the code point of
another. By inspecting at the byte level, a program can easily find the
next or previous code point division, so it is fairly easy to break a
string at code point values.
This unfortunately does not allow a string to be broken at an arbitrary
code point break and then have the two parts be rendered independently,
as first there are combining codes which cause multiple code points to
be merged into a single glyph, so if a program wishes to break a string
and not disturb the glyph that will be generated, it needs to at least
know what code points are combining codes. This happened because when
they got into trying to fully define things they found way too many
possible glyphs so they needed to include combining codes.
There are also issues bigger than glyphs that deal with typography
issues (like the bidirectionality issues, as well as the HAN unification
issues) that require some state encoding in the string, making arbitrary
breaking of string no longer a totally possible goal. That was given up
before Unicode outgrew its 16 bit target (and perhaps trying to keep it
as a 16 bit target was one reason for some of this).
There also is the issue that many programs add state to text by include
typography like font selection, bold, italics, etc, which is beyond
Unicode (but it allows an application to define character codes for this
sort of thing) which also may need processing to handle the division of
a string.
>
>> Thus you original string, assuming that string is in a LTR default
>> region would be rendered as (if I am doing it right):
>>
>> testing one two three EVIF RUOF
>> XIS seven eight nine.
>>
>> Yes, in Unicode you need to define with direction marks you general text
>> direction and the extents of alternate direction quotes etc. Thus your
>> 'import' program that takes in properly formed Unicode text doesn't need
>> to 'guess', as the stream will define all that information.
>
> If text is fully marked up with direction changes, there would be no need
> to have a rendering engine try to infer direction based upon character
> ranges. If text doesn't contain marked up, then a rendering engine would
> be guessing about how to display things.
Text only needs to be marked with direction changes not implied by the
characters. The goal of the directionality implied in characters was to
make it so that much text could be properly rendered without the need
for the directionality codes to be embedded, and in general it does a
fairly good job at it (mixing Arabic numerals in RTL texts is one
weakness of it).
>
>> Yes, for your OCR example, as a first step, probably generates the text
>> with a forced LTR (or RTL) direction control, and maybe then do a pass
>> to improve the representation (actually GOOD OCR, likely does this at
>> the beginning as using context to help with character recognition can
>> help greatly with accuracy for normal text, as that processing would at
>> least identify direction of words.)
>
> For text which is set ragged-left or ragged-right, the choice of which
> margin was straight would usually indicate the primary text direction,
> but an OCR program would be guessing. Having an encoding which can
> indicate "the words appeared in this left-to-right order" would allow
> the information in the document to be reported according to what actually
> appears without the program having to guess.
>
As I said, the LRO character is exactly what is wanted here,
>>> When using a Unicode bidirection text display, replacing cats with Hebrew
>>> would lay out the line as:
>>>
>>> "1 dog, 2 3 ,STAC zebras, and four unicorns".
>>>
>>> Perhaps the post didn't show up that way for you when I used Unicode
>>> Hebrew characters? The 2 gets processed as LTR because the preceding
>>> text was LTR, while the 3 gets joined with the preceding RTL text and
>>> thus appears between the "2" and the word "STAC".
>>>
>>> How many readers would look at the above and figure out that there were
>>> two STAC and three zebras?
>>
>> The issue is that you didn't enter the data right. The issue is that
>> Arabic numbers (sort of like many punctuation characters) are just
>> weakly LTR so become RTL in the presence of RTL characters, so you need
>> to be aware of such things at data entry and sometimes you need to
>> provide overrides to correct things. When mixing languages you need to
>> follow the rules to get what you want, even more so when they are human
>> languages.
>
> If a text-entry program noticed that one was pasting an "RTL text" object
> into an LTR document and simply bracketed it with "embedded RTL" tags,
> then copying and pasting the Hebrew word "STAC" over "cats" would yield
> sensible behavior without the rendering engine having to know anything
> about the directionality of individual characters. Requiring that
> someone who doesn't normally work with bidirectional text but simply
> wants to embed one Hebrew word in a document must learn all about the
> complexities of bidirectional text seems rather less helpful.
>
The problem is that there are logically a couple of different very
reasonable options for what is wanted, and thus somewhat complicated
rules, and sometimes it doesn't work the way you want.
>>> In the absence of mixed-direction scripts, it would be fairly
>>> straightforward. The problem is that once code figures out where the
>>> line breaks are, there's no nice way to then figure out what order to
>>> display things in.
>>>
>> The Render String at this coordinate function should do that (provided
>> you maintain the direction nesting context down to that string), at
>> least if it is properly Unicode aware.
>
> Two problems with that:
>
> 1. The proper rendering order for things on a line may depend upon the
> content of preceding or succeeding lines.
>
> 2. Things like full justification, geometric distortion, etc. may
> require rendering a line of text as a sequence of smaller portions.
>
> Thus the need to subdivide text into portions that can be displayed in
> context-free portions.
>
As I said, you need to maintain the context, so you needed to keep the
list of directionality control character nesting, removing those that
have been popped off, that handles item 1.
Item 2 says you don't have a good enough primitive, and thus yes, you do
need to understand the details. If the render string at this location
was designed for justified text, it would also take a width parameter
for how much space to use to render the text. If not, then you need to
understand how to process the text, and if you need to deal with
bidirectional text, it is more complicated. Actually, even without the
RTL issue, if you want to do the 'best' job, you need code to handle the
degenerate case of a single word on the line (because the next word is
long) and how to (and if) you want to stretch the word to justify it,
and possible also do this for a line with very few words and a lot of
space to add.
Of course, if you want to do a simple version to come closer to justify
the text without needing to understand RTL details, you could just start
changing the space character (and possibly adding other space
characters) using the variety of space characters (there are a number
defined in the range U+200x and elsewhere) into the text to roughly even
out the margin, and measure the string.