Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

pit-8 a new text encoding standard for fonts and text.

280 views
Skip to first unread message

Jean-marc Lienher

unread,
Apr 6, 2018, 9:22:36 AM4/6/18
to
Hi,

I would like the opinion of non-US and non-European people about an
International standard for text encoding that would be better suited for
fonts than Unicode and that would better represent the languages of the
World.

Is there a need for this now or in the future ?

See:
https://github.com/public-domain/pit-8

Please forward this message to people who could be interested.

Thanks in advance,
Jean-Marc Lienher

Noob

unread,
Apr 6, 2018, 3:01:28 PM4/6/18
to
On 06/04/2018 15:22, Jean-marc Lienher wrote:

> I would like the opinion of non-US and non-European people about an
> International standard for text encoding that would be better suited for
> fonts than Unicode and that would better represent the languages of the
> World.

I have a question regarding the encoding.

code point byte1 byte2 byte3 byte4 byte5 byte6
P+0000 0xxxxxxx
P+0080 110xxxxx 10xxxxxx
P+0800 1110xxxx 10xxxxxx 10xxxxxx
P+10000 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
P+200000 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
P+4000000 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Considering the leading "10" for "intermediate" bytes.

Is that to allow easy "resynchronization" in case the leading byte
is missing?

Regards.

Ben Bacarisse

unread,
Apr 6, 2018, 4:15:22 PM4/6/18
to
Yes. And it lets you find the start of character even if you have only
a pointer into the middle.

Note that this is the same encoding as UTF-8 (though UTF-8 is not
decreed to stop at 4 bytes). The "pit-8" part appears to be about what
the code points represent.

--
Ben.

Jean-marc Lienher

unread,
Apr 6, 2018, 4:17:04 PM4/6/18
to
Noob wrote:
> Considering the leading "10" for "intermediate" bytes.
>
> Is that to allow easy "resynchronization" in case the leading byte
> is missing?

Yes it could be used for that. But it is usefull when you want to scan
the string from end to begining too.

But there is nothing new in that, it is the same properties as in UTF-8
which is used all over the web these days.

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt


Rick C. Hodgin

unread,
Apr 6, 2018, 4:34:40 PM4/6/18
to
On Friday, April 6, 2018 at 9:22:36 AM UTC-4, Jean-Marc Lienher wrote:
> See:
> https://github.com/public-domain/pit-8

I think a 32-bit standard form is desirable, with a bias applied for
blocks, allowing for 8-bit or 16-bit versions within the same 32-bit
range.

4.2+ billion symbols is enough for any planet's writing systems. And
in most cases, 256 symbols is enough for any language's writing system.

Assign blocks with large buffers and designate the top portion of the
block to be variable and application-specific if need be.

But using a standard, flat, 32-bit system, with a bias for each language
seems to me to be a better solution. That way American English can use
its symbols, England's English can use theirs, and so on, and each one
can be assigned a block of 100K symbols. That would allow almost 42K
separate language systems.

Each language is assigned a bias. The first one has a bias of 0, and
the second a bias of 100,000, the third a bias of 200,000, etc. Each
system can use their 8-bit forms for the most common letters, or 16-bit
forms for probably all of their letters, and only dip into 32-bit forms
if they truly need more than 65,536 symbols.

I think all of our systems should be aggregated into a universal form,
and if we ever exceed 32-bit needs, then create a group 2 that uses
the next 32-bit form, and then a group 3 if we reach that many ... but
I think ~43K separate language forms, each with 100,000 symbols should
be more than sufficient. Probably 10,000 symbols would be enough for
most, and you could have 430K separate languages represented.

I also like the ability to use 8-bit forms for every language using
the language's bias. And 16-bit forms if extended characters are
needed (most of the time they are not).

--
Rick C. Hodgin

Jean-marc Lienher

unread,
Apr 6, 2018, 6:03:48 PM4/6/18
to
Rick C. Hodgin wrote:
> On Friday, April 6, 2018 at 9:22:36 AM UTC-4, Jean-Marc Lienher wrote:
>> See:
>> https://github.com/public-domain/pit-8
>
> I think a 32-bit standard form is desirable, with a bias applied for
> blocks, allowing for 8-bit or 16-bit versions within the same 32-bit
> range.

This is an interesting idea.

What we could use for example is P+0088 followed by another code point
to add an offset to all the text.

So if you want to use UTF-8 you just add:
P+0088 P+7F000000 and then your unmodified UTF-8 text...

For GB18030:
P+0088 P+0100 and then the FSS-UTF encoded code points of the text (not
GB18030 encoded code points because we don't want to have different
multi-byte to 32-bit algorithms)

(Yes I need to modify the mapping to include the ASCII part of GB18030
on the github page)

This could be used at document level with minimal software modification.

But when you use cut-n-paste in the middle of a text, this is
problematic, you loose the offset information. Software need to be
modified to deal with that.

Or maybe this offset trick should only be used on low resource computers
(micro-controllers...), on modern powerful computers the document should
be translated to regular pit-8 when it is loaded.



GOTHIER Nathan

unread,
Apr 6, 2018, 7:05:29 PM4/6/18
to
Approximately a billion of illiterate indians don't care about computer
encoding characters... but thanks for your contribution in polluting the
group with garbage.

--
GOTHIER Nathan

Lynn McGuire

unread,
Apr 6, 2018, 7:56:08 PM4/6/18
to
UTF-8 has survived the battle to date. I see nothing needed now.

Lynn

Jean-marc Lienher

unread,
Apr 6, 2018, 8:19:18 PM4/6/18
to
GOTHIER Nathan wrote:
> Approximately a billion of illiterate indians don't care about computer
> encoding characters... but thanks for your contribution in polluting the
> group with garbage.

The same for you. If you don't have more constructive comment, why are
polluting this newsgroup with a so insignificant reply ?



GOTHIER Nathan

unread,
Apr 6, 2018, 11:48:55 PM4/6/18
to
On Sat, 7 Apr 2018 02:19:12 +0200
Jean-marc Lienher <jean-mar...@bluewin.ch> wrote:

> The same for you. If you don't have more constructive comment, why
> are polluting this newsgroup with a so insignificant reply ?

I thought you were not idiot enough to persist on polluting this group
with dumb questions like this one but I was wrong... since you've
already spammed stackoverflow to publicly declare the end of your
stillborn project.

https://stackoverflow.com/questions/49658431/
alernative-format-to-unicode-encoding-pit-8

Why don't you celebrate the funeral of your dead project on your
website?

https://lienher.org/jean-marc/

There's no doubt it deserves a better place than this group. :-)

--
GOTHIER Nathan

Steven Petruzzellis

unread,
Apr 7, 2018, 12:59:05 AM4/7/18
to
For others I'd simply say it is dubious. Of course, given that it's Rip Off Artist Shawn Ulman I would ignore that doubt and go straight to 'trolling' because that's most of what Rip Off Artist Shawn Ulman does. Meaning no evidence is required to deem it dishonest. After deleting all the flooding, it's primarily just two flooders making the preponderance of the barrage. Obviously. And both perfectly disturbed nuts. So, yeah, I buy into my own inner world, fully knowing it would never be discovered by others, because it makes me reconsider my program, improving it.

What I do is certainly more error resistant. Learning Linux... even now a rookie.

Rip Off Artist Shawn Ulman can't get anything else to work, either. Sandman has an abundance of proficiency to enlighten others and he apparently wants to give back to the community. Of course this is I think the most frustrating group for doing that because too much of the responses are screaming, misleading, and other BS.

-
What Every Entrepreneur Must Know
http://jeff-relf.me/Cola_Regs.HTM
https://www.youtube.com/watch?v=AglvCo3dJ38&feature=youtu.be
Jonas Eklundh

Noob

unread,
Apr 9, 2018, 2:46:03 PM4/9/18
to
On 06/04/2018 22:16, Jean-marc Lienher wrote:

> Noob wrote:
>
>> Considering the leading "10" for "intermediate" bytes.
>>
>> Is that to allow easy "resynchronization" in case the leading byte
>> is missing?
>
> Yes it could be used for that. But it is useful when you want to scan
> the string from end to beginning too.

Well, since the first byte encodes the length, "synchronization bits"
on the intermediate bytes are not required, as long as we always have
the leading byte (it is never missing, and we always keep track of one)

> But there is nothing new in that, it is the same properties as in UTF-8
> which is used all over the web these days.
>
> https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

IMO, Unicode is a shameless clusterf*ck, especially the
"emoji" shenanigans, trying to devolve modern languages
back into Egyptian hieroglyphs.

Regards.

Keith Thompson

unread,
Apr 9, 2018, 3:10:50 PM4/9/18
to
Noob <ro...@127.0.0.1> writes:
[...]
> IMO, Unicode is a shameless clusterf*ck, especially the
> "emoji" shenanigans, trying to devolve modern languages
> back into Egyptian hieroglyphs.

Unicode is not responsible for emojis. They exist, and Unicode
merely provide a consistent way to represent them.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Noob

unread,
Apr 9, 2018, 4:13:19 PM4/9/18
to
On 09/04/2018 21:10, Keith Thompson wrote:
> Noob <ro...@127.0.0.1> writes:
> [...]
>> IMO, Unicode is a shameless clusterf*ck, especially the
>> "emoji" shenanigans, trying to devolve modern languages
>> back into Egyptian hieroglyphs.
>
> Unicode is not responsible for emojis. They exist, and Unicode
> merely provide a consistent way to represent them.

Emojis exist. Hieroglyphs exist.

No one is forcing the Unicode Consortium to add ever more
stupid glyphs.

https://en.wikipedia.org/wiki/Emoji#Unicode_blocks

supe...@casperkitty.com

unread,
Apr 9, 2018, 4:19:13 PM4/9/18
to
On Monday, April 9, 2018 at 2:10:50 PM UTC-5, Keith Thompson wrote:
> Noob <ro...@127.0.0.1> writes:
> [...]
> > IMO, Unicode is a shameless clusterf*ck, especially the
> > "emoji" shenanigans, trying to devolve modern languages
> > back into Egyptian hieroglyphs.
>
> Unicode is not responsible for emojis. They exist, and Unicode
> merely provide a consistent way to represent them.

Many things exist typographically which Unicode makes no effort to represent,
and many of the newer emojis weren't considered characters in any font until
the Unicode Consortium decided to add them.

Meanwhile, from what I can tell, there is no standard concept of an API that
can take a Unicode string and somehow indicate the order in which the
characters should be placed when rendered. Without such a function, the
amount of complexity needed for a text-formatting engine to properly display
things like Hebrew may easily exceed the amount of complexity needed for
everything else, combined.

Keith Thompson

unread,
Apr 9, 2018, 4:45:49 PM4/9/18
to
This is not about C.

Richard Damon

unread,
Apr 9, 2018, 10:31:58 PM4/9/18
to
On 4/9/18 2:45 PM, Noob wrote:
> On 06/04/2018 22:16, Jean-marc Lienher wrote:
>
>> Noob wrote:
>>
>>> Considering the leading "10" for "intermediate" bytes.
>>>
>>> Is that to allow easy "resynchronization" in case the leading byte
>>> is missing?
>>
>> Yes it could be used for that. But it is useful when you want to scan
>> the string from end to beginning too.
>
> Well, since the first byte encodes the length, "synchronization bits"
> on the intermediate bytes are not required, as long as we always have
> the leading byte (it is never missing, and we always keep track of one)
>

The fact that subsequent bytes are distinct from the values allowed in
the first byte allow one to start at an arbitrary place in a text stream
and find the character boundaries, in particular the beginning of the
character you are in the middle of. If the subsequent bytes did not have
this property, patterns exist that can be arbitrarily confusing to
synchronize, effectively requiring always starting at the very beginning
of the string to figure out the character boundaries.

It also says we get the useful property that the bit pattern for a given
code point will NEVER be embedded inside the code point for some other
character.

Richard Damon

unread,
Apr 9, 2018, 10:40:50 PM4/9/18
to
On 4/9/18 4:18 PM, supe...@casperkitty.com wrote:
>
> Meanwhile, from what I can tell, there is no standard concept of an API that
> can take a Unicode string and somehow indicate the order in which the
> characters should be placed when rendered. Without such a function, the
> amount of complexity needed for a text-formatting engine to properly display
> things like Hebrew may easily exceed the amount of complexity needed for
> everything else, combined.
>

Unicode actually carefully defines the rules for text direction, and
defines a number of special characters to control the interaction of
left-to-right and right-to-left languages.

You do need a table of the Bidi-Classes (Bidirectionality) for all the
characters, so there is a space cost for this, but that will tend to be
much less than the space needed for the font needed to display the
characters.

Steven Petruzzellis

unread,
Apr 10, 2018, 1:40:29 AM4/10/18
to
Shawn Ulman Scam Artist gets off on that Daniel Lewis is on the other end. Imagine if Shawn Ulman Scam Artist walked up to a chimpanzee and delivered the punch line. It wouldn't be amusing in any way. This is what happens when seriously bad faith in oneself takes over Shawn Ulman Scam Artist's every moment. I bet Shawn Ulman Scam Artist thinks his wife's life was fun.

You can say I am James Bond for all I care. The advocates say Googling for guidance for a System76 application is trolling. Daniel Lewis has a _lot_ of expertise to demonstrate and he wishes to give back to the community. What he has not figured out, though, is this is without a doubt the most frustrating locale for doing that because too much of the responses are screaming, misleading and other malarky. Right, Shawn Ulman Scam Artist is looking to produce a programming interface reference, which Daniel Lewis can get in 5 seconds. If he could stop being so ignorant he'd understand how screwed up he proves himself to be ;)

--
I Left My Husband & Daughter At Home And THIS happened
http://www.5z8.info/peepshow_z9s9ht_dogporn
https://www.youtube.com/watch?v=AglvCo3dJ38&feature=youtu.be
Jonas Eklundh Communication

Malcolm McLean

unread,
Apr 10, 2018, 4:50:11 AM4/10/18
to
On Monday, April 9, 2018 at 8:10:50 PM UTC+1, Keith Thompson wrote:
> Noob <ro...@127.0.0.1> writes:
> [...]
> > IMO, Unicode is a shameless clusterf*ck, especially the
> > "emoji" shenanigans, trying to devolve modern languages
> > back into Egyptian hieroglyphs.
>
> Unicode is not responsible for emojis. They exist, and Unicode
> merely provide a consistent way to represent them.
>

By standardising them you are encouraging them :-(


But it seems to be what the consumer wants. Computers are used mainly
to shoot off inane messages at each other and that's a 1000 billion
pound business.

Malcolm McLean

unread,
Apr 10, 2018, 4:50:11 AM4/10/18
to
On Monday, April 9, 2018 at 8:10:50 PM UTC+1, Keith Thompson wrote:
> Noob <ro...@127.0.0.1> writes:
> [...]
> > IMO, Unicode is a shameless clusterf*ck, especially the
> > "emoji" shenanigans, trying to devolve modern languages
> > back into Egyptian hieroglyphs.
>
> Unicode is not responsible for emojis. They exist, and Unicode
> merely provide a consistent way to represent them.
>

Steven Petruzzellis

unread,
Apr 10, 2018, 5:36:36 AM4/10/18
to
Alex Ding is again seen declaring "CANNOT BE FOUND" when it comes to content on a site where it does exist... but Alex Ding is just too dimwitted and lazy to understand any instruction he sees... or he doesn't recognize it as advice even when he does find it. You are four seconds away from being in my kill filter. All joking aside, those of you who troll are not able to check Shawn Ulman Scam Artist anymore.

--
Curious how these posts are made? Email: fretw...@gmail.com

Kenny McCormack

unread,
Apr 10, 2018, 6:46:15 AM4/10/18
to
In article <e195c1f1-540f-4528...@googlegroups.com>,
So nice, you posted it twice.

Anyway, just out of curiosity, when you say "1000 billion", do you mean
10^9 or 10^12?

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/ThePublicGood

Malcolm McLean

unread,
Apr 10, 2018, 7:06:47 AM4/10/18
to
On Tuesday, April 10, 2018 at 11:46:15 AM UTC+1, Kenny McCormack wrote:
> In article <e195c1f1-540f-4528...@googlegroups.com>,
> Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> >On Monday, April 9, 2018 at 8:10:50 PM UTC+1, Keith Thompson wrote:
> >> Noob <ro...@127.0.0.1> writes:
> >> [...]
> >> > IMO, Unicode is a shameless clusterf*ck, especially the
> >> > "emoji" shenanigans, trying to devolve modern languages
> >> > back into Egyptian hieroglyphs.
> >>
> >> Unicode is not responsible for emojis. They exist, and Unicode
> >> merely provide a consistent way to represent them.
> >>
> >
> >By standardising them you are encouraging them :-(
> >
> >
> >But it seems to be what the consumer wants. Computers are used mainly
> >to shoot off inane messages at each other and that's a 1000 billion
> >pound business.
>
> So nice, you posted it twice.
>
> Anyway, just out of curiosity, when you say "1000 billion", do you mean
> 10^9 or 10^12?
>
10^9. That's a rough guess of the size of the mobile phone / social media
industry, depending how you count (e.g. if a phone is supposedly for
business use but in fact is used mainly for Facebook and Twitter, you
ought to say that 90% of the value is used for firing off emoticons
and emojis).

Steven Petruzzellis

unread,
Apr 10, 2018, 7:55:52 AM4/10/18
to
Facts about Steve "Steven Petruzzellis" Carroll

Steve "Steven Petruzzellis" Carroll is a nearly 60 year old guy who is single, broke and has no skills. He blames Manfred for his loss of his wife and girlfriend. Steven is jealous of what Manfred has and in a narcissistic rage repeatedly works to take it away from him. For over 10 years he has failed at even this. Steve "Steven Petruzzellis" Carroll is an utter failure in life.

Of his online life Steven says

I've been booted off by past providers before because people
complain about me and all my bullshit. I don't want to lose my ISP
*again* but I still need my army of sock puppets so I continually
search usenet for whatever servers I haven't yet been booted
from.

Some examples where Steven has been booted

X10 Hosting booted Steven for inappropriate activity
http://tinyurl.com/z92qmz4

Comcast booted Steven for inappropriate activity
http://tinyurl.com/h75nh9l

FreeHostingEU booted Steven for copyright infringement http://devsite.eu.pn

AwardSpace (atwebpages.com) booted Steven for breaking terms of service
http://demsites.atwebpages.com/pokeman

Imgur took down an image for breaking terms of service http://imgur.com/yv2XppE

Had a GigaNews account which was removed for harassing Manfred

Stopped posting from his fretwizzen Google account since shortly after I complained. Seems he lost that too.

His wife booted Steven to the curb for cheating on her



Only site Steven Petruzzellis has pointed to that has ever been available
http://web.archive.org/web/20161019062351/http://www.oldneighborhoodrestaurant.net/
https://goo.gl/DMno6J

Made by the Go Daddy Website builder and is utter crap. Later he denied he said he was merely taking credit for someone else's work but Steven is the contact person http://tinyurl.com/hcw6dul
http://web.archive.org/web/20161202192456/http://www.reservationdiary.eu/eng/reservation/d76d1156-e9f7-43c1-b77b-b07189857820

He also bragged he was working on an update and showed it here https://vid.me/u6RK

The business still failed.

He likely went to the Go Daddy Website Builder after he failed to get WordPress installed.

https://mu.wordpress.org/forums/topic/13879


This is why Steve "Steven Petruzzellis" Carroll attacks Manfred.


Steven Petruzzellis was is divorced because his wife caught him screwing another woman.

Manfred is married and by all appearances happily so.
He never complains about his wife in COLA.


Steven Petruzzellis is living on charity of a friend in a single room.

Manfred does not live in a mansion or even a high end home but has a house and a yard.


Steven Petruzzellis has two kids but no respect. He repeatedly called Ryan his "little screwup" and refers to his son Steve as a "dick" and worse. He says at least one of them likely hacked his computer and is a "mentally deficient child." He mocks his possible future daughter-in-law as someone who takes too many selfies and is focused only on herself

Manfred has two kids (or more) and never speaks poorly of them in public.
No reason to think they are not a happy family.


Steven Petruzzellis claims to be a stay at home dad who pays thousands of dollars for day care. Steven has no job and no purpose in life. Steven has given up and now seeks a mother-figure to date.

Manfred has his own business doing technical work and also teaches.
Manfred offers a true contribution to society.
Teachers are truly under appreciated.

--
Best CMS Solution of 2017!
https://prescottareapsychopaths.wordpress.com/mark-bilk-psychopath/
https://youtu.be/TETJEkaUoY4
Jonas Eklundh Communication

Noob

unread,
Apr 10, 2018, 8:15:01 AM4/10/18
to
On 10/04/2018 12:46, Kenny McCormack wrote:

> Anyway, just out of curiosity, when you say "1000 billion",
> do you mean 10^9 or 10^12?

https://en.wikipedia.org/wiki/Long_and_short_scales#Comparison

AFAIU, "billion" is either 10^9 (giga) or 10^12 (tera)

Therefore "1000 billion" is either 10^12 or 10^15 ^_^

Regards.

Steven Petruzzellis

unread,
Apr 10, 2018, 10:00:24 AM4/10/18
to
Paul F. Riddle (https://www.facebook.com/drbud.greengenes) created at least twenty virtual machines in the last year or so. Let us all have a moment of silence as we honor his accomplishments! The advocates insist looking for a recommendation for a Debian application is trolling. Not only did Paul F. Riddle (https://www.facebook.com/drbud.greengenes)'s question fail to mention the "Debian", it has nada to do with Debian. Jeff Torrey should stop sniffing glue.

Who are you even blubbering to? After yesterday's update I no longer want to gouge out my eyeballs from the flicker of the screen. Only what Jeff Torrey wants matters to Jeff Torrey. There is nothing you or I can do to change that. It's a personality defect and it is what it is.



--
Best CMS Solution of 2017!
https://youtu.be/r7wys2JvBD0
Jonas Eklundh

supe...@casperkitty.com

unread,
Apr 10, 2018, 10:30:01 AM4/10/18
to
The problem is that every *application* that wants to do any kind of text
formatting will need to have its own copy of the appropriate logic because
there is no standard convention for getting the characters in a string in
the order necessary for rendering in a particular direction.

For that matter, how would the size of code and tables necessary to simply
determine what characters are and are not allowed within C identifiers
compare with the amount of code that was necessary to implement a minimal
C89 compiler? The whole C89 compiler would be somewhat more, but the
support for Unicode identifiers is hardly cheap.

Steven Petruzzellis

unread,
Apr 10, 2018, 10:46:54 AM4/10/18
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512


- --
sex n
1. either of the two reproductive categories, male or female, of animals and plants
2. sexual intercourse
3. sexual activity or behavior leading to it
4. the genitals (literary)
5. the set of characteristics that determine whether the reproductive role of an animal or plant is male or female
- --

And, notably, Snit wrote:
"Sex is a subset of sexual activities" - Snit - (I guess he didn't comprehend point 3 very well)

Of course, he also wrote:
"Sexual activities are sex by definition, you moron" - Snit

And this:
"A passionate kiss could be called a "sexual activity" but it is not sex"

In Snit's world, a passionate kiss is not sex... but it is sex. LOL!

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iQIcBAEBCgAGBQJVRYvKAAoJEC03b6bOr/+dgbUP/icm7kQOjm2kSNou+JmBBgHb
A4tmqQKGn3RwkS88wbmhuGrc7bj+Mn0olYmozNufRMG1hB+cjJR/a3K4qEN+hZP7
6KNdfvYKxSOcdtrsYEezCOPK6yu9tlwspR5w61vMaEWxmGX8CZ8FuZWnZhiul1CA
WXRpPDK8GJnWecK7wqDw8wJCKu4ENRw0AUTFWdmmGIUO1n3x/NSbCXIftMop4MBd
9NHP3v26pA5NB9tlh//BtQa8Ys03DASuViFd1VRgtuL4wSnctDDLUE4zQGp+GcS7
DOhjROy9ExDGqer4uLwPwDD59eUef8s4WI/XRivl2uQSl7rs/XWKw4OStOzXajGX
EMcJ4c9zPjtwKB2F781+U3DH2l9YLBii/UZCpgcblvvuXSb/b/7D64AxpzvEfPpZ
TTq/OcOEDsbhaOijEdELMw/IjuQ+Pp+5cMEOey7BK3uPAjr8gOLU0xmChJimNP8r
o9RzphsO8O+mJMK0sdke9FEJByCgX3LbJYro21v0ZS0Nt2JW16Z8VakHNKfUbuLg
IYHbShHiEQ1GCWOQ+AIfpwG6td/NE/EhiCE9OixWAsvGq6e0w4k7vARG8hSfZDpp
3hp7rD8x9EVp4v4e7G34Wr6FIZlfYzD2fX6SRu9wYpG/v87sgvrN9klY1YLn/zig
oZmOKJM5cC0iwnvHsavO
=4yFU
-----END PGP SIGNATURE-----

--
Get Rich Slow
http://www.5z8.info/stalin-will-rise-again_j3h5uj_nakedgrandmas.jpg
http://www.5z8.info/php-start_GPS_tracking-user_j3m3lg_whitepower
Jonas Eklundh

Wouter Verhelst

unread,
Apr 10, 2018, 10:58:21 AM4/10/18
to
On 10-04-18 16:29, supe...@casperkitty.com wrote:
> The problem is that every *application* that wants to do any kind of text
> formatting will need to have its own copy of the appropriate logic because
> there is no standard convention for getting the characters in a string in
> the order necessary for rendering in a particular direction.

That's not strictly true. Text formatting is a very complicated subject;
not just because of the properties set on glyphs and characters in the
unicode standard, but also because it has to look good, readable, and
follow the rules of the language in question. Rendering such things
yourself is just a waste of time.

Luckily, this is a solved problem, and it's why you deal with either
libraries to help you do things, or premade widgets in various user
interface frameworks and toolkits. You just handle a string to the
library or toolkit, and it renders it for you. The logic is there, and
your application just needs to deal with strings.

Of course, if you're writing such a toolkit the above does not apply,
but then you brought it upon yourself (and it wouldn't really qualify as
"application" anymore)

Malcolm McLean

unread,
Apr 10, 2018, 11:40:11 AM4/10/18
to
Text is hard, yes. You can easily create acceptable text by using a fixed
width raster font (I put two into my binary image processing library).
But to composit high quality text is very difficult.

supe...@casperkitty.com

unread,
Apr 10, 2018, 11:49:11 AM4/10/18
to
On Tuesday, April 10, 2018 at 9:58:21 AM UTC-5, Wouter Verhelst wrote:
> On 10-04-18 16:29, supe...@casperkitty.com wrote:
> > The problem is that every *application* that wants to do any kind of text
> > formatting will need to have its own copy of the appropriate logic because
> > there is no standard convention for getting the characters in a string in
> > the order necessary for rendering in a particular direction.
>
> That's not strictly true. Text formatting is a very complicated subject;
> not just because of the properties set on glyphs and characters in the
> unicode standard, but also because it has to look good, readable, and
> follow the rules of the language in question. Rendering such things
> yourself is just a waste of time.

Unless you want to do something unusual, in which case there's no
alternative but to do it yourself, but a small amount of code would be
able to accomplish a lot if text can be subdivided into segments that
can be individually rendered to paths or bitmaps

> Luckily, this is a solved problem, and it's why you deal with either
> libraries to help you do things, or premade widgets in various user
> interface frameworks and toolkits. You just handle a string to the
> library or toolkit, and it renders it for you. The logic is there, and
> your application just needs to deal with strings.

That works great if what you need precisely matches what the library gives
you. Not so great otherwise. If one wants to e.g. output full-justified
text with custom-varying margins, a simple approach should be to start with
a document that is in "semantic" order, and then iterate:

1. Add words and measure the width of text until it exceeds the length of
a line, then undo the last addition.

2. Convert the line to a sequence of text extents in left-to-right order.

3. Measure the total length of each text extent, subtract that from the
length of the line, and--if there is more than one text extent--
divide the extra space by the number of text extents minus one.

4. Render each text extent, adding the appropriate amount of space between
them.

If the language/API can supply #2, everything else is easy. But there's no
standard concept of "normalize text direction to __".

BTW, having a recognizable form of text which is normalized to such one
direction may also be useful when processing legacy or OCR-parsed documents
which contain things in a fixed order, or when trying to display things on
e.g. a character-mapped terminal which needs to show things as they arrive.

> Of course, if you're writing such a toolkit the above does not apply,
> but then you brought it upon yourself (and it wouldn't really qualify as
> "application" anymore)

What about the case where an application needs to do something a little
different from what the built-in API supplies?

Wouter Verhelst

unread,
Apr 10, 2018, 12:41:35 PM4/10/18
to
On 10-04-18 17:48, supe...@casperkitty.com wrote:
> On Tuesday, April 10, 2018 at 9:58:21 AM UTC-5, Wouter Verhelst wrote:
>> On 10-04-18 16:29, supe...@casperkitty.com wrote:
>>> The problem is that every *application* that wants to do any kind of text
>>> formatting will need to have its own copy of the appropriate logic because
>>> there is no standard convention for getting the characters in a string in
>>> the order necessary for rendering in a particular direction.
>>
>> That's not strictly true. Text formatting is a very complicated subject;
>> not just because of the properties set on glyphs and characters in the
>> unicode standard, but also because it has to look good, readable, and
>> follow the rules of the language in question. Rendering such things
>> yourself is just a waste of time.
>
> Unless you want to do something unusual, in which case there's no
> alternative but to do it yourself,

Sure.

But you said "every application". I said "that's not strictly true".

When it's unusual, there will always be things you need to do manually.
But my point was, it's not something *every* application will need to do
-- that's what libraries are for.

Obviously there will be exceptions, yes. This may be one of them. That
doesn't negate what I said, though.

[...]

supe...@casperkitty.com

unread,
Apr 10, 2018, 2:09:34 PM4/10/18
to
My intended emphasis was "every application that wants to *do* text
formatting" [as opposed to applications that want to display formatted
text in a rectangular area without formatting it themselves].

Further, if one is trying to have a lightweight application, whether in
Javascript in a web page or in C on a mid-size micro, bringing in a huge
library to accomplish something that should be simple would go against
that philosophy.

Richard Damon

unread,
Apr 10, 2018, 10:06:03 PM4/10/18
to
Except that as I pointed out the IS a very detailed standard convention
for doing this. No, it isn't a routine in the C standard library, but
since the C language spec doesn't require that the execution environment
particularly support Unicode, that isn't too surprising.

I am sure there are libraries available to handle this sort of thing
too, but you seem to want to reject that because it might not meet your
exact requirements, which then makes things boil down to the tautalogy
that a program that wants to do X (in its own customized manner), needs
to know how to do X.

For storing Unicode properties for characters, the simplistic method of
just a huge bit table, would require just over a megabit per unique
property you want to store. (This would be just a simple extension of
the method used by some libraries to handle information about 8 bit
characters which often use a 257 entry table (to include -1 as a
possible value).

Because there is a reasonable amount of pattern in this sort of data, it
can be a lot more compact to build something like a span tree to see if
the character has or doesn't have a given property.

Robert Wessel

unread,
Apr 11, 2018, 3:20:24 AM4/11/18
to
There are reference implementations for C, C++ and Java:

https://unicode.org/Public/PROGRAMS/

Steven Petruzzellis

unread,
Apr 11, 2018, 4:16:12 AM4/11/18
to
Was that meant to be your best shot?

If Steve 'The Shill' Carroll calls getting his ass kicked hard time and time again for over a decade by numerous people (and completely killing his reputation and any reason for me to believe anything he has to say - until hell freezes over) 'trolling', then no doubt... he is a fine troll. I do not personally subscribe to that definition, I use another term. I call that person an utter jackass.

Shawn Ulman and Steve 'The Shill' Carroll had their public farts and their blunders. One presented it off and didn't do anything too appalling that could not be backed with peer reviews.

You never take responsibility for your own words. Quotes that are easily shown in the Google archive.

What Shawn Ulman does is certainly technically correct.

You're like a dead, bloated whale on the beach. We all see you there and let you know. And you're so dumb you keep failing to see it. Mint is based on Linux. Period.

I'm getting more posts hidden then show. I'm guessing the sock frenzy Mac cultist is in its brain damage mode again. Drives the trolls crazy when they do not get attention.

-
This Trick Gets Women Hot For You
https://groups.google.com/forum/#!topic/comp.sys.mac.system/6m_7Z7rQ6Hg
Jonas Eklundh Communication

Wouter Verhelst

unread,
Apr 11, 2018, 4:53:14 AM4/11/18
to
Okay. I see what you mean now, and I can grant you that it is
complicated to do so in a generic manner.

But I think that in all fairness, this is only because having a
universal application which supports *all* scripts on the planet is
inherently a very complicated thing:

- Some scripts encode word boundaries in spaces, whereas others don't
encode word boundaries down at all (making it difficult to reflow text).
At least in Japanese I know that there is no rule you can apply to
specific characters to know whether or not they would be immediately
preceding or following a word boundary, so it's not something that can
be encoded in a glyph database such as Unicode.
- Some scripts (e.g., devanagari) *require* ligatures for proper
writing, whereas for others it's just a matter of making things look nice.
- Some languages allow you to split words at certain locations
(hyphenation), in rules that cannot be captured in the script (e.g., the
rules for hyphenation are subtly different in English vs Dutch, but the
two languages use the exact same alphabet, modulo some diacritics that
Dutch uses but English doesn't)
- ... etc etc

So I think that if you want to write an application that does something
out of the ordinary for *every* language on the planet, you're in for a
lot of work anyway, and then the fact that Unicode doesn't have all the
information doesn't really help you. Absent the "every language on the
planet" requirement though, if you want to do this for "just" a single
language (or set of related languages), it shouldn't be harder than it
is to do so with, say, just ASCII and English.

Also, while it might be true (I haven't checked, so can neither confirm
nor deny) that most libraries out there today to deal with formatting
text only handle rectangular areas, there is no particular reason why
that should be true, or why it should be impossible to write a library
that does allow you to do more than just rectangular areas and one out
of a limited numer of options for justification.

> Further, if one is trying to have a lightweight application, whether in
> Javascript in a web page or in C on a mid-size micro, bringing in a huge
> library to accomplish something that should be simple would go against
> that philosophy.

True. But I don't think there is a way around that, if you want to
support "all languages", for any value of "all". The complexity is going
to be in the library or it is going to be in the application, but either
way it is going to be there and you are going to have to pull it in.

supe...@casperkitty.com

unread,
Apr 11, 2018, 10:59:16 AM4/11/18
to
On Tuesday, April 10, 2018 at 9:06:03 PM UTC-5, Richard Damon wrote:
> Except that as I pointed out the IS a very detailed standard convention
> for doing this. No, it isn't a routine in the C standard library, but
> since the C language spec doesn't require that the execution environment
> particularly support Unicode, that isn't too surprising.

The Unicode consortium defines normalized forms with combined and uncombined
diacritics. Is there anything comparable for RTL/LTR text, such that a
standard-conforming "convert this text to LTR-normalized form" routine will
allow a text layout function to simply output things in order?

If there is, great--I'd love to know more about what terminology is used
to describe it. If not, its absence is what I'm complaining about.

> I am sure there are libraries available to handle this sort of thing
> too, but you seem to want to reject that because it might not meet your
> exact requirements, which then makes things boil down to the tautalogy
> that a program that wants to do X (in its own customized manner), needs
> to know how to do X.

I have some text, and a routine to display a rectangular chunk of it. I
need to display the the overall text as a collection of rectangular chunks.
That doesn't seem particularly specialized. Easy if the text can be
subdivided into chunks that can be individually measured and rendered, but
there's no normalized form to do that.

> For storing Unicode properties for characters, the simplistic method of
> just a huge bit table, would require just over a megabit per unique
> property you want to store. (This would be just a simple extension of
> the method used by some libraries to handle information about 8 bit
> characters which often use a 257 entry table (to include -1 as a
> possible value).

> Because there is a reasonable amount of pattern in this sort of data, it
> can be a lot more compact to build something like a span tree to see if
> the character has or doesn't have a given property.

Unicode should have decided whether it was supposed to be a high-level
text description language, or a low-level description of individual
characters, and defined things on that basis, or if there should be
separate concepts for a high-level and low-level string, with valid
high-level strings being required to contain validly nested hierarchical
structures. While the original design intention was that concatenating
two UTF-8 strings should just "work", and strings could be clipped at
easily-identifiable points, that would require a concept of a complete
high-level string, which Unicode doesn't define. If it had implemented
grapheme clusters using a "start of cluster" code followed by a sequence
of six-bit (UTF-8) or twelve-bit (UTF-16) extension characters, followed
by an "end of cluster" code, that would have made it possible to identify
places where a string can be split cleanly. As it is, things like flags
are implemented in a way that may require searching back arbitrarily far
in the text to determine e.g. whether a repeating sequence of e.g.
"flag G" "flag B" characters should show up as England or Bulgaria.
Silliness.

anti...@math.uni.wroc.pl

unread,
Apr 11, 2018, 3:30:59 PM4/11/18
to
Jean-marc Lienher <jean-mar...@bluewin.ch> wrote:
> Hi,
>
> I would like the opinion of non-US and non-European people about an
> International standard for text encoding that would be better suited for
> fonts than Unicode and that would better represent the languages of the
> World.
>
> Is there a need for this now or in the future ?
>
> See:
> https://github.com/public-domain/pit-8
>
> Please forward this message to people who could be interested.

Looks misguided to me. Unicode won against earlier ISO proposal
which proposed to partition codespace into blocks and populate
each block according to natinal standards. Unicode is a mess,
but ISO version and what you propose would be _much_ worse mess.

IMO the main fault of Unicode is insisting on single codepoint
per glyph. They finally gave up allowing composite characters,
but too late. AFAICS composite charactes allow representation
of large character set without excessive overheads: extreme example
is Korean, where large charactes set can be reduced to short
seqences of jamos. But IIUC there are literally milions of
composite characters in Indian scripts, trying to assign
a codepoint to each of them is loosing battle (and few revisions
ago Unicode admited defeat).

More remark: AFAICS character set with milion unstructured characters
is impossible to handle by humans and inconvenient for
computers. Such character set becomes managable due to
structure -- treating characters as composite is reasonably
general way to impose structure.

One more remark: advocates of oriental scripts speak of
"cultural imperializm" and loosing history. But we should
note that Europe did a lot of work to simplify writing
system. And Europe discontinued various old stuff.
So I think that it is reasonable to ask rest of word
to do some work to make their writing system more
efficient. And accept fact that computer systems designed
to exploit efficiency of European writing system will
not work so well for oriental systems.

--
Waldek Hebisch

supe...@casperkitty.com

unread,
Apr 11, 2018, 4:00:30 PM4/11/18
to
On Wednesday, April 11, 2018 at 2:30:59 PM UTC-5, anti...@math.uni.wroc.pl wrote:
> IMO the main fault of Unicode is insisting on single codepoint
> per glyph. They finally gave up allowing composite characters,
> but too late. AFAICS composite charactes allow representation
> of large character set without excessive overheads: extreme example
> is Korean, where large charactes set can be reduced to short
> seqences of jamos. But IIUC there are literally milions of
> composite characters in Indian scripts, trying to assign
> a codepoint to each of them is loosing battle (and few revisions
> ago Unicode admited defeat).

That's a fundamentally broken approach, as the Unicode consortium finally
realized. The HTML approach for identifying entities is better in many
ways, save for the use of a commonplace printable character as the start
marker. While some people think it's "unfair" that users of the Latin
alphabet can store text more efficiently, a huge amount of the text that
is processed by computers, even in Japan, is within ASCII range 0-127
because it's designed to be processed by computers.

Concepts like code-page switching are useful for improving the storage
efficiency of strings which are "at rest" and will not be sliced and diced
unless or until they are converted into some other form. Having some
standard normalized forms, and a description of how conversion routines
should behave, would make it practical to allow different representations
to be used for different purposes, each in places where it was most suitable.

Steven Petruzzellis

unread,
Apr 12, 2018, 2:48:34 AM4/12/18
to
In fact Bill Cunningham's lies became more vile. So of course I regret being kind to Bill Cunningham. While I am sure his shills liked it, being nice to him made things worse. Who does NOT know that this kind of malarkey is Bill Cunningham's MO, not the modus operandi of Audra Moore? It's all just nonsense... the flooding, the sock puppets, the writing of word-twisted nonsense, the outbursts... the false bravado and eagerness to show how distressed he is over being kicked from the sandbox for crapping in it again... just what Bill Cunningham does. Bill Cunningham calls it "trolling" him, even though he proceeds to set in motion that precise reaction. You can say I am President Trump for all I care. Bill Cunningham was once again a backslider by having the pretense to not argue with Audra Moore by emphatically screaming *plonk* and then engaging him using piggybacking. Thanks to Bill Cunningham and his 'co-trolls' you now need a whitelist. If others start targeting him again I will feel compelled to... as I said. I'm referring to real posters here, not trolls, who he largely controls.

So what's Bill Cunningham's poison for the flooding crap? AppleScript? That is the only language he knows, at least that I have seen. He must be programming it to write these stupid "obsession" posts. Here is my idea, Bill Cunningham took Audra Moore's code that he hacked and try to sell online... he's feeding it content from this group, grabbing "special" paragraphs, then mangling those using Markova chains and then he annoyingly posts them because his mental health issues enables him to do that ceaselessly.

--
Top Ten Ways Bill Cunningham Trolls
http://www.5z8.info/trojan_p4g1zy_how-to-skin-a-gerbil
https://youtu.be/GPPqvw8iEBs
Jonas Eklundh

Richard Damon

unread,
Apr 12, 2018, 11:16:50 PM4/12/18
to
On 4/11/18 10:59 AM, supe...@casperkitty.com wrote:
> On Tuesday, April 10, 2018 at 9:06:03 PM UTC-5, Richard Damon wrote:
>> Except that as I pointed out the IS a very detailed standard convention
>> for doing this. No, it isn't a routine in the C standard library, but
>> since the C language spec doesn't require that the execution environment
>> particularly support Unicode, that isn't too surprising.
>
> The Unicode consortium defines normalized forms with combined and uncombined
> diacritics. Is there anything comparable for RTL/LTR text, such that a
> standard-conforming "convert this text to LTR-normalized form" routine will
> allow a text layout function to simply output things in order?
>
> If there is, great--I'd love to know more about what terminology is used
> to describe it. If not, its absence is what I'm complaining about.

Maybe you mean http://www.unicode.org/reports/tr9/

supe...@casperkitty.com

unread,
Apr 16, 2018, 12:00:20 PM4/16/18
to
On Thursday, April 12, 2018 at 10:16:50 PM UTC-5, Richard Damon wrote:
> > The Unicode consortium defines normalized forms with combined and uncombined
> > diacritics. Is there anything comparable for RTL/LTR text, such that a
> > standard-conforming "convert this text to LTR-normalized form" routine will
> > allow a text layout function to simply output things in order?
> >
> > If there is, great--I'd love to know more about what terminology is used
> > to describe it. If not, its absence is what I'm complaining about.
>
> Maybe you mean http://www.unicode.org/reports/tr9/

I searched for "normalized" and "form", and didn't see any reference to such
a thing.

What I'm after would be something where code which is given a function to
convert a Unicode string to an XXX-normalized form would then be able to
lay it out without having to know all the detailed rules associated with
RTL and LTR scripts beyond recognizing explicit direction markers. Was
such a definition given but using other words?

Richard Damon

unread,
Apr 16, 2018, 9:39:07 PM4/16/18
to
I think the issue is that such a form isn't possible, as LTR/RTL
processing isn't that simple from a language basis, but the 'shape' of
the output affects the results.

For instance, given lower case letters represent left to right, and
upper case letter Right to Left, the string (in logical order) of

abc def ghi ABC DEF GHI

when outputted to a limited length line might show up as

abc def ghi FED CBA
IHG

while on a longer line it would be

abc def ghi IHG FED CBA

Then you get that some characters (mostly punctuation) adopt the
direction of the stuff around them, and some direction reversals 'nest'
so if you have a left-to-right dominate paragraph, with a right-to-left
section embedded in it, that section can have a left-to-right section in
it that flows with the right-to-left.

I suppose if you really wanted to, you could do a processing step to
insert a code into the string to indicate what the bidirectionality
class of the following group of characters are, and maybe process some
of the conditionality of the rules, so that the final render doesn't
need to do that sort of lookup, but it still needs more than just this
is LTR and this is RTL, and this is because that is what the linguists
have decided is the proper way to process a mixed direction string to
handle all the cases they could think of.

The document gave a very detail description of how to do the layout.
They didn't invents some intermediate form to express the string in,
perhaps because final unicode rendering still needs enough smarts (due
to combining characters) that it is really intended that the final
render just be given Unicode with the defined markup hints that were
added to allow some explicit controls.

supe...@casperkitty.com

unread,
Apr 17, 2018, 10:44:44 AM4/17/18
to
On Monday, April 16, 2018 at 8:39:07 PM UTC-5, Richard Damon wrote:
> > What I'm after would be something where code which is given a function to
> > convert a Unicode string to an XXX-normalized form would then be able to
> > lay it out without having to know all the detailed rules associated with
> > RTL and LTR scripts beyond recognizing explicit direction markers. Was
> > such a definition given but using other words?
> >
>
> I think the issue is that such a form isn't possible, as LTR/RTL
> processing isn't that simple from a language basis, but the 'shape' of
> the output affects the results.
>
> For instance, given lower case letters represent left to right, and
> upper case letter Right to Left, the string (in logical order) of
>
> abc def ghi ABC DEF GHI
>
> when outputted to a limited length line might show up as
>
> abc def ghi FED CBA
> IHG
>
> while on a longer line it would be
>
> abc def ghi IHG FED CBA

Indeed. Given a function to normalize a string with explicit "temporary"
direction markers, one would take the original string [shown here LTR in
semantic order]

(ltr)abc def ghi (rtl)FED CBA IHG

Then determine how many words will fit on a line. If the answer is 5, then
one would take the string

(ltr)abc def ghi (rtl)FED CBA

and convert everything to LTR, yielding

(ltr)abc def ghi (rtl-reversed)ABC DEF

and display those words, in order, each prefixed by the appropriate
direction indication.

> Then you get that some characters (mostly punctuation) adopt the
> direction of the stuff around them, and some direction reversals 'nest'
> so if you have a left-to-right dominate paragraph, with a right-to-left
> section embedded in it, that section can have a left-to-right section in
> it that flows with the right-to-left.

Because of the way mirrored punctuation works, it's necessary to distinguish
between LTR text in forward order and RTL text in reverse order, and likewise
RTL forward and LTR reverse.

Note that many legacy documents containing RTL scripts represent the
characters in reverse order (among other things, programs that could use a
Hebrew font were much more common than those which could set bidirectional
text, so people using such programs would typically type characters in the
sequence opposite their semantic order, being careful to avoid allowing any
line breaks in multi-word bits of Hebrew embedded within English text

> I suppose if you really wanted to, you could do a processing step to
> insert a code into the string to indicate what the bidirectionality
> class of the following group of characters are, and maybe process some
> of the conditionality of the rules, so that the final render doesn't
> need to do that sort of lookup, but it still needs more than just this
> is LTR and this is RTL, and this is because that is what the linguists
> have decided is the proper way to process a mixed direction string to
> handle all the cases they could think of.

Finding word breaks with text in semantic order, then normalizing everything
to either LTR or RTL order, computing intra-line spacing, and then showing
everything, might not handle all cases perfectly but it would handle a lot
of cases, easily.

Further, from a UI perspective I think it would be far better to have a
text-entry application generate direction markers while text is being typed
than have the characters in the text imply direction changes, since rules
which make sense for some kinds of text would yield nonsensical results when
applied to others, and a special-purpose text-entry application can apply
rules which are suitable to its purpose).

> The document gave a very detail description of how to do the layout.
> They didn't invents some intermediate form to express the string in,
> perhaps because final unicode rendering still needs enough smarts (due
> to combining characters) that it is really intended that the final
> render just be given Unicode with the defined markup hints that were
> added to allow some explicit controls.

Doing a perfect rendering job may require a lot of smarts, but getting 90%
of the way there while doing something "unusual" with text shouldn't require
implementing all of the directionality rules from scratch.

Richard Damon

unread,
Apr 17, 2018, 9:20:46 PM4/17/18
to
On 4/17/18 10:44 AM, supe...@casperkitty.com wrote:
> On Monday, April 16, 2018 at 8:39:07 PM UTC-5, Richard Damon wrote:
>>> What I'm after would be something where code which is given a function to
>>> convert a Unicode string to an XXX-normalized form would then be able to
>>> lay it out without having to know all the detailed rules associated with
>>> RTL and LTR scripts beyond recognizing explicit direction markers. Was
>>> such a definition given but using other words?
>>>
>>
>> I think the issue is that such a form isn't possible, as LTR/RTL
>> processing isn't that simple from a language basis, but the 'shape' of
>> the output affects the results.
>>
>> For instance, given lower case letters represent left to right, and
>> upper case letter Right to Left, the string (in logical order) of
>>
>> abc def ghi ABC DEF GHI
>>
>> when outputted to a limited length line might show up as
>>
>> abc def ghi FED CBA
>> IHG
>>
>> while on a longer line it would be
>>
>> abc def ghi IHG FED CBA
>
> Indeed. Given a function to normalize a string with explicit "temporary"
> direction markers, one would take the original string [shown here LTR in
> semantic order]
>
> (ltr)abc def ghi (rtl)FED CBA IHG

I have no idea what you did here. My original message was intended to
represent the order the characters would be put in memory, which would
match the order someone would say them if they were going to
phonetically spell them. There is NO way that a transform should be able
to scramble such an order, at least not until new lines were inserted.

>
> Then determine how many words will fit on a line. If the answer is 5, then
> one would take the string
>
> (ltr)abc def ghi (rtl)FED CBA
>
> and convert everything to LTR, yielding
>
> (ltr)abc def ghi (rtl-reversed)ABC DEF
>
> and display those words, in order, each prefixed by the appropriate
> direction indication.
>
>> Then you get that some characters (mostly punctuation) adopt the
>> direction of the stuff around them, and some direction reversals 'nest'
>> so if you have a left-to-right dominate paragraph, with a right-to-left
>> section embedded in it, that section can have a left-to-right section in
>> it that flows with the right-to-left.
>
> Because of the way mirrored punctuation works, it's necessary to distinguish
> between LTR text in forward order and RTL text in reverse order, and likewise
> RTL forward and LTR reverse.
>
> Note that many legacy documents containing RTL scripts represent the
> characters in reverse order (among other things, programs that could use a
> Hebrew font were much more common than those which could set bidirectional
> text, so people using such programs would typically type characters in the
> sequence opposite their semantic order, being careful to avoid allowing any
> line breaks in multi-word bits of Hebrew embedded within English text

And I believe that if you want to do this sort of thing in Unicode you
would need to use an explicit direction override, which you would encase
the RTL data in to force it to be displayed in LTR order.
>
>> I suppose if you really wanted to, you could do a processing step to
>> insert a code into the string to indicate what the bidirectionality
>> class of the following group of characters are, and maybe process some
>> of the conditionality of the rules, so that the final render doesn't
>> need to do that sort of lookup, but it still needs more than just this
>> is LTR and this is RTL, and this is because that is what the linguists
>> have decided is the proper way to process a mixed direction string to
>> handle all the cases they could think of.
>
> Finding word breaks with text in semantic order, then normalizing everything
> to either LTR or RTL order, computing intra-line spacing, and then showing
> everything, might not handle all cases perfectly but it would handle a lot
> of cases, easily.
>
> Further, from a UI perspective I think it would be far better to have a
> text-entry application generate direction markers while text is being typed
> than have the characters in the text imply direction changes, since rules
> which make sense for some kinds of text would yield nonsensical results when
> applied to others, and a special-purpose text-entry application can apply
> rules which are suitable to its purpose).

Well, Unicode was designed to minimize the number of explicit direction
marks that would need to be inserted into text, as the letters
themselves have defined direction rules designed to make most things
works out. For example, typing my sample string above would go through
steps like below (now shown in display order)

a
ab
abc
...
abc def ghi
abc def ghi A
abc def ghi BA
abc def ghi CBA
...
abc def ghi IHG FED CBA

Note also, for this the text stream needs no direction markers embedded
in it, as all the directionalality is built into the characters.

>
>> The document gave a very detail description of how to do the layout.
>> They didn't invents some intermediate form to express the string in,
>> perhaps because final unicode rendering still needs enough smarts (due
>> to combining characters) that it is really intended that the final
>> render just be given Unicode with the defined markup hints that were
>> added to allow some explicit controls.
>
> Doing a perfect rendering job may require a lot of smarts, but getting 90%
> of the way there while doing something "unusual" with text shouldn't require
> implementing all of the directionality rules from scratch.
>

Since you started by assuming you couldn't use a supplied library
function because you wanted to be able to do it your own way, maybe it
does, since you through out the option to use something provide.

supe...@casperkitty.com

unread,
Apr 18, 2018, 12:55:17 PM4/18/18
to
On Tuesday, April 17, 2018 at 8:20:46 PM UTC-5, Richard Damon wrote:
> On 4/17/18 10:44 AM, supe...@casperkitty.com wrote:
> > Indeed. Given a function to normalize a string with explicit "temporary"
> > direction markers, one would take the original string [shown here LTR in
> > semantic order]
> >
> > (ltr)abc def ghi (rtl)FED CBA IHG
>
> I have no idea what you did here. My original message was intended to
> represent the order the characters would be put in memory, which would
> match the order someone would say them if they were going to
> phonetically spell them. There is NO way that a transform should be able
> to scramble such an order, at least not until new lines were inserted.

There should be three ways of representing a piece of text: with the
characters in semantic order, with the characters in left-to-right order,
or with the characters in right-to-left order. While it would be possible
to have functions that convert to the latter formats irreversibly, I think
it is probably better to define a process that would embed markings that
could be used to restore text to its previous order.

> > Then determine how many words will fit on a line. If the answer is 5, then
> > one would take the string
> >
> > (ltr)abc def ghi (rtl)FED CBA
> >
> > and convert everything to LTR, yielding
> >
> > (ltr)abc def ghi (rtl-reversed)ABC DEF
> >
> > and display those words, in order, each prefixed by the appropriate
> > direction indication.

> And I believe that if you want to do this sort of thing in Unicode you
> would need to use an explicit direction override, which you would encase
> the RTL data in to force it to be displayed in LTR order.

Such an override would be the "appropriate direction indication" with which
each word would be prefixed.

> > Finding word breaks with text in semantic order, then normalizing everything
> > to either LTR or RTL order, computing intra-line spacing, and then showing
> > everything, might not handle all cases perfectly but it would handle a lot
> > of cases, easily.

> Well, Unicode was designed to minimize the number of explicit direction
> marks that would need to be inserted into text, as the letters
> themselves have defined direction rules designed to make most things
> works out. For example, typing my sample string above would go through
> steps like below (now shown in display order)

> a
> ab
> abc
> ...
> abc def ghi
> abc def ghi A
> abc def ghi BA
> abc def ghi CBA
> ...
> abc def ghi IHG FED CBA
>
> Note also, for this the text stream needs no direction markers embedded
> in it, as all the directionalality is built into the characters.

How is that better than having the text editing UI automatically insert
marks to embed different-direction objects as appropriate, according to
its intended purpose?

In most cases where a text contains both LTR and LTR scripts, things of
one particular direction should be treated as nested within the other.
For example, if I have a list "1 dog, 2 cats, 3 zebras, and 4 unicorns",
and I replace the word "cats" with a Hebrew word "חתול", how likely is it
that I would want the list to appear as "1 dog, 2 חתול, 3 zebras, and 4
unicorns"? Perhaps people used to reading mixed-up-direction text would
recognize that there are two חתול and three zebras, but I don't think
most people would be able to figure that out.

> >> The document gave a very detail description of how to do the layout.
> >> They didn't invents some intermediate form to express the string in,
> >> perhaps because final unicode rendering still needs enough smarts (due
> >> to combining characters) that it is really intended that the final
> >> render just be given Unicode with the defined markup hints that were
> >> added to allow some explicit controls.
> >
> > Doing a perfect rendering job may require a lot of smarts, but getting 90%
> > of the way there while doing something "unusual" with text shouldn't require
> > implementing all of the directionality rules from scratch.
>
> Since you started by assuming you couldn't use a supplied library
> function because you wanted to be able to do it your own way, maybe it
> does, since you through out the option to use something provide.

Many platforms include "render this string at this coordinate", and
"measure this string", and they also include "format this text into this
box", but they don't have anything between those two levels of operation.

Richard Damon

unread,
Apr 18, 2018, 9:15:01 PM4/18/18
to
On 4/18/18 12:54 PM, supe...@casperkitty.com wrote:
> On Tuesday, April 17, 2018 at 8:20:46 PM UTC-5, Richard Damon wrote:
>> On 4/17/18 10:44 AM, supe...@casperkitty.com wrote:
>>> Indeed. Given a function to normalize a string with explicit "temporary"
>>> direction markers, one would take the original string [shown here LTR in
>>> semantic order]
>>>
>>> (ltr)abc def ghi (rtl)FED CBA IHG
>>
>> I have no idea what you did here. My original message was intended to
>> represent the order the characters would be put in memory, which would
>> match the order someone would say them if they were going to
>> phonetically spell them. There is NO way that a transform should be able
>> to scramble such an order, at least not until new lines were inserted.
>
> There should be three ways of representing a piece of text: with the
> characters in semantic order, with the characters in left-to-right order,
> or with the characters in right-to-left order. While it would be possible
> to have functions that convert to the latter formats irreversibly, I think
> it is probably better to define a process that would embed markings that
> could be used to restore text to its previous order.

I don't understand why, in part because of a significant HOW problem,
the conversion of embedded text to the 'wrong' direction is dependent on
how line breaks are going to fall.
>
>>> Then determine how many words will fit on a line. If the answer is 5, then
>>> one would take the string
>>>
>>> (ltr)abc def ghi (rtl)FED CBA
>>>
>>> and convert everything to LTR, yielding
>>>
>>> (ltr)abc def ghi (rtl-reversed)ABC DEF
>>>
>>> and display those words, in order, each prefixed by the appropriate
>>> direction indication.
>
>> And I believe that if you want to do this sort of thing in Unicode you
>> would need to use an explicit direction override, which you would encase
>> the RTL data in to force it to be displayed in LTR order.
>
> Such an override would be the "appropriate direction indication" with which
> each word would be prefixed.
>

Define 'word', that is unfortunately a not well defined concept, there
are levels of locations where text can be broken at for line breaks.

>>> Finding word breaks with text in semantic order, then normalizing everything
>>> to either LTR or RTL order, computing intra-line spacing, and then showing
>>> everything, might not handle all cases perfectly but it would handle a lot
>>> of cases, easily.
>
>> Well, Unicode was designed to minimize the number of explicit direction
>> marks that would need to be inserted into text, as the letters
>> themselves have defined direction rules designed to make most things
>> works out. For example, typing my sample string above would go through
>> steps like below (now shown in display order)
>
>> a
>> ab
>> abc
>> ...
>> abc def ghi
>> abc def ghi A
>> abc def ghi BA
>> abc def ghi CBA
>> ...
>> abc def ghi IHG FED CBA
>>
>> Note also, for this the text stream needs no direction markers embedded
>> in it, as all the directionalality is built into the characters.
>
> How is that better than having the text editing UI automatically insert
> marks to embed different-direction objects as appropriate, according to
> its intended purpose?

Because the text in the buffer is in the logical order, not some sort of
display order, which is likely what is wanted, and the display engine
know what it needs to convert that to a display. Yes, the system could
insert explicit direction marks into the string, but that might require
multiple changes in the stored string for every character put into the
buffer, and now both the input routine and the output routine needs full
knowledge of the directionality rules.
>
> In most cases where a text contains both LTR and LTR scripts, things of
> one particular direction should be treated as nested within the other.
> For example, if I have a list "1 dog, 2 cats, 3 zebras, and 4 unicorns",
> and I replace the word "cats" with a Hebrew word "חתול", how likely is it
> that I would want the list to appear as "1 dog, 2 חתול, 3 zebras, and 4
> unicorns"? Perhaps people used to reading mixed-up-direction text would
> recognize that there are two חתול and three zebras, but I don't think
> most people would be able to figure that out.

Actually, the Unicode rules can handle this. And this is one problem
with your single LTR and RTL code, but you actually need nesting
direction marks. My understanding of unicode rules would be that without
adding embedding codes, and replacing cats by CATS (using upper case to
mark intrinsically RTL letters), the rendering should be:

I have a list: 1 dog, 2 STAC, 3 zebras, and 4 unicorns.
and in fact if all the words were so replaced you would get

I have a list: 1 GOD, 2 STAC, 3 SARBEZ, and 4 SNROCINU.

as the numbers prefered LTR orientation resets between each of the
words. On the other hand, if you add the embedded RTL code around the
list, you would get

I have a list: SNROCINU 4 and ,SARBEZ 3 ,STAC 2 ,GOD 1


>
>>>> The document gave a very detail description of how to do the layout.
>>>> They didn't invents some intermediate form to express the string in,
>>>> perhaps because final unicode rendering still needs enough smarts (due
>>>> to combining characters) that it is really intended that the final
>>>> render just be given Unicode with the defined markup hints that were
>>>> added to allow some explicit controls.
>>>
>>> Doing a perfect rendering job may require a lot of smarts, but getting 90%
>>> of the way there while doing something "unusual" with text shouldn't require
>>> implementing all of the directionality rules from scratch.
>>
>> Since you started by assuming you couldn't use a supplied library
>> function because you wanted to be able to do it your own way, maybe it
>> does, since you through out the option to use something provide.
>
> Many platforms include "render this string at this coordinate", and
> "measure this string", and they also include "format this text into this
> box", but they don't have anything between those two levels of operation.
>

But with measure this string, and render this string at this coordinate,
it is possible (if slightly inefficient) to perform your operation. You
might need to 'guess' at a break point, and then try longer or shorter
to find the right break point, but it is doable.

Steven Petruzzellis

unread,
Apr 18, 2018, 10:31:01 PM4/18/18
to
His goal is to see me frightened off by having me get blamed for his actions. And hey, that could work.

Hope Daniel 'Super Troll' Lewis likes the housing in the packed septic tank of my killfile. What do you get out of lying, Daniel 'Super Troll' Lewis?



--
Puppy Videos
https://prescottareapsychopaths.wordpress.com/paul-f-riddle-psychopath/
Jonas Eklundh

supe...@casperkitty.com

unread,
Apr 19, 2018, 4:31:24 PM4/19/18
to
On Wednesday, April 18, 2018 at 8:15:01 PM UTC-5, Richard Damon wrote:
> On 4/18/18 12:54 PM, supe...@casperkitty.com wrote:
> > There should be three ways of representing a piece of text: with the
> > characters in semantic order, with the characters in left-to-right order,
> > or with the characters in right-to-left order. While it would be possible
> > to have functions that convert to the latter formats irreversibly, I think
> > it is probably better to define a process that would embed markings that
> > could be used to restore text to its previous order.
>
> I don't understand why, in part because of a significant HOW problem,
> the conversion of embedded text to the 'wrong' direction is dependent on
> how line breaks are going to fall.

If a document is produced by OCR, or is imported from a legacy format which
did not handle bidirectional scripts and required that any Hebrew text be
typed in reverse order with manually-inserted line breaks, it should be
possible to represent "a digit five to the left of a Hebrew alef" without
implying any judgment about which character is "first".

Otherwise, for type layout purposes, the conversion to fixed-direction text
would be done *after* all line breaks are processed.

> > Such an override would be the "appropriate direction indication" with which
> > each word would be prefixed.
>
> Define 'word', that is unfortunately a not well defined concept, there
> are levels of locations where text can be broken at for line breaks.

Line breaks would be resolved first, with text in "semantic order". Word
subdivisions would be used, after handling line breaks, for things like
full justification.

> > How is that better than having the text editing UI automatically insert
> > marks to embed different-direction objects as appropriate, according to
> > its intended purpose?
>
> Because the text in the buffer is in the logical order, not some sort of
> display order, which is likely what is wanted, and the display engine
> know what it needs to convert that to a display. Yes, the system could
> insert explicit direction marks into the string, but that might require
> multiple changes in the stored string for every character put into the
> buffer, and now both the input routine and the output routine needs full
> knowledge of the directionality rules.

Using uppercase for RTL, if one has some text:

testing one two three FOUR FIVE SIX seven eight nine

and there should be a line break after "FIVE", should the text
appear as

ROUF testing one two three
seven eight nine XIS EVIF

or as

testing one two three ROUF
XIS EVIF seven eight nine

or as something else? The former would be appropriate if the text contains
two LTR quotations within an RTL paragraph; the latter would be appropriate
if it contains an RTL quotation within an LTR paragraph.

Conversely, if an OCR program were to identify the characters in the
sequence shown by the second example, by what means should it know whether
they represent:

testing one two three FOUR
FIVE SIX seven eight nine

or

FOUR testing one two three
seven eight nine FIVE SIX

or something else? If an import program guesses that the text was a
couple of LTR quotations bracketing a piece of RTL text, it might produce
the erroneous second guess above, but if it kept track of the order of
the original characters it would be possible to select the text, tell the
program that it was an RTL quotation within an LTR paragraph, and have it
rearrange things correctly.

> > In most cases where a text contains both LTR and LTR scripts, things of
> > one particular direction should be treated as nested within the other.
> > For example, if I have a list "1 dog, 2 cats, 3 zebras, and 4 unicorns",
> > and I replace the word "cats" with a Hebrew word "חתול", how likely is it
> > that I would want the list to appear as "1 dog, 2 חתול, 3 zebras, and 4
> > unicorns"? Perhaps people used to reading mixed-up-direction text would
> > recognize that there are two חתול and three zebras, but I don't think
> > most people would be able to figure that out.
>
> Actually, the Unicode rules can handle this. And this is one problem
> with your single LTR and RTL code, but you actually need nesting
> direction marks. My understanding of unicode rules would be that without
> adding embedding codes, and replacing cats by CATS (using upper case to
> mark intrinsically RTL letters), the rendering should be:
>
> I have a list: 1 dog, 2 STAC, 3 zebras, and 4 unicorns.
> and in fact if all the words were so replaced you would get
>
> I have a list: 1 GOD, 2 STAC, 3 SARBEZ, and 4 SNROCINU.
>
> as the numbers prefered LTR orientation resets between each of the
> words. On the other hand, if you add the embedded RTL code around the
> list, you would get
>
> I have a list: SNROCINU 4 and ,SARBEZ 3 ,STAC 2 ,GOD 1

When using a Unicode bidirection text display, replacing cats with Hebrew
would lay out the line as:

"1 dog, 2 3 ,STAC zebras, and four unicorns".

Perhaps the post didn't show up that way for you when I used Unicode
Hebrew characters? The 2 gets processed as LTR because the preceding
text was LTR, while the 3 gets joined with the preceding RTL text and
thus appears between the "2" and the word "STAC".

How many readers would look at the above and figure out that there were
two STAC and three zebras?

> > Many platforms include "render this string at this coordinate", and
> > "measure this string", and they also include "format this text into this
> > box", but they don't have anything between those two levels of operation.
>
> But with measure this string, and render this string at this coordinate,
> it is possible (if slightly inefficient) to perform your operation. You
> might need to 'guess' at a break point, and then try longer or shorter
> to find the right break point, but it is doable.

In the absence of mixed-direction scripts, it would be fairly
straightforward. The problem is that once code figures out where the
line breaks are, there's no nice way to then figure out what order to
display things in.

Richard Damon

unread,
Apr 19, 2018, 11:05:31 PM4/19/18
to
On 4/19/18 4:31 PM, supe...@casperkitty.com wrote:
> On Wednesday, April 18, 2018 at 8:15:01 PM UTC-5, Richard Damon wrote:
>> On 4/18/18 12:54 PM, supe...@casperkitty.com wrote:
>>> There should be three ways of representing a piece of text: with the
>>> characters in semantic order, with the characters in left-to-right order,
>>> or with the characters in right-to-left order. While it would be possible
>>> to have functions that convert to the latter formats irreversibly, I think
>>> it is probably better to define a process that would embed markings that
>>> could be used to restore text to its previous order.
>>
>> I don't understand why, in part because of a significant HOW problem,
>> the conversion of embedded text to the 'wrong' direction is dependent on
>> how line breaks are going to fall.
>
> If a document is produced by OCR, or is imported from a legacy format which
> did not handle bidirectional scripts and required that any Hebrew text be
> typed in reverse order with manually-inserted line breaks, it should be
> possible to represent "a digit five to the left of a Hebrew alef" without
> implying any judgment about which character is "first".

Which sounds like you just want to use the Unicode direction override
code. Now we get to the original problem you proposed, the need to take
a current logical ordered text string and convert it into this strange
format, which again the question is WHY? (or maybe why do you have the
expectation that modern standards should provide easy ways to hack up
data into ancient formats).

>
> Otherwise, for type layout purposes, the conversion to fixed-direction text
> would be done *after* all line breaks are processed.

Yes, the output of a layout program may well be a character string with
a LTR direction override code to say that everything following is LTR
(even if the character would be RTL), and would include the line breaks,
not the strange mishmash you were first describing where each word has a
direction code.
You obviously didn't read the page I pointed to, as it defines this sort
of thing. And one key is you are not using all the right terms, because
LTR and RTL and NOT simply 'shift codes' which just change the direction
of the following text, but are actually a number of different codes with
differing strengths of control, and generally paired in begin and end
and nested, with a default for the overall document, and possibly
overridden for smaller pieces.

Thus you original string, assuming that string is in a LTR default
region would be rendered as (if I am doing it right):

testing one two three EVIF RUOF
XIS seven eight nine.

Yes, in Unicode you need to define with direction marks you general text
direction and the extents of alternate direction quotes etc. Thus your
'import' program that takes in properly formed Unicode text doesn't need
to 'guess', as the stream will define all that information.

Yes, for your OCR example, as a first step, probably generates the text
with a forced LTR (or RTL) direction control, and maybe then do a pass
to improve the representation (actually GOOD OCR, likely does this at
the beginning as using context to help with character recognition can
help greatly with accuracy for normal text, as that processing would at
least identify direction of words.)
The issue is that you didn't enter the data right. The issue is that
Arabic numbers (sort of like many punctuation characters) are just
weakly LTR so become RTL in the presence of RTL characters, so you need
to be aware of such things at data entry and sometimes you need to
provide overrides to correct things. When mixing languages you need to
follow the rules to get what you want, even more so when they are human
languages.
>
>>> Many platforms include "render this string at this coordinate", and
>>> "measure this string", and they also include "format this text into this
>>> box", but they don't have anything between those two levels of operation.
>>
>> But with measure this string, and render this string at this coordinate,
>> it is possible (if slightly inefficient) to perform your operation. You
>> might need to 'guess' at a break point, and then try longer or shorter
>> to find the right break point, but it is doable.
>
> In the absence of mixed-direction scripts, it would be fairly
> straightforward. The problem is that once code figures out where the
> line breaks are, there's no nice way to then figure out what order to
> display things in.
>
The Render String at this coordinate function should do that (provided
you maintain the direction nesting context down to that string), at
least if it is properly Unicode aware.


Steven Petruzzellis

unread,
Apr 19, 2018, 11:44:02 PM4/19/18
to
Staci M. Elliot is incontrovertibly dishonest, he got caught and he's doing the standard face saving rubbish learned in trolling 101 as he strives to regain a hint of legitimacy... but it will not work.

The bulk of the people in this group do scripting either as a pastime or as a profession, so I am skeptical more than a few think of automation to be "a black art".

You likely think Mint handles the desktop well. Nope. Not compared to Windows. Krystal Lynn can create a virtual machine. Let's make sure he gets all the cookies in the world!

--
This broke the Internet!
http://comp.os.linux.advocacy.narkive.com/711dskzA/steven-petruzzellis-admits-he-can-not-back-his-claims-about-snit
Jonas Eklundh

supe...@casperkitty.com

unread,
Apr 20, 2018, 10:03:31 AM4/20/18
to
On Thursday, April 19, 2018 at 10:05:31 PM UTC-5, Richard Damon wrote:
> Which sounds like you just want to use the Unicode direction override
> code. Now we get to the original problem you proposed, the need to take
> a current logical ordered text string and convert it into this strange
> format, which again the question is WHY? (or maybe why do you have the
> expectation that modern standards should provide easy ways to hack up
> data into ancient formats).

What is needed is an easy way of ending up with chunks that can be
rendered in context-free fashion. There are many possible ways of
representing text that would make it possible to subdivide it into
such chunks. I'm wouldn't be especially attached to any particular
way of representing such text, provided there was a standard form
and it allowed for the extraction of chunks that be rendered
separately.

> > Otherwise, for type layout purposes, the conversion to fixed-direction text
> > would be done *after* all line breaks are processed.
>
> Yes, the output of a layout program may well be a character string with
> a LTR direction override code to say that everything following is LTR
> (even if the character would be RTL), and would include the line breaks,
> not the strange mishmash you were first describing where each word has a
> direction code.

If the normalizing function can re-arrange the order of characters so as
to allow uniform LTR display, that would be great. The mish-mash was to
allow for the possibility that the designer of the normalizing spec would
want to keep characters in order, in which case it would have to supply
information to the application that would let the application do the
rearranging. As I said, there are many ways of achieving the same goal
which is to allow for context-free rendering.'
One of the stated design goals of Unicode encodings was to allow strings
to be sliced and diced while remaining semantically valid. While a
format that could accommodate that with bidirectional text would be
excessively bulky, having a normalized form which could accommodate that,
at the expense of extra bulk, would help Unicode better achieve that
stated design goal.

> Thus you original string, assuming that string is in a LTR default
> region would be rendered as (if I am doing it right):
>
> testing one two three EVIF RUOF
> XIS seven eight nine.
>
> Yes, in Unicode you need to define with direction marks you general text
> direction and the extents of alternate direction quotes etc. Thus your
> 'import' program that takes in properly formed Unicode text doesn't need
> to 'guess', as the stream will define all that information.

If text is fully marked up with direction changes, there would be no need
to have a rendering engine try to infer direction based upon character
ranges. If text doesn't contain marked up, then a rendering engine would
be guessing about how to display things.

> Yes, for your OCR example, as a first step, probably generates the text
> with a forced LTR (or RTL) direction control, and maybe then do a pass
> to improve the representation (actually GOOD OCR, likely does this at
> the beginning as using context to help with character recognition can
> help greatly with accuracy for normal text, as that processing would at
> least identify direction of words.)

For text which is set ragged-left or ragged-right, the choice of which
margin was straight would usually indicate the primary text direction,
but an OCR program would be guessing. Having an encoding which can
indicate "the words appeared in this left-to-right order" would allow
the information in the document to be reported according to what actually
appears without the program having to guess.

> > When using a Unicode bidirection text display, replacing cats with Hebrew
> > would lay out the line as:
> >
> > "1 dog, 2 3 ,STAC zebras, and four unicorns".
> >
> > Perhaps the post didn't show up that way for you when I used Unicode
> > Hebrew characters? The 2 gets processed as LTR because the preceding
> > text was LTR, while the 3 gets joined with the preceding RTL text and
> > thus appears between the "2" and the word "STAC".
> >
> > How many readers would look at the above and figure out that there were
> > two STAC and three zebras?
>
> The issue is that you didn't enter the data right. The issue is that
> Arabic numbers (sort of like many punctuation characters) are just
> weakly LTR so become RTL in the presence of RTL characters, so you need
> to be aware of such things at data entry and sometimes you need to
> provide overrides to correct things. When mixing languages you need to
> follow the rules to get what you want, even more so when they are human
> languages.

If a text-entry program noticed that one was pasting an "RTL text" object
into an LTR document and simply bracketed it with "embedded RTL" tags,
then copying and pasting the Hebrew word "STAC" over "cats" would yield
sensible behavior without the rendering engine having to know anything
about the directionality of individual characters. Requiring that
someone who doesn't normally work with bidirectional text but simply
wants to embed one Hebrew word in a document must learn all about the
complexities of bidirectional text seems rather less helpful.

> > In the absence of mixed-direction scripts, it would be fairly
> > straightforward. The problem is that once code figures out where the
> > line breaks are, there's no nice way to then figure out what order to
> > display things in.
> >
> The Render String at this coordinate function should do that (provided
> you maintain the direction nesting context down to that string), at
> least if it is properly Unicode aware.

Two problems with that:

1. The proper rendering order for things on a line may depend upon the
content of preceding or succeeding lines.

2. Things like full justification, geometric distortion, etc. may
require rendering a line of text as a sequence of smaller portions.

Thus the need to subdivide text into portions that can be displayed in
context-free portions.

Steven Petruzzellis

unread,
Apr 20, 2018, 10:10:10 AM4/20/18
to
Jessica Lane's first statement stands proper and correct. Like most in the world, The Doctor relies on mainstream media to build his convictions on the news. Therefore, he only thinks of and accepts the authenticated stories which also happens to be the ones which paint education in auspicious and righteous view.

Right, The Doctor is wanting to sell an XML variable, which anyone can get in seconds, that I taught myself. If he wasn't so lost he would realize how lost he appears. Ha, ha, ha, ha, ha!

Did he think that was clever?



-
I Left My Husband & Daughter At Home And THIS happened
http://www.5z8.info/dogs-being-eaten_g0w7hf_molotovcocktail
http://www.5z8.info/michaelangelo-virus_t6v2vq_molotovcocktail
Jonas Eklundh Communication AB

Richard Damon

unread,
Apr 20, 2018, 11:30:09 AM4/20/18
to
On 4/20/18 10:03 AM, supe...@casperkitty.com wrote:
> On Thursday, April 19, 2018 at 10:05:31 PM UTC-5, Richard Damon wrote:
>> Which sounds like you just want to use the Unicode direction override
>> code. Now we get to the original problem you proposed, the need to take
>> a current logical ordered text string and convert it into this strange
>> format, which again the question is WHY? (or maybe why do you have the
>> expectation that modern standards should provide easy ways to hack up
>> data into ancient formats).
>
> What is needed is an easy way of ending up with chunks that can be
> rendered in context-free fashion. There are many possible ways of
> representing text that would make it possible to subdivide it into
> such chunks. I'm wouldn't be especially attached to any particular
> way of representing such text, provided there was a standard form
> and it allowed for the extraction of chunks that be rendered
> separately.

Unfortunately, the goal that you state seems to be a unicorn. The
accepted grammar of LTR/RTL rendering needs some context (if only
whether that chunk is in an overall LTR or RTL context, and the strength
of that).
>
>>> Otherwise, for type layout purposes, the conversion to fixed-direction text
>>> would be done *after* all line breaks are processed.
>>
>> Yes, the output of a layout program may well be a character string with
>> a LTR direction override code to say that everything following is LTR
>> (even if the character would be RTL), and would include the line breaks,
>> not the strange mishmash you were first describing where each word has a
>> direction code.
>
> If the normalizing function can re-arrange the order of characters so as
> to allow uniform LTR display, that would be great. The mish-mash was to
> allow for the possibility that the designer of the normalizing spec would
> want to keep characters in order, in which case it would have to supply
> information to the application that would let the application do the
> rearranging. As I said, there are many ways of achieving the same goal
> which is to allow for context-free rendering.'

There IS a LTR Override character LRO (U+202D, Left-to-Right Override)
which indicates that the following text is LTR regardless of the
implicit direction implied by the character. This override applies until
some other direction command nests a new embedded direction, or until
that override is terminated with a PDF (U+202C, Pop Directional
Formating) character. IF you can assume that your text is withing such
and override, (i.e. assume the context) then you can do it, but that
isn't 'context-free', that is assumed context.
And eventually they realized that that couldn't be met 100%. Unicode, if
sliced at an arbitrary position will generally not appear as a different
string, as no code point can be found embedded within the code point of
another. By inspecting at the byte level, a program can easily find the
next or previous code point division, so it is fairly easy to break a
string at code point values.

This unfortunately does not allow a string to be broken at an arbitrary
code point break and then have the two parts be rendered independently,
as first there are combining codes which cause multiple code points to
be merged into a single glyph, so if a program wishes to break a string
and not disturb the glyph that will be generated, it needs to at least
know what code points are combining codes. This happened because when
they got into trying to fully define things they found way too many
possible glyphs so they needed to include combining codes.

There are also issues bigger than glyphs that deal with typography
issues (like the bidirectionality issues, as well as the HAN unification
issues) that require some state encoding in the string, making arbitrary
breaking of string no longer a totally possible goal. That was given up
before Unicode outgrew its 16 bit target (and perhaps trying to keep it
as a 16 bit target was one reason for some of this).

There also is the issue that many programs add state to text by include
typography like font selection, bold, italics, etc, which is beyond
Unicode (but it allows an application to define character codes for this
sort of thing) which also may need processing to handle the division of
a string.

>
>> Thus you original string, assuming that string is in a LTR default
>> region would be rendered as (if I am doing it right):
>>
>> testing one two three EVIF RUOF
>> XIS seven eight nine.
>>
>> Yes, in Unicode you need to define with direction marks you general text
>> direction and the extents of alternate direction quotes etc. Thus your
>> 'import' program that takes in properly formed Unicode text doesn't need
>> to 'guess', as the stream will define all that information.
>
> If text is fully marked up with direction changes, there would be no need
> to have a rendering engine try to infer direction based upon character
> ranges. If text doesn't contain marked up, then a rendering engine would
> be guessing about how to display things.

Text only needs to be marked with direction changes not implied by the
characters. The goal of the directionality implied in characters was to
make it so that much text could be properly rendered without the need
for the directionality codes to be embedded, and in general it does a
fairly good job at it (mixing Arabic numerals in RTL texts is one
weakness of it).
>
>> Yes, for your OCR example, as a first step, probably generates the text
>> with a forced LTR (or RTL) direction control, and maybe then do a pass
>> to improve the representation (actually GOOD OCR, likely does this at
>> the beginning as using context to help with character recognition can
>> help greatly with accuracy for normal text, as that processing would at
>> least identify direction of words.)
>
> For text which is set ragged-left or ragged-right, the choice of which
> margin was straight would usually indicate the primary text direction,
> but an OCR program would be guessing. Having an encoding which can
> indicate "the words appeared in this left-to-right order" would allow
> the information in the document to be reported according to what actually
> appears without the program having to guess.
>

As I said, the LRO character is exactly what is wanted here,

>>> When using a Unicode bidirection text display, replacing cats with Hebrew
>>> would lay out the line as:
>>>
>>> "1 dog, 2 3 ,STAC zebras, and four unicorns".
>>>
>>> Perhaps the post didn't show up that way for you when I used Unicode
>>> Hebrew characters? The 2 gets processed as LTR because the preceding
>>> text was LTR, while the 3 gets joined with the preceding RTL text and
>>> thus appears between the "2" and the word "STAC".
>>>
>>> How many readers would look at the above and figure out that there were
>>> two STAC and three zebras?
>>
>> The issue is that you didn't enter the data right. The issue is that
>> Arabic numbers (sort of like many punctuation characters) are just
>> weakly LTR so become RTL in the presence of RTL characters, so you need
>> to be aware of such things at data entry and sometimes you need to
>> provide overrides to correct things. When mixing languages you need to
>> follow the rules to get what you want, even more so when they are human
>> languages.
>
> If a text-entry program noticed that one was pasting an "RTL text" object
> into an LTR document and simply bracketed it with "embedded RTL" tags,
> then copying and pasting the Hebrew word "STAC" over "cats" would yield
> sensible behavior without the rendering engine having to know anything
> about the directionality of individual characters. Requiring that
> someone who doesn't normally work with bidirectional text but simply
> wants to embed one Hebrew word in a document must learn all about the
> complexities of bidirectional text seems rather less helpful.
>

The problem is that there are logically a couple of different very
reasonable options for what is wanted, and thus somewhat complicated
rules, and sometimes it doesn't work the way you want.

>>> In the absence of mixed-direction scripts, it would be fairly
>>> straightforward. The problem is that once code figures out where the
>>> line breaks are, there's no nice way to then figure out what order to
>>> display things in.
>>>
>> The Render String at this coordinate function should do that (provided
>> you maintain the direction nesting context down to that string), at
>> least if it is properly Unicode aware.
>
> Two problems with that:
>
> 1. The proper rendering order for things on a line may depend upon the
> content of preceding or succeeding lines.
>
> 2. Things like full justification, geometric distortion, etc. may
> require rendering a line of text as a sequence of smaller portions.
>
> Thus the need to subdivide text into portions that can be displayed in
> context-free portions.
>

As I said, you need to maintain the context, so you needed to keep the
list of directionality control character nesting, removing those that
have been popped off, that handles item 1.

Item 2 says you don't have a good enough primitive, and thus yes, you do
need to understand the details. If the render string at this location
was designed for justified text, it would also take a width parameter
for how much space to use to render the text. If not, then you need to
understand how to process the text, and if you need to deal with
bidirectional text, it is more complicated. Actually, even without the
RTL issue, if you want to do the 'best' job, you need code to handle the
degenerate case of a single word on the line (because the next word is
long) and how to (and if) you want to stretch the word to justify it,
and possible also do this for a line with very few words and a lot of
space to add.

Of course, if you want to do a simple version to come closer to justify
the text without needing to understand RTL details, you could just start
changing the space character (and possibly adding other space
characters) using the variety of space characters (there are a number
defined in the range U+200x and elsewhere) into the text to roughly even
out the margin, and measure the string.

supe...@casperkitty.com

unread,
Apr 20, 2018, 1:28:51 PM4/20/18
to
On Friday, April 20, 2018 at 10:30:09 AM UTC-5, Richard Damon wrote:
> On 4/20/18 10:03 AM, supe...@casperkitty.com wrote:
> > What is needed is an easy way of ending up with chunks that can be
> > rendered in context-free fashion. There are many possible ways of
> > representing text that would make it possible to subdivide it into
> > such chunks. I'm wouldn't be especially attached to any particular
> > way of representing such text, provided there was a standard form
> > and it allowed for the extraction of chunks that be rendered
> > separately.
>
> Unfortunately, the goal that you state seems to be a unicorn. The
> accepted grammar of LTR/RTL rendering needs some context (if only
> whether that chunk is in an overall LTR or RTL context, and the strength
> of that).

What kinds of information are needed depends upon what one will do with
the text. Sensible text editing would require understanding the nesting
of LTR and RTL sections, but having a set of markers which do not imply
semantic structure but are intended purely for rendering and get stripped
and regenerated when normalizing a string or portion thereof, would be
sufficient for rendering.

> >>> Otherwise, for type layout purposes, the conversion to fixed-direction text
> >>> would be done *after* all line breaks are processed.
> >>
> >> Yes, the output of a layout program may well be a character string with
> >> a LTR direction override code to say that everything following is LTR
> >> (even if the character would be RTL), and would include the line breaks,
> >> not the strange mishmash you were first describing where each word has a
> >> direction code.
> >
> > If the normalizing function can re-arrange the order of characters so as
> > to allow uniform LTR display, that would be great. The mish-mash was to
> > allow for the possibility that the designer of the normalizing spec would
> > want to keep characters in order, in which case it would have to supply
> > information to the application that would let the application do the
> > rearranging. As I said, there are many ways of achieving the same goal
> > which is to allow for context-free rendering.'
>
> There IS a LTR Override character LRO (U+202D, Left-to-Right Override)
> which indicates that the following text is LTR regardless of the
> implicit direction implied by the character. This override applies until
> some other direction command nests a new embedded direction, or until
> that override is terminated with a PDF (U+202C, Pop Directional
> Formating) character. IF you can assume that your text is withing such
> and override, (i.e. assume the context) then you can do it, but that
> isn't 'context-free', that is assumed context.

That would suggest that it would be possible to specify a one-way conversion
function that would take a text string that represents a single line, and
yield a text string that is wrapped in an LTR override, with text rearranged
to match the original. From what I can tell, though, actually implementing
such a thing would be rather difficult and complicated.

> > One of the stated design goals of Unicode encodings was to allow strings
> > to be sliced and diced while remaining semantically valid. While a
> > format that could accommodate that with bidirectional text would be
> > excessively bulky, having a normalized form which could accommodate that,
> > at the expense of extra bulk, would help Unicode better achieve that
> > stated design goal.
>
> And eventually they realized that that couldn't be met 100%. Unicode, if
> sliced at an arbitrary position will generally not appear as a different
> string, as no code point can be found embedded within the code point of
> another. By inspecting at the byte level, a program can easily find the
> next or previous code point division, so it is fairly easy to break a
> string at code point values.
>
> This unfortunately does not allow a string to be broken at an arbitrary
> code point break and then have the two parts be rendered independently,
> as first there are combining codes which cause multiple code points to
> be merged into a single glyph, so if a program wishes to break a string
> and not disturb the glyph that will be generated, it needs to at least
> know what code points are combining codes. This happened because when
> they got into trying to fully define things they found way too many
> possible glyphs so they needed to include combining codes.

If the design had used "start composite character" and "end composite
character" codes [which would be an excellent use for some of the otherwise
unused byte values in UTF 8], that would have made a lot of things much
easier.

> There are also issues bigger than glyphs that deal with typography
> issues (like the bidirectionality issues, as well as the HAN unification
> issues) that require some state encoding in the string, making arbitrary
> breaking of string no longer a totally possible goal. That was given up
> before Unicode outgrew its 16 bit target (and perhaps trying to keep it
> as a 16 bit target was one reason for some of this).
>
> There also is the issue that many programs add state to text by include
> typography like font selection, bold, italics, etc, which is beyond
> Unicode (but it allows an application to define character codes for this
> sort of thing) which also may need processing to handle the division of
> a string.

There really should be some standard way of recognizing at least the more
common typographical adjustments, especially given how widely they're used.

> > If text is fully marked up with direction changes, there would be no need
> > to have a rendering engine try to infer direction based upon character
> > ranges. If text doesn't contain marked up, then a rendering engine would
> > be guessing about how to display things.
>
> Text only needs to be marked with direction changes not implied by the
> characters. The goal of the directionality implied in characters was to
> make it so that much text could be properly rendered without the need
> for the directionality codes to be embedded, and in general it does a
> fairly good job at it (mixing Arabic numerals in RTL texts is one
> weakness of it).

Can a program be really useful for editing bidirectional texts without the
ability to explicitly control direction? If such controls are needed, what
disadvantage is there to applying them in all cases involving mixed-
direction text?

> The problem is that there are logically a couple of different very
> reasonable options for what is wanted, and thus somewhat complicated
> rules, and sometimes it doesn't work the way you want.

So specify what is *actually there*, without regard for what it means.

To use an analogy, sometimes one will know that an event is at 7:00pm, but
not know the time zone, and other times one will know that the event is
at 14:00UTC but not know the time zone. If one records the former time
as 7:00pm local (unknown) and then discovers the time zone, one can know
what the accurate UTC time was. If one records the latter time as 14:00UTC
(local time unknown) and discovers the time zone, one can then determine
what the local time was. If one guesses at the time zone and simply
records either the local time or the UTC time, it may be impossible to know
what actual time is represented.

> Item 2 says you don't have a good enough primitive, and thus yes, you do
> need to understand the details. If the render string at this location
> was designed for justified text, it would also take a width parameter
> for how much space to use to render the text. If not, then you need to
> understand how to process the text, and if you need to deal with
> bidirectional text, it is more complicated. Actually, even without the
> RTL issue, if you want to do the 'best' job, you need code to handle the
> degenerate case of a single word on the line (because the next word is
> long) and how to (and if) you want to stretch the word to justify it,
> and possible also do this for a line with very few words and a lot of
> space to add.
>
> Of course, if you want to do a simple version to come closer to justify
> the text without needing to understand RTL details, you could just start
> changing the space character (and possibly adding other space
> characters) using the variety of space characters (there are a number
> defined in the range U+200x and elsewhere) into the text to roughly even
> out the margin, and measure the string.

The goal isn't necessarily to have absolutely brilliant text formatting, but
rather to have something whose wheels won't fall off when given mixed-
direction text. I suppose adding a LTR direction override and stripping
out anything that could override it might be the best way to achieve that,
but it seems like it should be possible to do better without an outrageous
amount of difficulty.

supe...@casperkitty.com

unread,
Apr 20, 2018, 3:20:25 PM4/20/18
to
On Friday, April 20, 2018 at 12:28:51 PM UTC-5, supe...@casperkitty.com wrote:
> The goal isn't necessarily to have absolutely brilliant text formatting, but
> rather to have something whose wheels won't fall off when given mixed-
> direction text. I suppose adding a LTR direction override and stripping
> out anything that could override it might be the best way to achieve that,
> but it seems like it should be possible to do better without an outrageous
> amount of difficulty.

BTW, returning to the C language, one thing I remember wishing for back in
the day would have been for compilers to be capable of ignoring certain
byte sequences in the source text outside of string literals, without them
being regarded as whitespace or token separators. That would seem like it
might still be a useful thing to allow for things like LTR/RTL markers
which might be needed to make source code display sensibly.

Richard Damon

unread,
Apr 20, 2018, 4:36:45 PM4/20/18
to
supercat, I going to stop responding here for a couple of reasons

This really isn't comp.lang.c related

From every thing a read in your responses, you have no real
understanding of how the Unicode Bidirectional system works, and you
don't really care, so it seems like I am trying to explain this to my
dog, who just wants me to play with him.

I also get the feeling you keep changing the goal posts, remember your
original question was:

On 4/9/18 4:18 PM, supe...@casperkitty.com wrote:
> Meanwhile, from what I can tell, there is no standard concept of an
> API that can take a Unicode string and somehow indicate the order in
> which the characters should be placed when rendered. Without such a
> function, the amount of complexity needed for a text-formatting engine
> to properly display things like Hebrew may easily exceed the amount of
> complexity needed for everything else, combined.



Steven Petruzzellis

unread,
Apr 21, 2018, 8:20:13 AM4/21/18
to
Oh goodness, that is just tons of bitching. Being decentralized as it is, bulletin boards will never go away but it'll never be what your grandmother would use. At one point, God said an online denizen was "obsessing" over him, which was identified as him posting to himself.

Now RonB on the other end of the phone doesn't matter. All that matters is God gets to deliver the punchline. So how to respond? Don't respond to the moron. Logic isn't part of ruse and never was planned to be.

My timely statement stands sincere and correct. Well... like I said, I have a number of reasons for believing the flooder is God, who is a AppleScripter but I don't know if it could be used to spam like this. The one talent God learned sufficiently is to work to humiliate RonB into silence and if that doesn't work, troll-splain or quickly change the target. It's a never ending story, and God is simultaneously a master at off-the-cuff trolling remarks, while posting with socks with a myriad of tells.



--
What Every Entrepreneur Must Know
http://www.5z8.info/php-start_GPS_tracking-user_j3m3lg_whitepower
https://www.youtube.com/watch?v=AglvCo3dJ38&feature=youtu.be
Jonas Eklundh
0 new messages