Google Groups Home
Help | Sign in
RfD: XCHAR wordset
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 26 - 49 of 49 - Collapse all < Older 
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Bruce McFarling  
View profile
 More options Jul 23 2007, 9:23 am
Newsgroups: comp.lang.forth
From: Bruce McFarling <agil...@netscape.net>
Date: Mon, 23 Jul 2007 06:23:26 -0700
Local: Mon, Jul 23 2007 9:23 am
Subject: Re: RfD: XCHAR wordset
On Jul 23, 7:38 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> Hmm, thinking a little longer about it, all the string words use chars
> for their granularity, so the xchars words should use chars, too.  Not
> that it makes a difference in practice.

If XCHAR means eXtended CHAR, and an XCHAR in memory is always a
multiple (sometimes variable multiple) number of CHARs, then a char
size is feasible. It would seem that translating code that works with
UTF-8 so that it works with with 16-bit chars and UTF-16 would not be
made any worse by having char granularity.

For the Atlantic economic space, utf-8 is fine ... for much of the
Atlantic economic space, Latin-1 is fine ... and since the main
difference that comes into play between utf-8 and the utf-16's are
storage space inflation, the ability to do on the fly translation
between I/O the Forth implementation gets most of the way there in the
East Asian Economic Space. IOW, flexibility in file source-encoding
can mostly cover for a little rigidity in internal character set in
use, for the situations where it is most critical, which is where
there is a need to access existing databases and their character
encoding is entrenched by its own set of existing practices.

With on the fly translation and internal processing in utf-8, the
buffers in memory would be sized in bytes=chars, and how many xchars
you end up with would have to be determined on an ongoing basis ...
whether the original source is utf-8 or not.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bruce McFarling  
View profile
 More options Jul 23 2007, 9:42 am
Newsgroups: comp.lang.forth
From: Bruce McFarling <agil...@netscape.net>
Date: Mon, 23 Jul 2007 06:42:35 -0700
Local: Mon, Jul 23 2007 9:42 am
Subject: Re: RfD: XCHAR wordset
On Jul 16, 5:46 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> I think that no word for changing the internal encoding should be
> standardized.  Or if you standardize it, it should fail if the new
> internal encoding is not an extension of the old one (i.e.,
> ASCII->Latin-1 ok, ASCII->UTF-8 ok, but Latin-1->UTF-8 fails); since
> this is a one-way street, GET-ENCODING makes little sense.

And I think that this kind of setting should never be made a one-way
street, but I am happy with not having any "set internal encoding" at
all.

... IOW, "even where everything is fluid, we build boats so that we
have a place to stand".

> Otherwise a standard program could contain strings in different,
> incompatible encodings, some of them in system-controlled strings
> (e.g., word names), controlled by a global state variable.  This would
> be worse than STATE and BASE.  No need to introduce another such
> mistake.
> The primary method should work through OPEN-FILE and CREATE-FILE
> (e.g., by specifying the encoding in the fam).  But yes, a word like
> SET-FILE-ENCODING is useful when the program learns about the encoding
> later (e.g., when the encoding is specified at the start of the file).

As I've just noted, SET-FILE-ENCODING is the real standardization hook
for not being boxed in by the implementation defined encoding ... as
long as we can talk to file and other i/o in the encoding it uses,
then it makes much less difference what encoding we use internally.

For proponents of utf-8 uber alles, SET-FILE-ENCODING is important for
reading a file that uses an ASCII header that contains code-page
information ... with essentially all relevant code pages lying within
the UCS16 character set, a 512 byte table gives the information
required to translate any code page to utf-8 on the fly.

And of course, for reading well-formed files in one of the utf-16s,
its critical, since you would open it in your default utf-16 encoding,
and if the first character is FEFF, would reset the file encoding to
the other utf-16 encoding.

GET-FILE-ENCODING / SET-FILE-ENCODING, obviously, have none of the
"trap door" implication of GET-INTERNAL-ENCODING / SET-INTERNAL-
ENCODING.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bernd Paysan  
View profile
 More options Jul 23 2007, 10:59 am
Newsgroups: comp.lang.forth
From: Bernd Paysan <bernd.pay...@gmx.de>
Date: Mon, 23 Jul 2007 16:59:15 +0200
Local: Mon, Jul 23 2007 10:59 am
Subject: Re: RfD: XCHAR wordset

Bruce McFarling wrote:
> And of course, for reading well-formed files in one of the utf-16s,
> its critical, since you would open it in your default utf-16 encoding,
> and if the first character is FEFF, would reset the file encoding to
> the other utf-16 encoding.

Actually, you can use the UTF-16 start mark of a file to jump out of UTF-8
encoding, as well, since both FF FE and FE FF are illegal UTF-8 characters.
So it is possible to open a text file and autodetect the three different
widespread UTF encodings with the first two bytes.

For converting other encodings than latin-1 to Unicode, there's not much
hope to make that easy (latin-1 is the first code-page, so conversion is
straight forward). koi8-r and the cyrillic code page are quite different
(isn't there a collating sequence in Russian? The koi8-r page looks like
there is, the same as the latin ABC/greek alpha-beta-gamma one, with the
extra letters appended behind, but Unicode seemed to ignore that). gb2312,
big5, and the CJK Unicode code pages are very different, as well. Note that
there *is* a collating sequence for Chinese characters, as well (otherwise,
using a dictionary would be hell - with the collating sequence, it's only
heck ;-).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bruce McFarling  
View profile
 More options Jul 23 2007, 11:52 am
Newsgroups: comp.lang.forth
From: Bruce McFarling <agil...@netscape.net>
Date: Mon, 23 Jul 2007 08:52:15 -0700
Local: Mon, Jul 23 2007 11:52 am
Subject: Re: RfD: XCHAR wordset
On Jul 23, 10:59 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:

> For converting other encodings than latin-1 to Unicode, there's not much
> hope to make that easy

In what sense? Use the character as an index into a table of USC16
characters for that code page, convert that character to utf-8, and
you are done. A "code page base" value, with 0 turning off code page
to utf-8 conversion, would be sufficient for triggering translation in
that direction.

Its converting Unicode to other code page encodings that is more
cumbersome. I have no idea whether a search in a table, a b-tree, or
some hash based approach is most efficient in general ... and of
course, there may be different answers for different priorities on
space and speed.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bruce McFarling  
View profile
 More options Jul 23 2007, 12:00 pm
Newsgroups: comp.lang.forth
From: Bruce McFarling <agil...@netscape.net>
Date: Mon, 23 Jul 2007 09:00:54 -0700
Local: Mon, Jul 23 2007 12:00 pm
Subject: Re: RfD: XCHAR wordset
On Jul 17, 5:13 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:

> There are some serious issues with UTF-16 and UTF-32, e.g. endianess.
> Basically, a file or a string can start with a "silent" endianess-switching
> character. This requires a global state of endianess, which is awful when
> you are dealing with more than one string at the same time (and then also
> makes it mandatory for every string to start with the endian marker).

If this is externalized to SET-FILE-ENCODING and GET-FILE-ENCODING,
then there is no need for a mutable global endianess state ... each
file can have its own endianess state. Indeed, even a utf-16 system
could have a fixed endianess and cope with files in the other
endianess in this way.

Under this, the most natural way to expose the implementation encoding
available would be as an environment query.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bruce McFarling  
View profile
 More options Jul 23 2007, 12:19 pm
Newsgroups: comp.lang.forth
From: Bruce McFarling <agil...@netscape.net>
Date: Mon, 23 Jul 2007 09:19:03 -0700
Local: Mon, Jul 23 2007 12:19 pm
Subject: Re: RfD: XCHAR wordset
On Jul 20, 7:33 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:

> The three different character sets certainly are not future best practice.
> It's not a problem to have a different I/O encoding (which is handled by
> translation from internal to IO), but otherwise, a single encoding should
> be used.

This would seem to contradict the remark upthread:

> >> Small embedded systems with their own graphic IO will quite likely
> >> use a single 8 bit character set; if you do telnet to a small embedded
> >> system, even UTF-8 is fine (the embedded system doesn't need to
> >> know).

Best practice with no dead weight of an established code base and no
external network, and best practice for a specific situation, can
easily be two different things. DCS, OCS, and ACS is a very useful
framework for analyzing what best practice is in a particular
situation, even when the result in many cases is DCS=OCS=ACS.

A small embedded device that has its own 8-bit character set which can
report to a larger system what that character set is one scenario that
can involve OCS=ACS for the small embedded device, but OCS<>ACS and
DCS<>ACS for the larger system on the other end of the wire.

And porting code from one system that uses one OCS to another system
that uses another OCS is a scenario that can readily involve DCS!=OCS
on one end, the other, or both.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bernd Paysan  
View profile
 More options Jul 23 2007, 12:10 pm
Newsgroups: comp.lang.forth
From: Bernd Paysan <bernd.pay...@gmx.de>
Date: Mon, 23 Jul 2007 18:10:22 +0200
Local: Mon, Jul 23 2007 12:10 pm
Subject: Re: RfD: XCHAR wordset

Bruce McFarling wrote:
> On Jul 23, 10:59 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
>> For converting other encodings than latin-1 to Unicode, there's not much
>> hope to make that easy

> In what sense? Use the character as an index into a table of USC16
> characters for that code page, convert that character to utf-8, and
> you are done.

In the sense of a memory-saving way to do it. Yes, you can always have a
full table (or in case of gb2312 perhaps a compressed table), and use that.

> Its converting Unicode to other code page encodings that is more
> cumbersome. I have no idea whether a search in a table, a b-tree, or
> some hash based approach is most efficient in general ... and of
> course, there may be different answers for different priorities on
> space and speed.

Indeed. It might be possible that different code pages have
different "optimal" approaches. Converting UTF-8 to ISO-Latin-1 doesn't
need a table, it's simple

: utf8>latin1 ( xc -- xc )  dup $FF > IF  drop [char] ?  THEN ;

Cyrillic might need a small table for the cyrillic page itself. Chinese
needs a large table (and then, a table is quite ok, because it's
sufficiently dense populated).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bernd Paysan  
View profile
 More options Jul 23 2007, 12:19 pm
Newsgroups: comp.lang.forth
From: Bernd Paysan <bernd.pay...@gmx.de>
Date: Mon, 23 Jul 2007 18:19:42 +0200
Local: Mon, Jul 23 2007 12:19 pm
Subject: Re: RfD: XCHAR wordset

Bruce McFarling wrote:
> Under this, the most natural way to expose the implementation encoding
> available would be as an environment query.

Yes, I thought about that, as well. I've modified the proposal with the
discussion results as we go on, and removing SET-ENCODING and GET-ENCODING
is part of it. XCHAR-ENCODING is now an environment query, which returns a
string like "UTF-8". There must be some other standard where I can refer to
for unambiguous names (e.g. MIME or HTTP RFCs). I'll suggest to start from
here, and use the preferred MIME name (if there is one, otherwise the name
itself) as unique identifier:

http://www.iana.org/assignments/character-sets

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bruce McFarling  
View profile
 More options Jul 23 2007, 2:21 pm
Newsgroups: comp.lang.forth
From: Bruce McFarling <agil...@netscape.net>
Date: Mon, 23 Jul 2007 11:21:43 -0700
Local: Mon, Jul 23 2007 2:21 pm
Subject: Re: RfD: XCHAR wordset
On Jul 23, 12:19 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:

> Yes, I thought about that, as well. I've modified the proposal with the
> discussion results as we go on, and removing SET-ENCODING and GET-ENCODING
> is part of it. XCHAR-ENCODING is now an environment query, which returns a
> string like "UTF-8". There must be some other standard where I can refer to
> for unambiguous names (e.g. MIME or HTTP RFCs). I'll suggest to start from
> here, and use the preferred MIME name (if there is one, otherwise the name
> itself) as unique identifier:

MIME is good. MIME encodings do not distinguish big endian and little
endian in standards that are defined in 16-bit or 31-bit integer
spaces, like usc16, utf-16 or utf-32. This is not a real issue for the
internal character encoding ... in the possible rare case, such as an
8-bit CHAR Forth with UTF-16 XCHARS, it could be specified explicitly
that the internal encoding has the same endianess as the
implementation itself, if only to forestall pointless semantic
quibbling.

However, for SET-FILE-ENCODING, a one-to-one correspondence between
MIME encoding names and encoding tokens would entail some provision
for endianess ... the simplest would be to specify little endian where
not specified in the MIME standard, and provide a word to convert an
encoding token to big endian (or its exact mirror image).


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Doty  
View profile
 More options Jul 23 2007, 2:56 pm
Newsgroups: comp.lang.forth
From: John Doty <j...@whispertel.LoseTheH.net>
Date: Mon, 23 Jul 2007 12:56:54 -0600
Local: Mon, Jul 23 2007 2:56 pm
Subject: Re: RfD: XCHAR wordset

Bernd Paysan wrote:
> For converting other encodings than latin-1 to Unicode, there's not much
> hope to make that easy (latin-1 is the first code-page, so conversion is
> straight forward). koi8-r and the cyrillic code page are quite different
> (isn't there a collating sequence in Russian? The koi8-r page looks like
> there is, the same as the latin ABC/greek alpha-beta-gamma one, with the
> extra letters appended behind, but Unicode seemed to ignore that). gb2312,
> big5, and the CJK Unicode code pages are very different, as well. Note that
> there *is* a collating sequence for Chinese characters, as well (otherwise,
> using a dictionary would be hell - with the collating sequence, it's only
> heck ;-).

Is there really a collating sequence for Chinese characters? Japanese
certainly doesn't have one for the Chinese characters (kanji) it uses.
Every dictionary maker uses its own order.

To look up an unfamiliar word in my paper dictionaries, the procedure is:

1. Decide where the word starts. This is nontrivial, since there are no
spaces between words and there is no clear distinction between "word"
and "common phrase". Also, common paper dictionaries aren't good for
technical terms. So it's easy to get lost.

2. Look up the first kanji in a kanji dictionary. This needs a whole set
of skills including counting strokes, recognizing which radical it will
be indexed under, distinguishing between similar radicals, and
recognizing changes in style. And sometimes you'll encounter one that
isn't in your dictionary.

3. Guess which pronunciation the kanji has in this word. In Japanese,
pronunciation of kanji shifts with context.

4. Look up the word phonetically in a Japanese-English dictionary. At
least phonetic dictionary order is *almost* standardized.

5. Interpret and iterate as needed.

Not really hell: a lot more fun and useful than a crossword puzzle. But
it does eat up valuable time.

I recall the first time I encountered the word "読み込み". A compound of
gerund forms of two common verbs, but a standard dictionary is not very
helpful ("読み" might mean "insight"). Took me awhile to understand it
means "operand fetch".

Fortunately, these days computers help a lot here: they can index kanji
in multiple ways and match words multiple ways. Digital dictionaries are
much better about listing technical terms than any paper dictionary I've
found. I can now sit in the back of the lecture hall with my laptop and
look up unfamiliar words from the slides during a talk.

But for LSE64 I duck all these encoding issues by using mbtowc() and
friends. That keeps LSE64 consistent with other software in my world,
although basically I stick with ASCII anyway. And I have much better
ways to spend my time than inventing another approach here.

To see how this paints over a very serious mess, check out:

http://examples.oreilly.com/cjkvinfo/doc/cjk.inf

It doesn't seem to me from reading this that there is any common
standard collating sequence for Chinese characters. Various ones, some
partially correlated, but still different in detail. Even tables can't
really work all the time: the distinction between character identity and
style is blurry.

--
John Doty, Noqsi Aerospace, Ltd.
http://www.noqsi.com/
--
Specialization is for robots.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bernd Paysan  
View profile
 More options Jul 23 2007, 4:55 pm
Newsgroups: comp.lang.forth
From: Bernd Paysan <bernd.pay...@gmx.de>
Date: Mon, 23 Jul 2007 22:55:02 +0200
Local: Mon, Jul 23 2007 4:55 pm
Subject: Re: RfD: XCHAR wordset

John Doty wrote:
> Is there really a collating sequence for Chinese characters?

Yes, actually, there are several (but for all practical purposes,
dictionaries use a single one for simplified Chinese, and all the others
are for traditional Chinese only). The main problem is how many glyphs you
include in your collating sequence, and debates about how to write a
particular glyph (see the revisions of the GB tables, where some glyphs
were moved around for using a simplified radical).

> Japanese
> certainly doesn't have one for the Chinese characters (kanji) it uses.
> Every dictionary maker uses its own order.

That's mostly past in China; maybe you find other sorting orders in Taiwan.

There's a second sorting order, that's by pinyin. You always need both
sorting orders in a dictionary, since you either read a glyph (then you go
through the radical/stroke order), or you hear it, then you go through
pinyin. Dictionaries often use pinyin as their primary sorting order, and
the glyph order as secondary, with an indirection (table driven).

> To look up an unfamiliar word in my paper dictionaries, the procedure is:

> 1. Decide where the word starts. This is nontrivial, since there are no
> spaces between words and there is no clear distinction between "word"
> and "common phrase". Also, common paper dictionaries aren't good for
> technical terms. So it's easy to get lost.

This one is much easier in Chinese, because of its completely different
grammar. People sometimes have problems where sentences start (that's why
they use punctation marks now), but words are dead easy.

> 2. Look up the first kanji in a kanji dictionary. This needs a whole set
> of skills including counting strokes, recognizing which radical it will
> be indexed under, distinguishing between similar radicals, and
> recognizing changes in style.

Sounds remarkable similar to the Chinese system, apart from the "each
dictionary maker uses his own order". You need to be sufficiently skilled
in the art of calligraphy to know how a glyph is written.

> And sometimes you'll encounter one that
> isn't in your dictionary.

Yes, that happens. Despite all the efforts to standardize the Chinese
written language the past 2200 years, the number of glyphs around still
seems to be unbound. Usually, when you find such a glyph, asking some
native speaker also won't help - they don't know more than the dictionary.

> 3. Guess which pronunciation the kanji has in this word. In Japanese,
> pronunciation of kanji shifts with context.

Fortunately, pronunciation only shifts with dialect in Chinese.

> 4. Look up the word phonetically in a Japanese-English dictionary. At
> least phonetic dictionary order is *almost* standardized.

> 5. Interpret and iterate as needed.

> Not really hell: a lot more fun and useful than a crossword puzzle. But
> it does eat up valuable time.

Indeed.

> I recall the first time I encountered the word "読み込み". A compound of
> gerund forms of two common verbs, but a standard dictionary is not very
> helpful ("読み" might mean "insight"). Took me awhile to understand it
> means "operand fetch".

From my Chinese knowledge, it seems to be easier to decipher. The first sign
("du") means "read" (and it's not part of my dictionary, since the usual
way to say "read" is "see book" (kan shu), so I found it with gucharmap),
and all the rest is Japanese. The usual way to decipher Japanese from
Chinese is to discard all the Japanese stuff, and guess from the ambiguous
meaning the remaining words have.

> To see how this paints over a very serious mess, check out:

> http://examples.oreilly.com/cjkvinfo/doc/cjk.inf

> It doesn't seem to me from reading this that there is any common
> standard collating sequence for Chinese characters. Various ones, some
> partially correlated, but still different in detail. Even tables can't
> really work all the time: the distinction between character identity and
> style is blurry.

Yes, it's not easy. One particular problem is that the glyph space is so
large, and if you combine them all, you end up with a lot more glyphs than
you expect. Unicode is a very typical example: Here's a set of codepages
with lots of CJK glyphs. And here's another one, discontiguous with the
previous, containing glyphs we forgot last time. And oops, we made a
mistake, this glyph really should be written like that, and that means it
ends up somewhere completely different in the sorting order ;-).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/<