On Jul 23, 7:38 am, an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> Hmm, thinking a little longer about it, all the string words use chars > for their granularity, so the xchars words should use chars, too. Not > that it makes a difference in practice.
If XCHAR means eXtended CHAR, and an XCHAR in memory is always a multiple (sometimes variable multiple) number of CHARs, then a char size is feasible. It would seem that translating code that works with UTF-8 so that it works with with 16-bit chars and UTF-16 would not be made any worse by having char granularity.
For the Atlantic economic space, utf-8 is fine ... for much of the Atlantic economic space, Latin-1 is fine ... and since the main difference that comes into play between utf-8 and the utf-16's are storage space inflation, the ability to do on the fly translation between I/O the Forth implementation gets most of the way there in the East Asian Economic Space. IOW, flexibility in file source-encoding can mostly cover for a little rigidity in internal character set in use, for the situations where it is most critical, which is where there is a need to access existing databases and their character encoding is entrenched by its own set of existing practices.
With on the fly translation and internal processing in utf-8, the buffers in memory would be sized in bytes=chars, and how many xchars you end up with would have to be determined on an ongoing basis ... whether the original source is utf-8 or not.
On Jul 16, 5:46 am, an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> I think that no word for changing the internal encoding should be > standardized. Or if you standardize it, it should fail if the new > internal encoding is not an extension of the old one (i.e., > ASCII->Latin-1 ok, ASCII->UTF-8 ok, but Latin-1->UTF-8 fails); since > this is a one-way street, GET-ENCODING makes little sense.
And I think that this kind of setting should never be made a one-way street, but I am happy with not having any "set internal encoding" at all.
... IOW, "even where everything is fluid, we build boats so that we have a place to stand".
> Otherwise a standard program could contain strings in different, > incompatible encodings, some of them in system-controlled strings > (e.g., word names), controlled by a global state variable. This would > be worse than STATE and BASE. No need to introduce another such > mistake. > The primary method should work through OPEN-FILE and CREATE-FILE > (e.g., by specifying the encoding in the fam). But yes, a word like > SET-FILE-ENCODING is useful when the program learns about the encoding > later (e.g., when the encoding is specified at the start of the file).
As I've just noted, SET-FILE-ENCODING is the real standardization hook for not being boxed in by the implementation defined encoding ... as long as we can talk to file and other i/o in the encoding it uses, then it makes much less difference what encoding we use internally.
For proponents of utf-8 uber alles, SET-FILE-ENCODING is important for reading a file that uses an ASCII header that contains code-page information ... with essentially all relevant code pages lying within the UCS16 character set, a 512 byte table gives the information required to translate any code page to utf-8 on the fly.
And of course, for reading well-formed files in one of the utf-16s, its critical, since you would open it in your default utf-16 encoding, and if the first character is FEFF, would reset the file encoding to the other utf-16 encoding.
GET-FILE-ENCODING / SET-FILE-ENCODING, obviously, have none of the "trap door" implication of GET-INTERNAL-ENCODING / SET-INTERNAL- ENCODING.
Bruce McFarling wrote: > And of course, for reading well-formed files in one of the utf-16s, > its critical, since you would open it in your default utf-16 encoding, > and if the first character is FEFF, would reset the file encoding to > the other utf-16 encoding.
Actually, you can use the UTF-16 start mark of a file to jump out of UTF-8 encoding, as well, since both FF FE and FE FF are illegal UTF-8 characters. So it is possible to open a text file and autodetect the three different widespread UTF encodings with the first two bytes.
For converting other encodings than latin-1 to Unicode, there's not much hope to make that easy (latin-1 is the first code-page, so conversion is straight forward). koi8-r and the cyrillic code page are quite different (isn't there a collating sequence in Russian? The koi8-r page looks like there is, the same as the latin ABC/greek alpha-beta-gamma one, with the extra letters appended behind, but Unicode seemed to ignore that). gb2312, big5, and the CJK Unicode code pages are very different, as well. Note that there *is* a collating sequence for Chinese characters, as well (otherwise, using a dictionary would be hell - with the collating sequence, it's only heck ;-).
On Jul 23, 10:59 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> For converting other encodings than latin-1 to Unicode, there's not much > hope to make that easy
In what sense? Use the character as an index into a table of USC16 characters for that code page, convert that character to utf-8, and you are done. A "code page base" value, with 0 turning off code page to utf-8 conversion, would be sufficient for triggering translation in that direction.
Its converting Unicode to other code page encodings that is more cumbersome. I have no idea whether a search in a table, a b-tree, or some hash based approach is most efficient in general ... and of course, there may be different answers for different priorities on space and speed.
On Jul 17, 5:13 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> There are some serious issues with UTF-16 and UTF-32, e.g. endianess. > Basically, a file or a string can start with a "silent" endianess-switching > character. This requires a global state of endianess, which is awful when > you are dealing with more than one string at the same time (and then also > makes it mandatory for every string to start with the endian marker).
If this is externalized to SET-FILE-ENCODING and GET-FILE-ENCODING, then there is no need for a mutable global endianess state ... each file can have its own endianess state. Indeed, even a utf-16 system could have a fixed endianess and cope with files in the other endianess in this way.
Under this, the most natural way to expose the implementation encoding available would be as an environment query.
On Jul 20, 7:33 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> The three different character sets certainly are not future best practice. > It's not a problem to have a different I/O encoding (which is handled by > translation from internal to IO), but otherwise, a single encoding should > be used.
This would seem to contradict the remark upthread:
> >> Small embedded systems with their own graphic IO will quite likely > >> use a single 8 bit character set; if you do telnet to a small embedded > >> system, even UTF-8 is fine (the embedded system doesn't need to > >> know).
Best practice with no dead weight of an established code base and no external network, and best practice for a specific situation, can easily be two different things. DCS, OCS, and ACS is a very useful framework for analyzing what best practice is in a particular situation, even when the result in many cases is DCS=OCS=ACS.
A small embedded device that has its own 8-bit character set which can report to a larger system what that character set is one scenario that can involve OCS=ACS for the small embedded device, but OCS<>ACS and DCS<>ACS for the larger system on the other end of the wire.
And porting code from one system that uses one OCS to another system that uses another OCS is a scenario that can readily involve DCS!=OCS on one end, the other, or both.
Bruce McFarling wrote: > On Jul 23, 10:59 am, Bernd Paysan <bernd.pay...@gmx.de> wrote: >> For converting other encodings than latin-1 to Unicode, there's not much >> hope to make that easy
> In what sense? Use the character as an index into a table of USC16 > characters for that code page, convert that character to utf-8, and > you are done.
In the sense of a memory-saving way to do it. Yes, you can always have a full table (or in case of gb2312 perhaps a compressed table), and use that.
> Its converting Unicode to other code page encodings that is more > cumbersome. I have no idea whether a search in a table, a b-tree, or > some hash based approach is most efficient in general ... and of > course, there may be different answers for different priorities on > space and speed.
Indeed. It might be possible that different code pages have different "optimal" approaches. Converting UTF-8 to ISO-Latin-1 doesn't need a table, it's simple
: utf8>latin1 ( xc -- xc ) dup $FF > IF drop [char] ? THEN ;
Cyrillic might need a small table for the cyrillic page itself. Chinese needs a large table (and then, a table is quite ok, because it's sufficiently dense populated).
Bruce McFarling wrote: > Under this, the most natural way to expose the implementation encoding > available would be as an environment query.
Yes, I thought about that, as well. I've modified the proposal with the discussion results as we go on, and removing SET-ENCODING and GET-ENCODING is part of it. XCHAR-ENCODING is now an environment query, which returns a string like "UTF-8". There must be some other standard where I can refer to for unambiguous names (e.g. MIME or HTTP RFCs). I'll suggest to start from here, and use the preferred MIME name (if there is one, otherwise the name itself) as unique identifier:
On Jul 23, 12:19 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Yes, I thought about that, as well. I've modified the proposal with the > discussion results as we go on, and removing SET-ENCODING and GET-ENCODING > is part of it. XCHAR-ENCODING is now an environment query, which returns a > string like "UTF-8". There must be some other standard where I can refer to > for unambiguous names (e.g. MIME or HTTP RFCs). I'll suggest to start from > here, and use the preferred MIME name (if there is one, otherwise the name > itself) as unique identifier:
MIME is good. MIME encodings do not distinguish big endian and little endian in standards that are defined in 16-bit or 31-bit integer spaces, like usc16, utf-16 or utf-32. This is not a real issue for the internal character encoding ... in the possible rare case, such as an 8-bit CHAR Forth with UTF-16 XCHARS, it could be specified explicitly that the internal encoding has the same endianess as the implementation itself, if only to forestall pointless semantic quibbling.
However, for SET-FILE-ENCODING, a one-to-one correspondence between MIME encoding names and encoding tokens would entail some provision for endianess ... the simplest would be to specify little endian where not specified in the MIME standard, and provide a word to convert an encoding token to big endian (or its exact mirror image).
Bernd Paysan wrote: > For converting other encodings than latin-1 to Unicode, there's not much > hope to make that easy (latin-1 is the first code-page, so conversion is > straight forward). koi8-r and the cyrillic code page are quite different > (isn't there a collating sequence in Russian? The koi8-r page looks like > there is, the same as the latin ABC/greek alpha-beta-gamma one, with the > extra letters appended behind, but Unicode seemed to ignore that). gb2312, > big5, and the CJK Unicode code pages are very different, as well. Note that > there *is* a collating sequence for Chinese characters, as well (otherwise, > using a dictionary would be hell - with the collating sequence, it's only > heck ;-).
Is there really a collating sequence for Chinese characters? Japanese certainly doesn't have one for the Chinese characters (kanji) it uses. Every dictionary maker uses its own order.
To look up an unfamiliar word in my paper dictionaries, the procedure is:
1. Decide where the word starts. This is nontrivial, since there are no spaces between words and there is no clear distinction between "word" and "common phrase". Also, common paper dictionaries aren't good for technical terms. So it's easy to get lost.
2. Look up the first kanji in a kanji dictionary. This needs a whole set of skills including counting strokes, recognizing which radical it will be indexed under, distinguishing between similar radicals, and recognizing changes in style. And sometimes you'll encounter one that isn't in your dictionary.
3. Guess which pronunciation the kanji has in this word. In Japanese, pronunciation of kanji shifts with context.
4. Look up the word phonetically in a Japanese-English dictionary. At least phonetic dictionary order is *almost* standardized.
5. Interpret and iterate as needed.
Not really hell: a lot more fun and useful than a crossword puzzle. But it does eat up valuable time.
I recall the first time I encountered the word "読み込み". A compound of gerund forms of two common verbs, but a standard dictionary is not very helpful ("読み" might mean "insight"). Took me awhile to understand it means "operand fetch".
Fortunately, these days computers help a lot here: they can index kanji in multiple ways and match words multiple ways. Digital dictionaries are much better about listing technical terms than any paper dictionary I've found. I can now sit in the back of the lecture hall with my laptop and look up unfamiliar words from the slides during a talk.
But for LSE64 I duck all these encoding issues by using mbtowc() and friends. That keeps LSE64 consistent with other software in my world, although basically I stick with ASCII anyway. And I have much better ways to spend my time than inventing another approach here.
To see how this paints over a very serious mess, check out:
It doesn't seem to me from reading this that there is any common standard collating sequence for Chinese characters. Various ones, some partially correlated, but still different in detail. Even tables can't really work all the time: the distinction between character identity and style is blurry.
-- John Doty, Noqsi Aerospace, Ltd. http://www.noqsi.com/ -- Specialization is for robots.
John Doty wrote: > Is there really a collating sequence for Chinese characters?
Yes, actually, there are several (but for all practical purposes, dictionaries use a single one for simplified Chinese, and all the others are for traditional Chinese only). The main problem is how many glyphs you include in your collating sequence, and debates about how to write a particular glyph (see the revisions of the GB tables, where some glyphs were moved around for using a simplified radical).
> Japanese > certainly doesn't have one for the Chinese characters (kanji) it uses. > Every dictionary maker uses its own order.
That's mostly past in China; maybe you find other sorting orders in Taiwan.
There's a second sorting order, that's by pinyin. You always need both sorting orders in a dictionary, since you either read a glyph (then you go through the radical/stroke order), or you hear it, then you go through pinyin. Dictionaries often use pinyin as their primary sorting order, and the glyph order as secondary, with an indirection (table driven).
> To look up an unfamiliar word in my paper dictionaries, the procedure is:
> 1. Decide where the word starts. This is nontrivial, since there are no > spaces between words and there is no clear distinction between "word" > and "common phrase". Also, common paper dictionaries aren't good for > technical terms. So it's easy to get lost.
This one is much easier in Chinese, because of its completely different grammar. People sometimes have problems where sentences start (that's why they use punctation marks now), but words are dead easy.
> 2. Look up the first kanji in a kanji dictionary. This needs a whole set > of skills including counting strokes, recognizing which radical it will > be indexed under, distinguishing between similar radicals, and > recognizing changes in style.
Sounds remarkable similar to the Chinese system, apart from the "each dictionary maker uses his own order". You need to be sufficiently skilled in the art of calligraphy to know how a glyph is written.
> And sometimes you'll encounter one that > isn't in your dictionary.
Yes, that happens. Despite all the efforts to standardize the Chinese written language the past 2200 years, the number of glyphs around still seems to be unbound. Usually, when you find such a glyph, asking some native speaker also won't help - they don't know more than the dictionary.
> 3. Guess which pronunciation the kanji has in this word. In Japanese, > pronunciation of kanji shifts with context.
Fortunately, pronunciation only shifts with dialect in Chinese.
> 4. Look up the word phonetically in a Japanese-English dictionary. At > least phonetic dictionary order is *almost* standardized.
> 5. Interpret and iterate as needed.
> Not really hell: a lot more fun and useful than a crossword puzzle. But > it does eat up valuable time.
Indeed.
> I recall the first time I encountered the word "読み込み". A compound of > gerund forms of two common verbs, but a standard dictionary is not very > helpful ("読み" might mean "insight"). Took me awhile to understand it > means "operand fetch".
From my Chinese knowledge, it seems to be easier to decipher. The first sign ("du") means "read" (and it's not part of my dictionary, since the usual way to say "read" is "see book" (kan shu), so I found it with gucharmap), and all the rest is Japanese. The usual way to decipher Japanese from Chinese is to discard all the Japanese stuff, and guess from the ambiguous meaning the remaining words have.
> To see how this paints over a very serious mess, check out:
> It doesn't seem to me from reading this that there is any common > standard collating sequence for Chinese characters. Various ones, some > partially correlated, but still different in detail. Even tables can't > really work all the time: the distinction between character identity and > style is blurry.
Yes, it's not easy. One particular problem is that the glyph space is so large, and if you combine them all, you end up with a lot more glyphs than you expect. Unicode is a very typical example: Here's a set of codepages with lots of CJK glyphs. And here's another one, discontiguous with the previous, containing glyphs we forgot last time. And oops, we made a mistake, this glyph really should be written like that, and that means it ends up somewhere completely different in the sorting order ;-).