Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Message from discussion RfD: XCHAR wordset (for UTF-8 and alike)

View parsed - Show only message text

Message-ID: <p6kj03-ni8.ln1@vimes.paysan.nom>
From: Bernd Paysan <bernd.pay...@gmx.de>
Subject: RfD: XCHAR wordset (for UTF-8 and alike)
Newsgroups: comp.lang.forth
Date: Mon, 26 Sep 2005 00:16:25 +0200
Lines: 92
User-Agent: KNode/0.9.0
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8Bit
NNTP-Posting-Host: 82.82.50.198
X-Trace: 26 Sep 2005 00:16:50 +0200, 82.82.50.198
X-Complaints-To: abuse@arcor-ip.de
Path: g2news1.google.com!news1.google.com!newsread.com!news-xfer.newsread.com!news-out1.kabelfoon.nl!newsfeed.kabelfoon.nl!bandi.nntp.kabelfoon.nl!194.25.134.62.MISMATCH!newsfeed00.sul.t-online.de!newsfeed01.sul.t-online.de!t-online.de!newsfeed.arcor-ip.de!news.arcor-ip.de!vimes.paysan.nom!news

Problem:

ASCII is only appropriate for the English language. Most western languages
however fit somewhat into the Forth frame, since a byte is sufficient to
encode the few special characters in each (though not always the same
encoding can be used; latin-1 is most widely used, though). For other
languages, different char-sets have to be used, several of them
variable-width. Most prominent representant is UTF-8. Let's call these
extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
ASCII-compatible encodings may be used.

Proposal

Datatypes:

xc      is an extended char on the stack. It occupies one cell, and is
        a subset of unsigned cell. Note: UTF-8 can not store more that 31
        bits; on 16 bit systems, only the UCS16 subset of the UTF-8
        character set can be used.
xc_addr is the address of an XCHAR in memory. Alignment requirements are
        the same as c_addr. The memory representation of an XCHAR differs
        from the stack location, and depends on the encoding used. An XCHAR
        may use a variable number of address units in memory.

Common encodings:

Input and files commonly are either encoded iso-latin-1 or utf-8. The
encoding depends on settings of the computer system such as the LANG
environment variable on Unix. You can use the system consistently only when
you don't change the encoding, or only use the ASCII subset.

Words:

XC-SIZE ( xc -- u )
Computes the memory size of the XCHAR xc in address units.

XC@+ ( xc_addr1 -- xc_addr2 xc )
Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XC!+ ( xc xc_addr1 -- xc_addr2 )
Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XCHAR+ ( xc_addr1 -- xc_addr2 )
Adds the size of the XCHAR stored at xc_addr1 to this address, giving
xc_addr2.

XCHAR- ( xc_addr1 -- xc_addr2 )
Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
work for every possible encoding.

X-SIZE ( xc_addr u -- n )
n is the number of monospace ASCII characters that take the same space to
display as the the XCHAR string starting at xc_addr, using u address units.

XKEY ( -- xc )
Reads an XCHAR from the terminal.

XEMIT ( xc -- )
Prints an XCHAR on the terminal.

The following words behave different when the XCHAR extension is present:

CHAR ( "<spaces>name" -- xc )
Skip leading space delimiters.  Parse name delimited by a space.  Put the
value of its first XCHAR onto the stack.

[CHAR]
Interpretation: Interpretation semantics for this word are undefined.
        Compilation:    ( ?<spaces>name? -- )
Skip leading space delimiters.  Parse name delimited by a space.  Append the
run-time semantics given below to the current definition.
        Run-time:       ( -- xc )
Place xc, the value of the first XCHAR of name, on the stack.

Reference implementation:

Unfortunately, both the Gforth and the bigFORTH implementation have several
system-specific parts.

Experience:

Build into Gforth (development version) and recent versions of bigFORTH.
Open issues are file reading and writing (conversion on the fly or leave as
it is?).

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google