Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Wide character implementation
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 160 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Thomas Bushnell, BSG  
View profile  
 More options Mar 19 2002, 12:10 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: tb+use...@becket.net (Thomas Bushnell, BSG)
Date: 18 Mar 2002 21:08:15 -0800
Local: Tues, Mar 19 2002 12:08 am
Subject: Wide character implementation

If one uses tagged pointers, then its easy to implement fixnums as
ASCII characters efficiently.

But suppose one wants to have the character datatype be 32-bit Unicode
characters?  Or worse yet, 35-bit Unicode characters?

At the same time, most characters in the system will of course not be
wide.  What are the sane implementation strategies for this?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Frode Vatvedt Fjeld  
View profile  
 More options Mar 19 2002, 4:09 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Frode Vatvedt Fjeld <fro...@acm.org>
Date: Tue, 19 Mar 2002 10:08:59 +0100
Local: Tues, Mar 19 2002 4:08 am
Subject: Re: Wide character implementation
tb+use...@becket.net (Thomas Bushnell, BSG) writes:

> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.

Hm.. perhaps you mean it's easy to implement characters as immediate
values?

> But suppose one wants to have the character datatype be 32-bit
> Unicode characters?  Or worse yet, 35-bit Unicode characters?

> At the same time, most characters in the system will of course not
> be wide.  What are the sane implementation strategies for this?

I suppose to assign "most characters in the system" to a sub-type of
the wide characters, and implement that sub-type as immediates.

--
Frode Vatvedt Fjeld


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pierpaolo BERNARDI  
View profile  
 More options Mar 19 2002, 5:26 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: "Pierpaolo BERNARDI" <pierpaolo_berna...@hotmail.com>
Date: Tue, 19 Mar 2002 10:22:05 GMT
Local: Tues, Mar 19 2002 5:22 am
Subject: Re: Wide character implementation

"Thomas Bushnell, BSG" <tb+use...@becket.net> ha scritto nel messaggio
news:87wuw92lhc.fsf@becket.becket.net...

> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.

> But suppose one wants to have the character datatype be 32-bit Unicode
> characters?  Or worse yet, 35-bit Unicode characters?

21 bits are enough for Unicode.

P.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 19 2002, 5:53 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Erik Naggum <e...@naggum.net>
Date: Tue, 19 Mar 2002 10:53:48 GMT
Local: Tues, Mar 19 2002 5:53 am
Subject: Re: Wide character implementation
* Thomas Bushnell, BSG
| If one uses tagged pointers, then its easy to implement fixnums as
| ASCII characters efficiently.

  Huh?  No sense this makes.

| But suppose one wants to have the character datatype be 32-bit Unicode
| characters?  Or worse yet, 35-bit Unicode characters?

  Unicode is a 31-bit character set.  The base multilingual plane is 16
  bits wide, and then there are the possibility of 20 bits encoded in two
  16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (-
  (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme,
  but one does not have to understand the lo- and hi-word codes that make
  up the 20-bit character space.  In effect, you need 16 bits.  Therefore,
  you could represent characters with the following bit pattern, with b for
  bits and c for code.  Fonts are a mistake, so is removed.

000000ccccccccccccccccccccc00110

  This is useful when the fixnum type tag is either 000 for even fixnums
  and 100 for odd fixnums, effectively 00 for fixnums.  This makes
  char-code and code-char a single shift operation.  Of course, char-bits
  and char-font are not supported in this scheme, but if you _really_ have
  to, the upper 4 bits may be used for char-bits.

| At the same time, most characters in the system will of course not be
| wide.  What are the sane implementation strategies for this?

  I would (again) recommend actually reading the specification.  The
  character type can handle everything, but base-char could handle the
  8-bit things that reasonable people use.  The normal string type has
  character elements while base-string has base-char elements.  It would
  seem fairly reasonable to implement a *read-default-string-type* that
  would take string or base-string as value if you choose to implement both
  string types.

///
--
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Janis Dzerins  
View profile  
 More options Mar 19 2002, 6:45 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
Followup-To: comp.lang.lisp
From: Janis Dzerins <jo...@latnet.lv>
Date: 19 Mar 2002 13:31:52 +0200
Local: Tues, Mar 19 2002 6:31 am
Subject: Re: Wide character implementation

"Pierpaolo BERNARDI" <pierpaolo_berna...@hotmail.com> writes:
> "Thomas Bushnell, BSG" <tb+use...@becket.net> ha scritto nel messaggio
> news:87wuw92lhc.fsf@becket.becket.net...

> > If one uses tagged pointers, then its easy to implement fixnums as
> > ASCII characters efficiently.

> > But suppose one wants to have the character datatype be 32-bit Unicode
> > characters?  Or worse yet, 35-bit Unicode characters?

> 21 bits are enough for Unicode.

What "Unicode"?

--
Janis Dzerins

  Eat shit -- billions of flies can't be wrong.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pierpaolo BERNARDI  
View profile  
 More options Mar 19 2002, 9:56 am
Newsgroups: comp.lang.lisp
From: "Pierpaolo BERNARDI" <pierpaolo_berna...@hotmail.com>
Date: Tue, 19 Mar 2002 14:51:38 GMT
Local: Tues, Mar 19 2002 9:51 am
Subject: Re: Wide character implementation

"Janis Dzerins" <jo...@latnet.lv> ha scritto nel messaggio
news:87d6y0ztcn.fsf@asaka.latnet.lv...

> "Pierpaolo BERNARDI" <pierpaolo_berna...@hotmail.com> writes:

> > "Thomas Bushnell, BSG" <tb+use...@becket.net> ha scritto nel messaggio
> > news:87wuw92lhc.fsf@becket.becket.net...

> > > If one uses tagged pointers, then its easy to implement fixnums as
> > > ASCII characters efficiently.

> > > But suppose one wants to have the character datatype be 32-bit Unicode
> > > characters?  Or worse yet, 35-bit Unicode characters?

> > 21 bits are enough for Unicode.

> What "Unicode"?

The character encoding standard defined by the Unicode Consortium Inc.,
Are there other Unicodes?

P.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sander Vesik  
View profile  
 More options Mar 19 2002, 11:25 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Sander Vesik <san...@haldjas.folklore.ee>
Date: Tue, 19 Mar 2002 16:22:30 +0000 (UTC)
Local: Tues, Mar 19 2002 11:22 am
Subject: Re: Wide character implementation
In comp.lang.scheme Thomas Bushnell, BSG <tb+use...@becket.net> wrote:

> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.

> But suppose one wants to have the character datatype be 32-bit Unicode
> characters?  Or worse yet, 35-bit Unicode characters?

They use either UTF8 or UTF16 - you cannot rely on whetvere size
you pick to be suitably long forever, unicode is sort of inherently
variable-length (characters even have too possible representations
in many cases, &auml; and similar 8-)

> At the same time, most characters in the system will of course not be
> wide.  What are the sane implementation strategies for this?

Implement them as variable-length strings using say UTF-8. Also, saying that
most characters will not be wide may well be a wrong assumptin 8-)

--
        Sander

+++ Out of cheese error +++


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sander Vesik  
View profile  
 More options Mar 19 2002, 11:30 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Sander Vesik <san...@haldjas.folklore.ee>
Date: Tue, 19 Mar 2002 16:27:04 +0000 (UTC)
Local: Tues, Mar 19 2002 11:27 am
Subject: Re: Wide character implementation
In comp.lang.scheme Erik Naggum <e...@naggum.net> wrote:

I don't  think this is true any more as of unicode 3.1 afaik, 16 bits is
no longer enough.

[snip - this doesn't sound like scheme]

--
        Sander

+++ Out of cheese error +++


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ben Goetter  
View profile  
 More options Mar 19 2002, 11:46 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Ben Goetter <goet...@mazama.net.xyz>
Date: 19 Mar 2002 16:46:41 GMT
Local: Tues, Mar 19 2002 11:46 am
Subject: Re: Wide character implementation
Quoth Pierpaolo BERNARDI:

> "Thomas Bushnell, BSG" <tb+use...@becket.net> ha scritto
> > But suppose one wants to have the character datatype be 32-bit Unicode
> > characters?  Or worse yet, 35-bit Unicode characters?

> 21 bits are enough for Unicode.

And ISO 10646, per working group resolution.

http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2175.htm
http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2225.doc


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
lin8080  
View profile  
 More options Mar 19 2002, 1:55 pm
Newsgroups: comp.lang.lisp
From: lin8080 <lin8...@freenet.de>
Date: Tue, 19 Mar 2002 19:45:18 +0100
Local: Tues, Mar 19 2002 1:45 pm
Subject: Re: Wide character implementation
Janis Dzerins schrieb:

> "Pierpaolo BERNARDI" <pierpaolo_berna...@hotmail.com> writes:
> > "Thomas Bushnell, BSG" <tb+use...@becket.net> ha scritto nel messaggio
> > news:87wuw92lhc.fsf@becket.becket.net...
> > 21 bits are enough for Unicode.

> What "Unicode"?

Try:

http://www.linuxdoc.org/HOWTO/Unicode-HOWTO.html

http://www.cl.cam.ac.uk/~mgk25/unicode.html

stefan


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thomas Bushnell, BSG  
View profile  
 More options Mar 19 2002, 5:40 pm
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: tb+use...@becket.net (Thomas Bushnell, BSG)
Date: 19 Mar 2002 14:33:34 -0800
Local: Tues, Mar 19 2002 5:33 pm
Subject: Re: Wide character implementation

"Pierpaolo BERNARDI" <pierpaolo_berna...@hotmail.com> writes:
> 21 bits are enough for Unicode.

Um, Unicode version 3.1.1 has the following as the largest character:

E007F;CANCEL TAG;Cf;0;BN;;;;;N;;;;;

Now the Unicode space isn't sparse, but I don't think compressing the
space is the most efficient strategy.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 19 2002, 6:15 pm
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.net>
Date: Tue, 19 Mar 2002 23:15:12 GMT
Local: Tues, Mar 19 2002 6:15 pm
Subject: Re: Wide character implementation
* Janis Dzerins <jo...@latnet.lv>
| What "Unicode"?

  unicode.org

///
--
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 19 2002, 6:18 pm
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Erik Naggum <e...@naggum.net>
Date: Tue, 19 Mar 2002 23:18:22 GMT
Local: Tues, Mar 19 2002 6:18 pm
Subject: Re: Wide character implementation
* Sander Vesik <san...@haldjas.folklore.ee>
| I don't  think this is true any more as of unicode 3.1 afaik, 16 bits is
| no longer enough.

  Please pay attention and actually make an effort to read what you respond
  to, will you?  You should also be able to count the number of c bits and
  arrive at a number greater than 16 if you do no get lost on the way.

  Sheesh, some people.

///
--
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 19 2002, 6:22 pm
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Erik Naggum <e...@naggum.net>
Date: Tue, 19 Mar 2002 23:22:39 GMT
Local: Tues, Mar 19 2002 6:22 pm
Subject: Re: Wide character implementation
* Sander Vesik <san...@haldjas.folklore.ee>
| They use either UTF8 or UTF16 - you cannot rely on whetvere size
| you pick to be suitably long forever, unicode is sort of inherently
| variable-length (characters even have too possible representations
| in many cases, &auml; and similar 8-)

  Variable-length characters?  What the hell are you talking about?  UTF-8
  is a variable-length _encoding_ of characters that most certainly are
  intended to require a fixed number of bits.  That is, unless you think
  the digit 3 take up only 6 bits while the letter A takes up 7 bits and
  the symbol ± takes up 8.  Then you have variable-length characters.  Few
  people consider this a meaningful way of talking about variable length.

| Implement them as variable-length strings using say UTF-8. Also, saying
| that most characters will not be wide may well be a wrong assumptin 8-)

  Real programming languages work with real character objects, not just
  UTF-8-encoded strings in memory.

  Acquire clue, _then_ post, OK?

///
--
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Moore  
View profile  
 More options Mar 19 2002, 6:32 pm
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: tmo...@sea-tmoore-l.dotcast.com (Tim Moore)
Date: 19 Mar 2002 23:32:19 GMT
Local: Tues, Mar 19 2002 6:32 pm
Subject: Re: Wide character implementation
On 19 Mar 2002 14:33:34 -0800, Thomas Bushnell, BSG <tb+use...@becket.net>

 wrote:
>"Pierpaolo BERNARDI" <pierpaolo_berna...@hotmail.com> writes:

>> 21 bits are enough for Unicode.

>Um, Unicode version 3.1.1 has the following as the largest character:

>E007F;CANCEL TAG;Cf;0;BN;;;;;N;;;;;

>Now the Unicode space isn't sparse, but I don't think compressing the
>space is the most efficient strategy.

Um, what's your point? E007f fits in 20 bits.  If you're thinking
that's all that's needed, there are private use areas (E000..F8FF,
F0000..FFFFD, and 100000..10FFFD) that need to be encoded too.  So 21
bits looks right.

Tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thomas Bushnell, BSG  
View profile  
 More options Mar 19 2002, 6:50 pm
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: tb+use...@becket.net (Thomas Bushnell, BSG)
Date: 19 Mar 2002 15:46:51 -0800
Local: Tues, Mar 19 2002 6:46 pm
Subject: Re: Wide character implementation

tmo...@sea-tmoore-l.dotcast.com (Tim Moore) writes:
> Um, what's your point? E007f fits in 20 bits.  If you're thinking
> that's all that's needed, there are private use areas (E000..F8FF,
> F0000..FFFFD, and 100000..10FFFD) that need to be encoded too.  So 21
> bits looks right.

Oh what an embarrassing brain fart, yes that's quite right.  I don't
know what I was counting, but my head was clearly on backwards.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
David Rush  
View profile  
 More options Mar 20 2002, 3:44 am
Newsgroups: comp.lang.lisp
From: David Rush <k...@bellsouth.net>
Date: 20 Mar 2002 08:42:52 +0000
Local: Wed, Mar 20 2002 3:42 am
Subject: Re: Wide character implementation

Erik Naggum <e...@naggum.net> writes:
> * Sander Vesik <san...@haldjas.folklore.ee>
> | They use either UTF8 or UTF16 - you cannot rely on whetvere size
> | you pick to be suitably long forever, unicode is sort of inherently
> | variable-length (characters even have too possible representations
> | in many cases, &auml; and similar 8-)

>   Variable-length characters?  What the hell are you talking about?  UTF-8
>   is a variable-length _encoding_ of characters that most certainly are
>   intended to require a fixed number of bits.  That is, unless you think
>   the digit 3 take up only 6 bits while the letter A takes up 7 bits and
>   the symbol ± takes up 8.  Then you have variable-length characters.  Few
>   people consider this a meaningful way of talking about variable length.

Erik, this is beneath you. Surely you know that Octet != Character.

>   Acquire clue, _then_ post, OK?

In context, rather pathetic, this seems...

david rush
--
The important thing is victory, not persistence.
        -- the Silicon Valley Tarot


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pekka P. Pirinen  
View profile  
 More options Mar 20 2002, 11:30 am
Newsgroups: comp.lang.lisp
From: Pekka.P.Piri...@globalgraphics.com (Pekka P. Pirinen)
Date: 20 Mar 2002 16:20:00 +0000
Subject: Re: Wide character implementation
[comp.lang.lisp only]

Erik Naggum <e...@naggum.net> writes:
> * Thomas Bushnell, BSG
> | At the same time, most characters in the system will of course not be
> | wide.  What are the sane implementation strategies for this?

>   [...] The normal string type has character elements while
>   base-string has base-char elements.  It would seem fairly
>   reasonable to implement a *read-default-string-type* that would
>   take string or base-string as value if you choose to implement
>   both string types.

Yes, that's basically it.  

In actual fact, Liquid and Lispworks have
*DEFAULT-CHARACTER-ELEMENT-TYPE* for various functions taking an
:ELEMENT-TYPE argument, and other similar needs.  See
<http://www.xanalys.com/software_tools/reference/lwl42/LWRM-U/html/lwr...>.
Although the doc doesn't say it (there's a lot of unpublished doc on
fat characters), LW:*DEFAULT-CHARACTER-ELEMENT-TYPE* also controls
what kind of strings the reader constructs from the "" syntax.
However, if characters of larger types are seen by the string reader,
a string that can hold these characters is constructed without
complaint.

(This also avoid any confusion from STRING being a supertype of
BASE-STRING.)

Note that it is the programmer's responsibility to choose and declare
suitable character and string types, if they want to write a program
that works efficiently with both BASE-CHAR and larger character sets.
The implementation cannot possibly know enough to make the right
choices.  It can only offer a selection of types and interfaces to
control the types for each language feature.
--
Pekka P. Pirinen, Global Graphics Software Limited
In cyberspace, everybody can hear you scream.  - Gary Lewandowski


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ray Dillinger  
View profile  
 More options Mar 20 2002, 5:30 pm
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Ray Dillinger <b...@sonic.net>
Date: Wed, 20 Mar 2002 22:29:16 GMT
Local: Wed, Mar 20 2002 5:29 pm
Subject: Re: Wide character implementation

"Thomas Bushnell, BSG" wrote:

> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.

> But suppose one wants to have the character datatype be 32-bit Unicode
> characters?  Or worse yet, 35-bit Unicode characters?

> At the same time, most characters in the system will of course not be
> wide.  What are the sane implementation strategies for this?

I'd have a fixed-width internal representation -- probably 32 bits
although that's overkilling it by about a byte and a half, probably
identical to some mapping of the unicode character set -- and then
use i\o functions that were character-set aware and could translate
to and from various character sets and representations.  

I wouldn't want to muck about internally with a format that had
characters of various different widths: too much pain to implement,
too many chances to introduce bugs, not enough space savings.
Besides, when people read whole files as strings, do you really
want to run through the whole string counting multi-byte characters
and single-byte characters to find the value of an expression like

(string-ref FOO charcount)  ;; lookups in a 32 million character string!

where charcount is large?  I don't.  Constant width means O(1) lookup
time.

If space is limited, or if you're doing very serious performance
tuning, You might want to have two separate constant-width internal
character representations, one for short characters (ascii or 16bit)
and one for long (full unicode).  But if so, you're going to have to
take it into account the extra space that will be used by the
additional executable code in your character and string comparisons
and manipulation functions, and deal with the increased complexity
there. That would introduce some mild insanity and chances for a few
bugs, but imo it's not as bad as variable-width characters.

What is sane, however, depends deeply on what environment you expect
to be in.  You have to ask yourself whether the scheme you're writing
will be used with data in multiple character sets.  

For example, will users want to read strings in ebcdic and write
them in unicode?  How about the multiple incompatible versions of
ebcdic?  Do you have to support them, or can we let them die now?
Will your implementation want to read and produce both UTF-8 and
UTF-16 output?  Will you have to handle miscellaneous ISO character
sets that have different characters mapped to the same character
codes above 127?  Or obsolete ascii where the character code we
use as backslash used to mean 1/8?  How about five-bit Baudot
coding?  :-)

Get character i/o functions that do translation, and then the
lookups and references and compares and everything just work for
free with simple code, and all you have to do to support a new
character set is to provide a new mapping that the i/o functions
can use.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andy Heninger  
View profile  
 More options Mar 21 2002, 1:52 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: "Andy Heninger" <an...@jtcsv.com>
Date: Thu, 21 Mar 2002 06:53:06 GMT
Local: Thurs, Mar 21 2002 1:53 am
Subject: Re: Wide character implementation
"Ray Dillinger" <b...@sonic.net> wrote

> Get character i/o functions that do translation, and then the
> lookups and references and compares and everything just work for
> free with simple code, and all you have to do to support a new
> character set is to provide a new mapping that the i/o functions
> can use.

If you want to provide full up international support, the code for string
manipulatioin becomes anything but simple, no matter what your string
representation.  Think string compares that respect the cultural conventions
of different countries and languages (collation), for example.  And if
you're thinking Unicode, this is the direction you're headed.

See IBM's open source Unicode library for a good example of what's
involved -
http://oss.software.ibm.com/icu

   -- Andy Heninger
      henin...@us.ibm.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 21 2002, 5:14 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.net>
Date: Thu, 21 Mar 2002 10:14:25 GMT
Local: Thurs, Mar 21 2002 5:14 am
Subject: Re: Wide character implementation
* Pekka P. Pirinen
| Note that it is the programmer's responsibility to choose and declare
| suitable character and string types, if they want to write a program
| that works efficiently with both BASE-CHAR and larger character sets.

  If they want that, they should always use the types string and character.
  Only if the programmer knows that he creates base-string and with with
  base-char objects, should he so declare them.  Since string is carefully
  worded to be a collection of types, an implementation that declares
  strings exlusively will work for all subtypes of string.

///
--
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 21 2002, 5:15 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.net>
Date: Thu, 21 Mar 2002 10:15:47 GMT
Local: Thurs, Mar 21 2002 5:15 am
Subject: Re: Wide character implementation
* David Rush <k...@bellsouth.net>
| Erik, this is beneath you. Surely you know that Octet != Character.

  If you think this is about octets, you are retarded and proud of it.

| >   Acquire clue, _then_ post, OK?
|
| In context, rather pathetic, this seems...

  Learn of what you speak, _then_ become a snotty asshole, OK?

///
--
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ray Dillinger  
View profile  
 More options Mar 21 2002, 11:25 am
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Ray Dillinger <b...@sonic.net>
Date: Thu, 21 Mar 2002 16:21:57 GMT
Local: Thurs, Mar 21 2002 11:21 am
Subject: Re: Wide character implementation

Andy Heninger wrote:

> "Ray Dillinger" <b...@sonic.net> wrote

> If you want to provide full up international support, the code for string
> manipulatioin becomes anything but simple, no matter what your string
> representation.  Think string compares that respect the cultural conventions
> of different countries and languages (collation), for example.  And if
> you're thinking Unicode, this is the direction you're headed.

I dunno. As implementor I want to make it *possible* to
implement all the complications.  I want to take the major
barriers out of the way and deal with encodings intelligently.  
I'm willing to leave presentation and non-default collation
to the authors of language packages.  Let someone who knows
and cares implement that as a library; I want to provide the
foundation stones so that she can, and provide default
semantics on anonymous characters (which, to me, includes
anything outside of the latin, european, extended latin,
and math planes) that are logical, consistent, and overridable.

Should the REPL rearrange itself to go top-char-to-bottom,
right-column-to-left, with prompts appearing at the top,
if someone has named their variables and defined their
symbols with kanji characters instead of latin? It's an
interesting thought.  Should program code go in boustophedron
(alternating left-to-right in rows from top down) if someone
has named stuff using heiroglyphics? Um, maybe....  But is
the scheme system really where that kind of support is
needed, or would it just confuse people? And what's the
indentation convention for boustophedron?

Maybe that last byte-and-a-half should be used for left-right
and up-down and spacing properties and the scheme system itself
ought to do all that stuff.  But it's not so important I'm
going to implement it before, say, read-write invariance on
procedure objects.

                        Bear


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Duane Rettig  
View profile  
 More options Mar 21 2002, 1:01 pm
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Duane Rettig <du...@franz.com>
Date: Thu, 21 Mar 2002 18:00:01 GMT
Local: Thurs, Mar 21 2002 1:00 pm
Subject: Re: Wide character implementation

"Andy Heninger" <an...@jtcsv.com> writes:
> "Ray Dillinger" <b...@sonic.net> wrote
> > Get character i/o functions that do translation, and then the
> > lookups and references and compares and everything just work for
> > free with simple code, and all you have to do to support a new
> > character set is to provide a new mapping that the i/o functions
> > can use.

Even before our current verion of Allegro CL (6.1), we were
supporting external-formats to exactly that extent, and it has
been extendible (for the most part).  See

http://www.franz.com/support/documentation/6.0/doc/iacl.htm#locales-1

> If you want to provide full up international support, the code for string
> manipulatioin becomes anything but simple, no matter what your string
> representation.  Think string compares that respect the cultural conventions
> of different countries and languages (collation), for example.  And if
> you're thinking Unicode, this is the direction you're headed.

> See IBM's open source Unicode library for a good example of what's
> involved -
> http://oss.software.ibm.com/icu

We incorporate a large amount of IBM's work (and other work, as well)
in our current localization support.  See

http://www.franz.com/support/documentation/6.1/doc/iacl.htm#localizat...

Note that we have chosen not to support LC_CTYPE and LC_MESSAGES at this time.
Also, LC_COLLATE is not supported for 6.1, but Unicode Collation Element
Tables (UCETs) will be supported for 6.2.

--
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   du...@Franz.COM (internet)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sander Vesik  
View profile  
 More options Mar 22 2002, 4:13 pm
Newsgroups: comp.lang.lisp, comp.lang.scheme
From: Sander Vesik <san...@haldjas.folklore.ee>
Date: Fri, 22 Mar 2002 21:13:12 +0000 (UTC)
Local: Fri, Mar 22 2002 4:13 pm
Subject: Re: Wide character implementation
In comp.lang.scheme Erik Naggum <e...@naggum.net> wrote:

> * Sander Vesik <san...@haldjas.folklore.ee>
> | They use either UTF8 or UTF16 - you cannot rely on whetvere size
> | you pick to be suitably long forever, unicode is sort of inherently
> | variable-length (characters even have too possible representations
> | in many cases, &auml; and similar 8-)

>  Variable-length characters?  What the hell are you talking about?  UTF-8
>  is a variable-length _encoding_ of characters that most certainly are
>  intended to require a fixed number of bits.  That is, unless you think
>  the digit 3 take up only 6 bits while the letter A takes up 7 bits and
>  the symbol ? takes up 8.  Then you have variable-length characters.  Few
>  people consider this a meaningful way of talking about variable length.

Wake up, smnell the coffee and learn about 'combiners'. And then *think*
just a little bit, including about thinks like collation, sort order
and similar.

> ///

--
        Sander

+++ Out of cheese error +++


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 160   Newer >
« Back to Discussions « Newer topic     Older topic »