* Thomas Bushnell, BSG | If one uses tagged pointers, then its easy to implement fixnums as | ASCII characters efficiently.
Huh? No sense this makes.
| But suppose one wants to have the character datatype be 32-bit Unicode | characters? Or worse yet, 35-bit Unicode characters?
Unicode is a 31-bit character set. The base multilingual plane is 16 bits wide, and then there are the possibility of 20 bits encoded in two 16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (- (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme, but one does not have to understand the lo- and hi-word codes that make up the 20-bit character space. In effect, you need 16 bits. Therefore, you could represent characters with the following bit pattern, with b for bits and c for code. Fonts are a mistake, so is removed.
000000ccccccccccccccccccccc00110
This is useful when the fixnum type tag is either 000 for even fixnums and 100 for odd fixnums, effectively 00 for fixnums. This makes char-code and code-char a single shift operation. Of course, char-bits and char-font are not supported in this scheme, but if you _really_ have to, the upper 4 bits may be used for char-bits.
| At the same time, most characters in the system will of course not be | wide. What are the sane implementation strategies for this?
I would (again) recommend actually reading the specification. The character type can handle everything, but base-char could handle the 8-bit things that reasonable people use. The normal string type has character elements while base-string has base-char elements. It would seem fairly reasonable to implement a *read-default-string-type* that would take string or base-string as value if you choose to implement both string types.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
"Pierpaolo BERNARDI" <pierpaolo_berna...@hotmail.com> writes: > "Thomas Bushnell, BSG" <tb+use...@becket.net> ha scritto nel messaggio > news:87wuw92lhc.fsf@becket.becket.net...
> > If one uses tagged pointers, then its easy to implement fixnums as > > ASCII characters efficiently.
> > But suppose one wants to have the character datatype be 32-bit Unicode > > characters? Or worse yet, 35-bit Unicode characters?
In comp.lang.scheme Thomas Bushnell, BSG <tb+use...@becket.net> wrote:
> If one uses tagged pointers, then its easy to implement fixnums as > ASCII characters efficiently.
> But suppose one wants to have the character datatype be 32-bit Unicode > characters? Or worse yet, 35-bit Unicode characters?
They use either UTF8 or UTF16 - you cannot rely on whetvere size you pick to be suitably long forever, unicode is sort of inherently variable-length (characters even have too possible representations in many cases, ä and similar 8-)
> At the same time, most characters in the system will of course not be > wide. What are the sane implementation strategies for this?
Implement them as variable-length strings using say UTF-8. Also, saying that most characters will not be wide may well be a wrong assumptin 8-)
> * Thomas Bushnell, BSG > | If one uses tagged pointers, then its easy to implement fixnums as > | ASCII characters efficiently.
> Huh? No sense this makes.
> | But suppose one wants to have the character datatype be 32-bit Unicode > | characters? Or worse yet, 35-bit Unicode characters?
> Unicode is a 31-bit character set. The base multilingual plane is 16 > bits wide, and then there are the possibility of 20 bits encoded in two > 16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (- > (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme, > but one does not have to understand the lo- and hi-word codes that make > up the 20-bit character space. In effect, you need 16 bits. Therefore, > you could represent characters with the following bit pattern, with b for > bits and c for code. Fonts are a mistake, so is removed.
> 000000ccccccccccccccccccccc00110
I don't think this is true any more as of unicode 3.1 afaik, 16 bits is no longer enough.
> "Thomas Bushnell, BSG" <tb+use...@becket.net> ha scritto > > But suppose one wants to have the character datatype be 32-bit Unicode > > characters? Or worse yet, 35-bit Unicode characters?
* Sander Vesik <san...@haldjas.folklore.ee> | I don't think this is true any more as of unicode 3.1 afaik, 16 bits is | no longer enough.
Please pay attention and actually make an effort to read what you respond to, will you? You should also be able to count the number of c bits and arrive at a number greater than 16 if you do no get lost on the way.
Sheesh, some people.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
* Sander Vesik <san...@haldjas.folklore.ee> | They use either UTF8 or UTF16 - you cannot rely on whetvere size | you pick to be suitably long forever, unicode is sort of inherently | variable-length (characters even have too possible representations | in many cases, ä and similar 8-)
Variable-length characters? What the hell are you talking about? UTF-8 is a variable-length _encoding_ of characters that most certainly are intended to require a fixed number of bits. That is, unless you think the digit 3 take up only 6 bits while the letter A takes up 7 bits and the symbol ± takes up 8. Then you have variable-length characters. Few people consider this a meaningful way of talking about variable length.
| Implement them as variable-length strings using say UTF-8. Also, saying | that most characters will not be wide may well be a wrong assumptin 8-)
Real programming languages work with real character objects, not just UTF-8-encoded strings in memory.
Acquire clue, _then_ post, OK?
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
>Um, Unicode version 3.1.1 has the following as the largest character:
>E007F;CANCEL TAG;Cf;0;BN;;;;;N;;;;;
>Now the Unicode space isn't sparse, but I don't think compressing the >space is the most efficient strategy.
Um, what's your point? E007f fits in 20 bits. If you're thinking that's all that's needed, there are private use areas (E000..F8FF, F0000..FFFFD, and 100000..10FFFD) that need to be encoded too. So 21 bits looks right.
tmo...@sea-tmoore-l.dotcast.com (Tim Moore) writes: > Um, what's your point? E007f fits in 20 bits. If you're thinking > that's all that's needed, there are private use areas (E000..F8FF, > F0000..FFFFD, and 100000..10FFFD) that need to be encoded too. So 21 > bits looks right.
Oh what an embarrassing brain fart, yes that's quite right. I don't know what I was counting, but my head was clearly on backwards.
Erik Naggum <e...@naggum.net> writes: > * Sander Vesik <san...@haldjas.folklore.ee> > | They use either UTF8 or UTF16 - you cannot rely on whetvere size > | you pick to be suitably long forever, unicode is sort of inherently > | variable-length (characters even have too possible representations > | in many cases, ä and similar 8-)
> Variable-length characters? What the hell are you talking about? UTF-8 > is a variable-length _encoding_ of characters that most certainly are > intended to require a fixed number of bits. That is, unless you think > the digit 3 take up only 6 bits while the letter A takes up 7 bits and > the symbol ± takes up 8. Then you have variable-length characters. Few > people consider this a meaningful way of talking about variable length.
Erik, this is beneath you. Surely you know that Octet != Character.
> Acquire clue, _then_ post, OK?
In context, rather pathetic, this seems...
david rush -- The important thing is victory, not persistence. -- the Silicon Valley Tarot
Erik Naggum <e...@naggum.net> writes: > * Thomas Bushnell, BSG > | At the same time, most characters in the system will of course not be > | wide. What are the sane implementation strategies for this?
> [...] The normal string type has character elements while > base-string has base-char elements. It would seem fairly > reasonable to implement a *read-default-string-type* that would > take string or base-string as value if you choose to implement > both string types.
Yes, that's basically it.
In actual fact, Liquid and Lispworks have *DEFAULT-CHARACTER-ELEMENT-TYPE* for various functions taking an :ELEMENT-TYPE argument, and other similar needs. See <http://www.xanalys.com/software_tools/reference/lwl42/LWRM-U/html/lwr...>. Although the doc doesn't say it (there's a lot of unpublished doc on fat characters), LW:*DEFAULT-CHARACTER-ELEMENT-TYPE* also controls what kind of strings the reader constructs from the "" syntax. However, if characters of larger types are seen by the string reader, a string that can hold these characters is constructed without complaint.
(This also avoid any confusion from STRING being a supertype of BASE-STRING.)
Note that it is the programmer's responsibility to choose and declare suitable character and string types, if they want to write a program that works efficiently with both BASE-CHAR and larger character sets. The implementation cannot possibly know enough to make the right choices. It can only offer a selection of types and interfaces to control the types for each language feature. -- Pekka P. Pirinen, Global Graphics Software Limited In cyberspace, everybody can hear you scream. - Gary Lewandowski
> If one uses tagged pointers, then its easy to implement fixnums as > ASCII characters efficiently.
> But suppose one wants to have the character datatype be 32-bit Unicode > characters? Or worse yet, 35-bit Unicode characters?
> At the same time, most characters in the system will of course not be > wide. What are the sane implementation strategies for this?
I'd have a fixed-width internal representation -- probably 32 bits although that's overkilling it by about a byte and a half, probably identical to some mapping of the unicode character set -- and then use i\o functions that were character-set aware and could translate to and from various character sets and representations.
I wouldn't want to muck about internally with a format that had characters of various different widths: too much pain to implement, too many chances to introduce bugs, not enough space savings. Besides, when people read whole files as strings, do you really want to run through the whole string counting multi-byte characters and single-byte characters to find the value of an expression like
(string-ref FOO charcount) ;; lookups in a 32 million character string!
where charcount is large? I don't. Constant width means O(1) lookup time.
If space is limited, or if you're doing very serious performance tuning, You might want to have two separate constant-width internal character representations, one for short characters (ascii or 16bit) and one for long (full unicode). But if so, you're going to have to take it into account the extra space that will be used by the additional executable code in your character and string comparisons and manipulation functions, and deal with the increased complexity there. That would introduce some mild insanity and chances for a few bugs, but imo it's not as bad as variable-width characters.
What is sane, however, depends deeply on what environment you expect to be in. You have to ask yourself whether the scheme you're writing will be used with data in multiple character sets.
For example, will users want to read strings in ebcdic and write them in unicode? How about the multiple incompatible versions of ebcdic? Do you have to support them, or can we let them die now? Will your implementation want to read and produce both UTF-8 and UTF-16 output? Will you have to handle miscellaneous ISO character sets that have different characters mapped to the same character codes above 127? Or obsolete ascii where the character code we use as backslash used to mean 1/8? How about five-bit Baudot coding? :-)
Get character i/o functions that do translation, and then the lookups and references and compares and everything just work for free with simple code, and all you have to do to support a new character set is to provide a new mapping that the i/o functions can use.
> Get character i/o functions that do translation, and then the > lookups and references and compares and everything just work for > free with simple code, and all you have to do to support a new > character set is to provide a new mapping that the i/o functions > can use.
If you want to provide full up international support, the code for string manipulatioin becomes anything but simple, no matter what your string representation. Think string compares that respect the cultural conventions of different countries and languages (collation), for example. And if you're thinking Unicode, this is the direction you're headed.
* Pekka P. Pirinen | Note that it is the programmer's responsibility to choose and declare | suitable character and string types, if they want to write a program | that works efficiently with both BASE-CHAR and larger character sets.
If they want that, they should always use the types string and character. Only if the programmer knows that he creates base-string and with with base-char objects, should he so declare them. Since string is carefully worded to be a collection of types, an implementation that declares strings exlusively will work for all subtypes of string.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
> If you want to provide full up international support, the code for string > manipulatioin becomes anything but simple, no matter what your string > representation. Think string compares that respect the cultural conventions > of different countries and languages (collation), for example. And if > you're thinking Unicode, this is the direction you're headed.
I dunno. As implementor I want to make it *possible* to implement all the complications. I want to take the major barriers out of the way and deal with encodings intelligently. I'm willing to leave presentation and non-default collation to the authors of language packages. Let someone who knows and cares implement that as a library; I want to provide the foundation stones so that she can, and provide default semantics on anonymous characters (which, to me, includes anything outside of the latin, european, extended latin, and math planes) that are logical, consistent, and overridable.
Should the REPL rearrange itself to go top-char-to-bottom, right-column-to-left, with prompts appearing at the top, if someone has named their variables and defined their symbols with kanji characters instead of latin? It's an interesting thought. Should program code go in boustophedron (alternating left-to-right in rows from top down) if someone has named stuff using heiroglyphics? Um, maybe.... But is the scheme system really where that kind of support is needed, or would it just confuse people? And what's the indentation convention for boustophedron?
Maybe that last byte-and-a-half should be used for left-right and up-down and spacing properties and the scheme system itself ought to do all that stuff. But it's not so important I'm going to implement it before, say, read-write invariance on procedure objects.
"Andy Heninger" <an...@jtcsv.com> writes: > "Ray Dillinger" <b...@sonic.net> wrote > > Get character i/o functions that do translation, and then the > > lookups and references and compares and everything just work for > > free with simple code, and all you have to do to support a new > > character set is to provide a new mapping that the i/o functions > > can use.
Even before our current verion of Allegro CL (6.1), we were supporting external-formats to exactly that extent, and it has been extendible (for the most part). See
> If you want to provide full up international support, the code for string > manipulatioin becomes anything but simple, no matter what your string > representation. Think string compares that respect the cultural conventions > of different countries and languages (collation), for example. And if > you're thinking Unicode, this is the direction you're headed.
Note that we have chosen not to support LC_CTYPE and LC_MESSAGES at this time. Also, LC_COLLATE is not supported for 6.1, but Unicode Collation Element Tables (UCETs) will be supported for 6.2.
-- Duane Rettig Franz Inc. http://www.franz.com/ (www) 1995 University Ave Suite 275 Berkeley, CA 94704 Phone: (510) 548-3600; FAX: (510) 548-8253 du...@Franz.COM (internet)
In comp.lang.scheme Erik Naggum <e...@naggum.net> wrote:
> * Sander Vesik <san...@haldjas.folklore.ee> > | They use either UTF8 or UTF16 - you cannot rely on whetvere size > | you pick to be suitably long forever, unicode is sort of inherently > | variable-length (characters even have too possible representations > | in many cases, ä and similar 8-)
> Variable-length characters? What the hell are you talking about? UTF-8 > is a variable-length _encoding_ of characters that most certainly are > intended to require a fixed number of bits. That is, unless you think > the digit 3 take up only 6 bits while the letter A takes up 7 bits and > the symbol ? takes up 8. Then you have variable-length characters. Few > people consider this a meaningful way of talking about variable length.
Wake up, smnell the coffee and learn about 'combiners'. And then *think* just a little bit, including about thinks like collation, sort order and similar.