I've managed to avoid worrying about characters and strings and all the related horrors so far, but I've finally been forced into having to care.about
The particular thing I don't understand is what type a literal string has. It looks at first sight as if it should be something capable of holding any CHARACTER, but I'm not really sure if that's right. It looks to me as if it might be possible read things such that it's OK to return something that can only hold a subtype of CHARACTER in some cases.
I'm actually more concerned with the flip side of this -- if almost all the time I get some `good' subtype of CHARACTER (probably BASE-CHAR?) but sometimes I get some ginormous multibyte unicode thing or something, because I need to be able I have to deal with some C code which is blithely assuming that unsigned chars are just small integers and strings are arrays of small integers and so on in the usual C way, and I'm not sure that I can trust my strings to be the same as its strings.
I realise that people who care about character issues are probably laughing at me at this point, but my main aim is to keep everything as simple as I can, and especially I don't want to have to keep copying my strings into arrays of small integers (which I was doing at one point, but it's too hairy).
The practical question I guess is -- are there any implementations which do currently have really big characters in strings? Genera seems to, but that's of limited interest. CLISP seems to have internationalisation stuff in it, and I know there's an international Allegro, so those might have horrors in them.
Thanks for any advice.
--tim `7 bit ASCII was good enough for my father and it's good enough for me' Bradshaw.
In article <ey3hfe73nm4....@cley.com>, Tim Bradshaw <t...@cley.com> wrote:
>I realise that people who care about character issues are probably >laughing at me at this point, but my main aim is to keep everything as >simple as I can, and especially I don't want to have to keep copying >my strings into arrays of small integers (which I was doing at one >point, but it's too hairy).
You can call ARRAY-ELEMENT-TYPE on the string to find out if it contains anything weird. If its compatible with your foreign function's API, then you don't need to copy it.
-- Barry Margolin, bar...@bbnplanet.com GTE Internetworking, Powered by BBN, Burlington, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
* Tim Bradshaw <t...@cley.com> | The particular thing I don't understand is what type a literal string | has. It looks at first sight as if it should be something capable of | holding any CHARACTER, but I'm not really sure if that's right. It looks | to me as if it might be possible read things such that it's OK to return | something that can only hold a subtype of CHARACTER in some cases.
strings _always_ contain a subtype of character. e.g., an implementation that supports bits will have to discard them from strings. the only array type that can contain all character objects has element-type t.
| I'm actually more concerned with the flip side of this -- if almost all | the time I get some `good' subtype of CHARACTER (probably BASE-CHAR?) | but sometimes I get some ginormous multibyte unicode thing or something, | because I need to be able I have to deal with some C code which is | blithely assuming that unsigned chars are just small integers and strings | are arrays of small integers and so on in the usual C way, and I'm not | sure that I can trust my strings to be the same as its strings.
this is not a string issue, it's an FFI issue. if you tell your FFI that you want to ship a string to a C function, it should do the conversion for you if it needs to be performed. if you can't trust your FFI to do the necessary conversions, you need a better FFI.
| I realise that people who care about character issues are probably | laughing at me at this point, but my main aim is to keep everything as | simple as I can, and especially I don't want to have to keep copying my | strings into arrays of small integers (which I was doing at one point, | but it's too hairy).
if you worry about these things, your life is already _way_ more complex than it needs to be. a string is a string. each element of the string is a character. stop worrying beyond this point. C and Common Lisp agree on this fundamental belief, believe it or not. your _quality_ Common Lisp implementation will ensure that whatever invariants are maintained in _each_ environment.
| The practical question I guess is -- are there any implementations which | do currently have really big characters in strings?
yes, and not only that -- it's vitally important that strings take up no more space than they need. a system that doesn't support both base-string (of base-char) and string (of extended-char) when it attempts to support Unicode will fail in the market -- Europe and the U.S. simply can't tolerate the huge growth in memory consumption from wantonly using twice as much as you need. Unicode even comes with a very intelligent compression technique because people realize that it's a waste of space to use 16 bits and more for characters in a given character set group.
| I know there's an international Allegro, so those might have horrors in | them.
sure, but in the same vein, it might also have responsible, intelligent people behind it, not neurotics who fail to realize that customers have requirements that _must_ be resolved. Allegro CL's international version deals very well with conversion between the native system strings and its internal strings. I know -- not only do I run the International version in a test environment that needs wide characters _internally_, the test environment can't handle Unicode or anything else wide at all, and it's never been a problem.
incidentally, I don't see this as any different from whether you have a simple-base-string, a simple-string, a base-string, or a string. if you _have_ to worry, you should be the vendor or implementor of strings, not the user. if you are the user and worry, you either have a problem that you need to take up with your friendly programmer-savvy shrink, or you call your vendor and ask for support. I don't see this as any different from whether an array has a fill-pointer or not, either. if you hand it to your friendly FFI and you worry about the length of the array with or without fill-pointer, you're simply worrying too much, or you have a bug that needs to be fixed.
"might have horrors"! what's next? monster strings under your bed?
Erik is basically right that you shouldn't have to worry. Unless you're specifically writing localized applications. A string will hold a character, and the FFI will convert if it can. The details of how things should work with multiple string types have not been worked out in the standard, so if you do want more control, it's non-portable.
Erik Naggum <e...@naggum.no> writes: > Tim Bradshaw writes: > | The particular thing I don't understand is what type a literal string > | has. It looks at first sight as if it should be something capable of > | holding any CHARACTER, but I'm not really sure if that's right. It looks > | to me as if it might be possible read things such that it's OK to return > | something that can only hold a subtype of CHARACTER in some cases.
> strings _always_ contain a subtype of character. e.g., an implementation > that supports bits will have to discard them from strings. the only > array type that can contain all character objects has element-type t.
If only it were so! Unfortunately, the standard says characters with bits are of type CHARACTER and STRING = (VECTOR CHARACTER). Harlequin didn't have the guts to stop supporting them (even though there's a separate internal representation for keystroke events, now). I guess Franz did?
However, it's rarely necessary to create strings out of them, and it's easy to configure LispWorks so that never happens. Basically, there's a variable called *DEFAULT-CHARACTER-ELEMENT-TYPE* that is the default character type for all string constructors. That includes the reader's "-syntax that Tim Bradshaw was worrying about. The reader will actually silently construct wider strings if it sees a character that is not in *D-C-E-T*, it's just the default. (Note that if you're reading from a stream, you have to consider the external format on the stream first.)
> | The practical question I guess is -- are there any implementations which > | do currently have really big characters in strings?
Allegro and LispWorks, at least. Both will use thin strings where possible (but in slightly different ways). -- Pekka P. Pirinen A feature is a bug with seniority. - David Aldred <david_aldred.demon.co.uk>
* Erik Naggum wrote: > strings _always_ contain a subtype of character. e.g., an implementation > that supports bits will have to discard them from strings. the only > array type that can contain all character objects has element-type > t.
I don't think this is right -- rather I agree that they contain CHARACTERs, but it looks like `bits' -- which I think now are `implementation-defined attributes' -- can end up in strings, or at least it is implementation-defined whether they do or not (2.4.5 says this I think).
> this is not a string issue, it's an FFI issue. if you tell your FFI that > you want to ship a string to a C function, it should do the conversion > for you if it needs to be performed. if you can't trust your FFI to do > the necessary conversions, you need a better FFI.
Unfortunately my FFI is READ-SEQUENCE and WRITE-SEQUENCE, and at the far end of this is something which is defined in terms of treating characters as fixed-size (8 bit) small integers. And I can't change it because it's big important open source software and lots of people have it, and it's written in C so it's too hard to change anyway... So I need to be sure that nothing I can do is going to start spitting unicode or something at it.
At one point I did this by converting my strings to arrays of (UNSIGNED-BYTE 8)s, on I/O but that was stressful to do for various reasons.
In *practice* this has not turned out to be a problem but it's not clear what I need to check to make sure it is not. I suspect that checking that CHAR-CODE is always suitably small would be a good start.
> if you worry about these things, your life is already _way_ more complex > than it needs to be. a string is a string. each element of the string > is a character.
Well, the whole problem is that at the far end that's not true. Each element (they've decided!) is an *8-bit* character...
> yes, and not only that -- it's vitally important that strings take up no > more space than they need. a system that doesn't support both > base-string (of base-char) and string (of extended-char) when it attempts > to support Unicode will fail in the market -- Europe and the U.S. simply > can't tolerate the huge growth in memory consumption from wantonly using > twice as much as you need. Unicode even comes with a very intelligent > compression technique because people realize that it's a waste of space > to use 16 bits and more for characters in a given character set group.
For what it's worth I think this is wrong (but I could be wrong of course, and anyway it's not worth arguing over). People *happily* tolerate doublings of memory & disk consumption if it suits them -- look at windows 3.x to 95, or sunos 5.5 to 5.7, or any successive pair of xemacs versions ... And they're *right* to do that because Moore's law works really well. Using compressed representations makes things more complex -- if strings are arrays, then aref &c need to have hairy special cases, and everything else gets more complex, and that complexity never goes away, which doubled-storage costs do in about a year.
So I think that in a few years compressed representations will look like the various memory-remapping tricks that DOS did, or the similar things people now do with 32 bit machines to deal with really big databases (and push, incredibly, as `the right thing', I guess because they worship intel and intel are not doing too well with their 64bit offering). The only place it will matter is network transmission of data, and I don't see why normal compression techniques shouldn't deal with that.
So my story is if you want characters twice as big, just have big characters and use more memory and disk -- it's cheap enough now that it's dwarfed by labour costs and in a year it will be half the price.
On the other hand, people really like complex fiddly solutions to things (look at C++!), so that would argue that complex character compression techniques are here to stay.
Anyway, like I said it's not worth arguing over. Time will tell.
* I wrote: > For what it's worth I think this is wrong (but I could be wrong of > course, and anyway it's not worth arguing over).
Incidentally I should make this clearer, as it looks like I'm arguing against fat strings. Supporting several kinds of strings is *obviously* sensible, I quibble about the compressing stuff being worth it.
* Erik Naggum | strings _always_ contain a subtype of character. e.g., an implementation | that supports bits will have to discard them from strings. the only | array type that can contain all character objects has element-type t.
* Tim Bradshaw | I don't think this is right -- rather I agree that they contain | CHARACTERs, but it looks like `bits' -- which I think now are | `implementation-defined attributes' -- can end up in strings, or at least | it is implementation-defined whether they do or not (2.4.5 says this I | think).
trivially, "strings _always_ contain a subtype of character" must be true as character is a subtype of character, but I did mean in the sense that strings _don't_ contain full character objects, despite the relegation of fonts and bits to "implementation-defined attributes". that the type string-char was removed from the language but the attributes were sort of retained is perhaps confusing, but it is quite unambiguous as to intent.
so must "the only array type that can contain all character objects has element-type t" be true, since a string is allowed to contain a subtype of type character. (16.1.2 is pertinent in this regard.) it may come as a surprise to people, but if you store a random character object into a string, you're not guaranteed that what you get back is eql to what you put into it.
furthermore, there is no print syntax for implementation-defined attributes in strings, and no implementation is allowed to add any. it is perhaps not obvious, but the retention of attributes is restricted by _both_ the string type's capabilities and the stream type's capabilities.
you can quibble with the standard all you like -- you aren't going to see any implementation-defined attributes in string literals. if you compare with CLtL1 and its explicit support for string-char which didn't support them at all, you must realize that in order to _have_ any support for implementation-defined attributes, you have to _add_ it above and beyond what strings did in CLtL1. this is an extremely unlikely addition to an implementation just after bits and fonts were removed from the language and relegated to "implementation-defined attributes".
I think the rest of your paranoid conspiratorial delusions about what "horrors" might afflict Common Lisp implementations are equally lacking in merit. like, nothing is going to start spitting Unicode at you, Tim. not until and unless you ask for it. it's called "responsible vendors".
| The only place it will matter is network transmission of data, and I | don't see why normal compression techniques shouldn't deal with that.
then read the technical report and decrease your ignorance. sheesh.
* Tim Bradshaw <t...@cley.com> | Incidentally I should make this clearer, as it looks like I'm arguing | against fat strings. Supporting several kinds of strings is *obviously* | sensible, I quibble about the compressing stuff being worth it.
compressing strings for in-memory representation of _arrays_ is nuts. nobody has proposed it, and nobody ever will. again, read the Unicode technical report and decrease both your fear and your ignorance.
In article <3162223661729...@naggum.no>, Erik Naggum <e...@naggum.no> wrote:
> so must "the only array type that can contain all character objects has > element-type t" be true, since a string is allowed to contain a subtype > of type character. (16.1.2 is pertinent in this regard.) it may come as > a surprise to people, but if you store a random character object into a > string, you're not guaranteed that what you get back is eql to what you > put into it.
Isn't (array character (*)) able to contain all character objects?
-- Barry Margolin, bar...@bbnplanet.com GTE Internetworking, Powered by BBN, Burlington, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
* Erik Naggum wrote: > I think the rest of your paranoid conspiratorial delusions about what > "horrors" might afflict Common Lisp implementations are equally lacking > in merit. like, nothing is going to start spitting Unicode at you, Tim. > not until and unless you ask for it. it's called "responsible > vendors".
If my code gets a string (from wherever, the user if you like) which has bigger-than-8-bit characters in it, then tries to send it down the wire, then what will happen? I don't see this as a vendor issue, but perhaps I'm wrong.
Meantime I'm going to put in some optional checks to make sure that all my character codes are small enough.
* Barry Margolin <bar...@bbnplanet.com> | Isn't (array character (*)) able to contain all character objects?
no. specialized vectors whose elements are of type character (strings) are allowed to store only values of a subtype of type character. this is so consistently repeated in the standard and so unique to strings that I'm frankly amazed that anyone who has worked on the standard is having such a hard time accepting it. it was obviously intended to let strings be as efficient as the old string-char concept allowed, while not denying implementations the ability to retain bits and fonts if they so chose.
an implementation that stores characters in strings as if they have null implementation-defined attributes regardless of their actual attributes is actually fully conforming to the standard. the result is that you can't expect any attributes to survive string storage. the consequences are _undefined_ if you attempt to store a character with attributes in a string that can't handle it.
the removal of the type string-char is the key to understanding this.
Erik Naggum <e...@naggum.no> writes: > * Barry Margolin <bar...@bbnplanet.com> > | Isn't (array character (*)) able to contain all character objects?
> no. specialized vectors whose elements are of type character (strings) > are allowed to store only values of a subtype of type character. this is > so consistently repeated in the standard and so unique to strings that > I'm frankly amazed that anyone who has worked on the standard is having > such a hard time accepting it.
Who replaced #:Erik with a bad imitation? This one's got all the belligerence, but not the insight we've come to expect.
You've read a different standard than I, since many places actually say "of type CHARACTER or a subtype" -- superfluously, since the glossary entry for "subtype" says "Every type is a subtype of itself." When I was designing the "fat character" support for LispWorks, I looked for a get-out clause, and it's not there.
> the consequences are _undefined_ if you attempt to store a > character with attributes in a string that can't handle it.
This is true. It's also true of all the other specialized arrays, although different language ("must be") is used to specify that.
> the removal of the type string-char is the key to understanding this.
I suspect it was removed because it was realized that there would have to be many types of STRING (at least 8-byte and 16-byte), and hence there wasn't a single subtype of CHARACTER that would be associated with strings. Whatever the reason, we can only go by what the standard says.
I think it was a good choice, and LW specifically didn't retain the type, to force programmers to consider what the code actually meant by it (and to allow them to DEFTYPE it to the right thing). Nevertheless, there should be a standard name for the type of simple characters, i.e., with null implementation-defined attributes. LispWorks and Liquid use LW:SIMPLE-CHAR for this. -- Pekka P. Pirinen, Harlequin Limited The Risks of Electronic Communication http://www.best.com/~thvv/emailbad.html
* Pekka P. Pirinen | Who replaced #:Erik with a bad imitation?
geez...
| You've read a different standard than I, since many places actually say | "of type CHARACTER or a subtype" -- superfluously, since the glossary | entry for "subtype" says "Every type is a subtype of itself."
sigh. this is so incredibly silly it isn't worth responding to.
| I suspect it was removed because it was realized that there would have to | be many types of STRING (at least 8-byte and 16-byte), and hence there | wasn't a single subtype of CHARACTER that would be associated with | strings. Whatever the reason, we can only go by what the standard says.
the STRING type is a union type, and there are no other union types in Common Lisp. this should give you a pretty powerful hint, if you can get away from your "bad imitation" attitude problem and actually listen, but I guess that is not very likely at this time.
Tim Bradshaw <t...@cley.com> writes: > * Erik Naggum wrote: > > this is not a string issue, it's an FFI issue. if you tell your FFI that > > you want to ship a string to a C function, it should do the conversion > > for you if it needs to be performed. if you can't trust your FFI to do > > the necessary conversions, you need a better FFI.
> Unfortunately my FFI is READ-SEQUENCE and WRITE-SEQUENCE, and at the > far end of this is something which is defined in terms of treating > characters as fixed-size (8 bit) small integers.
You still need a better FFI: WRITE-SEQUENCE is just as much a foreign interface as any. In theory, you specify the representation on the other side by the external format of the stream. If the system doesn't have an external format that can do this, then you're reduced to hacking it.
> In *practice* this has not turned out to be a problem but it's not > clear what I need to check to make sure it is not. I suspect that > checking that CHAR-CODE is always suitably small would be a good > start.
In practice, most of us can pretend there's no encoding except ASCII. If you expect non-ASCII characters on the Lisp side, you need to know what the encoding is on the other side, otherwise it might come out wrong.
It might be enough to check the type of your strings (and perhaps the external format of the stream), instead of every character. -- Pekka P. Pirinen, Harlequin Limited Technology isn't just putting in the fastest processor and most RAM -- that's packaging. - Steve Wozniak
In article <3162232362158...@naggum.no>, Erik Naggum <e...@naggum.no> wrote:
>* Barry Margolin <bar...@bbnplanet.com> >| Isn't (array character (*)) able to contain all character objects?
> no. specialized vectors whose elements are of type character (strings) > are allowed to store only values of a subtype of type character.
You seem to be answering a different question than I asked. I didn't say "Aren't all strings of type (array character (*))?".
I realize that there are string types that are not (array character (*)), because a string can be of any array type where the element type is a subtype of character. But if you want a type that can hold any character, you can create it with:
(make-string length :element-type 'character)
In fact, you don't even need the :ELEMENT-TYPE option, because CHARACTER is the default.
-- Barry Margolin, bar...@bbnplanet.com GTE Internetworking, Powered by BBN, Burlington, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
> You still need a better FFI: WRITE-SEQUENCE is just as much a foreign > interface as any.
Yes, in fact it's worse than most, because I can't rely on the vendor/implementor to address the issues for me!
> In theory, you specify the representation on the other side by the > external format of the stream. If the system doesn't have an > external format that can do this, then you're reduced to hacking it.
Right. And I'm happy to do this -- what I was asking was how I can ensure
> In practice, most of us can pretend there's no encoding except ASCII. > If you expect non-ASCII characters on the Lisp side, you need to know > what the encoding is on the other side, otherwise it might come out > wrong.
Yes. And the problem is that since my stuff is a low-level facility which others (I hope) will build on, I don't really know what they will throw at me. And I don't want to check every character of the strings as this causes severe reduction in maximum performance (though I haven't spent a lot of time checking that the checker compiles really well yet, and in practice it will almost always be throttled elsewhere).
> It might be enough to check the type of your strings (and perhaps the > external format of the stream), instead of every character.
My hope is that BASE-STRING is good enough, but I'm not sure (I don't see that a BASE-STRING could not have more than 8-bit characters, if an implementation chose to have only one string type for instance (can it?)). Checking the external format of the stream is also obviously needed but if it's :DEFAULT does that tell me anything, and if it's not I have to special case anyway.
Obviously at some level I have to just have implementation-dependent checks because I don't think it says anywhere that characters are at n bits or any of that kind of grut (which is fine). Or I could just not care and pretend everything is 8-bit which will work for a while I guess.
Is there a useful, fast, check that that (write-sequence x y) will write (length x) bytes on y if all is well for LispWorks / Liquid (I don't have a license for these, unfortunately)?
* Tim Bradshaw <t...@cley.com> | Is there a useful, fast, check that that (write-sequence x y) will write | (length x) bytes on y if all is well for LispWorks / Liquid ...?
yes. make the buffer and the stream have type (unsigned-byte 8), and avoid the character abstraction which you obviously can't trust, anyway.
* Barry Margolin <bar...@bbnplanet.com> | But if you want a type that can hold any character, you can create it with: | | (make-string length :element-type 'character)
no, and that's the crux of the matter. this used to be different from
(make-string length :element-type 'string-char)
in precisely the capacity that you wish is still true, but it isn't. when the type string-char was removed, character assumed its role in specialized arrays, and you could not store bits and fonts in strings any more than you could with string-char. to do that, you need arrays with element-type t.
but I'm glad we've reached the point where you assert a positive, because your claim is what I've been trying to tell you guys DOES NOT HOLD. my claim is: there is nothing in the standard that _requires_ that there be a specialized array with elements that are subtypes of character (i.e., a member of the union type "string") that can hold _all_ character objects.
can you show me where the _standard_ supports your claim?
| In fact, you don't even need the :ELEMENT-TYPE option, because CHARACTER is | the default.
sure. however, I'm trying to penetrate the armor-plated belief that the resulting string is REQUIRED to retain non-null implementation-defined attributes if stored into it. no such requirement exists: a conforming implementation is completely free to provide a single string type that is able to hold only simple characters. you may think this is a mistake in the standard, but it's exactly what it says, after the type string-char was removed.
methinks you're stuck in CLtL1 days, Barry, and so is this bad imitation jerk from Harlequin, but that's much less surprising.
> | Is there a useful, fast, check that that (write-sequence x y) will write > | (length x) bytes on y if all is well for LispWorks / Liquid ...? > yes. make the buffer and the stream have type (unsigned-byte 8), and > avoid the character abstraction which you obviously can't trust, anyway.
Which is precisely what I want to avoid unfortunately, as it means that either this code or the code that calls it has to deal with the issue of copying strings too and from arrays of (UNSIGNED-BYTE 8)s, which simply brings back the same problem somewhere else.
(My first implementation did exactly this in fact)
In article <3162302923332...@naggum.no>, Erik Naggum <e...@naggum.no> wrote:
>* Barry Margolin <bar...@bbnplanet.com> >| But if you want a type that can hold any character, you can create it with: >| >| (make-string length :element-type 'character)
> no, and that's the crux of the matter. this used to be different from
>(make-string length :element-type 'string-char)
> in precisely the capacity that you wish is still true, but it isn't. > when the type string-char was removed, character assumed its role in > specialized arrays, and you could not store bits and fonts in strings any > more than you could with string-char. to do that, you need arrays with > element-type t.
I'm still not following you. Are you saying that characters with implementation-defined attributes (e.g. bits or fonts) might not satisfy (typep c 'character)? I suppose that's possible. The standard allows implementations to provide implementation-defined attributes, but doesn't require them; an implementor could instead provide their own type CHAR-WITH-BITS that's disjoint from CHARACTER rather than a subtype of it. I'm not sure why they would do this, but nothing in the standard prohibits it.
On the other hand, something like READ-CHAR would not be permitted to return a CHAR-WITH-BITS -- it has to return a CHARACTER. So I'm not sure how a program that thinks it's working with characters and strings would encounter such an object unexpectedly.
-- Barry Margolin, bar...@bbnplanet.com GTE Internetworking, Powered by BBN, Burlington, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
Erik Naggum wrote: > * Barry Margolin <bar...@bbnplanet.com> > | But if you want a type that can hold any character, you can create it with: > | > | (make-string length :element-type 'character)
> no, and that's the crux of the matter. this used to be different from
> (make-string length :element-type 'string-char)
> in precisely the capacity that you wish is still true, but it isn't. > when the type string-char was removed, character assumed its role in > specialized arrays, and you could not store bits and fonts in strings any > more than you could with string-char. to do that, you need arrays with > element-type t.
> but I'm glad we've reached the point where you assert a positive, because > your claim is what I've been trying to tell you guys DOES NOT HOLD. my > claim is: there is nothing in the standard that _requires_ that there be > a specialized array with elements that are subtypes of character (i.e., a > member of the union type "string") that can hold _all_ character objects.
> can you show me where the _standard_ supports your claim?
I'm not Barry, but I think I can. Provided I'm allowed to use the HyperSpec (which I have) rather than the Standard itself (which I don't).
1. MAKE-STRING is defined to return "a string ... of the most specialized type that can accommodate elements of the given type".
2. The default "given type" is CHARACTER.
3. Therefore, MAKE-STRING with the default ELEMENT-TYPE returns a string "that can accommodate elements of the type CHARACTER".
Unfortunately, there's no definition of "accommodate" in the HyperSpec. However, compare the following passages:
From MAKE-STRING: | The element-type names the type of the elements of the | string; a string is constructed of the most specialized | type that can accommodate elements of the given type.
From MAKE-ARRAY: | Creates and returns an array construbted of the most | specialized type that can accommodate elements of type | given by element-type.
It seems to me that the only reasonable definition of "can accommodate elements of type FOO" in this context is "can have arbitrary things of type FOO as elements". If so, then
4. MAKE-STRING with the default ELEMENT-TYPE returns a string capable of having arbitrary things of type CHARACTER as elements.
Now,
5. A "string" is defined as "a specialized vector ... whose elements are of type CHARACTER or a subtype of type CHARACTER".
6. A "specialized" array is defined to be one whose actual array element type is a proper subtype of T.
Hence,
7. MAKE-STRING with the default ELEMENT-TYPE returns a vector whose actual array element type is a proper subtype of T, whose elements are of type CHARACTER or a subtype thereof, and which is capable of holding arbitrary things of type CHARACTER as elements.
And therefore
8. There is such a thing as a specialized array with elements of type CHARACTER or some subtype thereof, which is capable of holding arbitrary things of type CHARACTER as elements.
Which is what you said the standard doesn't say. (From #7 we can also deduce that this thing has actual array element type a proper subtype of T, so it's not equivalent to (array t (*)).)
I can see only one hole in this. It's sort of possible that "can accommodate elements of type FOO" in the definition of MAKE-STRING doesn't mean what I said it does, even though the exact same language in the definition of MAKE-ARRAY does mean that. I don't find this plausible.
I remark also the following, from 16.1.1 ("Implications of strings being arrays"):
| Since all strings are arrays, all rules which apply | generally to arrays also apply to strings. See | Section 15.1 (Array Concepts). .. | and strings are also subject to the rules of element | type upgrading that apply to arrays.
I'd have thought that if strings were special in the kind of way you're saying they are, there would be some admission of the fact here. There isn't.
*
Elsewhere in the thread, you said
| an implementation that stores characters in strings | as if they have null implementation-defined attributes | regardless of their actual attributes is actually | fully conforming to the standard.
I have been unable to find anything in the HyperSpec that justifies this. Some places I've looked:
- 15.1.1 "Array elements" (in 15.1 "Array concepts")
I thought perhaps this might say something like "In some cases, storing an object in an array will actually store another object that need not be EQ to the original object". Nope.
- The definitions of CHAR and AREF
Again, looking for any sign that an implementation is allowed to store something non-EQ to what it's given with (setf (aref ...) ...) or (setf (char ...) ...). Again, no. The definition of CHAR just says that it and SCHAR "access the element of STRING specified by INDEX".
- 13.1.3 "Character attributes"
Perhaps this might say "Storing a character in a string may lose its implementation-defined attributes". Nope. It says that the way in which two characters with the same code differ is "implementation-defined", but I don't see any licence anywhere for this to mean they get confused when stored in an array.
- The definition of MAKE-STRING
I've discussed this already.
- The glossary entries for "string", "attribute", "element", and various others.
Also discussed above.
- The whole of chapter 13 (Characters) and 16 (Strings).
No sign here, unless I've missed something.
- The definitions of types CHARACTER, BASE-CHAR, STANDARD-CHAR, EXTENDED-CHAR.
Still no sign.
- The CHARACTER-PROPOSAL (which isn't, in any case, part of the standard).
I thought this might give some sign of the phenomenon you describe. Not that I can see.
Perhaps I'm missing something. It wouldn't be the first time. But I just can't find any sign at all that what you claim is true, and I can see rather a lot of things that suggest it isn't.
The nearest I can find is this, from 16.1.2 ("Subtypes of STRING"):
| However, the consequences are undefined if a character | is inserted into a string for which the element type of | the string does not include that character.
But that doesn't give any reason to believe that the result of (MAKE-STRING n :ELEMENT-TYPE 'CHARACTER) doesn't have an element type that includes all characters. And, as I've said above, there's good reason to believe that it does.
> | In fact, you don't even need the :ELEMENT-TYPE option, because CHARACTER is > | the default.
> sure. however, I'm trying to penetrate the armor-plated belief that the > resulting string is REQUIRED to retain non-null implementation-defined > attributes if stored into it. no such requirement exists: a conforming > implementation is completely free to provide a single string type that is > able to hold only simple characters. you may think this is a mistake in > the standard, but it's exactly what it says, after the type string-char > was removed.
Where?
-- Gareth McCaughan Gareth.McCaug...@pobox.com sig under construction
> > sure. however, I'm trying to penetrate the armor-plated belief that the > > resulting string is REQUIRED to retain non-null implementation-defined > > attributes if stored into it. no such requirement exists: a conforming > > implementation is completely free to provide a single string type that is > > able to hold only simple characters. you may think this is a mistake in > > the standard, but it's exactly what it says, after the type string-char > > was removed.
> Where?
The part from "a conforming implementation..." on is direcly supported by 13.1.3:
| A character for which each implementation-defined attribute has the | null value for that attribute is called a simple character. If the | implementation has no implementation-defined attributes, then all | characters are simple characters.
/Jon
-- Jon Anthony Synquiry Technologies, Ltd. Belmont, MA 02478, 617.484.3383 "Nightmares - Ha! The way my life's been going lately, Who'd notice?" -- Londo Mollari
* Barry Margolin <bar...@bbnplanet.com> | I'm still not following you. Are you saying that characters with | implementation-defined attributes (e.g. bits or fonts) might not satisfy | (typep c 'character)?
no. I'm saying that even as this _is_ the case, the standard does not require a string to be able to hold and return such a character intact.
* Tim Bradshaw <t...@cley.com> | Which is precisely what I want to avoid unfortunately, as it means that | either this code or the code that calls it has to deal with the issue of | copying strings too and from arrays of (UNSIGNED-BYTE 8)s, which simply | brings back the same problem somewhere else.
in this case, I'd talk to my vendor or dig deep in the implementation to find a way to transmogrify an (unsigned-byte 8) vector to a character vector by smashing the type codes instead of copying the data. (this is just like change-class for vectors.) barring bivalent streams that can accept either kind of vector (coming soon to an implementation near you), having to deal with annoyingly stupid or particular external requirements means it's OK to be less than nice at the interface level, provided it's done safely.
* Erik Naggum wrote: > in this case, I'd talk to my vendor or dig deep in the implementation to > find a way to transmogrify an (unsigned-byte 8) vector to a character > vector by smashing the type codes instead of copying the data. (this is > just like change-class for vectors.)
This doesn't work (unless I've misunderstood you) because I can't use it for the string->unsigned-byte-array case, because the strings might have big characters in them. Actually, it probably *would* work in that I could arrange to get a twice-as-big array if the string had 16-bit characters in (or 4x as big if ...), but I think the other end would expect UTF-8 or something in that case (or, more likely, just throw up its hands in horror at the thought that characters are not 8 bits wide, it's a pretty braindead design).
It looks to me like the outcome of all this is that there isn't a portable CL way of ensuring what I need to be true is true, and that I need to ask vendors for per-implementation answers, and meantime punt on the issue until my code is more stable. Which are fine answers from my point of view, in case anyone thinks I'm making the standard `lisp won't let me do x' complaint.
> barring bivalent streams that can accept either kind of vector > (coming soon to an implementation near you), having to deal with > annoyingly stupid or particular external requirements means it's > OK to be less than nice at the interface level, provided it's done > safely.