The particular thing I don't understand is what type a literal string
has. It looks at first sight as if it should be something capable of
holding any CHARACTER, but I'm not really sure if that's right. It
looks to me as if it might be possible read things such that it's OK
to return something that can only hold a subtype of CHARACTER in some
cases.
I'm actually more concerned with the flip side of this -- if almost all
the time I get some `good' subtype of CHARACTER (probably BASE-CHAR?)
but sometimes I get some ginormous multibyte unicode thing or
something, because I need to be able I have to deal with some C code
which is blithely assuming that unsigned chars are just small integers
and strings are arrays of small integers and so on in the usual C way,
and I'm not sure that I can trust my strings to be the same as its
strings.
I realise that people who care about character issues are probably
laughing at me at this point, but my main aim is to keep everything as
simple as I can, and especially I don't want to have to keep copying
my strings into arrays of small integers (which I was doing at one
point, but it's too hairy).
The practical question I guess is -- are there any implementations
which do currently have really big characters in strings? Genera
seems to, but that's of limited interest. CLISP seems to have
internationalisation stuff in it, and I know there's an international
Allegro, so those might have horrors in them.
Thanks for any advice.
--tim `7 bit ASCII was good enough for my father and it's good enough
for me' Bradshaw.
You can call ARRAY-ELEMENT-TYPE on the string to find out if it contains
anything weird. If its compatible with your foreign function's API,
then you don't need to copy it.
--
Barry Margolin, bar...@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
strings _always_ contain a subtype of character. e.g., an implementation
that supports bits will have to discard them from strings. the only
array type that can contain all character objects has element-type t.
| I'm actually more concerned with the flip side of this -- if almost all
| the time I get some `good' subtype of CHARACTER (probably BASE-CHAR?)
| but sometimes I get some ginormous multibyte unicode thing or something,
| because I need to be able I have to deal with some C code which is
| blithely assuming that unsigned chars are just small integers and strings
| are arrays of small integers and so on in the usual C way, and I'm not
| sure that I can trust my strings to be the same as its strings.
this is not a string issue, it's an FFI issue. if you tell your FFI that
you want to ship a string to a C function, it should do the conversion
for you if it needs to be performed. if you can't trust your FFI to do
the necessary conversions, you need a better FFI.
| I realise that people who care about character issues are probably
| laughing at me at this point, but my main aim is to keep everything as
| simple as I can, and especially I don't want to have to keep copying my
| strings into arrays of small integers (which I was doing at one point,
| but it's too hairy).
if you worry about these things, your life is already _way_ more complex
than it needs to be. a string is a string. each element of the string
is a character. stop worrying beyond this point. C and Common Lisp
agree on this fundamental belief, believe it or not. your _quality_
Common Lisp implementation will ensure that whatever invariants are
maintained in _each_ environment.
| The practical question I guess is -- are there any implementations which
| do currently have really big characters in strings?
yes, and not only that -- it's vitally important that strings take up no
more space than they need. a system that doesn't support both
base-string (of base-char) and string (of extended-char) when it attempts
to support Unicode will fail in the market -- Europe and the U.S. simply
can't tolerate the huge growth in memory consumption from wantonly using
twice as much as you need. Unicode even comes with a very intelligent
compression technique because people realize that it's a waste of space
to use 16 bits and more for characters in a given character set group.
| I know there's an international Allegro, so those might have horrors in
| them.
sure, but in the same vein, it might also have responsible, intelligent
people behind it, not neurotics who fail to realize that customers have
requirements that _must_ be resolved. Allegro CL's international version
deals very well with conversion between the native system strings and its
internal strings. I know -- not only do I run the International version
in a test environment that needs wide characters _internally_, the test
environment can't handle Unicode or anything else wide at all, and it's
never been a problem.
incidentally, I don't see this as any different from whether you have a
simple-base-string, a simple-string, a base-string, or a string. if you
_have_ to worry, you should be the vendor or implementor of strings, not
the user. if you are the user and worry, you either have a problem that
you need to take up with your friendly programmer-savvy shrink, or you
call your vendor and ask for support. I don't see this as any different
from whether an array has a fill-pointer or not, either. if you hand it
to your friendly FFI and you worry about the length of the array with or
without fill-pointer, you're simply worrying too much, or you have a bug
that needs to be fixed.
"might have horrors"! what's next? monster strings under your bed?
#:Erik
Erik Naggum <er...@naggum.no> writes:
> Tim Bradshaw writes:
> | The particular thing I don't understand is what type a literal string
> | has. It looks at first sight as if it should be something capable of
> | holding any CHARACTER, but I'm not really sure if that's right. It looks
> | to me as if it might be possible read things such that it's OK to return
> | something that can only hold a subtype of CHARACTER in some cases.
>
> strings _always_ contain a subtype of character. e.g., an implementation
> that supports bits will have to discard them from strings. the only
> array type that can contain all character objects has element-type t.
If only it were so! Unfortunately, the standard says characters with
bits are of type CHARACTER and STRING = (VECTOR CHARACTER). Harlequin
didn't have the guts to stop supporting them (even though there's a
separate internal representation for keystroke events, now). I guess
Franz did?
However, it's rarely necessary to create strings out of them, and it's
easy to configure LispWorks so that never happens. Basically, there's
a variable called *DEFAULT-CHARACTER-ELEMENT-TYPE* that is the default
character type for all string constructors. That includes the
reader's "-syntax that Tim Bradshaw was worrying about. The reader
will actually silently construct wider strings if it sees a character
that is not in *D-C-E-T*, it's just the default. (Note that if you're
reading from a stream, you have to consider the external format on the
stream first.)
> | The practical question I guess is -- are there any implementations which
> | do currently have really big characters in strings?
Allegro and LispWorks, at least. Both will use thin strings where possible
(but in slightly different ways).
--
Pekka P. Pirinen
A feature is a bug with seniority. - David Aldred <david_aldred.demon.co.uk>
I don't think this is right -- rather I agree that they contain
CHARACTERs, but it looks like `bits' -- which I think now are
`implementation-defined attributes' -- can end up in strings, or at
least it is implementation-defined whether they do or not (2.4.5 says
this I think).
> this is not a string issue, it's an FFI issue. if you tell your FFI that
> you want to ship a string to a C function, it should do the conversion
> for you if it needs to be performed. if you can't trust your FFI to do
> the necessary conversions, you need a better FFI.
Unfortunately my FFI is READ-SEQUENCE and WRITE-SEQUENCE, and at the
far end of this is something which is defined in terms of treating
characters as fixed-size (8 bit) small integers. And I can't change
it because it's big important open source software and lots of people
have it, and it's written in C so it's too hard to change anyway... So
I need to be sure that nothing I can do is going to start spitting
unicode or something at it.
At one point I did this by converting my strings to arrays of
(UNSIGNED-BYTE 8)s, on I/O but that was stressful to do for various
reasons.
In *practice* this has not turned out to be a problem but it's not
clear what I need to check to make sure it is not. I suspect that
checking that CHAR-CODE is always suitably small would be a good
start.
> if you worry about these things, your life is already _way_ more complex
> than it needs to be. a string is a string. each element of the string
> is a character.
Well, the whole problem is that at the far end that's not true. Each
element (they've decided!) is an *8-bit* character...
> yes, and not only that -- it's vitally important that strings take up no
> more space than they need. a system that doesn't support both
> base-string (of base-char) and string (of extended-char) when it attempts
> to support Unicode will fail in the market -- Europe and the U.S. simply
> can't tolerate the huge growth in memory consumption from wantonly using
> twice as much as you need. Unicode even comes with a very intelligent
> compression technique because people realize that it's a waste of space
> to use 16 bits and more for characters in a given character set group.
For what it's worth I think this is wrong (but I could be wrong of
course, and anyway it's not worth arguing over). People *happily*
tolerate doublings of memory & disk consumption if it suits them --
look at windows 3.x to 95, or sunos 5.5 to 5.7, or any successive pair
of xemacs versions ... And they're *right* to do that because Moore's
law works really well. Using compressed representations makes things
more complex -- if strings are arrays, then aref &c need to have hairy
special cases, and everything else gets more complex, and that
complexity never goes away, which doubled-storage costs do in about a
year.
So I think that in a few years compressed representations will look
like the various memory-remapping tricks that DOS did, or the similar
things people now do with 32 bit machines to deal with really big
databases (and push, incredibly, as `the right thing', I guess because
they worship intel and intel are not doing too well with their 64bit
offering). The only place it will matter is network transmission of
data, and I don't see why normal compression techniques shouldn't deal
with that.
So my story is if you want characters twice as big, just have big
characters and use more memory and disk -- it's cheap enough now that
it's dwarfed by labour costs and in a year it will be half the price.
On the other hand, people really like complex fiddly solutions to
things (look at C++!), so that would argue that complex character
compression techniques are here to stay.
Anyway, like I said it's not worth arguing over. Time will tell.
--tim
> For what it's worth I think this is wrong (but I could be wrong of
> course, and anyway it's not worth arguing over).
Incidentally I should make this clearer, as it looks like I'm arguing
against fat strings. Supporting several kinds of strings is
*obviously* sensible, I quibble about the compressing stuff being
worth it.
--tim
* Tim Bradshaw
| I don't think this is right -- rather I agree that they contain
| CHARACTERs, but it looks like `bits' -- which I think now are
| `implementation-defined attributes' -- can end up in strings, or at least
| it is implementation-defined whether they do or not (2.4.5 says this I
| think).
trivially, "strings _always_ contain a subtype of character" must be true
as character is a subtype of character, but I did mean in the sense that
strings _don't_ contain full character objects, despite the relegation of
fonts and bits to "implementation-defined attributes". that the type
string-char was removed from the language but the attributes were sort of
retained is perhaps confusing, but it is quite unambiguous as to intent.
so must "the only array type that can contain all character objects has
element-type t" be true, since a string is allowed to contain a subtype
of type character. (16.1.2 is pertinent in this regard.) it may come as
a surprise to people, but if you store a random character object into a
string, you're not guaranteed that what you get back is eql to what you
put into it.
furthermore, there is no print syntax for implementation-defined
attributes in strings, and no implementation is allowed to add any. it
is perhaps not obvious, but the retention of attributes is restricted by
_both_ the string type's capabilities and the stream type's capabilities.
you can quibble with the standard all you like -- you aren't going to see
any implementation-defined attributes in string literals. if you compare
with CLtL1 and its explicit support for string-char which didn't support
them at all, you must realize that in order to _have_ any support for
implementation-defined attributes, you have to _add_ it above and beyond
what strings did in CLtL1. this is an extremely unlikely addition to an
implementation just after bits and fonts were removed from the language
and relegated to "implementation-defined attributes".
I think the rest of your paranoid conspiratorial delusions about what
"horrors" might afflict Common Lisp implementations are equally lacking
in merit. like, nothing is going to start spitting Unicode at you, Tim.
not until and unless you ask for it. it's called "responsible vendors".
| The only place it will matter is network transmission of data, and I
| don't see why normal compression techniques shouldn't deal with that.
then read the technical report and decrease your ignorance. sheesh.
#:Erik, who's actually quite disappointed, now.
compressing strings for in-memory representation of _arrays_ is nuts.
nobody has proposed it, and nobody ever will. again, read the Unicode
technical report and decrease both your fear and your ignorance.
#:Erik
Isn't (array character (*)) able to contain all character objects?
> I think the rest of your paranoid conspiratorial delusions about what
> "horrors" might afflict Common Lisp implementations are equally lacking
> in merit. like, nothing is going to start spitting Unicode at you, Tim.
> not until and unless you ask for it. it's called "responsible
> vendors".
If my code gets a string (from wherever, the user if you like) which
has bigger-than-8-bit characters in it, then tries to send it down the
wire, then what will happen? I don't see this as a vendor issue, but
perhaps I'm wrong.
Meantime I'm going to put in some optional checks to make sure that
all my character codes are small enough.
--tim
no. specialized vectors whose elements are of type character (strings)
are allowed to store only values of a subtype of type character. this is
so consistently repeated in the standard and so unique to strings that
I'm frankly amazed that anyone who has worked on the standard is having
such a hard time accepting it. it was obviously intended to let strings
be as efficient as the old string-char concept allowed, while not denying
implementations the ability to retain bits and fonts if they so chose.
an implementation that stores characters in strings as if they have null
implementation-defined attributes regardless of their actual attributes
is actually fully conforming to the standard. the result is that you
can't expect any attributes to survive string storage. the consequences
are _undefined_ if you attempt to store a character with attributes in a
string that can't handle it.
the removal of the type string-char is the key to understanding this.
#:Erik
Who replaced #:Erik with a bad imitation? This one's got all the
belligerence, but not the insight we've come to expect.
You've read a different standard than I, since many places actually
say "of type CHARACTER or a subtype" -- superfluously, since the
glossary entry for "subtype" says "Every type is a subtype of itself."
When I was designing the "fat character" support for LispWorks, I
looked for a get-out clause, and it's not there.
> the consequences are _undefined_ if you attempt to store a
> character with attributes in a string that can't handle it.
This is true. It's also true of all the other specialized arrays,
although different language ("must be") is used to specify that.
> the removal of the type string-char is the key to understanding this.
I suspect it was removed because it was realized that there would have
to be many types of STRING (at least 8-byte and 16-byte), and hence
there wasn't a single subtype of CHARACTER that would be associated
with strings. Whatever the reason, we can only go by what the
standard says.
I think it was a good choice, and LW specifically didn't retain the
type, to force programmers to consider what the code actually meant by
it (and to allow them to DEFTYPE it to the right thing).
Nevertheless, there should be a standard name for the type of simple
characters, i.e., with null implementation-defined attributes.
LispWorks and Liquid use LW:SIMPLE-CHAR for this.
--
Pekka P. Pirinen, Harlequin Limited
The Risks of Electronic Communication
http://www.best.com/~thvv/emailbad.html
geez...
| You've read a different standard than I, since many places actually say
| "of type CHARACTER or a subtype" -- superfluously, since the glossary
| entry for "subtype" says "Every type is a subtype of itself."
sigh. this is so incredibly silly it isn't worth responding to.
| I suspect it was removed because it was realized that there would have to
| be many types of STRING (at least 8-byte and 16-byte), and hence there
| wasn't a single subtype of CHARACTER that would be associated with
| strings. Whatever the reason, we can only go by what the standard says.
the STRING type is a union type, and there are no other union types in
Common Lisp. this should give you a pretty powerful hint, if you can get
away from your "bad imitation" attitude problem and actually listen, but
I guess that is not very likely at this time.
#:Erik
You still need a better FFI: WRITE-SEQUENCE is just as much a foreign
interface as any. In theory, you specify the representation on the
other side by the external format of the stream. If the system
doesn't have an external format that can do this, then you're reduced
to hacking it.
> In *practice* this has not turned out to be a problem but it's not
> clear what I need to check to make sure it is not. I suspect that
> checking that CHAR-CODE is always suitably small would be a good
> start.
In practice, most of us can pretend there's no encoding except ASCII.
If you expect non-ASCII characters on the Lisp side, you need to know
what the encoding is on the other side, otherwise it might come out
wrong.
It might be enough to check the type of your strings (and perhaps the
external format of the stream), instead of every character.
--
Pekka P. Pirinen, Harlequin Limited
Technology isn't just putting in the fastest processor and most RAM -- that's
packaging. - Steve Wozniak
You seem to be answering a different question than I asked. I didn't say
"Aren't all strings of type (array character (*))?".
I realize that there are string types that are not (array character (*)),
because a string can be of any array type where the element type is a
subtype of character. But if you want a type that can hold any character,
you can create it with:
(make-string length :element-type 'character)
In fact, you don't even need the :ELEMENT-TYPE option, because CHARACTER is
the default.
> You still need a better FFI: WRITE-SEQUENCE is just as much a foreign
> interface as any.
Yes, in fact it's worse than most, because I can't rely on the
vendor/implementor to address the issues for me!
> In theory, you specify the representation on the other side by the
> external format of the stream. If the system doesn't have an
> external format that can do this, then you're reduced to hacking it.
Right. And I'm happy to do this -- what I was asking was how I can
ensure
> In practice, most of us can pretend there's no encoding except ASCII.
> If you expect non-ASCII characters on the Lisp side, you need to know
> what the encoding is on the other side, otherwise it might come out
> wrong.
Yes. And the problem is that since my stuff is a low-level facility
which others (I hope) will build on, I don't really know what they
will throw at me. And I don't want to check every character of the
strings as this causes severe reduction in maximum performance (though
I haven't spent a lot of time checking that the checker compiles
really well yet, and in practice it will almost always be throttled
elsewhere).
> It might be enough to check the type of your strings (and perhaps the
> external format of the stream), instead of every character.
My hope is that BASE-STRING is good enough, but I'm not sure (I don't
see that a BASE-STRING could not have more than 8-bit characters, if
an implementation chose to have only one string type for instance (can
it?)). Checking the external format of the stream is also obviously
needed but if it's :DEFAULT does that tell me anything, and if it's
not I have to special case anyway.
Obviously at some level I have to just have implementation-dependent
checks because I don't think it says anywhere that characters are at n
bits or any of that kind of grut (which is fine). Or I could just not
care and pretend everything is 8-bit which will work for a while I
guess.
Is there a useful, fast, check that that (write-sequence x y) will
write (length x) bytes on y if all is well for LispWorks / Liquid (I
don't have a license for these, unfortunately)?
Thanks
--tim
yes. make the buffer and the stream have type (unsigned-byte 8), and
avoid the character abstraction which you obviously can't trust, anyway.
#:Erik
no, and that's the crux of the matter. this used to be different from
(make-string length :element-type 'string-char)
in precisely the capacity that you wish is still true, but it isn't.
when the type string-char was removed, character assumed its role in
specialized arrays, and you could not store bits and fonts in strings any
more than you could with string-char. to do that, you need arrays with
element-type t.
but I'm glad we've reached the point where you assert a positive, because
your claim is what I've been trying to tell you guys DOES NOT HOLD. my
claim is: there is nothing in the standard that _requires_ that there be
a specialized array with elements that are subtypes of character (i.e., a
member of the union type "string") that can hold _all_ character objects.
can you show me where the _standard_ supports your claim?
| In fact, you don't even need the :ELEMENT-TYPE option, because CHARACTER is
| the default.
sure. however, I'm trying to penetrate the armor-plated belief that the
resulting string is REQUIRED to retain non-null implementation-defined
attributes if stored into it. no such requirement exists: a conforming
implementation is completely free to provide a single string type that is
able to hold only simple characters. you may think this is a mistake in
the standard, but it's exactly what it says, after the type string-char
was removed.
methinks you're stuck in CLtL1 days, Barry, and so is this bad imitation
jerk from Harlequin, but that's much less surprising.
#:Erik
Which is precisely what I want to avoid unfortunately, as it means
that either this code or the code that calls it has to deal with the
issue of copying strings too and from arrays of (UNSIGNED-BYTE 8)s,
which simply brings back the same problem somewhere else.
(My first implementation did exactly this in fact)
--tim
I'm still not following you. Are you saying that characters with
implementation-defined attributes (e.g. bits or fonts) might not satisfy
(typep c 'character)? I suppose that's possible. The standard allows
implementations to provide implementation-defined attributes, but doesn't
require them; an implementor could instead provide their own type
CHAR-WITH-BITS that's disjoint from CHARACTER rather than a subtype of it.
I'm not sure why they would do this, but nothing in the standard prohibits
it.
On the other hand, something like READ-CHAR would not be permitted to
return a CHAR-WITH-BITS -- it has to return a CHARACTER. So I'm not sure
how a program that thinks it's working with characters and strings would
encounter such an object unexpectedly.
> * Barry Margolin <bar...@bbnplanet.com>
> | But if you want a type that can hold any character, you can create it with:
> |
> | (make-string length :element-type 'character)
>
> no, and that's the crux of the matter. this used to be different from
>
> (make-string length :element-type 'string-char)
>
> in precisely the capacity that you wish is still true, but it isn't.
> when the type string-char was removed, character assumed its role in
> specialized arrays, and you could not store bits and fonts in strings any
> more than you could with string-char. to do that, you need arrays with
> element-type t.
>
> but I'm glad we've reached the point where you assert a positive, because
> your claim is what I've been trying to tell you guys DOES NOT HOLD. my
> claim is: there is nothing in the standard that _requires_ that there be
> a specialized array with elements that are subtypes of character (i.e., a
> member of the union type "string") that can hold _all_ character objects.
>
> can you show me where the _standard_ supports your claim?
I'm not Barry, but I think I can. Provided I'm allowed to
use the HyperSpec (which I have) rather than the Standard
itself (which I don't).
1. MAKE-STRING is defined to return "a string ... of the most
specialized type that can accommodate elements of the given
type".
2. The default "given type" is CHARACTER.
3. Therefore, MAKE-STRING with the default ELEMENT-TYPE
returns a string "that can accommodate elements of the
type CHARACTER".
Unfortunately, there's no definition of "accommodate" in the
HyperSpec. However, compare the following passages:
From MAKE-STRING:
| The element-type names the type of the elements of the
| string; a string is constructed of the most specialized
| type that can accommodate elements of the given type.
From MAKE-ARRAY:
| Creates and returns an array construbted of the most
| specialized type that can accommodate elements of type
| given by element-type.
It seems to me that the only reasonable definition of "can
accommodate elements of type FOO" in this context is "can
have arbitrary things of type FOO as elements". If so, then
4. MAKE-STRING with the default ELEMENT-TYPE returns a string
capable of having arbitrary things of type CHARACTER as
elements.
Now,
5. A "string" is defined as "a specialized vector ... whose
elements are of type CHARACTER or a subtype of type CHARACTER".
6. A "specialized" array is defined to be one whose actual array
element type is a proper subtype of T.
Hence,
7. MAKE-STRING with the default ELEMENT-TYPE returns a vector
whose actual array element type is a proper subtype of T,
whose elements are of type CHARACTER or a subtype thereof,
and which is capable of holding arbitrary things of type
CHARACTER as elements.
And therefore
8. There is such a thing as a specialized array with elements
of type CHARACTER or some subtype thereof, which is capable
of holding arbitrary things of type CHARACTER as elements.
Which is what you said the standard doesn't say. (From #7
we can also deduce that this thing has actual array element
type a proper subtype of T, so it's not equivalent to
(array t (*)).)
I can see only one hole in this. It's sort of possible that
"can accommodate elements of type FOO" in the definition of
MAKE-STRING doesn't mean what I said it does, even though
the exact same language in the definition of MAKE-ARRAY does
mean that. I don't find this plausible.
I remark also the following, from 16.1.1 ("Implications
of strings being arrays"):
| Since all strings are arrays, all rules which apply
| generally to arrays also apply to strings. See
| Section 15.1 (Array Concepts).
..
| and strings are also subject to the rules of element
| type upgrading that apply to arrays.
I'd have thought that if strings were special in the kind
of way you're saying they are, there would be some admission
of the fact here. There isn't.
*
Elsewhere in the thread, you said
| an implementation that stores characters in strings
| as if they have null implementation-defined attributes
| regardless of their actual attributes is actually
| fully conforming to the standard.
I have been unable to find anything in the HyperSpec that
justifies this. Some places I've looked:
- 15.1.1 "Array elements" (in 15.1 "Array concepts")
I thought perhaps this might say something like
"In some cases, storing an object in an array will
actually store another object that need not be EQ
to the original object". Nope.
- The definitions of CHAR and AREF
Again, looking for any sign that an implementation
is allowed to store something non-EQ to what it's
given with (setf (aref ...) ...) or (setf (char ...) ...).
Again, no. The definition of CHAR just says that it
and SCHAR "access the element of STRING specified by INDEX".
- 13.1.3 "Character attributes"
Perhaps this might say "Storing a character in a string
may lose its implementation-defined attributes". Nope.
It says that the way in which two characters with the
same code differ is "implementation-defined", but I don't
see any licence anywhere for this to mean they get confused
when stored in an array.
- The definition of MAKE-STRING
I've discussed this already.
- The glossary entries for "string", "attribute", "element",
and various others.
Also discussed above.
- The whole of chapter 13 (Characters) and 16 (Strings).
No sign here, unless I've missed something.
- The definitions of types CHARACTER, BASE-CHAR, STANDARD-CHAR,
EXTENDED-CHAR.
Still no sign.
- The CHARACTER-PROPOSAL (which isn't, in any case, part of
the standard).
I thought this might give some sign of the phenomenon
you describe. Not that I can see.
Perhaps I'm missing something. It wouldn't be the first time.
But I just can't find any sign at all that what you claim is
true, and I can see rather a lot of things that suggest it isn't.
The nearest I can find is this, from 16.1.2 ("Subtypes of STRING"):
| However, the consequences are undefined if a character
| is inserted into a string for which the element type of
| the string does not include that character.
But that doesn't give any reason to believe that the result
of (MAKE-STRING n :ELEMENT-TYPE 'CHARACTER) doesn't have an
element type that includes all characters. And, as I've said
above, there's good reason to believe that it does.
> | In fact, you don't even need the :ELEMENT-TYPE option, because CHARACTER is
> | the default.
>
> sure. however, I'm trying to penetrate the armor-plated belief that the
> resulting string is REQUIRED to retain non-null implementation-defined
> attributes if stored into it. no such requirement exists: a conforming
> implementation is completely free to provide a single string type that is
> able to hold only simple characters. you may think this is a mistake in
> the standard, but it's exactly what it says, after the type string-char
> was removed.
Where?
--
Gareth McCaughan Gareth.M...@pobox.com
sig under construction
The part from "a conforming implementation..." on is direcly supported
by
13.1.3:
| A character for which each implementation-defined attribute has the
| null value for that attribute is called a simple character. If the
| implementation has no implementation-defined attributes, then all
| characters are simple characters.
/Jon
--
Jon Anthony
Synquiry Technologies, Ltd. Belmont, MA 02478, 617.484.3383
"Nightmares - Ha! The way my life's been going lately,
Who'd notice?" -- Londo Mollari
no. I'm saying that even as this _is_ the case, the standard does not
require a string to be able to hold and return such a character intact.
#:Erik
in this case, I'd talk to my vendor or dig deep in the implementation to
find a way to transmogrify an (unsigned-byte 8) vector to a character
vector by smashing the type codes instead of copying the data. (this is
just like change-class for vectors.) barring bivalent streams that can
accept either kind of vector (coming soon to an implementation near you),
having to deal with annoyingly stupid or particular external requirements
means it's OK to be less than nice at the interface level, provided it's
done safely.
#:Erik
> in this case, I'd talk to my vendor or dig deep in the implementation to
> find a way to transmogrify an (unsigned-byte 8) vector to a character
> vector by smashing the type codes instead of copying the data. (this is
> just like change-class for vectors.)
This doesn't work (unless I've misunderstood you) because I can't use
it for the string->unsigned-byte-array case, because the strings might
have big characters in them. Actually, it probably *would* work in
that I could arrange to get a twice-as-big array if the string had
16-bit characters in (or 4x as big if ...), but I think the other end
would expect UTF-8 or something in that case (or, more likely, just
throw up its hands in horror at the thought that characters are not 8
bits wide, it's a pretty braindead design).
It looks to me like the outcome of all this is that there isn't a
portable CL way of ensuring what I need to be true is true, and that I
need to ask vendors for per-implementation answers, and meantime punt
on the issue until my code is more stable. Which are fine answers
from my point of view, in case anyone thinks I'm making the standard
`lisp won't let me do x' complaint.
> barring bivalent streams that can accept either kind of vector
> (coming soon to an implementation near you), having to deal with
> annoyingly stupid or particular external requirements means it's
> OK to be less than nice at the interface level, provided it's done
> safely.
Yes, I agree with this.
--tim
sigh. so read (unsigned-byte 8), smash the type code so it's a string of
non-big characters, and do _whatever_ you need to do with the string,
then smash the type code and write (unsigned-byte 8) to whatever.
| It looks to me like the outcome of all this is that there isn't a
| portable CL way of ensuring what I need to be true is true, and that I
| need to ask vendors for per-implementation answers, and meantime punt on
| the issue until my code is more stable. Which are fine answers from my
| point of view, in case anyone thinks I'm making the standard `lisp won't
| let me do x' complaint.
portable languages are for portable problems. conversely, non-portable
problems may require non-portable solutions. I don't have a problem with
that, but many seem to have.
#:Erik
> Gareth McCaughan wrote:
>>
>> Erik Naggum wrote:
>>
>>> sure. however, I'm trying to penetrate the armor-plated belief that the
>>> resulting string is REQUIRED to retain non-null implementation-defined
>>> attributes if stored into it. no such requirement exists: a conforming
>>> implementation is completely free to provide a single string type that is
>>> able to hold only simple characters. you may think this is a mistake in
>>> the standard, but it's exactly what it says, after the type string-char
>>> was removed.
>>
>> Where?
>
> The part from "a conforming implementation..." on is direcly supported
> by
> 13.1.3:
>
> | A character for which each implementation-defined attribute has the
> | null value for that attribute is called a simple character. If the
> | implementation has no implementation-defined attributes, then all
> | characters are simple characters.
Well, yes, but it's not actually relevant to the point
Erik's making.
The paragraph you quote says implies *if* an implementation
has no implementation-defined attributes, *then* that
implementation is free to make all its strings hold
only simple characters. In other words, if all characters
are simple then you can have strings that can only contain
simple characters. Surprise, surprise. :-)
What Erik's saying is that there needn't be any string
type that can hold arbitrary character objects. This claim
isn't supported by the paragraph you quoted, so far as
I can see.
note that this all hinges on the definition of STRING, not CHARACTER.
we all agree that character objects may have implementation-defined
attributes. the crux of the matter is whether strings are _required_ to
support these implementation-defined attributes for characters stored in
them, or is _permitted_ only to hold simple characters, i.e., characters
that have null or no implementation-defined attributes. sadly, nothing
you bring up affects this crucial argument.
there are two compelling reasons why implementation-defined attributes
are _not_ required to be retained in strings: (1) there is special
mention of which implementation-defined attributes are discarded when
reading a string literal from an input stream (which apparently may
support reading them, but nothing is indicated as to how this happens),
and (2) historically, strings did not retain bits and fonts, so if they
were to be supported by an implementation that conformed to CLtL1, they
would have to be _added_ to strings, while bits and fonts were explicitly
_removed_ from the language.
| 1. MAKE-STRING is defined to return "a string ... of the most
| specialized type that can accommodate elements of the given
| type".
|
| 2. The default "given type" is CHARACTER.
|
| 3. Therefore, MAKE-STRING with the default ELEMENT-TYPE
| returns a string "that can accommodate elements of the
| type CHARACTER".
the question boils down to whether the character concept as defined in
isolation is the same as the character concept as defined as part of a
string. if they are, your logic is impeccable. if they aren't the same,
your argument is entirely moot. I'm arguing that the crucial clue to
understand that there is a difference is indicated by the unique "union
type" of strings and the phrase "or a subtype of character" which is not
used of any other specialized array in the same way it is for strings --
no other types permit _only_ a subtype.
I'm arguing that an implementation is not required not to have a weaker
character concept in strings than in isolation, i.e., that strings may
_only_ hold a subtype of character, that implementation-defined
attributes are defined only to exist (i.e., be non-null) in isolated
character objects, and not in characters as stored in strings.
| Now,
|
| 5. A "string" is defined as "a specialized vector ... whose
| elements are of type CHARACTER or a subtype of type CHARACTER".
_please_ note that no other specialized vector type is permitted the
leeway that "or a subtype of" implies here. for some bizarre reason, the
bad imitation jerk from Harlequin thought that he could delete "of type
CHARAACTER" since every type is a subtype of itself. however, the key is
that this wording effectively allows a proper subtype of character to be
represented in strings. a similar wording does not exist _elsewhere_ in
the standard, signifying increased importance by this differentiation.
| 8. There is such a thing as a specialized array with elements
| of type CHARACTER or some subtype thereof, which is capable
| of holding arbitrary things of type CHARACTER as elements.
this is a contradiction in terms, so I'm glad you conclude this, as it
shows that carrying "or a subtype thereof" with you means precisely that
the standard does not require a _single_ string type to be able to hold
_all_ character values. that is why string is a union type, unlike all
other types in the language.
| I'd have thought that if strings were special in the kind of way you're
| saying they are, there would be some admission of the fact here. There
| isn't.
there is particular mention of "or a subtype of character" all over the
place when strings are mentioned. that's the fact you're looking for.
however, if you are willing to use contradictions in terms as evidence of
something and you're willing to ignore facts on purpose, there is not
much that logic and argumentation alone can do to correct the situation.
| I have been unable to find anything in the HyperSpec that justifies this.
again, look for "or a subtype of character" in the definition of STRING.
#:Erik
> sigh. so read (unsigned-byte 8), smash the type code so it's a string of
> non-big characters, and do _whatever_ you need to do with the string,
> then smash the type code and write (unsigned-byte 8) to whatever.
Unless I've misunderstood you this still won't work. The strings I
need to write will come from lisp code, not down the wire -- I have no
control at all over what the user code puts in them. If they have big
characters in then I assume that smashing the type code will, if it
works at all, double / quadruple the length and result in me writing
stuff to the other end that it won't be able to cope with. What I'm
after, if it is possible, is a way of asking lisp if this string
contains (or might contain) big characters, without iterating over the
string checking.
--tim
> we all agree that character objects may have implementation-defined
> attributes. the crux of the matter is whether strings are _required_ to
> support these implementation-defined attributes for characters stored in
> them, or is _permitted_ only to hold simple characters, i.e., characters
> that have null or no implementation-defined attributes. sadly, nothing
> you bring up affects this crucial argument.
What about the complete absence of any statement anywhere
in the standard (so far as I can tell) that it's legal for
storing characters in a string to throw away their attributes?
> there are two compelling reasons why implementation-defined attributes
> are _not_ required to be retained in strings: (1) there is special
> mention of which implementation-defined attributes are discarded when
> reading a string literal from an input stream (which apparently may
> support reading them, but nothing is indicated as to how this happens),
> and (2) historically, strings did not retain bits and fonts, so if they
> were to be supported by an implementation that conformed to CLtL1, they
> would have to be _added_ to strings, while bits and fonts were explicitly
> _removed_ from the language.
I don't see why #1 is relevant. #2 is interesting, but the
language is defined by what the standard says, not by what
it used to say.
> | 1. MAKE-STRING is defined to return "a string ... of the most
> | specialized type that can accommodate elements of the given
> | type".
> |
> | 2. The default "given type" is CHARACTER.
> |
> | 3. Therefore, MAKE-STRING with the default ELEMENT-TYPE
> | returns a string "that can accommodate elements of the
> | type CHARACTER".
>
> the question boils down to whether the character concept as defined in
> isolation is the same as the character concept as defined as part of a
> string. if they are, your logic is impeccable. if they aren't the same,
> your argument is entirely moot. I'm arguing that the crucial clue to
> understand that there is a difference is indicated by the unique "union
> type" of strings and the phrase "or a subtype of character" which is not
> used of any other specialized array in the same way it is for strings --
> no other types permit _only_ a subtype.
The point here is simply that there can be several different
kinds of string. The standard says that there may be string
types that only permit a subtype of CHARACTER; it doesn't say
that there need be no string type that permits CHARACTER itself.
> I'm arguing that an implementation is not required not to have a weaker
> character concept in strings than in isolation, i.e., that strings may
> _only_ hold a subtype of character, that implementation-defined
> attributes are defined only to exist (i.e., be non-null) in isolated
> character objects, and not in characters as stored in strings.
I understand that you're arguing this. I still don't see anything
in the HyperSpec that supports it.
> | Now,
> |
> | 5. A "string" is defined as "a specialized vector ... whose
> | elements are of type CHARACTER or a subtype of type CHARACTER".
>
> _please_ note that no other specialized vector type is permitted the
> leeway that "or a subtype of" implies here.
I understand this. But I don't think STRING is a specialized vector
type, even though it's a type all of whose instances are specialized
vectors :-). The type STRING is a union of types; each of those
types is a specialized vector type. *This* is the reason for the
stuff about "or a subtype of": some strings might not be able to
hold arbitrary characters. But I still maintain that the argument
I've given demonstrates that some strings *are* able to hold
arbitrary characters.
> for some bizarre reason, the
> bad imitation jerk from Harlequin thought that he could delete "of type
> CHARAACTER" since every type is a subtype of itself. however, the key is
> that this wording effectively allows a proper subtype of character to be
> represented in strings. a similar wording does not exist _elsewhere_ in
> the standard, signifying increased importance by this differentiation.
I agree that the wording allows a proper subtype of CHARACTER
to be represented in strings. What it doesn't allow is for
all strings to allow only proper subtypes.
Here's a good way to focus what we disagree about. In the
definition of the class STRING, we find
| A string is a specialized vector whose elements are
| of type CHARACTER or a subtype of type CHARACTER.
| When used as a type specifier for object creation,
| STRING means (VECTOR CHARACTER).
You're arguing, roughly, that "a subtype of type CHARACTER"
might mean "the implementors can choose a subtype of CHARACTER
which is all that strings can contain". I think it means
"there may be strings that only permit a subtype of CHARACTER",
and that there's other evidence that shows that there must
also be strings that permit arbitrary CHARACTERs.
I've just noticed a neater proof than the one I gave before,
which isn't open to your objection that "character" might
sometimes not mean the same as CHARACTER.
1. [System class STRING] "When used as a type specifier
for object creation, STRING means (VECTOR CHARACTER)."
and: "This denotes the union of all types (ARRAY c (SIZE))
for all subtypes c of CHARACTER; that is, the set of
strings of size SIZE."
2. [System class VECTOR] "for all types x, (VECTOR x) is
the same as (ARRAY x (*))."
3. [Function MAKE-ARRAY] "The NEW-ARRAY can actually store
any objects of the type which results from upgrading
ELEMENT-TYPE".
4. [Function UPGRADED-ELEMENT-TYPE] "The TYPESPEC is
a subtype of (and possibly type equivalent to) the
UPGRADED-TYPESPEC." and: "If TYPESPEC is CHARACTER,
the result is type equivalent to CHARACTER."
5. [Glossary] "type equivalent adj. (of two types X and Y)
having the same elements; that is, X is a subtype of Y
and Y is a subtype of X."
6. [Function MAKE-ARRAY] "Creates and returns an array
constructed of the most specialized type that can
accommodate elements of type given by ELEMENT-TYPE."
7. [Function MAKE-STRING] "a string is constructed of the
most specialized type that can accommodate elements
of the given type."
So, consider the result S of
(make-array 10 :element-type 'character) .
We know (3) that it can store any objects of the type which
results from upgrading CHARACTER. We know (4) that that type
is type-equivalent to CHARACTER. We therefore know that S
can store any object of type CHARACTER.
We also know that (6) S is an array constructed of the most
specialized type that can accommodate elements of type
CHARACTER. So (1,2,7) is the result S' of
(make-string 10 :element-type 'character) .
Therefore S and S' are arrays of the same type.
We've just shown that S
- can hold arbitrary characters
- is of the same type as S'
and of course S' is a string. Therefore S is also a string.
Therefore there is at least one string (namely S) that
can hold arbitrary characters.
> | 8. There is such a thing as a specialized array with elements
> | of type CHARACTER or some subtype thereof, which is capable
> | of holding arbitrary things of type CHARACTER as elements.
>
> this is a contradiction in terms, so I'm glad you conclude this, as it
> shows that carrying "or a subtype thereof" with you means precisely that
> the standard does not require a _single_ string type to be able to hold
> _all_ character values. that is why string is a union type, unlike all
> other types in the language.
It doesn't require *every* string type to be able to hold all
character values. It does, however, require *some* string type
to be able to hold all character values.
The reason why STRING is a union type is that implementors
might want to have (say) an "efficient" string type that uses
only one byte per character, for storing "easy" strings. Having
this as well as a type that can store arbitrary characters,
and having them both be subtypes of STRING, requires that
STRING be a union type.
Oh, and what I wrote isn't a contradiction in terms. It would be
if "or some subtype thereof" meant "or some particular subtype
thereof that might be imposed once-for-all by the implementation",
but I don't think it means that.
> | I'd have thought that if strings were special in the kind of way you're
> | saying they are, there would be some admission of the fact here. There
> | isn't.
>
> there is particular mention of "or a subtype of character" all over the
> place when strings are mentioned. that's the fact you're looking for.
I don't believe the fact means what you think it does.
> however, if you are willing to use contradictions in terms as evidence of
> something and you're willing to ignore facts on purpose, there is not
> much that logic and argumentation alone can do to correct the situation.
I've explained why I don't think what I wrote is a contradiction
in terms.
I assure you that I am not ignoring anything on purpose.
what of it? in case you don't realize the full ramification of the
equally completely absence of any mechanism to use, query, or set these
implementation-defined attributes to characters, the express intent of
the removal of bits and fonts were to remove character attributes from
the language. they are no longer there as part of the official standard,
and any implementation has to document what it does to them as part of
the set of implementation-defined features. OBVIOUSLY, the _standard_ is
not the right document to prescribe the consequences of such features!
an implementation, consequently, may or may not want to store attributes
in strings, and it is free to do or not to do so, and the standard cannot
prescribe this behavior.
conversely, if implementation-defined attributes were to be retained,
shouldn't they have an explicit statement that they were to be retaiend,
which would require an implementation to abide by certain rules in the
implementation-defined areas? that sounds _much_ more plausible to me
than saying "implementation-defined" and then defining it in the standard.
when talking about what an implementation is allowed to do on its own
accord, omitting specifics means it's free to do whatever it pleases. in
any requirement that is covered by conformance clauses, an omission is
treated very differently: it means you can't do it. we are not talking
about _standard_ attributes of characters (that's the code, and that's
the only attribute _required_ to be in _standard_ strings), but about
implementation-defined attributes.
| I don't see why #1 is relevant. #2 is interesting, but the language is
| defined by what the standard says, not by what it used to say.
it says "implementation-defined attributes" and it says "subtype of
character", which is all I need to go by. you seem to want the standard
to prescribe implementation-defined behavior. this is an obvious no-go.
it is quite the disingenious twist to attempt to rephrase what I said as
"what the standard used to say", but I'm getting used to a lot of weird
stuff from your side already, so I'll just point out to you that I'm
referring to how it came to be what it is, not what it used to say. if
you can't see the difference, I can't help you understand, but if you do
see the difference, you will understand that no standard or other
document written by and intended for human beings can ever be perfect in
the way you seem to expect. expecting standards to be free of errors or
of the need of interpretation by humans is just mind-bogglingly stupid,
so I'm blithly assuming that you don't hold that view, but instead don't
see that you are nonetheless flirting with it.
| The point here is simply that there can be several different kinds of
| string. The standard says that there may be string types that only
| permit a subtype of CHARACTER; it doesn't say that there need be no
| string type that permits CHARACTER itself.
sigh. the point I'm trying to make is that it doesn't _require_ there to
be one particular string type which can hold characters with all the
implementation-defined attributes.
| (make-array 10 :element-type 'character) [S]
| (make-string 10 :element-type 'character) [S']
|
| Therefore S and S' are arrays of the same type.
sorry, this is a mere tautology that brings nothing to the argument.
| Therefore there is at least one string (namely S) that can hold arbitrary
| characters.
but you are not showing that it can hold arbitrary characters. _nothing_
in what you dig up actually argues that implementation-defined attributes
have standardized semantics. an implementation is, by virtue of its very
own definition of the semantics, able to define a character in isolation
as having some implementation-defined attributes and strings to contain
characters without such implementation-defined attributes. this is the
result of the removal of the type string-char and the subsequent merging
of the semantics of character and string-char.
| It doesn't require *every* string type to be able to hold all character
| values. It does, however, require *some* string type to be able to hold
| all character values.
where do you find support for this? nowhere does the standard say that a
string must retain implementation-defined attributes of characters. it
does say that the code attribute is the only standard attributes, and it
is obvious that that attribute must be retained wherever. it is not at
all obvious that implementation-defined attributes must survive all kinds
of operations.
you've been exceedingly specific in finding ways to defend your position,
but nowhere do you find actual evidence of a requirement that there exist
a string type that would not reject at least some character objects. I'm
sorry, but the premise that some string type _must_ be able to hold _all_
characters, including all the implementation-defined attributes that
strings never were intended to hold to begin with, is no more than
unsupported wishful thinking, but if you hold this premise as axiomatic,
you won't see that it is unsupported. if you discard it as an axiom and
then try to find support for it, you find that you can't -- the language
definition is sufficiently slippery that these implementation-defined
attributes don't have any standard-prescribed semantics for them at all,
including giving the implementation leeway to define their behavior,
which means: not _requiring_ anything particular about them, which means:
not _requiring_ strings to retain them, since that would be a particular
requirement about an implementation-defined property of the language.
| The reason why STRING is a union type is that implementors might want to
| have (say) an "efficient" string type that uses only one byte per
| character, for storing "easy" strings. Having this as well as a type
| that can store arbitrary characters, and having them both be subtypes of
| STRING, requires that STRING be a union type.
now, this is the interesting part. _which_ string would that be? as far
as I understand your argument, you're allowing an implementation to have
an implementation-defined standard type to hold simple characters (there
is only one _standard_ attribute -- the code), while it is _required_ to
support a wider _non-standard_ implementation-defined type? this is
another contradiction in terms. either the same requirement is standard
or it is implementation-defined -- it can't be both a the same time.
I quote from the character proposal that led to the changes we're
discussing, _not_ to imply that what isn't in the standard is more of a
requirement on the implementation than the standard, but to identify the
intent and spirit of the change. as with any legally binding document,
if you can't figure it out by reading the actual document, you go hunting
for the meaning in the preparatory works. luckily, we have access to the
preparatory works with the HyperSpec. it should shed light on the
wording in the standard, if necessary. in this case, it is necessary.
Remove all discussion of attributes from the language specification. Add
the following discussion:
``Earlier versions of Common LISP incorporated FONT and BITS as
attributes of character objects. These and other supported
attributes are considered implementation-defined attributes and
if supported by an implementation effect the action of selected
functions.''
what we have is a standard that didn't come out and say "you can't retain
bits and fonts from CLtL1 in characters", but _allowed_ an implementation
to retain them, in whatever way they wanted. since the standard removed
these features, it must be interpreted relative to that (bloody) obvious
intent if a wording might be interpreted by some that the change would
require providing _additional_ support for the removed features -- such
an interpretation _must_ be discarded, even if it is possible to argue
for it in an interpretative vacuum, which never exists in any document
written by and for human beings regardless of some people's desires.
(such a vacuum cannot even exist in mathematics -- which reading a
standard is not an exercise in, anyway -- any document must always be
read in a context that supplies and retains its intention, otherwise
_human_ communication breaks down completely.)
#:Erik