Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
strings and characters
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 47 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Tim Bradshaw  
View profile  
 More options Mar 15 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: 2000/03/15
Subject: strings and characters
I've managed to avoid worrying about characters and strings and all
the related horrors so far, but I've finally been forced into having
to care.about

The particular thing I don't understand is what type a literal string
has.  It looks at first sight as if it should be something capable of
holding any CHARACTER, but I'm not really sure if that's right.  It
looks to me as if it might be possible read things such that it's OK
to return something that can only hold a subtype of CHARACTER in some
cases.  

I'm actually more concerned with the flip side of this -- if almost all
the time I get some `good' subtype of CHARACTER (probably BASE-CHAR?)
but sometimes I get some ginormous multibyte unicode thing or
something, because I need to be able I have to deal with some C code
which is blithely assuming that unsigned chars are just small integers
and strings are arrays of small integers and so on in the usual C way,
and I'm not sure that I can trust my strings to be the same as its
strings.

I realise that people who care about character issues are probably
laughing at me at this point, but my main aim is to keep everything as
simple as I can, and especially I don't want to have to keep copying
my strings into arrays of small integers (which I was doing at one
point, but it's too hairy).

The practical question I guess is -- are there any implementations
which do currently have really big characters in strings?  Genera
seems to, but that's of limited interest.  CLISP seems to have
internationalisation stuff in it, and I know there's an international
Allegro, so those might have horrors in them.

Thanks for any advice.

--tim `7 bit ASCII was good enough for my father and it's good enough
       for me' Bradshaw.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Barry Margolin  
View profile  
 More options Mar 15 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Barry Margolin <bar...@bbnplanet.com>
Date: 2000/03/15
Subject: Re: strings and characters
In article <ey3hfe73nm4....@cley.com>, Tim Bradshaw  <t...@cley.com> wrote:

>I realise that people who care about character issues are probably
>laughing at me at this point, but my main aim is to keep everything as
>simple as I can, and especially I don't want to have to keep copying
>my strings into arrays of small integers (which I was doing at one
>point, but it's too hairy).

You can call ARRAY-ELEMENT-TYPE on the string to find out if it contains
anything weird.  If its compatible with your foreign function's API,
then you don't need to copy it.

--
Barry Margolin, bar...@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 16 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 2000/03/16
Subject: Re: strings and characters
* Tim Bradshaw <t...@cley.com>
| The particular thing I don't understand is what type a literal string
| has.  It looks at first sight as if it should be something capable of
| holding any CHARACTER, but I'm not really sure if that's right.  It looks
| to me as if it might be possible read things such that it's OK to return
| something that can only hold a subtype of CHARACTER in some cases.

  strings _always_ contain a subtype of character.  e.g., an implementation
  that supports bits will have to discard them from strings.  the only
  array type that can contain all character objects has element-type t.

| I'm actually more concerned with the flip side of this -- if almost all
| the time I get some `good' subtype of CHARACTER (probably BASE-CHAR?)
| but sometimes I get some ginormous multibyte unicode thing or something,
| because I need to be able I have to deal with some C code which is
| blithely assuming that unsigned chars are just small integers and strings
| are arrays of small integers and so on in the usual C way, and I'm not
| sure that I can trust my strings to be the same as its strings.

  this is not a string issue, it's an FFI issue.  if you tell your FFI that
  you want to ship a string to a C function, it should do the conversion
  for you if it needs to be performed.  if you can't trust your FFI to do
  the necessary conversions, you need a better FFI.

| I realise that people who care about character issues are probably
| laughing at me at this point, but my main aim is to keep everything as
| simple as I can, and especially I don't want to have to keep copying my
| strings into arrays of small integers (which I was doing at one point,
| but it's too hairy).

  if you worry about these things, your life is already _way_ more complex
  than it needs to be.  a string is a string.  each element of the string
  is a character.  stop worrying beyond this point.  C and Common Lisp
  agree on this fundamental belief, believe it or not.  your _quality_
  Common Lisp implementation will ensure that whatever invariants are
  maintained in _each_ environment.

| The practical question I guess is -- are there any implementations which
| do currently have really big characters in strings?

  yes, and not only that -- it's vitally important that strings take up no
  more space than they need.  a system that doesn't support both
  base-string (of base-char) and string (of extended-char) when it attempts
  to support Unicode will fail in the market -- Europe and the U.S. simply
  can't tolerate the huge growth in memory consumption from wantonly using
  twice as much as you need.  Unicode even comes with a very intelligent
  compression technique because people realize that it's a waste of space
  to use 16 bits and more for characters in a given character set group.

| I know there's an international Allegro, so those might have horrors in
| them.

  sure, but in the same vein, it might also have responsible, intelligent
  people behind it, not neurotics who fail to realize that customers have
  requirements that _must_ be resolved.  Allegro CL's international version
  deals very well with conversion between the native system strings and its
  internal strings.  I know -- not only do I run the International version
  in a test environment that needs wide characters _internally_, the test
  environment can't handle Unicode or anything else wide at all, and it's
  never been a problem.

  incidentally, I don't see this as any different from whether you have a
  simple-base-string, a simple-string, a base-string, or a string.  if you
  _have_ to worry, you should be the vendor or implementor of strings, not
  the user.  if you are the user and worry, you either have a problem that
  you need to take up with your friendly programmer-savvy shrink, or you
  call your vendor and ask for support.  I don't see this as any different
  from whether an array has a fill-pointer or not, either.  if you hand it
  to your friendly FFI and you worry about the length of the array with or
  without fill-pointer, you're simply worrying too much, or you have a bug
  that needs to be fixed.

  "might have horrors"!  what's next?  monster strings under your bed?

#:Erik


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pekka P. Pirinen  
View profile  
 More options Mar 16 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: pe...@harlequin.co.uk (Pekka P. Pirinen)
Date: 2000/03/16
Subject: Re: strings and characters
Erik is basically right that you shouldn't have to worry.  Unless
you're specifically writing localized applications.  A string will
hold a character, and the FFI will convert if it can.  The details of
how things should work with multiple string types have not been worked
out in the standard, so if you do want more control, it's
non-portable.

Erik Naggum <e...@naggum.no> writes:
> Tim Bradshaw writes:
> | The particular thing I don't understand is what type a literal string
> | has.  It looks at first sight as if it should be something capable of
> | holding any CHARACTER, but I'm not really sure if that's right.  It looks
> | to me as if it might be possible read things such that it's OK to return
> | something that can only hold a subtype of CHARACTER in some cases.

>   strings _always_ contain a subtype of character.  e.g., an implementation
>   that supports bits will have to discard them from strings.  the only
>   array type that can contain all character objects has element-type t.

If only it were so!  Unfortunately, the standard says characters with
bits are of type CHARACTER and STRING = (VECTOR CHARACTER).  Harlequin
didn't have the guts to stop supporting them (even though there's a
separate internal representation for keystroke events, now).  I guess
Franz did?

However, it's rarely necessary to create strings out of them, and it's
easy to configure LispWorks so that never happens.  Basically, there's
a variable called *DEFAULT-CHARACTER-ELEMENT-TYPE* that is the default
character type for all string constructors.  That includes the
reader's "-syntax that Tim Bradshaw was worrying about.  The reader
will actually silently construct wider strings if it sees a character
that is not in *D-C-E-T*, it's just the default.  (Note that if you're
reading from a stream, you have to consider the external format on the
stream first.)

> | The practical question I guess is -- are there any implementations which
> | do currently have really big characters in strings?

Allegro and LispWorks, at least.  Both will use thin strings where possible
(but in slightly different ways).
--
Pekka P. Pirinen
A feature is a bug with seniority.  - David Aldred <david_aldred.demon.co.uk>

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Mar 16 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: 2000/03/16
Subject: Re: strings and characters

* Erik Naggum wrote:
>   strings _always_ contain a subtype of character.  e.g., an implementation
>   that supports bits will have to discard them from strings.  the only
>   array type that can contain all character objects has element-type
>   t.

I don't think this is right -- rather I agree that they contain
CHARACTERs, but it looks like `bits' -- which I think now are
`implementation-defined attributes' -- can end up in strings, or at
least it is implementation-defined whether they do or not (2.4.5 says
this I think).  

>   this is not a string issue, it's an FFI issue.  if you tell your FFI that
>   you want to ship a string to a C function, it should do the conversion
>   for you if it needs to be performed.  if you can't trust your FFI to do
>   the necessary conversions, you need a better FFI.

Unfortunately my FFI is READ-SEQUENCE and WRITE-SEQUENCE, and at the
far end of this is something which is defined in terms of treating
characters as fixed-size (8 bit) small integers.  And I can't change
it because it's big important open source software and lots of people
have it, and it's written in C so it's too hard to change anyway... So
I need to be sure that nothing I can do is going to start spitting
unicode or something at it.

At one point I did this by converting my strings to arrays of
(UNSIGNED-BYTE 8)s, on I/O but that was stressful to do for various
reasons.

In *practice* this has not turned out to be a problem but it's not
clear what I need to check to make sure it is not.  I suspect that
checking that CHAR-CODE is always suitably small would be a good
start.

>   if you worry about these things, your life is already _way_ more complex
>   than it needs to be.  a string is a string.  each element of the string
>   is a character.  

Well, the whole problem is that at the far end that's not true.  Each
element (they've decided!) is an *8-bit* character...

>   yes, and not only that -- it's vitally important that strings take up no
>   more space than they need.  a system that doesn't support both
>   base-string (of base-char) and string (of extended-char) when it attempts
>   to support Unicode will fail in the market -- Europe and the U.S. simply
>   can't tolerate the huge growth in memory consumption from wantonly using
>   twice as much as you need.  Unicode even comes with a very intelligent
>   compression technique because people realize that it's a waste of space
>   to use 16 bits and more for characters in a given character set group.

For what it's worth I think this is wrong (but I could be wrong of
course, and anyway it's not worth arguing over).  People *happily*
tolerate doublings of memory & disk consumption if it suits them --
look at windows 3.x to 95, or sunos 5.5 to 5.7, or any successive pair
of xemacs versions ... And they're *right* to do that because Moore's
law works really well.  Using compressed representations makes things
more complex -- if strings are arrays, then aref &c need to have hairy
special cases, and everything else gets more complex, and that
complexity never goes away, which doubled-storage costs do in about a
year.

So I think that in a few years compressed representations will look
like the various memory-remapping tricks that DOS did, or the similar
things people now do with 32 bit machines to deal with really big
databases (and push, incredibly, as `the right thing', I guess because
they worship intel and intel are not doing too well with their 64bit
offering).  The only place it will matter is network transmission of
data, and I don't see why normal compression techniques shouldn't deal
with that.

So my story is if you want characters twice as big, just have big
characters and use more memory and disk -- it's cheap enough now that
it's dwarfed by labour costs and in a year it will be half the price.

On the other hand, people really like complex fiddly solutions to
things (look at C++!), so that would argue that complex character
compression techniques are here to stay.

Anyway, like I said it's not worth arguing over.  Time will tell.

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Mar 16 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: 2000/03/16
Subject: Re: strings and characters

* I wrote:
> For what it's worth I think this is wrong (but I could be wrong of
> course, and anyway it's not worth arguing over).

Incidentally I should make this clearer, as it looks like I'm arguing
against fat strings. Supporting several kinds of strings is
*obviously* sensible, I quibble about the compressing stuff being
worth it.

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 16 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 2000/03/16
Subject: Re: strings and characters
* Erik Naggum
| strings _always_ contain a subtype of character.  e.g., an implementation
| that supports bits will have to discard them from strings.  the only
| array type that can contain all character objects has element-type t.

* Tim Bradshaw
| I don't think this is right -- rather I agree that they contain
| CHARACTERs, but it looks like `bits' -- which I think now are
| `implementation-defined attributes' -- can end up in strings, or at least
| it is implementation-defined whether they do or not (2.4.5 says this I
| think).

  trivially, "strings _always_ contain a subtype of character" must be true
  as character is a subtype of character, but I did mean in the sense that
  strings _don't_ contain full character objects, despite the relegation of
  fonts and bits to "implementation-defined attributes".  that the type
  string-char was removed from the language but the attributes were sort of
  retained is perhaps confusing, but it is quite unambiguous as to intent.

  so must "the only array type that can contain all character objects has
  element-type t" be true, since a string is allowed to contain a subtype
  of type character.  (16.1.2 is pertinent in this regard.)  it may come as
  a surprise to people, but if you store a random character object into a
  string, you're not guaranteed that what you get back is eql to what you
  put into it.

  furthermore, there is no print syntax for implementation-defined
  attributes in strings, and no implementation is allowed to add any.  it
  is perhaps not obvious, but the retention of attributes is restricted by
  _both_ the string type's capabilities and the stream type's capabilities.

  you can quibble with the standard all you like -- you aren't going to see
  any implementation-defined attributes in string literals.  if you compare
  with CLtL1 and its explicit support for string-char which didn't support
  them at all, you must realize that in order to _have_ any support for
  implementation-defined attributes, you have to _add_ it above and beyond
  what strings did in CLtL1.  this is an extremely unlikely addition to an
  implementation just after bits and fonts were removed from the language
  and relegated to "implementation-defined attributes".

  I think the rest of your paranoid conspiratorial delusions about what
  "horrors" might afflict Common Lisp implementations are equally lacking
  in merit.  like, nothing is going to start spitting Unicode at you, Tim.
  not until and unless you ask for it.  it's called "responsible vendors".

| The only place it will matter is network transmission of data, and I
| don't see why normal compression techniques shouldn't deal with that.

  then read the technical report and decrease your ignorance.  sheesh.

#:Erik, who's actually quite disappointed, now.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 16 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 2000/03/16
Subject: Re: strings and characters
* Tim Bradshaw <t...@cley.com>
| Incidentally I should make this clearer, as it looks like I'm arguing
| against fat strings.  Supporting several kinds of strings is *obviously*
| sensible, I quibble about the compressing stuff being worth it.

  compressing strings for in-memory representation of _arrays_ is nuts.
  nobody has proposed it, and nobody ever will.  again, read the Unicode
  technical report and decrease both your fear and your ignorance.

#:Erik


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Barry Margolin  
View profile  
 More options Mar 16 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Barry Margolin <bar...@bbnplanet.com>
Date: 2000/03/16
Subject: Re: strings and characters
In article <3162223661729...@naggum.no>, Erik Naggum  <e...@naggum.no> wrote:

>  so must "the only array type that can contain all character objects has
>  element-type t" be true, since a string is allowed to contain a subtype
>  of type character.  (16.1.2 is pertinent in this regard.)  it may come as
>  a surprise to people, but if you store a random character object into a
>  string, you're not guaranteed that what you get back is eql to what you
>  put into it.

Isn't (array character (*)) able to contain all character objects?

--
Barry Margolin, bar...@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Mar 16 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: 2000/03/16
Subject: Re: strings and characters

* Erik Naggum wrote:
>   I think the rest of your paranoid conspiratorial delusions about what
>   "horrors" might afflict Common Lisp implementations are equally lacking
>   in merit.  like, nothing is going to start spitting Unicode at you, Tim.
>   not until and unless you ask for it.  it's called "responsible
>   vendors".

If my code gets a string (from wherever, the user if you like) which
has bigger-than-8-bit characters in it, then tries to send it down the
wire, then what will happen?  I don't see this as a vendor issue, but
perhaps I'm wrong.

Meantime I'm going to put in some optional checks to make sure that
all my character codes are small enough.

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 16 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 2000/03/16
Subject: Re: strings and characters
* Barry Margolin <bar...@bbnplanet.com>
| Isn't (array character (*)) able to contain all character objects?

  no.  specialized vectors whose elements are of type character (strings)
  are allowed to store only values of a subtype of type character.  this is
  so consistently repeated in the standard and so unique to strings that
  I'm frankly amazed that anyone who has worked on the standard is having
  such a hard time accepting it.  it was obviously intended to let strings
  be as efficient as the old string-char concept allowed, while not denying
  implementations the ability to retain bits and fonts if they so chose.

  an implementation that stores characters in strings as if they have null
  implementation-defined attributes regardless of their actual attributes
  is actually fully conforming to the standard.  the result is that you
  can't expect any attributes to survive string storage.  the consequences
  are _undefined_ if you attempt to store a character with attributes in a
  string that can't handle it.

  the removal of the type string-char is the key to understanding this.

#:Erik


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pekka P. Pirinen  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: pe...@harlequin.co.uk (Pekka P. Pirinen)
Date: 2000/03/17
Subject: Re: strings and characters

Erik Naggum <e...@naggum.no> writes:
> * Barry Margolin <bar...@bbnplanet.com>
> | Isn't (array character (*)) able to contain all character objects?

>   no.  specialized vectors whose elements are of type character (strings)
>   are allowed to store only values of a subtype of type character.  this is
>   so consistently repeated in the standard and so unique to strings that
>   I'm frankly amazed that anyone who has worked on the standard is having
>   such a hard time accepting it.

Who replaced #:Erik with a bad imitation?  This one's got all the
belligerence, but not the insight we've come to expect.

You've read a different standard than I, since many places actually
say "of type CHARACTER or a subtype" -- superfluously, since the
glossary entry for "subtype" says "Every type is a subtype of itself."
When I was designing the "fat character" support for LispWorks, I
looked for a get-out clause, and it's not there.

>   the consequences are _undefined_ if you attempt to store a
>   character with attributes in a string that can't handle it.

This is true.  It's also true of all the other specialized arrays,
although different language ("must be") is used to specify that.

>   the removal of the type string-char is the key to understanding this.

I suspect it was removed because it was realized that there would have
to be many types of STRING (at least 8-byte and 16-byte), and hence
there wasn't a single subtype of CHARACTER that would be associated
with strings.  Whatever the reason, we can only go by what the
standard says.

I think it was a good choice, and LW specifically didn't retain the
type, to force programmers to consider what the code actually meant by
it (and to allow them to DEFTYPE it to the right thing).
Nevertheless, there should be a standard name for the type of simple
characters, i.e., with null implementation-defined attributes.
LispWorks and Liquid use LW:SIMPLE-CHAR for this.
--
Pekka P. Pirinen, Harlequin Limited
The Risks of Electronic Communication
http://www.best.com/~thvv/emailbad.html


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 2000/03/17
Subject: Re: strings and characters
* Pekka P. Pirinen
| Who replaced #:Erik with a bad imitation?

  geez...

| You've read a different standard than I, since many places actually say
| "of type CHARACTER or a subtype" -- superfluously, since the glossary
| entry for "subtype" says "Every type is a subtype of itself."

  sigh.  this is so incredibly silly it isn't worth responding to.

| I suspect it was removed because it was realized that there would have to
| be many types of STRING (at least 8-byte and 16-byte), and hence there
| wasn't a single subtype of CHARACTER that would be associated with
| strings.  Whatever the reason, we can only go by what the standard says.

  the STRING type is a union type, and there are no other union types in
  Common Lisp.  this should give you a pretty powerful hint, if you can get
  away from your "bad imitation" attitude problem and actually listen, but
  I guess that is not very likely at this time.

#:Erik


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pekka P. Pirinen  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: pe...@harlequin.co.uk (Pekka P. Pirinen)
Date: 2000/03/17
Subject: Re: strings and characters

Tim Bradshaw <t...@cley.com> writes:
> * Erik Naggum wrote:
> >   this is not a string issue, it's an FFI issue.  if you tell your FFI that
> >   you want to ship a string to a C function, it should do the conversion
> >   for you if it needs to be performed.  if you can't trust your FFI to do
> >   the necessary conversions, you need a better FFI.

> Unfortunately my FFI is READ-SEQUENCE and WRITE-SEQUENCE, and at the
> far end of this is something which is defined in terms of treating
> characters as fixed-size (8 bit) small integers.

You still need a better FFI: WRITE-SEQUENCE is just as much a foreign
interface as any.  In theory, you specify the representation on the
other side by the external format of the stream.  If the system
doesn't have an external format that can do this, then you're reduced
to hacking it.

> In *practice* this has not turned out to be a problem but it's not
> clear what I need to check to make sure it is not.  I suspect that
> checking that CHAR-CODE is always suitably small would be a good
> start.

In practice, most of us can pretend there's no encoding except ASCII.
If you expect non-ASCII characters on the Lisp side, you need to know
what the encoding is on the other side, otherwise it might come out
wrong.

It might be enough to check the type of your strings (and perhaps the
external format of the stream), instead of every character.
--
Pekka P. Pirinen, Harlequin Limited
Technology isn't just putting in the fastest processor and most RAM -- that's
packaging.   - Steve Wozniak


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Barry Margolin  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Barry Margolin <bar...@bbnplanet.com>
Date: 2000/03/17
Subject: Re: strings and characters
In article <3162232362158...@naggum.no>, Erik Naggum  <e...@naggum.no> wrote:

>* Barry Margolin <bar...@bbnplanet.com>
>| Isn't (array character (*)) able to contain all character objects?

>  no.  specialized vectors whose elements are of type character (strings)
>  are allowed to store only values of a subtype of type character.  

You seem to be answering a different question than I asked.  I didn't say
"Aren't all strings of type (array character (*))?".

I realize that there are string types that are not (array character (*)),
because a string can be of any array type where the element type is a
subtype of character.  But if you want a type that can hold any character,
you can create it with:

(make-string length :element-type 'character)

In fact, you don't even need the :ELEMENT-TYPE option, because CHARACTER is
the default.

--
Barry Margolin, bar...@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: 2000/03/17
Subject: Re: strings and characters
* Pekka P Pirinen wrote:

> You still need a better FFI: WRITE-SEQUENCE is just as much a foreign
> interface as any.  

Yes, in fact it's worse than most, because I can't rely on the
vendor/implementor to address the issues for me!

> In theory, you specify the representation on the other side by the
> external format of the stream.  If the system doesn't have an
> external format that can do this, then you're reduced to hacking it.

Right.  And I'm happy to do this -- what I was asking was how I can
ensure

> In practice, most of us can pretend there's no encoding except ASCII.
> If you expect non-ASCII characters on the Lisp side, you need to know
> what the encoding is on the other side, otherwise it might come out
> wrong.

Yes.  And the problem is that since my stuff is a low-level facility
which others (I hope) will build on, I don't really know what they
will throw at me.  And I don't want to check every character of the
strings as this causes severe reduction in maximum performance (though
I haven't spent a lot of time checking that the checker compiles
really well yet, and in practice it will almost always be throttled
elsewhere).

> It might be enough to check the type of your strings (and perhaps the
> external format of the stream), instead of every character.

My hope is that BASE-STRING is good enough, but I'm not sure (I don't
see that a BASE-STRING could not have more than 8-bit characters, if
an implementation chose to have only one string type for instance (can
it?)).  Checking the external format of the stream is also obviously
needed but if it's :DEFAULT does that tell me anything, and if it's
not I have to special case anyway.

Obviously at some level I have to just have implementation-dependent
checks because I don't think it says anywhere that characters are at n
bits or any of that kind of grut (which is fine).  Or I could just not
care and pretend everything is 8-bit which will work for a while I
guess.

Is there a useful, fast, check that that (write-sequence x y) will
write (length x) bytes on y if all is well for LispWorks / Liquid (I
don't have a license for these, unfortunately)?

Thanks

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 2000/03/17
Subject: Re: strings and characters
* Tim Bradshaw <t...@cley.com>
| Is there a useful, fast, check that that (write-sequence x y) will write
| (length x) bytes on y if all is well for LispWorks / Liquid ...?

  yes.  make the buffer and the stream have type (unsigned-byte 8), and
  avoid the character abstraction which you obviously can't trust, anyway.

#:Erik


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 2000/03/17
Subject: Re: strings and characters
* Barry Margolin <bar...@bbnplanet.com>
| But if you want a type that can hold any character, you can create it with:
|
| (make-string length :element-type 'character)

  no, and that's the crux of the matter.  this used to be different from

(make-string length :element-type 'string-char)

  in precisely the capacity that you wish is still true, but it isn't.
  when the type string-char was removed, character assumed its role in
  specialized arrays, and you could not store bits and fonts in strings any
  more than you could with string-char.  to do that, you need arrays with
  element-type t.

  but I'm glad we've reached the point where you assert a positive, because
  your claim is what I've been trying to tell you guys DOES NOT HOLD.  my
  claim is: there is nothing in the standard that _requires_ that there be
  a specialized array with elements that are subtypes of character (i.e., a
  member of the union type "string") that can hold _all_ character objects.

  can you show me where the _standard_ supports your claim?

| In fact, you don't even need the :ELEMENT-TYPE option, because CHARACTER is
| the default.

  sure.  however, I'm trying to penetrate the armor-plated belief that the
  resulting string is REQUIRED to retain non-null implementation-defined
  attributes if stored into it.  no such requirement exists: a conforming
  implementation is completely free to provide a single string type that is
  able to hold only simple characters.  you may think this is a mistake in
  the standard, but it's exactly what it says, after the type string-char
  was removed.

  methinks you're stuck in CLtL1 days, Barry, and so is this bad imitation
  jerk from Harlequin, but that's much less surprising.

#:Erik


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: 2000/03/17
Subject: Re: strings and characters
* Erik Naggum wrote:

* Tim Bradshaw <t...@cley.com>

> | Is there a useful, fast, check that that (write-sequence x y) will write
> | (length x) bytes on y if all is well for LispWorks / Liquid ...?
>   yes.  make the buffer and the stream have type (unsigned-byte 8), and
>   avoid the character abstraction which you obviously can't trust, anyway.

Which is precisely what I want to avoid unfortunately, as it means
that either this code or the code that calls it has to deal with the
issue of copying strings too and from arrays of (UNSIGNED-BYTE 8)s,
which simply brings back the same problem somewhere else.

(My first implementation did exactly this in fact)

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Barry Margolin  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Barry Margolin <bar...@bbnplanet.com>
Date: 2000/03/17
Subject: Re: strings and characters
In article <3162302923332...@naggum.no>, Erik Naggum  <e...@naggum.no> wrote:

>* Barry Margolin <bar...@bbnplanet.com>
>| But if you want a type that can hold any character, you can create it with:
>|
>| (make-string length :element-type 'character)

>  no, and that's the crux of the matter.  this used to be different from

>(make-string length :element-type 'string-char)

>  in precisely the capacity that you wish is still true, but it isn't.
>  when the type string-char was removed, character assumed its role in
>  specialized arrays, and you could not store bits and fonts in strings any
>  more than you could with string-char.  to do that, you need arrays with
>  element-type t.

I'm still not following you.  Are you saying that characters with
implementation-defined attributes (e.g. bits or fonts) might not satisfy
(typep c 'character)?  I suppose that's possible.  The standard allows
implementations to provide implementation-defined attributes, but doesn't
require them; an implementor could instead provide their own type
CHAR-WITH-BITS that's disjoint from CHARACTER rather than a subtype of it.
I'm not sure why they would do this, but nothing in the standard prohibits
it.

On the other hand, something like READ-CHAR would not be permitted to
return a CHAR-WITH-BITS -- it has to return a CHARACTER.  So I'm not sure
how a program that thinks it's working with characters and strings would
encounter such an object unexpectedly.

--
Barry Margolin, bar...@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gareth McCaughan  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Gareth McCaughan <Gareth.McCaug...@pobox.com>
Date: 2000/03/17
Subject: Re: strings and characters

I'm not Barry, but I think I can. Provided I'm allowed to
use the HyperSpec (which I have) rather than the Standard
itself (which I don't).

1. MAKE-STRING is defined to return "a string ... of the most
   specialized type that can accommodate elements of the given
   type".

2. The default "given type" is CHARACTER.

3. Therefore, MAKE-STRING with the default ELEMENT-TYPE
   returns a string "that can accommodate elements of the
   type CHARACTER".

Unfortunately, there's no definition of "accommodate" in the
HyperSpec. However, compare the following passages:

From MAKE-STRING:
  | The element-type names the type of the elements of the
  | string; a string is constructed of the most specialized
  | type that can accommodate elements of the given type.

From MAKE-ARRAY:
  | Creates and returns an array construbted of the most
  | specialized type that can accommodate elements of type
  | given by element-type.

It seems to me that the only reasonable definition of "can
accommodate elements of type FOO" in this context is "can
have arbitrary things of type FOO as elements". If so, then

4. MAKE-STRING with the default ELEMENT-TYPE returns a string
   capable of having arbitrary things of type CHARACTER as
   elements.

Now,

5. A "string" is defined as "a specialized vector ... whose
   elements are of type CHARACTER or a subtype of type CHARACTER".

6. A "specialized" array is defined to be one whose actual array
   element type is a proper subtype of T.

Hence,

7. MAKE-STRING with the default ELEMENT-TYPE returns a vector
   whose actual array element type is a proper subtype of T,
   whose elements are of type CHARACTER or a subtype thereof,
   and which is capable of holding arbitrary things of type
   CHARACTER as elements.

And therefore

8. There is such a thing as a specialized array with elements
   of type CHARACTER or some subtype thereof, which is capable
   of holding arbitrary things of type CHARACTER as elements.

Which is what you said the standard doesn't say. (From #7
we can also deduce that this thing has actual array element
type a proper subtype of T, so it's not equivalent to
(array t (*)).)

I can see only one hole in this. It's sort of possible that
"can accommodate elements of type FOO" in the definition of
MAKE-STRING doesn't mean    what I said it does, even though
the exact same language in the definition of MAKE-ARRAY does
mean that. I don't find this plausible.

I remark also the following, from 16.1.1 ("Implications
of strings being arrays"):

  | Since all strings are arrays, all rules which apply
  | generally to arrays also apply to strings. See
  | Section 15.1 (Array Concepts).
..
  | and strings are also subject to the rules of element
  | type upgrading that apply to arrays.

I'd have thought that if strings were special in the kind
of way you're saying they are, there would be some admission
of the fact here. There isn't.

                            *

Elsewhere in the thread, you said

  | an implementation that stores characters in strings
  | as if they have null implementation-defined attributes
  | regardless of their actual attributes is actually
  | fully conforming to the standard.

I have been unable to find anything in the HyperSpec that
justifies this. Some places I've looked:

  - 15.1.1 "Array elements" (in 15.1 "Array concepts")

    I thought perhaps this might say something like
    "In some cases, storing an object in an array will
    actually store another object that need not be EQ
    to the original object". Nope.

  - The definitions of CHAR and AREF

    Again, looking for any sign that an implementation
    is allowed to store something non-EQ to what it's
    given with (setf (aref ...) ...) or (setf (char ...) ...).
    Again, no. The definition of CHAR just says that it
    and SCHAR "access the element of STRING specified by INDEX".

  - 13.1.3 "Character attributes"

    Perhaps this might say "Storing a character in a string
    may lose its implementation-defined attributes". Nope.
    It says that the way in which two characters with the
    same code differ is "implementation-defined", but I don't
    see any licence anywhere for this to mean they get confused
    when stored in an array.

  - The definition of MAKE-STRING

    I've discussed this already.

  - The glossary entries for "string", "attribute", "element",
    and various others.

    Also discussed above.

  - The whole of chapter 13 (Characters) and 16 (Strings).

    No sign here, unless I've missed something.

  - The definitions of types CHARACTER, BASE-CHAR, STANDARD-CHAR,
    EXTENDED-CHAR.

    Still no sign.

  - The CHARACTER-PROPOSAL (which isn't, in any case, part of
    the standard).

    I thought this might give some sign of the phenomenon
    you describe. Not that I can see.

Perhaps I'm missing something. It wouldn't be the first time.
But I just can't find any sign at all that what you claim is
true, and I can see rather a lot of things that suggest it isn't.

The nearest I can find is this, from 16.1.2 ("Subtypes of STRING"):

  | However, the consequences are undefined if a character
  | is inserted into a string for which the element type of
  | the string does not include that character.

But that doesn't give any reason to believe that the result
of (MAKE-STRING n :ELEMENT-TYPE 'CHARACTER) doesn't have an
element type that includes all characters. And, as I've said
above, there's good reason to believe that it does.

> | In fact, you don't even need the :ELEMENT-TYPE option, because CHARACTER is
> | the default.

>   sure.  however, I'm trying to penetrate the armor-plated belief that the
>   resulting string is REQUIRED to retain non-null implementation-defined
>   attributes if stored into it.  no such requirement exists: a conforming
>   implementation is completely free to provide a single string type that is
>   able to hold only simple characters.  you may think this is a mistake in
>   the standard, but it's exactly what it says, after the type string-char
>   was removed.

Where?

--
Gareth McCaughan  Gareth.McCaug...@pobox.com
sig under construction


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jon S Anthony  
View profile  
 More options Mar 17 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Jon S Anthony <j...@synquiry.com>
Date: 2000/03/17
Subject: Re: strings and characters

Gareth McCaughan wrote:

> Erik Naggum wrote:

> >   sure.  however, I'm trying to penetrate the armor-plated belief that the
> >   resulting string is REQUIRED to retain non-null implementation-defined
> >   attributes if stored into it.  no such requirement exists: a conforming
> >   implementation is completely free to provide a single string type that is
> >   able to hold only simple characters.  you may think this is a mistake in
> >   the standard, but it's exactly what it says, after the type string-char
> >   was removed.

> Where?

The part from "a conforming implementation..." on is direcly supported
by
13.1.3:

| A character for which each implementation-defined attribute has the
| null value for that attribute is called a simple character. If the
| implementation has no implementation-defined attributes, then all
| characters are simple characters.

/Jon

--
Jon Anthony
Synquiry Technologies, Ltd. Belmont, MA 02478, 617.484.3383
"Nightmares - Ha!  The way my life's been going lately,
 Who'd notice?"  -- Londo Mollari


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 18 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 2000/03/18
Subject: Re: strings and characters
* Barry Margolin <bar...@bbnplanet.com>
| I'm still not following you.  Are you saying that characters with
| implementation-defined attributes (e.g. bits or fonts) might not satisfy
| (typep c 'character)?

  no.  I'm saying that even as this _is_ the case, the standard does not
  require a string to be able to hold and return such a character intact.

#:Erik


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Naggum  
View profile  
 More options Mar 18 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Erik Naggum <e...@naggum.no>
Date: 2000/03/18
Subject: Re: strings and characters
* Tim Bradshaw <t...@cley.com>
| Which is precisely what I want to avoid unfortunately, as it means that
| either this code or the code that calls it has to deal with the issue of
| copying strings too and from arrays of (UNSIGNED-BYTE 8)s, which simply
| brings back the same problem somewhere else.

  in this case, I'd talk to my vendor or dig deep in the implementation to
  find a way to transmogrify an (unsigned-byte 8) vector to a character
  vector by smashing the type codes instead of copying the data.  (this is
  just like change-class for vectors.)  barring bivalent streams that can
  accept either kind of vector (coming soon to an implementation near you),
  having to deal with annoyingly stupid or particular external requirements
  means it's OK to be less than nice at the interface level, provided it's
  done safely.

#:Erik


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Bradshaw  
View profile  
 More options Mar 18 2000, 3:00 am
Newsgroups: comp.lang.lisp
From: Tim Bradshaw <t...@cley.com>
Date: 2000/03/18
Subject: Re: strings and characters

* Erik Naggum wrote:
>   in this case, I'd talk to my vendor or dig deep in the implementation to
>   find a way to transmogrify an (unsigned-byte 8) vector to a character
>   vector by smashing the type codes instead of copying the data.  (this is
>   just like change-class for vectors.)  

This doesn't work (unless I've misunderstood you) because I can't use
it for the string->unsigned-byte-array case, because the strings might
have big characters in them.  Actually, it probably *would* work in
that I could arrange to get a twice-as-big array if the string had
16-bit characters in (or 4x as big if ...), but I think the other end
would expect UTF-8 or something in that case (or, more likely, just
throw up its hands in horror at the thought that characters are not 8
bits wide, it's a pretty braindead design).

It looks to me like the outcome of all this is that there isn't a
portable CL way of ensuring what I need to be true is true, and that I
need to ask vendors for per-implementation answers, and meantime punt
on the issue until my code is more stable.  Which are fine answers
from my point of view, in case anyone thinks I'm making the standard
`lisp won't let me do x' complaint.

>   barring bivalent streams that can accept either kind of vector
>   (coming soon to an implementation near you), having to deal with
>   annoyingly stupid or particular external requirements means it's
>   OK to be less than nice at the interface level, provided it's done
>   safely.

Yes, I agree with this.

--tim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 47   Newer >
« Back to Discussions « Newer topic     Older topic »