Storing strings

James Harris

unread,

Oct 24, 2022, 6:31:14 AM10/24/22

to

Do you guys have any thoughts on the best ways for strings of characters
to be stored?

1. There's the C way, of course, of reserving one value (zero) and using
it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past
(where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

Options 1 and 2 have the advantage that they can be referred to simply
by address. Option 3 needs an additional place in which to store the
(first, past) control block.

Option 1 has the advantage that it's easy for a program to process (by
either pointer or index).

Options 1 and 3 have the advantage that one can refer to the tail of the
string (anything past the first character) without creating a copy,
although option 3 would need a new control block to be created. Option 2
would require a new string to be created.

In fact, option 3 has the advantage that it allows any continuous
substring - head, mid, or tail - to be referred to without making a copy
of the required part of the string.

Options 2 and 3 make it fast to find the length. They also allow any
value (i.e. including zero) to be part of the string.

So: Which of those should a compiler support? Should it support more
than one form? If so, should the language allow the programmer to
specify which form to use on any particular string?

If that's not complicated enough, the above essentially considers
strings whose contents could be read-only or read-write but their
lengths don't change. If the lengths can change then there are
additional issues of storage management. Eek! ;)

Recommendations welcome!

--
James Harris

Dmitry A. Kazakov

unread,

Oct 24, 2022, 8:07:40 AM10/24/22

to

On 2022-10-24 12:31, James Harris wrote:
> Do you guys have any thoughts on the best ways for strings of characters
> to be stored?
>
> 1. There's the C way, of course, of reserving one value (zero) and using
> it as a terminator.
>
> 2. There's the 'length prefix' option of putting the length of the
> string in a machine word before the characters.
>
> 3. There's the 'double pointer' way of pointing at, say, first and past
> (where 'past' means first plus length such that the second pointer
> points one position beyond the last character).
>
> Any others?

4. String body only. The constraints are known outside.

This is the way string slices and fixed length strings are implemented.
In the later case the compiler knows the strings bounds (first and last
indices and thus the length). In the former case the compiler passes a
"string dope" along with the naked body. The dope contains the bounds.

This has an effect on pointers. E.g. if you want slices and efficient
raw strings you must distinguish pointers to definite (constrained) vs.
indefinite (unconstrained) objects of same type.

E.g. in Ada you cannot take an indefinite string pointer to a fixed
length string because there is no bounds. If you wanted that feature you
would use a "fat pointer" to carry bounds with it.

This is similar to atomic, volatile objects and pointers to. The
mechanics is same. You cannot take a general-purpose pointer to an
atomic object, because the client code would not know that it should
take care upon dereferencing.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Bart

unread,

Oct 24, 2022, 10:28:47 AM10/24/22

to

For lower level strings, I'd highly recommend using zero-terminated
strings, or using them as the basis, or at least having it as an option.

This is not the 'C way', as I'd long used this outside of C and Unix
(eg. in DEC assembly, and in my own stuff for at least a decode before I
first dealt with C.

I still use them, and among many advantages such as pure simplicity,
allow you to directly make use of innumerable APIs that specify such
strings.

They can be used in contexts such as the compact string fields of
structs, since the only overhead is allowing space for that terminator **.

The next step up, in lower level code, is to use a slice. This is a
(pointer, length) descriptor. Here no terminator is necessary, and
allows strings to also contain embedded zeros (so can contain any binary
data).

String slices can point into another string (allowing sharing), or into
another slice, or into a regular zero-terminated string.

However to call an API function expecting a zero-terminated string
('stringz` as I sometimes call it), the pointer is not enough: you need
to ensure there's a zero following those <length> characters!

Within my dynamic scripting language, I have a full-on counted string
type, with reference counting to manage sharing and allow automatic
memory management. But with the same headache when calling low-level FFI
functions that expect C-like strings.

But that language at least will cope with it.

(** The scripting language can define structs with fixed types including
fixed-width string fields. Those are defined in two ways:

stringz*8 A
stringc*8 B

Both A and B occupy an 8-byte field. But A can store a maximum string of
7 characters, with B it can be 8 characters.

Yet B also includes the count so no scanning is needed to determine the
string length. The scheme however only works on fields of 2 to 256
characters.)

James Harris

unread,

Oct 26, 2022, 3:43:13 PM10/26/22

to

On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
> On 2022-10-24 12:31, James Harris wrote:
>> Do you guys have any thoughts on the best ways for strings of
>> characters to be stored?
>>
>> 1. There's the C way, of course, of reserving one value (zero) and
>> using it as a terminator.
>>
>> 2. There's the 'length prefix' option of putting the length of the
>> string in a machine word before the characters.
>>
>> 3. There's the 'double pointer' way of pointing at, say, first and
>> past (where 'past' means first plus length such that the second
>> pointer points one position beyond the last character).
>>
>> Any others?
>
> 4. String body only. The constraints are known outside.
>
> This is the way string slices and fixed length strings are implemented.
> In the later case the compiler knows the strings bounds (first and last
> indices and thus the length). In the former case the compiler passes a
> "string dope" along with the naked body. The dope contains the bounds.

That doesn't seem meaningfully different from case 3. To be clear, case
3 would be represented by, in addition to the bytes of the string,

struct
first: pointer to first byte of string
past: pointer to byte after last byte of string
.... other fields ....
end struct

The string length would be past - first. The bytes of the string would
be those pointed at (which I presume is what you are calling the naked
body).

>
> This has an effect on pointers. E.g. if you want slices and efficient
> raw strings you must distinguish pointers to definite (constrained) vs.
> indefinite (unconstrained) objects of same type.
>
> E.g. in Ada you cannot take an indefinite string pointer to a fixed
> length string because there is no bounds. If you wanted that feature you
> would use a "fat pointer" to carry bounds with it.

Any reason you'd recommend against storing bounds as in the struct, above?

>
> This is similar to atomic, volatile objects and pointers to. The
> mechanics is same. You cannot take a general-purpose pointer to an
> atomic object, because the client code would not know that it should
> take care upon dereferencing.
>

I am not sure what that means. I guess the point you are making is that
there are levels of classification which don't affect the data type but
they do affect how it can be accessed - with the language needing to
prevent a reference weakening the storage model. For example, a
read-write reference to a substring should be prevented from being used
to access part of a string which is supposed to be read-only.

--
James Harris

James Harris

unread,

Oct 26, 2022, 4:34:01 PM10/26/22

to

On 24/10/2022 15:28, Bart wrote:
> On 24/10/2022 11:31, James Harris wrote:

>> Do you guys have any thoughts on the best ways for strings of
>> characters to be stored?

..

> For lower level strings, I'd highly recommend using zero-terminated
> strings, or using them as the basis, or at least having it as an option.

They certainly seem easiest to work with although they do have
limitations such as:

* cannot include a character with the encoding of zero (as you say)
* must be scanned to determine length
* awkward to add to or delete from the end of as they don't carry any
data about whether the memory immediately following is available or not

>
> This is not the 'C way', as I'd long used this outside of C and Unix
> (eg. in DEC assembly, and in my own stuff for at least a decode before I
> first dealt with C.

True, though C popularised the scheme. Besides, on the PDP one way of
storing strings was apparently as

(length, address)

That's according to the Commercial Instruction Set (CIS) part of

https://en.wikipedia.org/wiki/PDP-11_architecture

>
> I still use them, and among many advantages such as pure simplicity,
> allow you to directly make use of innumerable APIs that specify such
> strings.
>
> They can be used in contexts such as the compact string fields of
> structs, since the only overhead is allowing space for that terminator **.
>

OK.

>
> The next step up, in lower level code, is to use a slice. This is a
> (pointer, length) descriptor. Here no terminator is necessary, and
> allows strings to also contain embedded zeros (so can contain any binary
> data).
>
> String slices can point into another string (allowing sharing), or into
> another slice, or into a regular zero-terminated string.

That's more universal and therefore perhaps the best to implement if
only one scheme is to be available. Have to say, though, I guess it
would be hard to manage the memory for. Instead of just (first, length)
or (first, past) perhaps one would need something like

struct
first: pointer to first element
past: pointer just past last element
count: number of slices pointing to this slice/string
base: the parent string or memory
flags: various
end struct

The base field would refer to the string object we were a slice of or,
if we were not a slice but the base string, the memory area in which the
string was stored.

The flags would indicate whether the string/slice could have its
contents changed and whether it could have its length changed, whether
the contents could be moved in memory, etc.

>
> However to call an API function expecting a zero-terminated string
> ('stringz` as I sometimes call it), the pointer is not enough: you need
> to ensure there's a zero following those <length> characters!
>
>
> Within my dynamic scripting language, I have a full-on counted string
> type, with reference counting to manage sharing and allow automatic
> memory management.

What fields did you use to manage such stuff? Am I on the right lines
with the ideas above?

> But with the same headache when calling low-level FFI
> functions that expect C-like strings.

Just a thought: ensure there is always at least one more byte of memory
than the string requires and put a zero byte at the end of the string
before calling any function which expects a C-like string. (User
responsibility to ensure there are no zero bytes embedded in the string.)

Perhaps one reason is that some predefined data structures include a
fixed-length field in which a string can sit but which has no room for
another byte such as a terminating zero. But for them the string could
be copied out.

Having a string defined by a (first, past) pair would perhaps allow
fixed-length fields to be handled as easily as mutable strings.

--
James Harris

James Harris

unread,

Oct 26, 2022, 4:52:08 PM10/26/22

to

On 24/10/2022 12:44, Stefan Ram wrote:

> James Harris <james.h...@gmail.com> writes:
>> So: Which of those should a compiler support? Should it support more
>> than one form? If so, should the language allow the programmer to
>> specify which form to use on any particular string?
>

> I think the idea of C is to leave it up to the programmer.
> The C string literals and functions are just some kind of
> suggestion, and they help to provide basic services, such
> as printing some text to the terminal. But otherwise, the
> programmer is free to implement his own string type(s) or
> use string libraries.
>
> The choice depends on the expected type of use. For example,
> some ways to store strings are known as "ropes" (Hans J Boehm,
> 1994), others are known as "gap buffers". A text editor
> might simultaneously use ropes for its text buffers and
> C strings for filenames.

Thanks for the references.

>
> The crucial thing for allowing programmers to implement
> their own string type is that the languages is fast enough
> to do this with little overhead compared to an implementation
> of strings in the langugage itself. Implementing custom
> string representations in slow languages might not feasible.

That's fair. I would like, however, to have an inbuilt string type that
is easy to work with so that there's a pre-made standard and programmers
don't have to come up with their own or to spend time working out what a
previous programmer had created.

I should have suggested a string or slice interface. Here's a first
attempt at the operations a string would be expected to be hit with.

These are part of the mechanics of string handling, relating to the
structure of the string rather than to its contents, so I've not
included anything in this list which looks at the content of the string.
There are loads of content-based operations such as string comparisons,
case conversion, whitespace trimming, etc, which could be built on top
of the basic handling.

Potential operations on string structures:
* allocate a new string
* create a slice (view) of an existing string
* index into a string
* increase the size of a string
* reduce the size of a string
* return the length of the string
* append/delete characters from the end
* insert/delete characters at the beginning
* take a slice of a string
* concatenate strings (including copying)
* pass to and from functions

The idea of slices is that they would appear the be strings but could be
created to refer to the same string elements without allocating new
storage for the sliced data.

These are just some ideas on what might be required. To do this
comprehensively seems rather complicated! :(

--
James Harris

Dmitry A. Kazakov

unread,

Oct 27, 2022, 3:28:58 AM10/27/22

to

That is the structure of a string dope, not the string itself, unless
you have the body in other fields, but then why would you need pointers?

To clarify terms. String representation must include the string body if
we are talking about values of strings. The things like pointers and
vectorized dopes are references to a string, not strings. You can pass a
string by a reference, sure. But the string value is somewhere else.
What you pass is not a string it is a substitute.

>> This has an effect on pointers. E.g. if you want slices and efficient
>> raw strings you must distinguish pointers to definite (constrained)
>> vs. indefinite (unconstrained) objects of same type.
>>
>> E.g. in Ada you cannot take an indefinite string pointer to a fixed
>> length string because there is no bounds. If you wanted that feature
>> you would use a "fat pointer" to carry bounds with it.
>
> Any reason you'd recommend against storing bounds as in the struct, above?

Start with interoperability of strings and slices of. The crucial
requirements would be:

A slice can be passed to a subprogram expecting a string without
copying.

Consider efficiency and low-level close to hardware stuff:

Aggregation of strings with known bounds does not require storing them.

E.g. you can have arrays of fixed length strings (like an image buffer).
If a member of a structure is a fixed length string, no bounds are
stored. A pointer to a fixed length string is a plain pointer etc.

>> This is similar to atomic, volatile objects and pointers to. The
>> mechanics is same. You cannot take a general-purpose pointer to an
>> atomic object, because the client code would not know that it should
>> take care upon dereferencing.
>
> I am not sure what that means. I guess the point you are making is that
> there are levels of classification which don't affect the data type but
> they do affect how it can be accessed - with the language needing to
> prevent a reference weakening the storage model. For example, a
> read-write reference to a substring should be prevented from being used
> to access part of a string which is supposed to be read-only.

Yes, it is a type constraint. There are all sorts of constraints one
could put on a type in order to produce a constrained subtype.
Constraining limits operations, e.g. immutability removes mutators. It
also directs certain implementations like using locking instructions or
dropping known bounds.

Bart

unread,

Oct 27, 2022, 7:14:28 AM10/27/22

to

On 26/10/2022 21:33, James Harris wrote:
> On 24/10/2022 15:28, Bart wrote:
>> On 24/10/2022 11:31, James Harris wrote:
>
>>> Do you guys have any thoughts on the best ways for strings of
>>> characters to be stored?
>
> ..
>
>> For lower level strings, I'd highly recommend using zero-terminated
>> strings, or using them as the basis, or at least having it as an option.
>
> They certainly seem easiest to work with although they do have
> limitations such as:
>
> * cannot include a character with the encoding of zero (as you say)
> * must be scanned to determine length

I use such strings in my 1Mlps compilers. Note that in that application:

* You don't need to get string lengths that often
* The vast majority of strings are short
* String append (that you mention below) is mainly when speed is not
critical (eg. for diagnostics)

> * awkward to add to or delete from the end of as they don't carry any
> data about whether the memory immediately following is available or not

>> The next step up, in lower level code, is to use a slice. This is a
>> (pointer, length) descriptor. Here no terminator is necessary, and
>> allows strings to also contain embedded zeros (so can contain any
>> binary data).
>>
>> String slices can point into another string (allowing sharing), or
>> into another slice, or into a regular zero-terminated string.
>
> That's more universal and therefore perhaps the best to implement if
> only one scheme is to be available.

Most strings are fixed-length once created; strings that can grow are
rare. You don't need a 'capacity' field for example (like C++'s Vector
type).

But managing memory can still be an issue because you don't know if a
particular slice owns its memory, or points to a string literal, or
points into a shared string, or points to external memory.

So a simple slice suits a lower-level language where you do this manual
(it would be a welcome addition to C for example).

My main language is just like this.

Have to say, though, I guess it
> would be hard to manage the memory for. Instead of just (first, length)
> or (first, past) perhaps one would need something like
>
> struct
>     first: pointer to first element
>     past: pointer just past last element
>     count: number of slices pointing to this slice/string
>     base: the parent string or memory
>     flags: various
> end struct
>
> The base field would refer to the string object we were a slice of or,
> if we were not a slice but the base string, the memory area in which the
> string was stored.
>
> The flags would indicate whether the string/slice could have its
> contents changed and whether it could have its length changed, whether
> the contents could be moved in memory, etc.
>
>>
>> However to call an API function expecting a zero-terminated string
>> ('stringz` as I sometimes call it), the pointer is not enough: you
>> need to ensure there's a zero following those <length> characters!

>>
>> Within my dynamic scripting language, I have a full-on counted string
>> type, with reference counting to manage sharing and allow automatic
>> memory management.
>
> What fields did you use to manage such stuff? Am I on the right lines
> with the ideas above?

The structure I use is not lightweight because it is for interpreted
code. The following object descriptor is a 32-byte record, used for all
objects. I've shown only the fields used by string objects:

record objrec =
u32 refcount
byte mutable # 1 for mutable strings
byte objtype
u16 dummy

ichar strptr # (ref char)
u64 length
union
u64 alloc64
object objptr2 # (ref objptr)
end
end

The string data itself is separate, pointed to by 'strptr'. This is nil
when the length is zero (it doesn't point to ""). It is not
zero-terminated (unless an external slice happens to be).

Most strings are mutable, then .alloc64 gives the capacity of the
allocation.

An important field is objtype; its values are:

Normal Regular string (uses alloc64)
Slice Slice into another (uses objptr2)
Extslice Strings lie outside the object scheme

For slices, while .strptr refers to the string data in question,
.objptr2 refs to the owner object of that string, which has its own
refcount.

External strings are those that belong to external code (eg. from an FFI
function), or those occuring inside a packed struct field for example.

So .objtype is used when sharing or freeing string data.

As I said, this is for interpreted code which can afford to do this
fiddly checking at runtime, which is not done inline either.

For static languages using inline code, it might need to be more
streamlined.

Note that if you take those 32 bytes, then the middle 16 bytes (.strptr
and .length fields) correspond to a raw Slice as used in my lower level
language.

>> But with the same headache when calling low-level FFI functions that
>> expect C-like strings.
>
> Just a thought: ensure there is always at least one more byte of memory
> than the string requires and put a zero byte at the end of the string
> before calling any function which expects a C-like string. (User
> responsibility to ensure there are no zero bytes embedded in the string.)

I think I tried that once. In general it doesn't work, as you might have
a slice into another string; you can't inject a zero byte into the
middle of that other string!

James Harris

unread,

Oct 29, 2022, 7:23:12 AM10/29/22

to

Curious use of terms. I presume that by "dope" you mean a dope vector
which can also be called a control block or a descriptor.

As for this specific case, the same information can be conveyed in
different ways: (start, length), (start, memsize), (first, last),
(first, past). I chose the latter as it should be slightly faster than
the others and does not run into problems when the elements are other
than single bytes.

To explain, for common operations,

memsize() = past - first
length() = memsize() >> alignbits
forward iteration proceeds while address < past
backward iteration proceeds while address >= first

The only stipulation is that the body must not be allocated at the very
top or bottom of the addressable range.

Using (first, past) should be as simple as that. By contrast, the
similar (first, last) runs into a slight problem when elements are wider
than single bytes: should the last pointer point to the start or the end
of the last item?

The others, which involve memsize or length, make it slightly slower to
judge the limits of iteration in the general case, requiring a
calculation to see if a pointer is outside the limits of the string
being referred to.

>
> To clarify terms. String representation must include the string body if
> we are talking about values of strings. The things like pointers and
> vectorized dopes are references to a string, not strings. You can pass a
> string by a reference, sure. But the string value is somewhere else.
> What you pass is not a string it is a substitute.

That depends, surely, on how "a string" is defined. If strings are
defined as descriptors starting with the fields first and past then the
bodies of such strings can be elsewhere. (There would be other fields of
a string descriptor to assist with memory management and probably some
flags, though I am open to suggestions as to what those fields should be.)

>
>>> This has an effect on pointers. E.g. if you want slices and efficient
>>> raw strings you must distinguish pointers to definite (constrained)
>>> vs. indefinite (unconstrained) objects of same type.
>>>
>>> E.g. in Ada you cannot take an indefinite string pointer to a fixed
>>> length string because there is no bounds. If you wanted that feature
>>> you would use a "fat pointer" to carry bounds with it.
>>
>> Any reason you'd recommend against storing bounds as in the struct,
>> above?
>
> Start with interoperability of strings and slices of. The crucial
> requirements would be:
>
> A slice can be passed to a subprogram expecting a string without
> copying.

Indeed, that's a major benefit of slices, IMO, being able to pass
something which looks and acts like a string but which doesn't need the
elements of the string to be copied.

That said, a slice would probably have a length which the callee can
determine but which the callee cannot change. I presume that's what
you'd call a constraint.

If a callee wanted to be able to change the length of a string then it
would have to be passed a real string, not a slice.

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be altered
by the callee. (Could be a real string or a slice.)

>
> Consider efficiency and low-level close to hardware stuff:
>
> Aggregation of strings with known bounds does not require storing them.
>
> E.g. you can have arrays of fixed length strings (like an image buffer).
> If a member of a structure is a fixed length string, no bounds are
> stored. A pointer to a fixed length string is a plain pointer etc.

You mean the string bounds could be known at compile time, say, rather
than at run time. Good point. Any suggestions on how that should be
implemented?

>
>>> This is similar to atomic, volatile objects and pointers to. The
>>> mechanics is same. You cannot take a general-purpose pointer to an
>>> atomic object, because the client code would not know that it should
>>> take care upon dereferencing.
>>
>> I am not sure what that means. I guess the point you are making is
>> that there are levels of classification which don't affect the data
>> type but they do affect how it can be accessed - with the language
>> needing to prevent a reference weakening the storage model. For
>> example, a read-write reference to a substring should be prevented
>> from being used to access part of a string which is supposed to be
>> read-only.
>
> Yes, it is a type constraint. There are all sorts of constraints one
> could put on a type in order to produce a constrained subtype.
> Constraining limits operations, e.g. immutability removes mutators. It
> also directs certain implementations like using locking instructions or
> dropping known bounds.
>

Was with you all the way until you mentioned dropping known bounds. What
does that mean? How can it be legitimate to drop any bounds?

--
James Harris

Bart

unread,

Oct 29, 2022, 8:24:49 AM10/29/22

to

4. Extensible string. This is not quite the same as your (1) which
requires only a mutable string.

You can mutate a string (alter individual characters) without needing to
know the overall length or its allocated capacity.

(You might further split that into mutable/non-mutable extensible
strings. Usually if growing a string by appending to it, you don't want
to also alter existing parts of the string.)

(You probably need to consider Unicode strings too, especially if
represented as UTF8, as the meaning of 'length' needs pinning down.)

James Harris

unread,

Oct 29, 2022, 9:10:51 AM10/29/22

to

On 27/10/2022 09:58, Stefan Ram wrote:
> James Harris <james.h...@gmail.com> writes:
>> Potential operations on string structures:
>> * allocate a new string
>> * create a slice (view) of an existing string
>> * index into a string
>

> Many of such operations are provided by the standard library
> of C++. You could have a look at its implementation. One might
> even think of kinda "backporting" it to C. Or use C++.
>
> Suggested Video: "The strange details of std::string at
> Facebook" - Nicholas Ormrod (2016)

Thanks. I had a look at that video - and a number of others.

--
James Harris

James Harris

unread,

Oct 29, 2022, 10:01:36 AM10/29/22

to

On 27/10/2022 12:14, Bart wrote:
> On 26/10/2022 21:33, James Harris wrote:
>> On 24/10/2022 15:28, Bart wrote:
>>> On 24/10/2022 11:31, James Harris wrote:
>>
>>>> Do you guys have any thoughts on the best ways for strings of
>>>> characters to be stored?

..

>>> String slices can point into another string (allowing sharing), or
>>> into another slice, or into a regular zero-terminated string.
>>
>> That's more universal and therefore perhaps the best to implement if
>> only one scheme is to be available.
>
> Most strings are fixed-length once created; strings that can grow are
> rare. You don't need a 'capacity' field for example (like C++'s Vector
> type).

Having watched some videos on string storage recently I now think I know
what you mean by the capacity field - basically that a string descriptor
would consist of these fields:

start
length
capacity

so that the string could be extended at the end (up to the capacity).
That may be a bit restrictive. A programmer might want to remove or add
characters at the beginning rather than just at the end, even though
such would be done less often.

So what do you think of having a string descriptor more like

first
past
memfirst
mempast

where memfirst and mempast would define the allocated space in which the
string body would sit.

Or perhaps the descriptor should be unified with other references to
memory - as I understand is true of your example, below.

>
> But managing memory can still be an issue because you don't know if a
> particular slice owns its memory, or points to a string literal, or
> points into a shared string, or points to external memory.

Yes, some flags would be needed. They could be stored in the low-order
bits of memfirst and mempast given that:

a) allocations (hence, memfirst) could be aligned
b) sizes of allocations (hence, mempast) could be rounded up to a
suitable power of 2
c) memfirst and mempast would need to be used far less often than first
and past so there would be no great problem with the cost of masking out
the low-order bits to get the addresses.

..

>>> Within my dynamic scripting language, I have a full-on counted string
>>> type, with reference counting to manage sharing and allow automatic
>>> memory management.
>>
>> What fields did you use to manage such stuff? Am I on the right lines
>> with the ideas above?
>
> The structure I use is not lightweight because it is for interpreted
> code. The following object descriptor is a 32-byte record, used for all
> objects. I've shown only the fields used by string objects:
>
>     record objrec =
>         u32         refcount
>         byte        mutable      # 1 for mutable strings
>         byte        objtype
>         u16         dummy
>
>         ichar       strptr       # (ref char)
>         u64         length
>         union
>             u64     alloc64
>             object objptr2      # (ref objptr)
>         end
>     end

That looks very sensible. I have considered having 'sentient references'
which would have a common format for anything which refers to memory
(especially referents of dynamic size) and would include the address of
a vtable for the specific type of sentient reference. The vtable would
hold the addresses of methods which could be applied to the reference
rather than to the referent. IOW the referent and the reference would
each have a type.

ATM I think I'd need to work through a lot more use cases before I would
be ready to settle on the details of that so for now I may just go with
the idea of a string descriptor.

>
> The string data itself is separate, pointed to by 'strptr'. This is nil
> when the length is zero (it doesn't point to ""). It is not
> zero-terminated (unless an external slice happens to be).
>
> Most strings are mutable, then .alloc64 gives the capacity of the
> allocation.
>
> An important field is objtype; its values are:
>
>     Normal            Regular string (uses alloc64)
>     Slice             Slice into another (uses objptr2)
>     Extslice          Strings lie outside the object scheme

OK. I may use something like that or, possibly, some flags.

..

> Note that if you take those 32 bytes, then the middle 16 bytes (.strptr
> and .length fields) correspond to a raw Slice as used in my lower level
> language.

Good point. I'd need slices to have the same format as strings and for
both to have flags. As there's no space for flags in the (first, past)
pair I'd need to add a flags word, making the structure

first
past
misc
memfirst
mempast

where misc would store various pieces of information, not just flag
bits. Slices would have only the first three fields. Strings would have
all five. Flags would indicate whether this was a string or a slice.

For me it's too early to optimise but it's worth noting that even for
64-bit machines the above would occupy only 24 or 40 bytes of a 64-byte
cache line so short string bodies could be stored in the same line,
again with flags indicating that that was so.

--
James Harris

James Harris

unread,

Oct 29, 2022, 10:16:28 AM10/29/22

to

You mean a string which can be made longer but the existing contents
could not be changed? I cannot think of a use case for that.

>
> You can mutate a string (alter individual characters) without needing to
> know the overall length or its allocated capacity.

Wouldn't you need to know how long the string was so that a callee could
make sure it was trying to modify characters within the string rather
than memory locations outside it?

>
> (You might further split that into mutable/non-mutable extensible
> strings. Usually if growing a string by appending to it, you don't want
> to also alter existing parts of the string.)

Mutable and extensible are good descriptions though as above I don't yet
see the value in allowing a string to be extensible but its existing
contents to be immutable.

A slice would be inextensible but could be mutable or immutable, AISI.

>
> (You probably need to consider Unicode strings too, especially if
> represented as UTF8, as the meaning of 'length' needs pinning down.)

I haven't mentioned it but ATM my chars are 32-bit and any 32-bit value
can be stored in them, including zero. It also means there's no way to
reserve a value for EOF so that condition has to be handled a different
way from what C programmers are used to where EOF is a value which is
outside the range permitted for chars. Challenges a plenty!

--
James Harris

Bart

unread,

Oct 29, 2022, 11:30:20 AM10/29/22

to

On 29/10/2022 15:16, James Harris wrote:
> On 29/10/2022 13:24, Bart wrote:
>> On 29/10/2022 12:23, James Harris wrote:
>
>
>>> I guess there would be these kinds of string argument:
>>>
>>> 1. Read-write string. Anything could be done to the string by the
>>> callee. (Would have to be a real string.)
>>>
>>> 2. Read-write fixed-length string. The string's contents could be
>>> altered but it could not be made longer or shorter. (Could be a real
>>> string or a slice.)
>>>
>>> 3. Read-only string. Neither its length nor it contents could be
>>> altered by the callee. (Could be a real string or a slice.)
>>
>> 4. Extensible string. This is not quite the same as your (1) which
>> requires only a mutable string.
>
> You mean a string which can be made longer but the existing contents
> could not be changed? I cannot think of a use case for that.

That's a pattern I used all the time to incrementally build strings, for
example to generate C or ASM source files from a language app.

Or it can be as simple as this:

errormess +:= " on line "+tostr(linenumber)

Once extended, the existing parts of the string are never modified.

Perhaps you can give an example of where mutating the characters of a
string, extensible or otherwise, comes in useful.

(My strings generally are mutable, but it's not a feature I use a great
deal.

For applications like text editors, I use a list of strings, one per
line. And editing within each line create a new string for each edit.
Efficiency here is not critical, and the needs are diverse, like
deleting within the string, or insertion. It's just easier to construct
a new one.)

>
>>
>> You can mutate a string (alter individual characters) without needing
>> to know the overall length or its allocated capacity.
>
> Wouldn't you need to know how long the string was so that a callee could
> make sure it was trying to modify characters within the string rather
> than memory locations outside it?

My point is that, given only the string pointer and an index or offset
into it, that's all that's needed to modify it. If slices at least are
used, then the callee could do bounds checking /if it wanted/.

(My dynamic language does do runtime checking of bounds but, once an
application has been developed, it is very, very rare that I have a
bounds error come up. In a working, debugged program, it should not be
necessary.)

>>
>> (You might further split that into mutable/non-mutable extensible
>> strings. Usually if growing a string by appending to it, you don't
>> want to also alter existing parts of the string.)
>
> Mutable and extensible are good descriptions though as above I don't yet
> see the value in allowing a string to be extensible but its existing
> contents to be immutable.
>
> A slice would be inextensible but could be mutable or immutable, AISI.
>
>>
>> (You probably need to consider Unicode strings too, especially if
>> represented as UTF8, as the meaning of 'length' needs pinning down.)
>
> I haven't mentioned it but ATM my chars are 32-bit and any 32-bit value
> can be stored in them, including zero. It also means there's no way to
> reserve a value for EOF so that condition has to be handled a different
> way from what C programmers are used to where EOF is a value which is
> outside the range permitted for chars. Challenges a plenty!

But you're not using all 2**32 bit patterns? It could reserve -1 or all
1s for EOF just like C does. Because EOF would generally be used for
character-at-a-time streaming, which is typically 8-bit anyway.

Or have you developed a binary file system which works with 32-bit-wide
'bytes'?

James Harris

unread,

Oct 29, 2022, 1:42:54 PM10/29/22

to

On 29/10/2022 16:30, Bart wrote:
> On 29/10/2022 15:16, James Harris wrote:
>> On 29/10/2022 13:24, Bart wrote:
>>> On 29/10/2022 12:23, James Harris wrote:
>>
>>
>>>> I guess there would be these kinds of string argument:
>>>>
>>>> 1. Read-write string. Anything could be done to the string by the
>>>> callee. (Would have to be a real string.)
>>>>
>>>> 2. Read-write fixed-length string. The string's contents could be
>>>> altered but it could not be made longer or shorter. (Could be a real
>>>> string or a slice.)
>>>>
>>>> 3. Read-only string. Neither its length nor it contents could be
>>>> altered by the callee. (Could be a real string or a slice.)
>>>
>>> 4. Extensible string. This is not quite the same as your (1) which
>>> requires only a mutable string.
>>
>> You mean a string which can be made longer but the existing contents
>> could not be changed? I cannot think of a use case for that.
>
> That's a pattern I used all the time to incrementally build strings, for
> example to generate C or ASM source files from a language app.
>
> Or it can be as simple as this:
>
> errormess +:= " on line "+tostr(linenumber)
>
> Once extended, the existing parts of the string are never modified.

Good examples. The 'extend' permission seems a bit specific although I
accept that the uses you mention are common. I suppose it adds to the
security of the language to be able to designate a string as
extensible/inextensible separately from designating whether its existing
contents can be changed or not.

How would it be used? Thinking about functions which take a string as
input, most strings would be purely inputs. They would therefore be both
read-only and inextensible within the called function. Such arguments
could be strings or slices.

Further, functions which /return/ a string would create the string and
return it whole.

It is only functions which /modify/ a string, i.e. take it as an inout
parameter, where it would matter whether the string was read/write or
extensible. For an inout string what should be the defaults? If we say
an inout string defaults to immutable and inextensible then that would
lead to the following ways to specify a string, s, as a parameter:

f: function(s: inout string char)
f: function(s: inout string char rw)
f: function(s: inout string char ext rw)
f: function(s: inout string char ext)

Note the "ext" and "rw" attributes. The idea is that they would specify
how the string could be modified in the function. Adding rw would allow
the string's existing contents to be taken as read-write rather than
read-only. Adding ext would allow the string to be extended.

That's effectively me thinking out loud and trying out some ideas. How
does it look to you?

What about other permissions such as prepend, split, insert, delete,
etc? Perhaps it's too specific to have too many qualifiers although I
can see value in using such info to help match caller and callee. For
example, given the above one could say that as long as the callee
doesn't specify the string as ext then it could be either a string or a
slice. That is appealing from a security perspective.

That said, can a compiler ensure that a string is not used in a way
which breaks the contract indicated by its keywords? You raise some big
issues!

>
> Perhaps you can give an example of where mutating the characters of a
> string, extensible or otherwise, comes in useful.

I intend a string to be simply an array whose length can be changed. The
idea being that a program could have a string of integers, a string of
floats etc just as easily as having a string of characters. As such,
anything which changes the content of an array should also work on
strings. For example, one might want to sort an array in place. As a
string of characters one might want to convert lower case to upper case,
etc.

>
> (My strings generally are mutable, but it's not a feature I use a great
> deal.
>
> For applications like text editors, I use a list of strings, one per
> line. And editing within each line create a new string for each edit.
> Efficiency here is not critical, and the needs are diverse, like
> deleting within the string, or insertion. It's just easier to construct
> a new one.)

OK.

..

>>>
>>> (You might further split that into mutable/non-mutable extensible
>>> strings. Usually if growing a string by appending to it, you don't
>>> want to also alter existing parts of the string.)
>>
>> Mutable and extensible are good descriptions though as above I don't
>> yet see the value in allowing a string to be extensible but its
>> existing contents to be immutable.
>>
>> A slice would be inextensible but could be mutable or immutable, AISI.
>>
>>>
>>> (You probably need to consider Unicode strings too, especially if
>>> represented as UTF8, as the meaning of 'length' needs pinning down.)
>>
>> I haven't mentioned it but ATM my chars are 32-bit and any 32-bit
>> value can be stored in them, including zero. It also means there's no
>> way to reserve a value for EOF so that condition has to be handled a
>> different way from what C programmers are used to where EOF is a value
>> which is outside the range permitted for chars. Challenges a plenty!
>
> But you're not using all 2**32 bit patterns? It could reserve -1 or all
> 1s for EOF just like C does. Because EOF would generally be used for
> character-at-a-time streaming, which is typically 8-bit anyway.

As above, the language is meant to treat strings as arrays. So AISI it
should not ascribe any particular meaning to their contents.

There are other ways. For example, my plan for EOF is twofold:

1. to have it as an attribute of a file object

2. to have an attempt to read at EOF throw a weak exception which would
be a catchable way to end an iteration.

>
> Or have you developed a binary file system which works with 32-bit-wide
> 'bytes'?

No, my system is nothing like that advanced. At present all bytes
(octets) I read from disk are zero extended to 32 bits. And all chars I
write to disk have their top 24 zero bits chopped off. Though please
don't think that's by design. It's only a temporary measure while I get
the compiler up and running properly. (The compiler and the compilable
language are, at present, rather limited.) In the long term IO streams
should be via typed channels where chars of octets (or some other size)
could be handled natively.

--
James Harris

Dmitry A. Kazakov

unread,

Oct 29, 2022, 4:17:26 PM10/29/22

to

Yes. The problem with (first, next) is that next could be inexpressible.
Most difficulties arise with strings/arrays over enumerations and
modular types. (first, last) has no such problem.

Both have issues with empty strings, e.g. with a multitude of
representations of. Compare with +/-0 problem for non-2-complement integers.

> Using (first, past) should be as simple as that. By contrast, the
> similar (first, last) runs into a slight problem when elements are wider
> than single bytes: should the last pointer point to the start or the end
> of the last item?

Just do not use multibyte representations at all. E.g. UTF-8 string is
represented by an array of *octets*. It has a view of an array of code
points, but that is not the physical representation, only a view.

>> To clarify terms. String representation must include the string body
>> if we are talking about values of strings. The things like pointers
>> and vectorized dopes are references to a string, not strings. You can
>> pass a string by a reference, sure. But the string value is somewhere
>> else. What you pass is not a string it is a substitute.
>
> That depends, surely, on how "a string" is defined.

def String is a sequence of characters.

There is not other definitions. Little depends on that because there is
no requirement to represent string this way. You are completely free to
choose any suitable representation.

[...]

Skipped description of a possible representation.

>>>> This has an effect on pointers. E.g. if you want slices and
>>>> efficient raw strings you must distinguish pointers to definite
>>>> (constrained) vs. indefinite (unconstrained) objects of same type.
>>>>
>>>> E.g. in Ada you cannot take an indefinite string pointer to a fixed
>>>> length string because there is no bounds. If you wanted that feature
>>>> you would use a "fat pointer" to carry bounds with it.
>>>
>>> Any reason you'd recommend against storing bounds as in the struct,
>>> above?
>>
>> Start with interoperability of strings and slices of. The crucial
>> requirements would be:
>>
>> A slice can be passed to a subprogram expecting a string without
>> copying.
>
> Indeed, that's a major benefit of slices, IMO, being able to pass
> something which looks and acts like a string but which doesn't need the
> elements of the string to be copied.
>
> That said, a slice would probably have a length which the callee can
> determine but which the callee cannot change. I presume that's what
> you'd call a constraint.

It could be a constraint for fixed length slices.

> If a callee wanted to be able to change the length of a string then it
> would have to be passed a real string, not a slice.

A callee might pass a variable length slice, which, for example, can be
enlarged or shortened. Many languages with dynamically allocated strings
have this. You need to find some balance between flexibility of
pool-allocated strings and efficiency of fixed length ones. If the
language has a developed type system you can have both transparently
interchangeable for the programmer. Note this is same discussion as with
numbers. Programmers want all of them with an ability to pass one for
another.

> I guess there would be these kinds of string argument:
>
> 1. Read-write string. Anything could be done to the string by the
> callee. (Would have to be a real string.)
>
> 2. Read-write fixed-length string. The string's contents could be
> altered but it could not be made longer or shorter. (Could be a real
> string or a slice.)
>
> 3. Read-only string. Neither its length nor it contents could be altered
> by the callee. (Could be a real string or a slice.)

Think of it in terms of constraints. Immutability is a constraint. Fixed
length is a constraint. Bounded length is a constraint. Non-sliding
lower bound is a constraint. Non-sliding upper bound is a constraint.

This should cover all spectrum. You can express all cases in terms of
constraints.

>> Consider efficiency and low-level close to hardware stuff:
>>
>> Aggregation of strings with known bounds does not require storing
>> them.
>>
>> E.g. you can have arrays of fixed length strings (like an image
>> buffer). If a member of a structure is a fixed length string, no
>> bounds are stored. A pointer to a fixed length string is a plain
>> pointer etc.
>
> You mean the string bounds could be known at compile time, say, rather
> than at run time. Good point. Any suggestions on how that should be
> implemented?

As I said, you just have the string body and nothing else in the
representation. Compare it to numbers. You can have indefinite length
integers, but for many reasons programmers stick to constrained variants
like -2**15..2*15-1.

>>>> This is similar to atomic, volatile objects and pointers to. The
>>>> mechanics is same. You cannot take a general-purpose pointer to an
>>>> atomic object, because the client code would not know that it should
>>>> take care upon dereferencing.
>>>
>>> I am not sure what that means. I guess the point you are making is
>>> that there are levels of classification which don't affect the data
>>> type but they do affect how it can be accessed - with the language
>>> needing to prevent a reference weakening the storage model. For
>>> example, a read-write reference to a substring should be prevented
>>> from being used to access part of a string which is supposed to be
>>> read-only.
>>
>> Yes, it is a type constraint. There are all sorts of constraints one
>> could put on a type in order to produce a constrained subtype.
>> Constraining limits operations, e.g. immutability removes mutators. It
>> also directs certain implementations like using locking instructions
>> or dropping known bounds.
>
> Was with you all the way until you mentioned dropping known bounds. What
> does that mean? How can it be legitimate to drop any bounds?

If you can deduce bounds why would you keep them? Again, consider 16-bit
integer. Do you keep -2**15 and 2*15-1 anywhere? You do not. The same
should happen to fixed length or fixed bounds strings.

Bart

unread,

Oct 29, 2022, 5:02:50 PM10/29/22

to

On 29/10/2022 18:42, James Harris wrote:

> On 29/10/2022 16:30, Bart wrote:

> Further, functions which /return/ a string would create the string and
> return it whole.

Not necessarily. My dynamic language can return a string which is a
slice into another. (Slices are not exposed in this language; they are
in the static one, where slices are distinct types.)

Example:

func trim(s) =
if s.len=2 then return "" fi
return s[2..$-1]
end

This trims the first and last character of string. But here it returns a
slice into the original string. If I wanted a fresh copy, I'd have to
use copy() inside the function, or copy() (or a special kind of
assignment) outside it.

> It is only functions which /modify/ a string, i.e. take it as an inout
> parameter, where it would matter whether the string was read/write or
> extensible. For an inout string what should be the defaults? If we say
> an inout string defaults to immutable and inextensible then that would
> lead to the following ways to specify a string, s, as a parameter:
>
>
> f: function(s: inout string char)
> f: function(s: inout string char rw)
> f: function(s: inout string char ext rw)
> f: function(s: inout string char ext)
>
> Note the "ext" and "rw" attributes. The idea is that they would specify
> how the string could be modified in the function. Adding rw would allow
> the string's existing contents to be taken as read-write rather than
> read-only. Adding ext would allow the string to be extended.
>
> That's effectively me thinking out loud and trying out some ideas. How
> does it look to you?

My preference is to keep it simple:

(1) String parameters are immutable

(2) String parameters are mutable (this is changing existing
content but also allow extension, plus deletion etc - the
works)

(3) String parameter are assignable. This means that, in addition to
(2), assigning to the parameter also replaces the caller's version

Python allows only (2), when working with Lists, and only (1) when
working with Strings (Strings are immutable, Lists are mutable)

(3) Requires full reference parameters so won't work in Python.

My scripting language allows (2) and (3) on both lists and strings. (1)
is only possible by a flag within the object that renders it immutable
(for example, passing a literal "ABC").

When I mentioned having extensibility as a different capability than
mutation, it is because this could be done via different string types.

There is in-place modification which changes the length of the object,
and modification where the length is not changed; I think these could be
useful, distinct attributes.

Changing the length requires a reference to the /original/ descriptor
where all the info is stored (heap pointer, length, capacity).

But changing the contents without affecting the size either only needs
the heap pointer, or can be done with a /copy/ of the descriptor; it
doesn't not need the original.

(My first implementation of a string type, on a 16- then 32-bit machine,
used a 16-byte descriptor passed by value. The string was mutable, but
it was not possible to extend it without a proper reference.)

> What about other permissions such as prepend, split, insert, delete,
> etc?

These all count as in-place modifications (except split), but as I said
above, it might be useful to treat length-modifying ones differently.

It's not clear how 'split' works, but there are anyway all sorts of
string ops that are not 'in-place'; they simply create new strings.
Presumably 'split' creates 2 or more new strings.

> I intend a string to be simply an array whose length can be changed.

I treat a string as one composite object normally treated as a single
value (like a record). I treat an array or list a collection of distinct
objects. But this is a minor point (it affects hows [] indexing works).

Bart

unread,

Oct 29, 2022, 5:25:40 PM10/29/22

to

On 29/10/2022 15:01, James Harris wrote:
> On 27/10/2022 12:14, Bart wrote:

>> Most strings are fixed-length once created; strings that can grow are
>> rare. You don't need a 'capacity' field for example (like C++'s Vector
>> type).
>
> Having watched some videos on string storage recently I now think I know
> what you mean by the capacity field - basically that a string descriptor
> would consist of these fields:
>
> start
> length
> capacity
>
> so that the string could be extended at the end (up to the capacity).
> That may be a bit restrictive. A programmer might want to remove or add
> characters at the beginning rather than just at the end, even though
> such would be done less often.

Doing a prepend is not a problem. What's critical is whether the new
length is still within the current allocation. (Prepend requires
shifting of the old string so is less efficient anyway.)

If a new allocation is needed, you may be copying data for both prepend
and append.

With delete however, you may need to think about whether to /reduce/ the
allocation size.

> So what do you think of having a string descriptor more like
>
> first
> past
> memfirst
> mempast
>
> where memfirst and mempast would define the allocated space in which the
> string body would sit.

What's the difference between 'first' and 'memfirst'? Would you have a
string that doesn't start at the beginning of its allocated block?

>> An important field is objtype; its values are:
>>
>>      Normal            Regular string (uses alloc64)
>>      Slice             Slice into another (uses objptr2)
>>      Extslice          Strings lie outside the object scheme
>
> OK. I may use something like that or, possibly, some flags.
>
> ..
>
>> Note that if you take those 32 bytes, then the middle 16 bytes
>> (.strptr and .length fields) correspond to a raw Slice as used in my
>> lower level language.
>
> Good point. I'd need slices to have the same format as strings and for
> both to have flags. As there's no space for flags in the (first, past)
> pair I'd need to add a flags word, making the structure
>
> first
> past
> misc
> memfirst
> mempast
>
> where misc would store various pieces of information, not just flag
> bits. Slices would have only the first three fields. Strings would have
> all five. Flags would indicate whether this was a string or a slice.
>
> For me it's too early to optimise but it's worth noting that even for
> 64-bit machines the above would occupy only 24 or 40 bytes of a 64-byte
> cache line so short string bodies could be stored in the same line,
> again with flags indicating that that was so.

I think that if your string implementation requires a 24 or 40-byte
descriptor, then thinking about cache-line optimisation /is/ premature!

I considered such a descriptor too heavyweight for my static language.
(I did incorporate such a string type once, intended for uses where
performance didn't matter: sorting out UI, printing error messages and
diagnostics, that sort of thing.)

In the end I decided it did't really fit. But then I have two languages.

I guess yours likely sits between my two.

James Harris

unread,

Oct 30, 2022, 7:24:35 AM10/30/22

to

On 29/10/2022 21:17, Dmitry A. Kazakov wrote:
> On 2022-10-29 13:23, James Harris wrote:

..

>> As for this specific case, the same information can be conveyed in
>> different ways: (start, length), (start, memsize), (first, last),
>> (first, past). I chose the latter as it should be slightly faster than
>> the others and does not run into problems when the elements are other
>> than single bytes.
>
> Yes. The problem with (first, next) is that next could be inexpressible.
> Most difficulties arise with strings/arrays over enumerations and
> modular types. (first, last) has no such problem.
>
> Both have issues with empty strings, e.g. with a multitude of
> representations of. Compare with +/-0 problem for non-2-complement
> integers.

That sounds interesting. Do you see multiple representations of the
empty string in the following? Monospacing required. Here's how the
string "abcd" would be stored

!_a_!_b_!_c_!_d_!

^ ^
! !
first past

* so first would point at the first element of the string
* and past would point one cell beyond the last element of the string.

I don't see where you see a multitude of representations of the null
string. AISI the empty string would simply have past equal to first in
all cases.

..

>> That said, a slice would probably have a length which the callee can
>> determine but which the callee cannot change. I presume that's what
>> you'd call a constraint.
>
> It could be a constraint for fixed length slices.
>
>> If a callee wanted to be able to change the length of a string then it
>> would have to be passed a real string, not a slice.
>
> A callee might pass a variable length slice, which, for example, can be
> enlarged or shortened. Many languages with dynamically allocated strings
> have this.

What is your definition of a slice? Is it /part/ of an underlying string
or is it a /copy/ of part of a string? For example, if

string S = "abcde"
slice T = S[1..3] ;"bcd"

then changes to T would do what to S?

If slice is a view of an underlying string (which is what I had in mind)
then I don't get how you could meaningfully enlarge or shorten it.

..

>> I guess there would be these kinds of string argument:
>>
>> 1. Read-write string. Anything could be done to the string by the
>> callee. (Would have to be a real string.)
>>
>> 2. Read-write fixed-length string. The string's contents could be
>> altered but it could not be made longer or shorter. (Could be a real
>> string or a slice.)
>>
>> 3. Read-only string. Neither its length nor it contents could be
>> altered by the callee. (Could be a real string or a slice.)
>
> Think of it in terms of constraints. Immutability is a constraint. Fixed
> length is a constraint. Bounded length is a constraint. Non-sliding
> lower bound is a constraint. Non-sliding upper bound is a constraint.
>
> This should cover all spectrum. You can express all cases in terms of
> constraints.

I presume such constraints would be specified when objects are declared.
As a programmer how would you want to specify such constraints? Would
each have a reserved word, for example?

--
James Harris

Dmitry A. Kazakov

unread,

Oct 30, 2022, 10:20:04 AM10/30/22

to

...
(0..0)
(1..1)
(2..2)
...
(n..n)
...

With pointers it becomes even worse as some of them might point to
invalid addresses.

>>> That said, a slice would probably have a length which the callee can
>>> determine but which the callee cannot change. I presume that's what
>>> you'd call a constraint.
>>
>> It could be a constraint for fixed length slices.
>>
>>> If a callee wanted to be able to change the length of a string then
>>> it would have to be passed a real string, not a slice.
>>
>> A callee might pass a variable length slice, which, for example, can
>> be enlarged or shortened. Many languages with dynamically allocated
>> strings have this.
>
> What is your definition of a slice? Is it /part/ of an underlying string
> or is it a /copy/ of part of a string? For example, if
>
> string S = "abcde"
> slice T = S[1..3] ;"bcd"
>
> then changes to T would do what to S?

No idea. It depends. Is slice in your example an independent object?

But considering this:

declare
S : String := "abcde";
begin
S (1..3) := "x"; -- Illegal in Ada

But should it be legal, then the result would be

"xde"

Many implementations make this illegal because it would require either
bounded or dynamically allocated unbounded string.

You can consider make it legal for these, but then you would have
different semantics of slices for different strings. And this would
contradict the design principle of having all strings interchangeable
regardless the implementation method.

There are contradictions in requirements you as the language designer
has to resolve this or that way.

> If slice is a view of an underlying string (which is what I had in mind)
> then I don't get how you could meaningfully enlarge or shorten it.

It is only your limited understanding of view as immutable and fixed
length. E.g. if you view a house in infrared why should not you be able
to open its door? Infrared googles would not limit you. Infrared photo
of a house would! (:-))

>>> I guess there would be these kinds of string argument:
>>>
>>> 1. Read-write string. Anything could be done to the string by the
>>> callee. (Would have to be a real string.)
>>>
>>> 2. Read-write fixed-length string. The string's contents could be
>>> altered but it could not be made longer or shorter. (Could be a real
>>> string or a slice.)
>>>
>>> 3. Read-only string. Neither its length nor it contents could be
>>> altered by the callee. (Could be a real string or a slice.)
>>
>> Think of it in terms of constraints. Immutability is a constraint.
>> Fixed length is a constraint. Bounded length is a constraint.
>> Non-sliding lower bound is a constraint. Non-sliding upper bound is a
>> constraint.
>>
>> This should cover all spectrum. You can express all cases in terms of
>> constraints.
>
> I presume such constraints would be specified when objects are declared.

Objects and/or subtypes. Depending on the language preferences. Note
also that you can have constrained views of the same object. E.g. you
have a mutable variable passed down as in-argument. That would be an
immutable view of the same object.

> As a programmer how would you want to specify such constraints? Would
> each have a reserved word, for example?

In some cases constraints might be implied. But usually language have
lots of [sub]type modifiers like

in, in out, out, constant
atomic, volatile, shared
aliased (can get pointers to)
external, static
public, private, protected (visibility constraints)
range, length, bounds
parameter AKA discriminant (general purpose constraint)
specific type AKA static/dynamic up/downcast (view as another type)
class-wide (view as a class of types rooted in this one)
...
measurement unit

Bart

unread,

Oct 30, 2022, 10:51:54 AM10/30/22

to

I don't know what these numbers mean. The main problem with 'first' and
'past' is that with an empty string, 'first' doesn't point anywhere, and
'past' ends up pointing to that same place, wherever that is.

I don't like it because that address is meaningless. Except possibly
when refering to an empty slice of an actual string.

>> string S = "abcde"
>> slice T = S[1..3] ;"bcd"
>>
>> then changes to T would do what to S?

Let's try it:

s ::= "abcde" # ::= is needed to make s (and t) mutable
t := s[1..3]
t[2]:="?"

println s # a?cde
println t # a?c

The language doesn't allow an empty slice, say s[1..0], although it
ought to be well-behaved (I think it just expects j>=i in s[i..j].)

> No idea. It depends. Is slice in your example an independent object?
>
> But considering this:
>
> declare
> S : String := "abcde";
> begin
> S (1..3) := "x"; -- Illegal in Ada
>
> But should it be legal, then the result would be
>
> "xde"
>
> Many implementations make this illegal because it would require either
> bounded or dynamically allocated unbounded string.

The language gets to say how this works. In mine it would have to be
like this:

s ::= "abcde" # ::= creates a mutable copy
s[1..3] := "xyz" # Can only insert string of matching length

s ends up as "xyzde"

luserdroog

unread,

Oct 30, 2022, 12:21:46 PM10/30/22

to

I think an exhaustive list of options would be very large if you're not
pre-judging and filtering as you're adding options.

4) [List|Array|Tuple|Iterator] of character objects

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
can be used to format the data to squeeze it into 7 bits.

6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
whole byte for metadata attached to each character.

...

Charles Lindsey

unread,

Oct 30, 2022, 1:01:24 PM10/30/22

to

On 29/10/2022 18:42, James Harris wrote:

> As above, the language is meant to treat strings as arrays. So AISI it should
> not ascribe any particular meaning to their contents.

In that case you should have a look at Algol68, where string are arrays of
characters. Every array has an associated descriptor containing its bounds etc.
But more importantly a REF to an array is best implemented by including the
descriptor in the REF (being a strongly typed language, there is no necessity
for REFs to assorted other types to have the same length - they are not
necessarily just address values). This makes it easy to construct slices
(immutable) and REFs to slices (so the slice is mutable), thus providing many of
the features discussed in this thread. However there is no provision for
extending (or shortening) an array, other than to create a new space to copy it
into; one could provide library routines with smart features to avoid actual
copying in some cases, and with a friendly interface which did not expose the
messiness inside.

--
Charles H. Lindsey ---------At my New Home, still doing my own thing------
Tel: +44 161 488 1845 Web: https://www.clerew.man.ac.uk
Email: c...@clerew.man.ac.uk Snail-mail: Apt 40, SK8 5BF, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

James Harris

unread,

Oct 30, 2022, 1:33:21 PM10/30/22

to

On 29/10/2022 22:25, Bart wrote:
> On 29/10/2022 15:01, James Harris wrote:
>> On 27/10/2022 12:14, Bart wrote:
>
>>> Most strings are fixed-length once created; strings that can grow are
>>> rare. You don't need a 'capacity' field for example (like C++'s
>>> Vector type).
>>
>> Having watched some videos on string storage recently I now think I
>> know what you mean by the capacity field - basically that a string
>> descriptor would consist of these fields:
>>
>>    start
>>    length
>>    capacity
>>
>> so that the string could be extended at the end (up to the capacity).
>> That may be a bit restrictive. A programmer might want to remove or
>> add characters at the beginning rather than just at the end, even
>> though such would be done less often.
>
> Doing a prepend is not a problem. What's critical is whether the new
> length is still within the current allocation. (Prepend requires
> shifting of the old string so is less efficient anyway.)

Well, see below.

>
> If a new allocation is needed, you may be copying data for both prepend
> and append.

Yes.

>
> With delete however, you may need to think about whether to /reduce/ the
> allocation size.

Agreed.

>
>> So what do you think of having a string descriptor more like
>>
>>    first
>>    past
>>    memfirst
>>    mempast
>>
>> where memfirst and mempast would define the allocated space in which
>> the string body would sit.
>
> What's the difference between 'first' and 'memfirst'?

memfirst would point to the start of the block in which the string body
existed. (first would point at the same address or a later address.)

> Would you have a
> string that doesn't start at the beginning of its allocated block?

Yes, that would be useful in some cases. For example, if deleting the
first part of a string one wouldn't want to be forced to copy the rest
of it down.

And there are cases where strings may be built by prepending. A classic
example is construction of a network frame. Each layer adds a header
which, naturally, has to go on the front.

--
James Harris

James Harris

unread,

Oct 30, 2022, 1:46:18 PM10/30/22

to

On 30/10/2022 14:20, Dmitry A. Kazakov wrote:
> On 2022-10-30 12:24, James Harris wrote:

..

>> I don't see where you see a multitude of representations of the null
>> string. AISI the empty string would simply have past equal to first in
>> all cases.
>
>    ...
>    (0..0)
>    (1..1)
>    (2..2)
>    ...
>    (n..n)
>    ...
>
> With pointers it becomes even worse as some of them might point to
> invalid addresses.

In the general case strings would live at arbitrary addresses so no
meaning could be inferred from any address. In all cases

past - first

would define the length of the string. If the length was zero then it
would be an empty string.

That said, if the string could be extended then both past and first
would have to point to allocated memory into which the extension could
take place.

..

>> What is your definition of a slice? Is it /part/ of an underlying
>> string or is it a /copy/ of part of a string? For example, if
>>
>> string S = "abcde"
>> slice T = S[1..3] ;"bcd"
>>
>> then changes to T would do what to S?
>
> No idea. It depends. Is slice in your example an independent object?

A slice, at the moment, at least, would be a view of part of a string.
Extending the earlier example,

!_a_!_b_!_c_!_d_!

^ ^
! !

s_first s_past

The string in the example would be "abcd" and the slice, delimited by
s_first and s_past, would be the "bc" in the middle of the string. Note
that the contents of the slice would not include the element at s_past.

The slice would appear to be a string with constraints. By default its
contents could be updated in place but it could not be made longer or
shorter.

A callee which wanted only to read a string (a common case) or to update
a string in place should not have to care whether it was passed a string
or a slice. For such a case, strings and slices would be interchangeable.

>
> But considering this:
>
> declare
> S : String := "abcde";
> begin
> S (1..3) := "x"; -- Illegal in Ada

In Ada would the following be legal?

S (1..3) := "xxx"; --replacement same size as what it is replacing

I'd be happy with that.

>
> But should it be legal, then the result would be
>
> "xde"
>
> Many implementations make this illegal because it would require either
> bounded or dynamically allocated unbounded string.
>
> You can consider make it legal for these, but then you would have
> different semantics of slices for different strings. And this would
> contradict the design principle of having all strings interchangeable
> regardless the implementation method.

I don't mind there being differences along the lines of 'constraints'
where a less-constrained object can be passed to a callee which expects
an object with such constraints or imposes more constraints, but not one
which needs fewer constraints.

..

>> I presume such constraints would be specified when objects are declared.
>
> Objects and/or subtypes. Depending on the language preferences. Note
> also that you can have constrained views of the same object. E.g. you
> have a mutable variable passed down as in-argument. That would be an
> immutable view of the same object.

Yes, and an immutable object could not be passed to a callee which
wanted a mutable object.

>
>> As a programmer how would you want to specify such constraints? Would
>> each have a reserved word, for example?
>
> In some cases constraints might be implied. But usually language have
> lots of [sub]type modifiers like
>
>    in, in out, out, constant
>    atomic, volatile, shared
>    aliased (can get pointers to)
>    external, static
>    public, private, protected (visibility constraints)
>    range, length, bounds
>    parameter AKA discriminant (general purpose constraint)
>    specific type AKA static/dynamic up/downcast (view as another type)
>    class-wide (view as a class of types rooted in this one)
>    ...
>    measurement unit

So you wouldn't have a keyword to indicate a constraint such as
"Non-sliding lower bound" which you mentioned before but IIUC you might
have some qualification of the 'bounds' keyword as in

bounds(^..)

to indicate an unchangeable lower bound (with ^ meaning the start of the
string)?

--
James Harris

James Harris

unread,

Oct 30, 2022, 1:52:43 PM10/30/22

to

On 29/10/2022 22:02, Bart wrote:
> On 29/10/2022 18:42, James Harris wrote:
>> On 29/10/2022 16:30, Bart wrote:

>> Further, functions which /return/ a string would create the string and
>> return it whole.
>
> Not necessarily. My dynamic language can return a string which is a
> slice into another. (Slices are not exposed in this language; they are
> in the static one, where slices are distinct types.)
>
> Example:
>
>     func trim(s) =
>         if s.len=2 then return "" fi
>         return s[2..$-1]
>     end
>
> This trims the first and last character of string. But here it returns a
> slice into the original string. If I wanted a fresh copy, I'd have to
> use copy() inside the function, or copy() (or a special kind of
> assignment) outside it.

That's a challenging example. In a sense it returns either of two
different types: the caller could be handed a string or a slice.

--
James Harris

James Harris

unread,

Oct 30, 2022, 2:13:44 PM10/30/22

to

On 30/10/2022 16:21, luserdroog wrote:
> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
>> Do you guys have any thoughts on the best ways for strings of characters
>> to be stored?
>>
>> 1. There's the C way, of course, of reserving one value (zero) and using
>> it as a terminator.
>>
>> 2. There's the 'length prefix' option of putting the length of the
>> string in a machine word before the characters.
>>
>> 3. There's the 'double pointer' way of pointing at, say, first and past
>> (where 'past' means first plus length such that the second pointer
>> points one position beyond the last character).
>>
>> Any others?

..

> I think an exhaustive list of options would be very large if you're not
> pre-judging and filtering as you're adding options.
>
> 4) [List|Array|Tuple|Iterator] of character objects

You mean where the characters are stored individually (one per node)?

>
> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
> can be used to format the data to squeeze it into 7 bits.

Interesting idea. It's certainly one I hadn't thought of.

>
> 6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
> whole byte for metadata attached to each character.

That's definitely thinking outside the box. I can see it working if the
user (the programmer) wanted a string of 24-bit values but it could be
awkward in other cases such as if he wanted a string of 32-bit or 8-bit
values. I don't think I mentioned it but I'd like the programmer to be
able to choose what the elements of the string would be.

--
James Harris

Dmitry A. Kazakov

unread,

Oct 30, 2022, 2:28:50 PM10/30/22

to

On 2022-10-30 18:46, James Harris wrote:
> On 30/10/2022 14:20, Dmitry A. Kazakov wrote:
>> On 2022-10-30 12:24, James Harris wrote:
>
> ..
>
>>> I don't see where you see a multitude of representations of the null
>>> string. AISI the empty string would simply have past equal to first
>>> in all cases.
>>
>>     ...
>>     (0..0)
>>     (1..1)
>>     (2..2)
>>     ...
>>     (n..n)
>>     ...
>>
>> With pointers it becomes even worse as some of them might point to
>> invalid addresses.
>
> In the general case strings would live at arbitrary addresses so no
> meaning could be inferred from any address.

As I said, that is a problem which will preclude some implementations
and make others inefficient.

And in general using pointers is wasting space as strings are much shorter.

> In all cases
>
> past - first
>
> would define the length of the string.

Not really. Usually negative length is considered equivalent to zero,
e.g. when iterating substrings. Other choices may consider iteration in
reverse, when bounds are inverted.

> If the length was zero then it would be an empty string.

Ergo, a special case to treat in many operations.

>> But considering this:
>>
>> declare
>> S : String := "abcde";
>> begin
>> S (1..3) := "x"; -- Illegal in Ada
>
> In Ada would the following be legal?

Yes, in Ada slice length is constrained as the string length is.

> S (1..3) := "xxx"; --replacement same size as what it is replacing
>
> I'd be happy with that.

It is still not fully defined. You need to consider the issue of sliding
bounds. E.g.

S (2..4) (2) := 'x'; -- Assign a character

Now with sliding:

S (2..4) (2) := 'x' gives "abxde", x is second in the slice

without sliding

S (2..4) (2) := 'x' gives "axcde", x is at 2 in the original string

In Ada the right side slides, the left does not. Sliding the right side
allows doing logical things like:

S1 (1..5) := S1 (5..9); -- 5..9 slides to 1..5

>> But should it be legal, then the result would be
>>
>> "xde"
>>
>> Many implementations make this illegal because it would require either
>> bounded or dynamically allocated unbounded string.
>>
>> You can consider make it legal for these, but then you would have
>> different semantics of slices for different strings. And this would
>> contradict the design principle of having all strings interchangeable
>> regardless the implementation method.
>
> I don't mind there being differences along the lines of 'constraints'
> where a less-constrained object can be passed to a callee which expects
> an object with such constraints or imposes more constraints, but not one
> which needs fewer constraints.

No, the problem is with semantics. E.g. let in a subprogram you do

S (1..100) := ""; -- Remove the first 100 characters

S is a formal parameter. Then, depending on the actual parameter's
subtype it may succeed (for a bounded length string) or fail (for a
fixed length string). Such things are big no-no in language design,
because they become a nightmare for programmers to track.

>>> I presume such constraints would be specified when objects are declared.
>>
>> Objects and/or subtypes. Depending on the language preferences. Note
>> also that you can have constrained views of the same object. E.g. you
>> have a mutable variable passed down as in-argument. That would be an
>> immutable view of the same object.
>
> Yes, and an immutable object could not be passed to a callee which
> wanted a mutable object.

Yes, lifting a constraint is not possible in most cases. However,
dynamic cast is a counterexample. You can move the view up the
inheritance tree. But this is frowned upon since it enforces certain
implementations.

>>> As a programmer how would you want to specify such constraints? Would
>>> each have a reserved word, for example?
>>
>> In some cases constraints might be implied. But usually language have
>> lots of [sub]type modifiers like
>>
>>     in, in out, out, constant
>>     atomic, volatile, shared
>>     aliased (can get pointers to)
>>     external, static
>>     public, private, protected (visibility constraints)
>>     range, length, bounds
>>     parameter AKA discriminant (general purpose constraint)
>>     specific type AKA static/dynamic up/downcast (view as another type)
>>     class-wide (view as a class of types rooted in this one)
>>     ...
>>     measurement unit
>
> So you wouldn't have a keyword to indicate a constraint such as
> "Non-sliding lower bound" which you mentioned before but IIUC you might
> have some qualification of the 'bounds' keyword as in
>
> bounds(^..)
>
> to indicate an unchangeable lower bound (with ^ meaning the start of the
> string)?

I am not sure if sliding constraint might be usable. It is a different
issue to constraining bounds because it involves operations like
assignment. And it is not clear how to implement such a constraint
effectively. Most constraints are either static (compile time), or
simple to represent, like bounds or type tags. Sliding might be
implemented as a flag, but then you will have to check it all the time.
Maybe not worth having it as a choice. And it is unclear what is the
unconstrained state, sliding or non-sliding? (:-))

Bart

unread,

Oct 30, 2022, 6:37:42 PM10/30/22

to

In this language, it only has a String type, not a Slice. Slicing is an
operation you apply on strings to yield another String object.

(Internally, it has to distinguish between owned strings and slices into
strings owned by other objects, but as I said that aspect is not exposed.)

luserdroog

unread,

Oct 30, 2022, 9:03:28 PM10/30/22

to

On Sunday, October 30, 2022 at 1:13:44 PM UTC-5, James Harris wrote:
> On 30/10/2022 16:21, luserdroog wrote:
> > On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
> >> Do you guys have any thoughts on the best ways for strings of characters
> >> to be stored?
> >>
> >> 1. There's the C way, of course, of reserving one value (zero) and using
> >> it as a terminator.
> >>
> >> 2. There's the 'length prefix' option of putting the length of the
> >> string in a machine word before the characters.
> >>
> >> 3. There's the 'double pointer' way of pointing at, say, first and past
> >> (where 'past' means first plus length such that the second pointer
> >> points one position beyond the last character).
> >>
> >> Any others?
> ..
> > I think an exhaustive list of options would be very large if you're not
> > pre-judging and filtering as you're adding options.
> >
> > 4) [List|Array|Tuple|Iterator] of character objects
> You mean where the characters are stored individually (one per node)?

Yep. Fat characters, or whatever other scaffolding helps for the operations
you want to support.

> > 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
> > can be used to format the data to squeeze it into 7 bits.
> Interesting idea. It's certainly one I hadn't thought of.

It's used in the Classical Forth Dictionary header for the name field which
can be variable length. Often it's followed by a length byte and you store
the pointer to the length byte, using

(length - *length)

to actually get the pointer to the start.

> > 6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
> > whole byte for metadata attached to each character.
> That's definitely thinking outside the box. I can see it working if the
> user (the programmer) wanted a string of 24-bit values but it could be
> awkward in other cases such as if he wanted a string of 32-bit or 8-bit
> values. I don't think I mentioned it but I'd like the programmer to be
> able to choose what the elements of the string would be.

This is what I was using in my unfinished APL-like language. The 8 bits
of tag meant it was easy for a node to be a character or an integer (25 bit)
or a nested subarray or whatever ... mpfp number... symbol table.

So I didn't really need a string type *per se* because there's an array type
whose data elements are these 32bits of encoded whatevs. A string is
just a 1D array, or maybe an array of arrays or a 2D array padded out with
spaces on each line. You'd read it in or receive it initially as a 1D array probably
from a file. Oh, an element could also be a file.

David Brown

unread,

Oct 31, 2022, 12:58:11 PM10/31/22

to

On 30/10/2022 19:13, James Harris wrote:
> On 30/10/2022 16:21, luserdroog wrote:
>> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
>>> Do you guys have any thoughts on the best ways for strings of characters
>>> to be stored?
>>>
>>> 1. There's the C way, of course, of reserving one value (zero) and using
>>> it as a terminator.
>>>
>>> 2. There's the 'length prefix' option of putting the length of the
>>> string in a machine word before the characters.
>>>
>>> 3. There's the 'double pointer' way of pointing at, say, first and past
>>> (where 'past' means first plus length such that the second pointer
>>> points one position beyond the last character).
>>>
>>> Any others?
>
> ..
>
>> I think an exhaustive list of options would be very large if you're not
>> pre-judging and filtering as you're adding options.
>>
>> 4) [List|Array|Tuple|Iterator] of character objects
>
> You mean where the characters are stored individually (one per node)?
>
>>
>> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
>> can be used to format the data to squeeze it into 7 bits.
>
> Interesting idea. It's certainly one I hadn't thought of.

Nor should you - that is a crazy idea. It is massively inefficient, as
well as being inconsistent with everything else.

>
>>
>> 6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
>> whole byte for metadata attached to each character.
>
> That's definitely thinking outside the box. I can see it working if the
> user (the programmer) wanted a string of 24-bit values but it could be
> awkward in other cases such as if he wanted a string of 32-bit or 8-bit
> values. I don't think I mentioned it but I'd like the programmer to be
> able to choose what the elements of the string would be.
>

UCS4 is 31 bit, not 24 bit. Perhaps luserdroog is mixing it up with
UTF-32, which can be covered by 21 bits. (The extra 10 bits in UCS4
have never been anything but 0, but if you want to refer to a long
out-dated and obsolete standard, it should still be done so accurately.)

Do not make a new string or character storage system based around
anything obsolete - that includes UTF-7 and UCS4. Making lots of extra
work for yourself to support something that everyone else rejected as
complicated, unnecessary and unused decades ago, is just silly.

There are only two character encodings that make any kind of sense in a
modern language (i.e., anything from this century). UTF-8 and
/possibly/ UTF-32 for internal usage, if you find it more convenient for
the operations you want.

If you are using UTF-32 only for internal usage within the language, and
not for export (external function calls, file IO, etc.), then the high
byte is always unused - and therefore free for metadata if that's what
you want. I'm not convinced it is a good idea, but it's certainly possible.

For any kind of interaction with anything else, UTF-8 is the standard.
There are a few other encodings that haven't died off completely yet,
but they are all on the way out.

I would also recommend treating characters and character strings as
something very different from raw bytes and binary blobs. Users want to
do very different things with them, and many of the useful operations
are completely different. Some languages have made the mistake of
conflating the two concepts - it's difficult to fix once that design
flaw is set into a language.

Bart

unread,

Oct 31, 2022, 1:31:29 PM10/31/22

to

It's a perfectly fine idea - for the 1970s.

(Now the 8th bit is better put to use to represent UTF8.)

David Brown

unread,

Oct 31, 2022, 3:34:18 PM10/31/22

to

Indeed.

UTF-7 was invented in attempt to make an encoding for Unicode that would
work for email, since some email servers did not handle 8-bit characters
correctly, long ago in the dark ages. It was never formally a Unicode
encoding, and almost never used in practice.

Using 7-bit characters and the eighth bit for a terminator would be
pretty inefficient - 12.5% wasted space to get a single bit of useful
information per string. Not a good trade-off!

James Harris

unread,

Nov 3, 2022, 11:48:22 AM11/3/22

to

On 31/10/2022 16:58, David Brown wrote:

> On 30/10/2022 19:13, James Harris wrote:
>> On 30/10/2022 16:21, luserdroog wrote:
>>> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

>>>> Do you guys have any thoughts on the best ways for strings of
>>>> characters
>>>> to be stored?

..

>>> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
>>> can be used to format the data to squeeze it into 7 bits.
>>
>> Interesting idea. It's certainly one I hadn't thought of.
>
> Nor should you - that is a crazy idea. It is massively inefficient, as
> well as being inconsistent with everything else.

The model I have chosen (at least, for now) is to have a string indexed
logically from zero (so indices do not need to be stored) and, for
implementation, delimited by two pointers.

The one downside I am aware of is that it will, at times, require
creation and destruction of a small descriptor. I'll have to see how the
approach works out in practice.

..

> I would also recommend treating characters and character strings as
> something very different from raw bytes and binary blobs. Users want to
> do very different things with them, and many of the useful operations
> are completely different. Some languages have made the mistake of
> conflating the two concepts - it's difficult to fix once that design
> flaw is set into a language.

That sounds interesting but I cannot tell what you have in mind.

One could consider strings as having two categories of operation: those
which involve only the memory used by strings such as allocation,
concatenation, insertion, deletion, etc; and those which care about the
contents of a string such as capitalisation, comparison, whitespace
recognition, parsing, etc. Why could the mechanics not apply to raw
bytes and blobs?

--
James Harris

James Harris

unread,

Nov 3, 2022, 11:57:11 AM11/3/22

to

On 30/10/2022 18:28, Dmitry A. Kazakov wrote:
> On 2022-10-30 18:46, James Harris wrote:

..

>> In Ada would the following be legal?
>
> Yes, in Ada slice length is constrained as the string length is.
>
>> S (1..3) := "xxx"; --replacement same size as what it is replacing
>>
>> I'd be happy with that.
>
> It is still not fully defined. You need to consider the issue of sliding
> bounds. E.g.
>
> S (2..4) (2) := 'x'; -- Assign a character
>
> Now with sliding:
>
> S (2..4) (2) := 'x' gives "abxde", x is second in the slice
>
> without sliding
>
> S (2..4) (2) := 'x' gives "axcde", x is at 2 in the original string

AISI the elements of strings and slices would always be accessed by
offset. That appears to be the 'sliding' model you mention.

>
> In Ada the right side slides, the left does not. Sliding the right side
> allows doing logical things like:
>
> S1 (1..5) := S1 (5..9); -- 5..9 slides to 1..5

I don't see any sliding there, only that chars 5 to 9 are copied over
chars 1 to 5 of the same string.

..

>
> I am not sure if sliding constraint might be usable. It is a different
> issue to constraining bounds because it involves operations like
> assignment. And it is not clear how to implement such a constraint
> effectively. Most constraints are either static (compile time), or
> simple to represent, like bounds or type tags. Sliding might be
> implemented as a flag, but then you will have to check it all the time.
> Maybe not worth having it as a choice. And it is unclear what is the
> unconstrained state, sliding or non-sliding? (:-))
>

Always sliding (if I understand the term)!

--
James Harris

James Harris

unread,

Nov 3, 2022, 12:10:40 PM11/3/22

to

As most uses of slices would be read-only but some would be read-write,
and there are various potential ways to implement a slice, it might be
sensible for me to do something like:

1. Have the default slice as read-only and the simplest to construct. If
s is a string or another slice then a slice of part of that string could
be constructed as

s(1..4)

2. Use keywords to effect other, more specialised, operations. For example,

s.slice_cow(1..4) ;a copy-on-write slice
s.slice_rw(1..4) ;a slice via which the base string can be changed
s.copy(1..4) ;copy the designated chars to a new string

--
James Harris

David Brown

unread,

Nov 4, 2022, 9:58:44 AM11/4/22

to

On 03/11/2022 16:48, James Harris wrote:
> On 31/10/2022 16:58, David Brown wrote:
>> On 30/10/2022 19:13, James Harris wrote:
>>> On 30/10/2022 16:21, luserdroog wrote:
>>>> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
>
>>>>> Do you guys have any thoughts on the best ways for strings of
>>>>> characters
>>>>> to be stored?
>
> ..
>
>>>> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
>>>> can be used to format the data to squeeze it into 7 bits.
>>>
>>> Interesting idea. It's certainly one I hadn't thought of.
>>
>> Nor should you - that is a crazy idea. It is massively inefficient,
>> as well as being inconsistent with everything else.
>
> The model I have chosen (at least, for now) is to have a string indexed
> logically from zero (so indices do not need to be stored) and, for
> implementation, delimited by two pointers.
>
> The one downside I am aware of is that it will, at times, require
> creation and destruction of a small descriptor. I'll have to see how the
> approach works out in practice.

There is no universally ideal way to store a string - every method is
inefficient for some uses and operations. You just have to find
something that is good enough for typical uses in your language, add
support for connecting to external code (such as exporting C-style
strings), and make it possible for users to implement other types when
it makes more sense for the user code.

>
> ..
>
>> I would also recommend treating characters and character strings as
>> something very different from raw bytes and binary blobs. Users want
>> to do very different things with them, and many of the useful
>> operations are completely different. Some languages have made the
>> mistake of conflating the two concepts - it's difficult to fix once
>> that design flaw is set into a language.
>
> That sounds interesting but I cannot tell what you have in mind.
>

I mean you should consider a "string" to be a way of holding a sequence
of "character" units which can hold a code unit of UTF-8 (since any
other choice of character encoding is madness).

This should be different from a "byte", which would be an 8-bit unit of
raw memory (ignore the existence of machines that can't address 8-bit
memory units directly - they will never use your language). Arrays of
this type should always be contiguous, and used for raw memory access.
(You may also want larger types - memory16, memory32, memory64, etc., if
that is convenient for efficient usage.)

So when you read a block of data from a file, or send a block to a
network socket, it is an array of bytes - not a string. (You can have
high-level abstractions for a "text file" wrapper that can read and
write strings, but that's not fundamental.) If you have an equivalent
of C's "memcpy" function it should use bytes, not any kind of character
type. If you have something like C's "type-based aliasing rules", then
it is bytes that should have the special exception, not characters.

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or conversion
operations on them.

The concept of "signed char" and "unsigned char" in C is a serious
design flaw. A type designed to hold letters should not have a sign,
and should not be used to hold arbitrary raw, low-level data.

You might also consider not having a character type at all. Python 3
has no character types - "a" is a string, not a character.

> One could consider strings as having two categories of operation: those
> which involve only the memory used by strings such as allocation,
> concatenation, insertion, deletion, etc; and those which care about the
> contents of a string such as capitalisation, comparison, whitespace
> recognition, parsing, etc. Why could the mechanics not apply to raw
> bytes and blobs?
>

Operations on strings are those that are relevant to strings. It makes
no sense to capitalise a raw binary blob. It makes no sense to have
methods for linking or chaining sets of blobs - these are direct handles
into memory, and chaining, allocation, etc., are higher level operations.

Raw binary buffers require nothing more than an address and a size to
describe them - anything more, and it is too high level. (Again,
there's nothing wrong with providing higher level features and
interfaces, but they have to build on the fundamental ones.)

Bart

unread,

Nov 4, 2022, 1:51:02 PM11/4/22

to

On 04/11/2022 13:58, David Brown wrote:

> Neither "byte" nor "character" should have any kind of arithmetic
> operators - they are not integers. But you will need cast or conversion
> operations on them.

Bytes are small integers, typically of u8 type.

I can't see why arithmetic can't be done with them, unless you want a
purer kind of language where arithmetic is only allowed on signed
numbers, and bitwise ops only on unsigned numbers, which is usually
going to be a pain for all concerned.

> The concept of "signed char" and "unsigned char" in C is a serious
> design flaw. A type designed to hold letters should not have a sign,
> and should not be used to hold arbitrary raw, low-level data.

Signed and unsigned chars are not so bad; presumably C intended these to
do the job of a 'byte' type for small integers. So it was just a poor
choice of name. (After all there is no separate type in C for bytes
holding character data.)

What's bad is that third kind: a 'plain char' type, which is
incompatible with both signed and unsigned char, even though it
necessarily needs to be one of the other on a specific platform. It
occurs in no other language, and causes problems within FFI APIs.

James Harris

unread,

Nov 4, 2022, 5:35:48 PM11/4/22

to

On 04/11/2022 17:50, Bart wrote:
> On 04/11/2022 13:58, David Brown wrote:
>
>> Neither "byte" nor "character" should have any kind of arithmetic
>> operators - they are not integers. But you will need cast or
>> conversion operations on them.
>
> Bytes are small integers, typically of u8 type.
>
> I can't see why arithmetic can't be done with them, unless you want a
> purer kind of language where arithmetic is only allowed on signed
> numbers, and bitwise ops only on unsigned numbers, which is usually
> going to be a pain for all concerned.

I think what David means is that arithmetic operations don't apply to
characters (even though some languages permit such operations). For
example, neither

'a' * 5

nor even

'R' + 1

have any meaning over the set of characters. Prohibiting arithmetic on
them could be dome but would make classifying and manipulating
characters difficult unless one had a comprehensive set of library
functions such as

is_digit(char)
is_alphanum(locale, char)
is_lower(locale, char)
upper(locale, char)

and many more.

--
James Harris

Bart

unread,

Nov 4, 2022, 6:28:01 PM11/4/22

to

On 04/11/2022 21:35, James Harris wrote:
> On 04/11/2022 17:50, Bart wrote:
>> On 04/11/2022 13:58, David Brown wrote:
>>
>>> Neither "byte" nor "character" should have any kind of arithmetic
>>> operators - they are not integers. But you will need cast or
>>> conversion operations on them.
>>
>> Bytes are small integers, typically of u8 type.
>>
>> I can't see why arithmetic can't be done with them, unless you want a
>> purer kind of language where arithmetic is only allowed on signed
>> numbers, and bitwise ops only on unsigned numbers, which is usually
>> going to be a pain for all concerned.
>
> I think what David means is that arithmetic operations don't apply to
> characters

I was picking on the 'byte' type; it seems extraordinary that you
shouldn't be allowed to do arithmetic with them. If you can initialise a
byte value with a number like this:

byte a = 123

then it's a number!

> (even though some languages permit such operations). For
> example, neither
>
> 'a' * 5
>
> nor even
>
> 'R' + 1
>
> have any meaning over the set of characters.

I actually had such a restriction for a while: char*5 wasn't allowed,
but char+1 was. After all why on earth shouldn't you want the next
character in that alphabet? Why should code like this be made illegal:

a := a * 10 + (c - '0')

Then I realised I shouldn't be telling the programmer what they can and
can't do with characters, as there might be some perfectly valid
use-case that I simply hadn't thought of.

Maybe 'a' * 5 yields the value 'aaaaa' or the string "aaaaa", or this is
some kind on encryption algorithm.

So now they are treated like integers, other than printing an array of
char or pointer to char assumes they are strings.

> Prohibiting arithmetic on
> them could be dome but would make classifying and manipulating
> characters difficult unless one had a comprehensive set of library
> functions such as
>
> is_digit(char)
> is_alphanum(locale, char)
> is_lower(locale, char)
> upper(locale, char)
>
> and many more.

As I said, you and I don't know all the possibilites. Of course there
would need to be conversions between char and int, but this can become a
nuisance.

James Harris

unread,

Nov 5, 2022, 6:36:35 AM11/5/22

to

That's because 'R' + 1 may not be the next character in all alphabets.
Defining 'next' is more than difficult. It depends on intended collation
order which varies in different parts of the world and can even change
over time as authorities choose different collation orders. Some
plausible meanings of 'R' + 1:

'S' (as in ASCII)
'r' (what the user may want as sort order)
's' (what the user may want as sort order)
a non-character (as in EBCDIC)

Perhaps a pseudo-call would be better such as

char_plus(collation, 'R', 1)

where 'collation' would be used to determine what was the specified
number of characters away from 'R'.

> Why should code like this be made illegal:
>
> a := a * 10 + (c - '0')

I'm not saying it should. I am in two minds about what's best. The
alternative is something like

a := a * 10 + digit_value(c)

>
> Then I realised I shouldn't be telling the programmer what they can and
> can't do with characters, as there might be some perfectly valid
> use-case that I simply hadn't thought of.

Agreed. Although the point of prohibiting arithmetic on characters is to
make multilingual programming easier, not harder.

..

>> Prohibiting arithmetic on them could be dome but would make
>> classifying and manipulating characters difficult unless one had a
>> comprehensive set of library functions such as
>>
>>    is_digit(char)
>>    is_alphanum(locale, char)
>>    is_lower(locale, char)
>>    upper(locale, char)
>>
>> and many more.
>
> As I said, you and I don't know all the possibilites.

Yes, the challenge of making multilingual programming easier is
providing a comprehensive and convenient set of conversions.

> Of course there
> would need to be conversions between char and int, but this can become a
> nuisance.

Aside from converting digits (and any other characters used as digits in
a higher number base) is there's any meaning to converting chars to/from
ints?

--
James Harris

James Harris

unread,

Nov 5, 2022, 7:14:03 AM11/5/22

to

On 04/11/2022 13:58, David Brown wrote:
> On 03/11/2022 16:48, James Harris wrote:
>> On 31/10/2022 16:58, David Brown wrote:

..

>>> I would also recommend treating characters and character strings as
>>> something very different from raw bytes and binary blobs. Users want
>>> to do very different things with them, and many of the useful
>>> operations are completely different. Some languages have made the
>>> mistake of conflating the two concepts - it's difficult to fix once
>>> that design flaw is set into a language.
>>
>> That sounds interesting but I cannot tell what you have in mind.
>>
>
> I mean you should consider a "string" to be a way of holding a sequence
> of "character" units which can hold a code unit of UTF-8 (since any
> other choice of character encoding is madness).

We've discussed before that (IMO) Unicode is useful for physical
printing to paper or electronic rendering such as to PDF but that it's a
nightmare for programmers and users when it is used for any kind of
input so I won't go over that again except to say that AISI Unicode
should be handled by library functions rather than a language.

What I do have in mind is strings of 'containers' where a string might
be declared as of type

string of char8 -- meaning a string of char8 containers
string of char32 -- meaning a string of char32 containers

What goes in each 8-bit or 32-bit 'container' would be another matter.

That agnostic ideal is somewhat in tension with the desire to include
string literals in a program text. For that, as I've mentioned before,
my preference is to have the program text and any literals within it
written in ASCII and American English; supplementary files would express
the string literals in other languages.

For example,

print "Hello world"

would be accompanied by a file for French which included

"Hello world" --> "Bonjour le monde"

Naturally, multilingual programming is much more complex than that
simple example but it shows the basic idea. The compiler would be able
to check that language files had everything required for a given piece
of source code.

..

> Neither "byte" nor "character" should have any kind of arithmetic
> operators - they are not integers. But you will need cast or conversion
> operations on them.

The char8 and char32 containers could omit support for arithmetic **if**
enough support routines were provided. But, as Bart says, it's difficult
to anticipate all such support routines that programmers might need.

>
> The concept of "signed char" and "unsigned char" in C is a serious
> design flaw. A type designed to hold letters should not have a sign,
> and should not be used to hold arbitrary raw, low-level data.

OK.

> You might also consider not having a character type at all. Python 3
> has no character types - "a" is a string, not a character.

I can see that as being possible. I keep coming across examples of where
a 1-character string would do just as well as a character. ATM, though,
I have a separate character type.

..

> Raw binary buffers require nothing more than an address and a size to
> describe them - anything more, and it is too high level. (Again,
> there's nothing wrong with providing higher level features and
> interfaces, but they have to build on the fundamental ones.)

Maybe that's where I am going with this. What you describe does sound
rather like my idea of using chars as containers and putting char and
string handling in libraries.

Incidentally, AISI I could have strings of any data type, not just
chars. For example,

string of int16
string of struct {x: float, y: float}
string of function(int, bool) -> uint

I'll have to see how it works out in practice but the idea is to
separate the concept of a string (basically, storage layout) from the
concept of whatever type of element the string is made from.

--
James Harris

Bart

unread,

Nov 5, 2022, 9:38:19 AM11/5/22

to

On 05/11/2022 10:36, James Harris wrote:

> On 04/11/2022 22:28, Bart wrote:

>> I actually had such a restriction for a while: char*5 wasn't allowed,
>> but char+1 was. After all why on earth shouldn't you want the next
>> character in that alphabet?
>
> That's because 'R' + 1 may not be the next character in all alphabets.
> Defining 'next' is more than difficult. It depends on intended collation
> order which varies in different parts of the world and can even change
> over time as authorities choose different collation orders. Some
> plausible meanings of 'R' + 1:
>
> 'S' (as in ASCII)

You said elsewhere that you want to use ASCII within programs. Which is
it happens, corresponds to the first 128 points of Unicode. Here:

char c
for c in 'A'..'Z' do
print c
od

this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by adding
+1 to 'c'; it doesn't care about collating order!

(This also illustrates the difference between `byte` and `char`; using a
byte type, the output would be '656667...90'.)

> 'r' (what the user may want as sort order)
> 's' (what the user may want as sort order)
> a non-character (as in EBCDIC)

My feeling is that it is these diverse requirements that require
user-supplied functions.

My 'char' is still a thinly veiled numeric type, so ordinary integer
arithmatic can be used. Otherwise even something like this becomes
impossible:

['A'..'Z']int histogram

midpoint := (histogram.upb - histogram.lwb)/2

++histogram[midpoint+1]

This requires certain properties of array indicates, like being able to
do arithmetic, as well as being consecutive ordinal values.

> Perhaps a pseudo-call would be better such as
>
> char_plus(collation, 'R', 1)
>
> where 'collation' would be used to determine what was the specified
> number of characters away from 'R'.

Sure, as I said, you can provide any interpretation you like. But if you
do C+1, you expect to get the code of the next character (or next
codepoint if venturing outside ASCII).

>
> Aside from converting digits (and any other characters used as digits in
> a higher number base) is there's any meaning to converting chars to/from
> ints?

My static language makes byte and char slightly different types. (Types
involving char may get printed differently.)

That meant that `ref byte` and `ref char` were incompatible, which
rapidly turned into a nightmare: I might have a readfile() routine that
returned a `ref byte` type, a pointer to a block of memory.

But I wanted to interpret that block as `ref char` - a string. So this
meant loads of casts to either `ref byte` or `ref char` to get things to
work, but it got too much (a bit like 'const poisoning' in C where it
just propagates everywhere). That was clearly the wrong approach.

In the end I relaxed the type rules so that `ref byte` and `ref char`
are compatible, and everything is now SO much simpler.

Bart

unread,

Nov 5, 2022, 9:46:34 AM11/5/22

to

Is it? This pretty much all I did when I used to write internationalised
applications. Although that was only done for French, German and Dutch.

But that print example would be written like this:

print /"Hello World"

The "/" was a translation operator, so only certain strings were
translated. This also made it easy to scan source code to build a list
of messages, used to maintain the dictionary as entries were added,
deleted or modified.

The scheme did need some hints sometimes, written like this, to get
around ambiguities:

print /"Green!colour"
print /"Green!fresh"

The hint was usually filtered out.

But this is little to do with how strings are represented. Even in
English, messages may include characters like "£" (pound sign) which is
not part of ASCII.

So a way to represent Unicode within literals is still needed (didn't we
discuss this a couple of years ago?).

James Harris

unread,

Nov 5, 2022, 12:07:32 PM11/5/22

to

On 05/11/2022 13:38, Bart wrote:
> On 05/11/2022 10:36, James Harris wrote:
>> On 04/11/2022 22:28, Bart wrote:
>
>>> I actually had such a restriction for a while: char*5 wasn't allowed,
>>> but char+1 was. After all why on earth shouldn't you want the next
>>> character in that alphabet?
>>
>> That's because 'R' + 1 may not be the next character in all alphabets.
>> Defining 'next' is more than difficult. It depends on intended
>> collation order which varies in different parts of the world and can
>> even change over time as authorities choose different collation
>> orders. Some plausible meanings of 'R' + 1:
>>
>>    'S' (as in ASCII)
>
> You said elsewhere that you want to use ASCII within programs. Which is
> it happens, corresponds to the first 128 points of Unicode. Here:
>
>     char c
>     for c in 'A'..'Z' do
>         print c
>     od
>
> this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by adding
> +1 to 'c'; it doesn't care about collating order!

That piece of code is fine for an English-speaking user but to a
Spaniard the alphabet has a character missing, and Greeks wouldn't agree
with it at all.

Where L is the locale why not allow something like

for c in L.alpha_first..L.alpha_last
print c
od

?

That should work in English or Spanish or Greek etc, shouldn't it?

>
> (This also illustrates the difference between `byte` and `char`; using a
> byte type, the output would be '656667...90'.)
>
>
>
>>    'r' (what the user may want as sort order)
>>    's' (what the user may want as sort order)
>>    a non-character (as in EBCDIC)
>
> My feeling is that it is these diverse requirements that require
> user-supplied functions.

Functions, yes, (or what appear to be functions) though surely they
should be part of a library that comes with the language.

>
> My 'char' is still a thinly veiled numeric type, so ordinary integer
> arithmatic can be used. Otherwise even something like this becomes
> impossible:
>
>     ['A'..'Z']int histogram
>
>     midpoint := (histogram.upb - histogram.lwb)/2
>
>     ++histogram[midpoint+1]
>
> This requires certain properties of array indicates, like being able to
> do arithmetic, as well as being consecutive ordinal values.

Why not

[L.alpha_first..L.alpha_last] int histogram

?

As for the calculations what about using L.ord and L.chr to convert
between chars and integers?

>
>> Perhaps a pseudo-call would be better such as
>>
>> char_plus(collation, 'R', 1)
>>
>> where 'collation' would be used to determine what was the specified
>> number of characters away from 'R'.
>
> Sure, as I said, you can provide any interpretation you like. But if you
> do C+1, you expect to get the code of the next character (or next
> codepoint if venturing outside ASCII).

If you use codepoints then you might not get the next character in
sequence - as in the case of 'R' in ebcdic (you'd get a non-printing
character) or 'N' in Spanish (you'd get 'O' rather than the N with a hat
that a Spaniard would expect).

If the programmer wants "the next character in the alphabet" then
shouldn't the programming language or a standard library help him get
that irrespective of the human language the program is meant to be
processing?

>
>
>>
>> Aside from converting digits (and any other characters used as digits
>> in a higher number base) is there's any meaning to converting chars
>> to/from ints?
>
>
> My static language makes byte and char slightly different types. (Types
> involving char may get printed differently.)
>
> That meant that `ref byte` and `ref char` were incompatible, which
> rapidly turned into a nightmare: I might have a readfile() routine that
> returned a `ref byte` type, a pointer to a block of memory.
>
> But I wanted to interpret that block as `ref char` - a string. So this
> meant loads of casts to either `ref byte` or `ref char` to get things to
> work, but it got too much (a bit like 'const poisoning' in C where it
> just propagates everywhere). That was clearly the wrong approach.

C's const propagation sounds like Java with its horrible, and sticky,
exception propagation.

>
> In the end I relaxed the type rules so that `ref byte` and `ref char`
> are compatible, and everything is now SO much simpler.

Would there have been any value in defining a layout for the untyped
area of bytes (or parts thereof)? That's where I think I am headed.

--
James Harris

James Harris

unread,

Nov 5, 2022, 12:18:51 PM11/5/22

to

On 05/11/2022 13:46, Bart wrote:
> On 05/11/2022 11:14, James Harris wrote:

..

>> For example,
>>
>> print "Hello world"
>>
>> would be accompanied by a file for French which included
>>
>> "Hello world" --> "Bonjour le monde"
>>
>> Naturally, multilingual programming is much more complex than that
>> simple example but it shows the basic idea. The compiler would be able
>> to check that language files had everything required for a given piece
>> of source code.
>
> Is it? This pretty much all I did when I used to write internationalised
> applications. Although that was only done for French, German and Dutch.

I imagine multilingual programming would be very difficult and that it's
something a language should help with but it's not something I have had
to do as yet.

>
> But that print example would be written like this:
>
>     print /"Hello World"
>
> The "/" was a translation operator, so only certain strings were
> translated. This also made it easy to scan source code to build a list
> of messages, used to maintain the dictionary as entries were added,
> deleted or modified.
>
> The scheme did need some hints sometimes, written like this, to get
> around ambiguities:
>
>     print /"Green!colour"
>     print /"Green!fresh"

That's cool. I have something similar which is trailing identifiers but
I had them down as specific rather than hints. For example,

print "Green" :GreenColor
print "Green" :GreenFresh

Then the accompanying language files could distinguish between the strings.

>
> The hint was usually filtered out.
>
> But this is little to do with how strings are represented. Even in
> English, messages may include characters like "£" (pound sign) which is
> not part of ASCII.

I have two ways to deal with that. If the program is to use a pound sign
in England but a Dollar sign in America, say, then the source would have
a dollar sign and there'd be a file to translate for use in England.

On the other hand, if the program was to use a pound sign in all cases
then I'd do as below.

>
> So a way to represent Unicode within literals is still needed (didn't we
> discuss this a couple of years ago?).
>

Yes. You may remember my preference was for a named character something like

pound_currency_string = "\PoundSterling/"

--
James Harris

David Brown

unread,

Nov 5, 2022, 12:42:01 PM11/5/22

to

On 04/11/2022 18:50, Bart wrote:
> On 04/11/2022 13:58, David Brown wrote:
>
>> Neither "byte" nor "character" should have any kind of arithmetic
>> operators - they are not integers. But you will need cast or
>> conversion operations on them.
>
> Bytes are small integers, typically of u8 type.

That's true in some languages. In other languages they are character
types. (In C, they are both.)

My suggestion is that they should not be considered integers or
characters, but a low-level "raw data" type. A "byte" should represent
a byte of memory, or an octet in a network packet, or a byte of a file.
It doesn't make sense to do any arithmetic on it because it might be
part of some completely different data type.

A given "byte" might be part of the storage of a floating point number,
or the storage of an address or pointer, or anything else.

I've said this before - the strength of a programming language is mainly
determined by what you /cannot/ do, rather than what you /can/ do.
Design a language to make it as hard as possible to get things wrong,
while still being easy to do the things you want it to do. Keeping
integer types, character types, strings, and raw data independent can
help with that.

>
> I can't see why arithmetic can't be done with them, unless you want a
> purer kind of language where arithmetic is only allowed on signed
> numbers, and bitwise ops only on unsigned numbers, which is usually
> going to be a pain for all concerned.
>

Whether you have signed and unsigned integer types, what sizes you have,
whether you have an abstract "integer" type and explicitly ranged
subtypes (as in Ada) is another matter.

>> The concept of "signed char" and "unsigned char" in C is a serious
>> design flaw. A type designed to hold letters should not have a sign,
>> and should not be used to hold arbitrary raw, low-level data.
>
> Signed and unsigned chars are not so bad;

Yes, they are bad. A "character" is a letter or other visual symbol.
Can you explain the difference between "a positive letter X" and "a
negative letter X" ? Of course you can't - it is utter nonsense. The
same goes for adding 6 to the Greek letter µ, or multiplying Ð by Å.
Even operations that might appear sensible in code, such as adding 1 to
a char, don't actually make sense - it is the operation of taking the
next letter in the alphabet that makes sense.

> presumably C intended these to
> do the job of a 'byte' type for small integers. So it was just a poor
> choice of name. (After all there is no separate type in C for bytes
> holding character data.)
>
> What's bad is that third kind: a 'plain char' type, which is
> incompatible with both signed and unsigned char, even though it
> necessarily needs to be one of the other on a specific platform. It
> occurs in no other language, and causes problems within FFI APIs.
>

Certainly having three different "char" types in C is bad. But the only
sensible choice is to have nothing but a plain "char" and use /integer/
types for numeric data. (Call them "u8" and "i8" if you prefer, rather
than the C names "uint8_t" and "int8_t".)

(These days I would consider not having any kind of character type at
all, unless it was a language targeting small embedded systems that need
maximal efficiency - by the time you have something that can hold any
UTF-8 character, you might as well just call it a "string".)

David Brown

unread,

Nov 5, 2022, 12:51:03 PM11/5/22

to

Yes.

You will, of course, need some kind of explicit conversion between
strings/characters and integers or bytes. But if you want operations
like "is_digit" or "upper" for more than plain ASCII, you need a
comprehensive library. (As a fine example, the uppercase of "i" in
English is the letter "I", while in Turkish it is the letter "İ".)

There are plenty of internationalisation libraries available - you
should be able to find something suitable (perhaps
<https://icu.unicode.org>) and make wrappers and interfaces to your new
language.

David Brown

unread,

Nov 5, 2022, 1:01:18 PM11/5/22

to

On 04/11/2022 23:28, Bart wrote:
> On 04/11/2022 21:35, James Harris wrote:
>> On 04/11/2022 17:50, Bart wrote:
>>> On 04/11/2022 13:58, David Brown wrote:
>>>
>>>> Neither "byte" nor "character" should have any kind of arithmetic
>>>> operators - they are not integers. But you will need cast or
>>>> conversion operations on them.
>>>
>>> Bytes are small integers, typically of u8 type.
>>>
>>> I can't see why arithmetic can't be done with them, unless you want a
>>> purer kind of language where arithmetic is only allowed on signed
>>> numbers, and bitwise ops only on unsigned numbers, which is usually
>>> going to be a pain for all concerned.
>>
>> I think what David means is that arithmetic operations don't apply to
>> characters
>
> I was picking on the 'byte' type; it seems extraordinary that you
> shouldn't be allowed to do arithmetic with them. If you can initialise a
> byte value with a number like this:
>
> byte a = 123
>
> then it's a number!
>

If I were making the language, then you could not do such an
initialisation. A "byte" is raw data, not a number.

>> (even though some languages permit such operations). For example, neither
>>
>>    'a' * 5
>>
>> nor even
>>
>>    'R' + 1
>>
>> have any meaning over the set of characters.
>
> I actually had such a restriction for a while: char*5 wasn't allowed,
> but char+1 was. After all why on earth shouldn't you want the next
> character in that alphabet? Why should code like this be made illegal:
>
>     a := a * 10 + (c - '0')

Why not :

char a; // Use whatever syntax you prefer
int i; // and whatever type names you prefer

a = digit(i);

The function "digit" might be defined :

char digit(int i) {
return char(i + ord('0'));
}

You want to find the next letter after "x"? "char(ord(x) + 1)". Or
perhaps, like Pascal and Ada, "succ(x)".

A language has to let you do what you need to do - but you should be
required to write it /clearly/. It should not let you mix apples and
oranges without you saying exactly how you think apples and oranges
should be mixed in the given situation.

>
> Then I realised I shouldn't be telling the programmer what they can and
> can't do with characters, as there might be some perfectly valid
> use-case that I simply hadn't thought of.
>
> Maybe 'a' * 5 yields the value 'aaaaa' or the string "aaaaa", or this is
> some kind on encryption algorithm.

You've hit the nail on the head. What does it mean to write "'a' * 5" ?
To some people it means one thing, to others it means something
different, and to most people in most circumstances it is meaningless.
Meaningless code should be a compile-time error. And you let the
programmer write /explicit/ code to say what he/she intends in other cases.

It is only if something is being used so often that explicit code is a
pain to read and write, that you should do anything implicitly here. So
if you are writing a language designed primarily for string processing,
you might consider having "'a' * 5" defined to be "aaaaa". Otherwise,
let the programmer write "repeat('a', 5)" or "'a'.repeat(5)", or
whatever suits the style of the language.

David Brown

unread,

Nov 5, 2022, 1:08:38 PM11/5/22

to

Yes. Of course, it would not work so well in Chinese (where there is no
concept of alphabet), and would be a challenge for Arabic (where the
shape of letters varies enormously depending on where they come in
words). But it is definitely a step in the right direction. And makes
the code independent of the character encoding.

>
>>
>> (This also illustrates the difference between `byte` and `char`; using
>> a byte type, the output would be '656667...90'.)
>>
>>
>>
>>>    'r' (what the user may want as sort order)
>>>    's' (what the user may want as sort order)
>>>    a non-character (as in EBCDIC)
>>
>> My feeling is that it is these diverse requirements that require
>> user-supplied functions.
>
> Functions, yes, (or what appear to be functions) though surely they
> should be part of a library that comes with the language.
>

How much of a library you provide depends on the goals of the language.
You don't have to provide /everything/ !

Getting "const" right is something to think long and hard about. When
do you mean "constant", when do you mean "read-only", when do you mean
"I promise this data will never change", "I will assume this data will
never change", "I promise /I/ won't change this data via this
reference", "This data will be unchanged logically but may change in
underlying representation, such as using a cache of some sort", etc. ?

Constness is a hugely powerful concept, and something you definitely
want in a language. Modern language design fashion is to making things
constant be default and require explicit indication that they can
change. Some programming languages (pure functional programming
languages, for example) have /only/ constant data - there is no such
thing as variables.

James Harris

unread,

Nov 6, 2022, 4:17:27 AM11/6/22

to

On 05/11/2022 17:08, David Brown wrote:
> On 05/11/2022 17:07, James Harris wrote:
>> On 05/11/2022 13:38, Bart wrote:

..

>>> My feeling is that it is these diverse requirements that require
>>> user-supplied functions.
>>
>> Functions, yes, (or what appear to be functions) though surely they
>> should be part of a library that comes with the language.
>>
>
> How much of a library you provide depends on the goals of the language.
> You don't have to provide /everything/ !

That's good to hear. :)

While I am trying to make it easy to invoke functions written by other
people ISTM that it's also right for the language to have associated
with it a load of standard provisions - i18n support, various data
structures, display support, maths libraries, etc for one simple reason:
code maintenance; it's easier to maintain software which uses library
calls one already knows than to have to learn yet another set of i18n
calls, for example.

..

>> C's const propagation sounds like Java with its horrible, and sticky,
>> exception propagation.
>>
>
> Getting "const" right is something to think long and hard about. When
> do you mean "constant", when do you mean "read-only", when do you mean
> "I promise this data will never change", "I will assume this data will
> never change", "I promise /I/ won't change this data via this
> reference", "This data will be unchanged logically but may change in
> underlying representation, such as using a cache of some sort", etc. ?

That sounds really interesting and I'd like to get in to it but this is
not the thread. If you wanted to start a new thread on the topic I would
reply. Suffice to say here that I don't use "const" but do have "ro" and
"rw" as usable in various contexts which effect many of the things you
mention but I don't know if I have covered everything a programmer might
need.

>
> Constness is a hugely powerful concept, and something you definitely
> want in a language. Modern language design fashion is to making things
> constant be default and require explicit indication that they can
> change. Some programming languages (pure functional programming
> languages, for example) have /only/ constant data - there is no such
> thing as variables.

I haven't gone that far but, for example, I have globals as, by default,
read only and while parameters are read-write a function would have to
keep the originals around if there's a chance they would be needed.

As I say, though, such things need a thread of their own so I'll resist
the urge to say more.

--
James Harris

James Harris

unread,

Nov 6, 2022, 4:28:29 AM11/6/22

to

On 05/11/2022 17:01, David Brown wrote:
> On 04/11/2022 23:28, Bart wrote:

..

>> I actually had such a restriction for a while: char*5 wasn't allowed,
>> but char+1 was. After all why on earth shouldn't you want the next
>> character in that alphabet? Why should code like this be made illegal:
>>
>>      a := a * 10 + (c - '0')
>
> Why not :
>
>     char a;        // Use whatever syntax you prefer
>     int i;        // and whatever type names you prefer
>
>     a = digit(i);
>
> The function "digit" might be defined :
>
>     char digit(int i) {
>         return char(i + ord('0'));
>     }

Wouldn't char and ord need a locale?

That may be the wrong term but by locale I mean a bundled set of rules
(including, in this case, what the digits are and how many there are of
them) which apply to the language and region the program is executing for.

Maybe it is the right term. I see on Wikipedia: "In computing, a locale
is a set of parameters that defines the user's language, region and any
special variant preferences that the user wants to see in their user
interface."

https://en.wikipedia.org/wiki/Locale_(computer_software)

>
> You want to find the next letter after "x"? "char(ord(x) + 1)". Or
> perhaps, like Pascal and Ada, "succ(x)".

pred and succ are great, and I was thinking to start a thread on how
they might be used for different data types. But I have to point out
that they are not enough on their own. If a user wanted the character
ten away from the current one then he wouldn't want to code ten succ
operations.

--
James Harris

David Brown

unread,

Nov 6, 2022, 6:10:09 AM11/6/22

to

On 06/11/2022 10:17, James Harris wrote:
> On 05/11/2022 17:08, David Brown wrote:
>> On 05/11/2022 17:07, James Harris wrote:
>>> On 05/11/2022 13:38, Bart wrote:
>
> ..
>
>>>> My feeling is that it is these diverse requirements that require
>>>> user-supplied functions.
>>>
>>> Functions, yes, (or what appear to be functions) though surely they
>>> should be part of a library that comes with the language.
>>>
>>
>> How much of a library you provide depends on the goals of the
>> language. You don't have to provide /everything/ !
>
> That's good to hear. :)
>
> While I am trying to make it easy to invoke functions written by other
> people ISTM that it's also right for the language to have associated
> with it a load of standard provisions - i18n support, various data
> structures, display support, maths libraries, etc for one simple reason:
> code maintenance; it's easier to maintain software which uses library
> calls one already knows than to have to learn yet another set of i18n
> calls, for example.

Sure. But pick one i18n library, write the FFI wrapper, and call that
your standard library. Then users don't have to deal with third-party
code and libraries, and you don't have to learn the intricacies of how
to support multiple languages properly (you don't have enough lifetimes
to learn enough to write it yourself). Everyone wins!

Fair enough. It's a big topic, and deserves its own thread. All I will
do here is encourage you to think hard about it, learn about it, and
test ideas early on - if you try to add "const" to a language later, it
will inevitably be a mess, complicated and inconsistent.

David Brown

unread,

Nov 6, 2022, 6:23:33 AM11/6/22

to

On 06/11/2022 10:28, James Harris wrote:
> On 05/11/2022 17:01, David Brown wrote:
>> On 04/11/2022 23:28, Bart wrote:
>
> ..
>
>>> I actually had such a restriction for a while: char*5 wasn't allowed,
>>> but char+1 was. After all why on earth shouldn't you want the next
>>> character in that alphabet? Why should code like this be made illegal:
>>>
>>>      a := a * 10 + (c - '0')
>>
>> Why not :
>>
>>      char a;        // Use whatever syntax you prefer
>>      int i;        // and whatever type names you prefer
>>
>>      a = digit(i);
>>
>> The function "digit" might be defined :
>>
>>      char digit(int i) {
>>          return char(i + ord('0'));
>>      }
>
> Wouldn't char and ord need a locale?

That would depend on what you are trying to do. If you wanted a real
multi-lingual "digit" function, then yes - and you'd return different
Unicode characters for different languages. But it is also important to
have some way of getting to the underlying representation of the
characters. At a minimum, you'll need that to implement the library of
functions for dealing with locales. (Maybe you want to distinguish
between "low-level" or "library implementation" code that is allowed to
do such things, and "user" code that is not - just as some languages
have "safe" and "unsafe" code modes.)

>
> That may be the wrong term but by locale I mean a bundled set of rules
> (including, in this case, what the digits are and how many there are of
> them) which apply to the language and region the program is executing for.
>
> Maybe it is the right term. I see on Wikipedia: "In computing, a locale
> is a set of parameters that defines the user's language, region and any
> special variant preferences that the user wants to see in their user
> interface."
>
> https://en.wikipedia.org/wiki/Locale_(computer_software)
>
>>
>> You want to find the next letter after "x"? "char(ord(x) + 1)". Or
>> perhaps, like Pascal and Ada, "succ(x)".
>
> pred and succ are great, and I was thinking to start a thread on how
> they might be used for different data types. But I have to point out
> that they are not enough on their own. If a user wanted the character
> ten away from the current one then he wouldn't want to code ten succ
> operations.
>

You could give "pred" and "succ" an optional step argument.

You will also have to think about what happens if the result doesn't
make sense - if you step beyond the range for the type, do you throw an
error of some sort? Should some kinds of types have wrapping succ/pred
operations?

James Harris

unread,

Nov 13, 2022, 3:49:24 PM11/13/22

to

On 06/11/2022 11:10, David Brown wrote:
> On 06/11/2022 10:17, James Harris wrote:
>> On 05/11/2022 17:08, David Brown wrote:

...

>>> How much of a library you provide depends on the goals of the
>>> language. You don't have to provide /everything/ !
>>
>> That's good to hear. :)
>>
>> While I am trying to make it easy to invoke functions written by other
>> people ISTM that it's also right for the language to have associated
>> with it a load of standard provisions - i18n support, various data
>> structures, display support, maths libraries, etc for one simple
>> reason: code maintenance; it's easier to maintain software which uses
>> library calls one already knows than to have to learn yet another set
>> of i18n calls, for example.
>
> Sure. But pick one i18n library, write the FFI wrapper, and call that
> your standard library. Then users don't have to deal with third-party
> code and libraries, and you don't have to learn the intricacies of how
> to support multiple languages properly (you don't have enough lifetimes
> to learn enough to write it yourself). Everyone wins!

Yes, internationalisation is an enormous area. I don't have anything
like the requisite knowledge of the world's languages and character sets
to do the work. What I /can/ do, AISI, is to establish principles and
restrictions intended to make multilingual programming more natural. For
example,

* To define the standard form of strings to have unadorned characters
stored as separate codes from diacritics (which I gather may be called
combining characters).

* Diacritic codes would all have to be stored in a certain order
relative to each other.

* String encodings would put diacritics before the character to which
they apply. (Though am not sure what to do about accents which apply to
whole words or groups of characters.)

* The language would not support ordering of characters based on their
internal codes. All ordering would require a locale to indicate which
should come first.

* String ordering would require creation of a 'comparison string'
created according to the rules of a selected locale. The internal codes
of the comparison string /would/ be comparable for orde5ring.

* There would be a standard API which all i18n libraries would have to
support.

etc

Whether that kind of approach is valid or not, I don't know. It's just
my best guess at what may be required.

BTW, emoticons are a complete nightmare. There could be any number of
them, they may render differently on different devices and no ordering
of them makes sense.

--
James Harris