immutable text alternative

55 views
Skip to first unread message

Per Bothner

unread,
Jul 7, 2016, 3:46:53 PM7/7/16
to scheme-re...@googlegroups.com
Question #11 "What immutable text library should R7RS-large provide?" of the Red Edition Ballot
is missing the alternative I was arguing for last month on the srfi-135 list:

http://srfi-email.schemers.org/srfi-135/msg/3849905
http://srfi-email.schemers.org/srfi-135/msg/3850401

Basically, the idea is to unify the concepts of SRFI-135 "text"
and R7RS "immutable string". We would formalize the distinction between
immutable and immutable strings, including adding a procedure immutable-string?
(though preferably under some other name). The SRFI-135 type "textual" is renamed
to the R7RS type "string", and the SRFI-135 type "text" (preferably under some
other name) becomes a subtype of "string".

string
- immutable strings / texts (string-set! throws exception; guaranteed O(1) string-ref)
- mutable strings (string-set! allowed - need not be O(1), nor does string-ref))

We could use the name "text" for "immutable strings" but I would prefer a name
like "istring" or "fstring" that makes it clear it is a sub-type of "string".

The procedures of SRFI-135 would be classified into two categories:

(1) Ones that returns a text; or a list/vector of text (i.e. text-split); plus text? :
These should retain the text- prefix (or some alternative new prefix,
such as istring- or fstring-).

(2) All the other ones should be renamed string-xxx. This assumes no conflict
with existing (or proposed) string- procedures; if there are conflicts they
should be considered on a case-by-case basis.

We might consider changing these traditional string procedures to return
immutable strings: string, string-upcase, string-downcase, string-foldcase, substring,
string-append, list->string, and string-map. (Though of course there has to be a
library with backward-compatible versions that return mutable strings.)

I'd also include SRFI-118's string-append! and string-replace! for mutable strings.

Again, I dislike this balloting process throwing a huge kitchen sink of proposed
additions without chance for reflection or nuance or discussion. Note that SRFI-135
is not yet final - and my proposal was dismissed as not suitable for SRFI-135 as
a standalone portable library. However, in the context of an updated Scheme
standard it is *not* out of place.

I can try to throw together a more detailed "Strings for WG2" proposal in the form
of a SRFI if there is interest.
--
--Per Bothner
p...@bothner.com http://per.bothner.com/

John Cowan

unread,
Jul 7, 2016, 5:37:41 PM7/7/16
to scheme-re...@googlegroups.com
Per Bothner scripsit:

> Question #11 "What immutable text library should R7RS-large provide?" of
> the Red Edition Ballot is missing the alternative I was arguing for
> last month on the srfi-135 list:

If you want to write up a SRFI and try to get people to vote for it, be
my guest.

--
John Cowan http://www.ccil.org/~cowan co...@ccil.org
People go through the bother of Christmas because Christmas helps them
to understand why they go through the bother of living out their lives
the rest of the year. For one brief instant, we see human society as it
should and could be, a world in which business has become the exchanging
of presents and in which nothing is important except the happiness and
well-being of the ultimate consumer. --Northrop Frye (1948)

Per Bothner

unread,
Jul 9, 2016, 2:56:26 PM7/9/16
to scheme-re...@googlegroups.com
I've sketched up my proposal for how strings should be in Scheme large:
http://per.bothner.com/tmp/wg2-strings.html

I can turn this into a more polished proposal and an actual SRFI if there is interest.
However, I'm not going to waste my time unless get some positive feedback:
I.e. that this is the right direction for the future of Scheme.

I think SRFI-135 *in concept* is a reasonable direction for the future of
Scheme, but not using SRFI-135's new names and new types. I think having
all of "mutable strings", "immutable stings", "texts", and "textuals" in
the same language would make for a terrible and unteachable mess. Please,
let's not do that.

Sudarshan S Chawathe

unread,
Jul 9, 2016, 4:44:38 PM7/9/16
to scheme-re...@googlegroups.com
> From: Per Bothner <p...@bothner.com>
> Date: Sat, 9 Jul 2016 11:56:18 -0700
>
> I can turn this into a more polished proposal and an actual SRFI if
> there is interest. However, I'm not going to waste my time unless get
> some positive feedback: I.e. that this is the right direction for the
> future of Scheme.

For what it's worth, I did read this proposal and find the
string/istring-based names conceptually a bit simpler (and thus nicer)
than the text/textual versions.

In any case, I hope that something like SRFI 135 makes it into
R7RS-large, regardless of the names selected, because I think an easy
way for a programmer to get O(1) access to string-like objects is a very
worthwhile feature.

Regards,

-chaw

Arthur A. Gleckler

unread,
Jul 9, 2016, 6:08:47 PM7/9/16
to scheme-re...@googlegroups.com
On Sat, Jul 9, 2016 at 11:56 AM, Per Bothner <p...@bothner.com> wrote:
I've sketched up my proposal for how strings should be in Scheme large:
http://per.bothner.com/tmp/wg2-strings.html

I can turn this into a more polished proposal and an actual SRFI if there is interest.
However, I'm not going to waste my time unless get some positive feedback:
I.e. that this is the right direction for the future of Scheme.

I'm happy to help you get any SRFI published quickly, if you decide to.

— SRFI Editor

Per Bothner

unread,
Jul 10, 2016, 1:13:39 PM7/10/16
to scheme-re...@googlegroups.com


On 07/09/2016 03:08 PM, Arthur A. Gleckler wrote:
Ok - we might as well go for it. I've done some polishing and expanding,
and uploaded the result to: http://per.bothner.com/tmp/wg2-strings.html

Jim Rees

unread,
Jul 10, 2016, 7:09:26 PM7/10/16
to scheme-reports-wg2
I am in favor of more discussion along these lines, and would support an effort that simplifies the union of existing string API and (SRFI 135), and
also unifies existing string "literals" with immutable strings/texts.

I like the features of (srfi 135), but roughly half of those API functions (those which do not return texts) should just be string-* equivalents updated to handle this new flavor of immutable string.

I also like stringcursors - because text-ref is slow on these fancy new immutable texts represented with UTF8 under the hood.   I optimized textual-fold heavily and use it for iterating across single texts, but in the multi-text argument case (textual-map, textual-for-each, and something users might want to write), it's not always practical to avoid using textual-ref.    stringcursors address that issue -- I have them working on my texts as of today.     That being said, I won't vote against your proposal if it doesn't include stringcursors.

I like the idea of promoting immutable strings by making old identifiers mean new things -- like string-upcase returning immutable strings (with all the sharing and complexity promises) -- but I fear this will be too bold for a majority of the voters here (that's not an argument against it, just an prediction that a proposal including changes in the semantics of a large set of core identifiers, even if exported by distinct libraries, will likely be voted down).


Per Bothner

unread,
Jul 10, 2016, 7:45:39 PM7/10/16
to scheme-re...@googlegroups.com


On 07/10/2016 04:09 PM, Jim Rees wrote:

> I also like stringcursors - because text-ref is slow on these fancy new immutable texts represented with UTF8 under the hood.

I believe a goal of SRFI-135 is to remove the need for string-cursors. You can make string-ref
O(1) fairly easily. For example (if we don't try to optimize sharing) using Java:

class SchemeString implements CharSequence {
String str;
int cplength; // number of codepoints

/* Index of every 16-th character */
int[] offsets;

/** To implement CharSequence */
public char charAt(int i) { return str.charAt(i); }
public String toString() { return str; }
public int length() { return str.length(); }

/** used for string-ref */
public int index(int i) {
int pos = offsets[i>>4];
return str.codePoinAt(pos + str.offsetByCodePoints(pos, i&15));
}
/* used for string-length */
public int cplength() { return cplength; }
}

offsetByCodePoints does do linear searching, but its loop is limited to 15.

Also, there is a simple and useful optimization:
if (offsets[i>>4] + 16 == offsets[(i>>4) + 1])
then we know all the characters in that range are in the basic plane,
so the code optimizes to:
return (int) str.charAt(pos + (i & 15));

Plus of of course all sequential options such as string-for-each
can be easily optimized.

From my point of view, as an implementer of Kawa on the JVM, there are
two downsides:
(1) Some space overhead and extra indirection compared to using String
directly (perhaps with string-cursor).
(2) The Java/Kawa mapping becomes more complex (though using CharSequence helps).

Jim Rees

unread,
Jul 10, 2016, 9:59:44 PM7/10/16
to scheme-reports-wg2
On Sunday, July 10, 2016 at 7:45:39 PM UTC-4, Per Bothner wrote:

On 07/10/2016 04:09 PM, Jim Rees wrote:

> I also like stringcursors - because text-ref is slow on these fancy new immutable texts represented with UTF8 under the hood.

I believe a goal of SRFI-135 is to remove the need for string-cursors.  You can make string-ref
O(1) fairly easily.  For example (if we don't try to optimize sharing) using Java:

Yes, I did the implementation right got "O(1)" but it's still a very slow O(1) compared to using stringcursors.   I know I can trade-off the size of the constant for space (by reduce the chunk size smaller), but the benefits of UTF-8 at all come into question then.

As for the optimization for chunks which are all single-byte-per-character - this does help things by around 5% for incremental scans over large all-ASCII texts (my chunk size is 32 characters at the moment).    There might be more value to using a different Text subclass for all-ASCII texts -- since the complexity in dispatching for the ref() method is there already.

Reply all
Reply to author
Forward
0 new messages