string apis revisited

32 views
Skip to first unread message

Alex Shinn

unread,
Jan 21, 2019, 10:42:18 AM1/21/19
to scheme-re...@googlegroups.com
There was a request to summarize the discussion regarding the different string API options in the tangerine edition, so I'll do my best here.

Problem: Historically, Scheme strings were treated as vectors of characters.  However, in the Unicode world, the only basic approach to string representations which preserves O(1) random access is UCS4, which is rarely used due to the space overhead.  Most languages and APIs thus use either UTF16 or UTF8, and almost all I/O uses UTF8.  Consequently, many Scheme implementations do not guarantee O(1) random access on strings, and R7RS respects this and does not enforce any time requirements.  Nonetheless, some existing code assumes constant time, and more importantly we need to be able to write code which can be expected to run efficiently in a variety of implementations.

I think everyone gets this - there's a time-space tradeoff.  What's unclear is what the proposed solutions do and exactly the trade-offs become.

The approaches to solving this generally involve either optimizations to ensure both reasonable time and space performance in strings, or alternate APIs using cursors instead of integer offsets (i.e. treating strings like lists instead of vectors).  Note due to the difficulties of changing internal representations, it's not practical to guarantee constant-time index operations on the string type itself, so proposals revolve around a new type either way: a new string type, or a cursor type.  Either of these could be disjoint types, optionally or not.

Note there seems to be near unanimous agreement that mutation is to be discouraged, so this is not a crucial factor.  There also seems to be interest in a higher level above strings allowing size-changing mutations, though disagreement on whether this should be purely functional or in-place, and the features available (up to and including Emacs-style marks), so we won't address this for now.

For completeness we can also consider a few variant approaches that have not been proposed.

1. Require constant-time string-ref.
Proposed in: R6RS
Pros:
- backwards compatible with R6RS and older Scheme code
Cons:
- no portable implementation - incompatible with many existing Schemes
- requires up to 4x the space of UTF8
- requires more encoding conversions for FFI and I/O

2. Create a new string type with required constant-time access
Proposed in: SRFI 135
Pros:
- provides a familiar though incompatible API to R6RS and older Scheme
- allows a portable implementation
Cons:
- reference implementation is slow for non-native Schemes
- native implementations still incur time and space overhead
- new type is not a string, requiring conversions at library (and FFI) boundaries
- no portable literal syntax

3. Strings as lists (string-car and string-cdr relying on efficient substring/shared)
Proposed in: nothing (though SRFI 13 has optional shared substrings)
Pros:
- simple functional API
- allows space efficient storage
Cons:
- not portable - efficient implementations need GC support
- even simple iteration generates garbage

4. Strings with cursors instead of indexes
Proposed in: SRFI 130, (chibi string)
Pros:
- provides a familiar API to indexes using the native string type
- allows space efficient storage and zero time overhead with a near portable implementation
Cons:
- integration with existing code may require conversions between indexes and cursors in places
- SRFI 130 made the poor design choice of polymorphism on non-disjoint types

The only alternative in the ballot is 4 (SRFI 130).  SRFI 13 predates awareness of the issues, and SRFI 152, being a stripped down version of SRFI 13, ignores the issues with no performance guarantees, thus not addressing the problem to begin with.  SRFI 140 (not on the ballot) is something of a cross between 1 and 2, creating a new type but requiring string literals to be of that type.

It's unclear what approach requires more migration efforts.  New string types will require more (and more expensive) conversions in general, since many string operations never even need to explicitly mention indexes/cursors.  However, certain specialized algorithms may perform index arithmetic which requires non-trivial translation to cursors.  For new code either should be easy enough, with the balance tipping in favor of cursors preserving string literal support.

Cursors have the best performance, but what is the real cost of the alternatives?  No one is arguing strongly for UCS4 with up to 4x times the storage, which means the solutions are more complex involving both time and space overhead.  In implementations without native support the time overhead alone can be prohibitive - Chibi is packaged with both (srfi 130) and (srfi 135), but iterating over a string in the latter is about 100x slower.  I think with native support this should only be a small fraction increase in both time and space, but as an implementor I would be disinclined to pay even that given that it still doesn't provide backwards compatibility.

I'll try to provide some more detailed benchmarks later.

--
Alex

John Cowan

unread,
Jan 21, 2019, 4:24:49 PM1/21/19
to scheme-re...@googlegroups.com
On Mon, Jan 21, 2019 at 10:42 AM Alex Shinn <alex...@gmail.com> wrote:

There was a request to summarize the discussion regarding the different string API options in the tangerine edition, so I'll do my best here.

Thanks for writing this.
 
Problem: Historically, Scheme strings were treated as vectors of characters.  However, in the Unicode world, the only basic approach to string representations which preserves O(1) random access is UCS4, which is rarely used due to the space overhead.

Actually that turns out not to be the case.  At least the following implementations use UTF-32 representation in strings: Gambit, Guile (unless the repertoire is Latin-1 only),  Foment, Ikarus/Vicare, Larceny, Ypsilon, Mosh, and Scheme48.  Cyclone, Chibi, and STklos use UTF-8 internally; the JVM, CLR, and JS implementations use UTF-16 (or just UCS2, with no access to the Astral Planes); most of the remaining Schemes have 8-bit implementations, with or without overlay libraries like Chicken's.  MIT uses 1-byte, 2-byte, or 3-byte representations for immutable strings, but is limited to 1-byte mutable ones.  UTF-32 is also typical in Common Lisp.

1. Require constant-time string-ref.
Proposed in: R6RS

R6RS only says "should", but all the R6RS implementations that aren't JVM/CLR/JS provide it.
 
- requires up to 4x the space of UTF8

Actually 3x is enough, as in MIT and one of the Larceny proposals ("record1").
 
3. Strings as lists (string-car and string-cdr relying on efficient substring/shared)
Proposed in: nothing (though SRFI 13 has optional shared substrings)

Haskell, though there are alternative string-ish types now, all subsumed under
a type class.
 
4. Strings with cursors instead of indexes
Proposed in: SRFI 130, (chibi string)
- SRFI 130 made the poor design choice of polymorphism on non-disjoint types

A revision of SRFI 130 that didn't do indexes might be a Good Thing.
 
The only alternative in the ballot is 4 (SRFI 130).  SRFI 13 predates awareness of the issues, and SRFI 152, being a stripped down version of SRFI 13, ignores the issues with no performance guarantees, thus not addressing the problem to begin with. 

It's not intended to.  It's intended to be used collaboratively with SRFI 135 for use cases where mutation is essential and/or performance doesn't matter much.

SRFI 140 (not on the ballot) is something of a cross between 1 and 2, creating a new type but requiring string literals to be of that type.

And requiring length mutation.
 
Cursors have the best performance,

Cursors are obviously more expensive on UTF-32 systems, since they have to be allocated and garbage collected.
 
 No one is arguing strongly for UCS4 with up to 4x times the storage

No, but a non-negligible number of implementers are voting with their feet.

-- 
John Cowan          http://vrici.lojban.org/~cowan        co...@ccil.org
Yakka foob mog.  Grug pubbawup zink wattoom gazork.  Chumble spuzz.
    --Calvin, giving Newton's First Law "in his own words"


Per Bothner

unread,
Jan 21, 2019, 4:56:06 PM1/21/19
to scheme-re...@googlegroups.com, John Cowan
On 1/21/19 1:24 PM, John Cowan wrote:
> SRFI 140 (not on the ballot) is something of a cross between 1 and 2, creating a new type but requiring string literals to be of that type.
>
> And requiring length mutation.

I don't think this should be a deciding factor in terms of the SRFI 140 design.
The major concepts and goals of SRFI are independent of length mutation:

* Distinct immutable and mutable string (sub-) types.
* O(1) string-ref for immutable strings.
* Re-use traditional string procedure names.
* Most standard string procedures return immutable strings.

The tradeoff in terms of updating existing programs is adding copy-string
in a few places (if existing code requires mutable strings) vs wholesale
replacing string-xxx functions with something else.

Those concepts are the core for SRFI 140 - and orthogonal to
whether string length can be changed in mutable strings.
However, if you're going to support mutable strings,
then allowing length mutation makes them much more useful,
and it's only a small addition to the API. For most implementation
strategies there should no extra complexity or overhead supporting length mutation.
--
--Per Bothner
p...@bothner.com http://per.bothner.com/

Alex Shinn

unread,
Jan 21, 2019, 5:52:30 PM1/21/19
to scheme-re...@googlegroups.com
On Tue, Jan 22, 2019 at 5:24 AM John Cowan <co...@ccil.org> wrote:

On Mon, Jan 21, 2019 at 10:42 AM Alex Shinn <alex...@gmail.com> wrote:

Problem: Historically, Scheme strings were treated as vectors of characters.  However, in the Unicode world, the only basic approach to string representations which preserves O(1) random access is UCS4, which is rarely used due to the space overhead.

Actually that turns out not to be the case.

I actually meant outside of the Scheme world.  Within the Scheme world you forgot to add Gauche to the UTF8 list.
 
Cursors have the best performance,

Cursors are obviously more expensive on UTF-32 systems, since they have to be allocated and garbage collected.

Non-disjoint cursors (as in Chicken's utf8 egg) have zero overhead.  Disjoint cursors _can_ have zero overhead as in Chibi with less implementation effort than changing the string representation.

--
Alex

Alex Shinn

unread,
Jan 26, 2019, 3:57:01 AM1/26/19
to scheme-re...@googlegroups.com
On Tue, Jan 22, 2019 at 6:52 AM Alex Shinn <alex...@gmail.com> wrote:

Non-disjoint cursors (as in Chicken's utf8 egg) have zero overhead.  Disjoint cursors _can_ have zero overhead as in Chibi with less implementation effort than changing the string representation.

I realize that my benchmark of (srfi 135) being 100x slower than string cursors was due to the string->text procedure being quadratic in utf8 implementations (calling string-ref on each index).  If (srfi 135) were built on top of (srfi 130), the string->text conversion would be faster.  In this sense, (srfi 130) is a lower-level building block for string libraries.

I've added a compile-time option (disabled by default) to include an index->cursor lookup table per string in chibi (default one per 64 chars, or 12.5% space overhead).
I ran some simple benchmarks 1. iterating over a string and 2. iteratively appending strings:

iteration:
(do ((j 0 (+ j 1))) ((>= j i)) (string-ref s j))
(do ((j (string-cursor-start s) (string-cursor-next s j))) ((string-cursor>=? j end)) (string-cursor-ref s j))
(do ((j 0 (+ j 1))) ((>= j i)) (text-ref s j))

impl\len                    1000   10000  100000  1000000
chicken string-ref (ascii)     0       0       1       13
chicken string-ref (utf8)      7     386   37920        x
chibi string-ref (utf8)        1      97    9622        x
chibi string-ref (lookup)      0       2      19      216
chibi cursor-ref (srfi 130)    0       4      18      150
chicken cursor-ref (utf8)      0       1      10      892
chibi text-ref (srfi 135)      2      27     211     2006

appending:
(do ((j 0 (+ j 1)) (s "" (string-append s (make-string (/ i 10) #\x)))) ((>= j 10)))
(do ((j 0 (+ j 1)) (s (text) (textual-append s (string->text (make-string (/ i 10) #\x))))) ((>= j 10)))

impl\len                      1000    10000  100000
chicken string-append (ascii)    0        0       0
chicken string-append (utf8)     0        0       0
chibi string-append (utf8)       0        0       0
chibi string-append (lookup)     0        1       9
chibi textual-append (srfi 135)  6       51    1264


In implementations in which string-ref is fast, then string-cursors are just defined as indexes and have zero performance overhead.
SRFI 135 will have some overhead in any implementation, particularly in non-native ones.
Add to this SRFI 135 strings are incompatible with native strings, and I'm confused as to why it is getting votes.

I believe there's room for two textual types in the Scheme language.  One should be immutable and emphasize efficiency (both time and space).  One should be similar to Emacs buffers, allowing efficient insertion and deletion at any point.  Note this implies indexes are a poor idiom, since they are soon invalidated - something akin to marks are called for.  Considering both the efficiency requirements and preserving similar programming idioms to the buffers, I think a cursor-based API for the immutable strings is the best approach.

--
Alex

John Cowan

unread,
Jan 26, 2019, 9:16:26 AM1/26/19
to scheme-re...@googlegroups.com
On Sat, Jan 26, 2019 at 3:57 AM Alex Shinn <alex...@gmail.com> wrote:

I realize that my benchmark of (srfi 135) being 100x slower than string cursors was due to the string->text procedure being quadratic in utf8 implementations (calling string-ref on each index).  If (srfi 135) were built on top of (srfi 130),

Adding another kernel to SRFI 135 shouldn't be that hard, if you wanted to.  I note, however, that Chibi uses kernel8; the recommended kernel for interpreters is kernel0.  Did you try that one?
 
I've added a compile-time option (disabled by default) to include an index->cursor lookup table per string in chibi (default one per 64 chars, or 12.5% space overhead).

Well, 12.5% worst case; if the content is in Japanese, only half that. 

Add to this SRFI 135 strings are incompatible with native strings, and I'm confused as to why it is getting votes.

It isn't.  SRFI 135 was voted in as part of the Red Edition, so it's already in the large language.  The issue here is "What should we provide to allow users to manipulate the Scheme strings (fixed-length and mutable) that every implementation has had since R2RS?"  Right now there is only the very limited R7RS-small API for them, which is plainly insufficient.  Nor is it possible to avoid strings completely: filenames are strings, and there are string ports but no text ports (nor can they be implemented portably).

I believe there's room for two textual types in the Scheme language.  One should be immutable and emphasize efficiency (both time and space).  One should be similar to Emacs buffers, allowing efficient insertion and deletion at any point.

I basically agree with this analysis.  If the SRFI 135 sample implementation is slow on a particular system, it should be adapted or reimplemented. As for Emacs buffers, I have a sketch for them at <https://bitbucket.org/cowan/r7rs-wg1-infra/src/default/BuffersCowan.md> which needs to be filled out.

  Note this implies indexes are a poor idiom, since they are soon invalidated - something akin to marks are called for.

Marks are sui generis, because some of them must move with insertions, others must not.  In any case, I think buffers are best implemented on top of either u32vectors (for speed) or ad hoc u24vectors (for space).  Of course, a native implementation of u24vectors achieves both.  UTF-8 for buffers would be painful.

Alex Shinn

unread,
Jan 27, 2019, 10:36:47 AM1/27/19
to scheme-re...@googlegroups.com
On Sat, Jan 26, 2019 at 10:16 PM John Cowan <co...@ccil.org> wrote:

On Sat, Jan 26, 2019 at 3:57 AM Alex Shinn <alex...@gmail.com> wrote:

I realize that my benchmark of (srfi 135) being 100x slower than string cursors was due to the string->text procedure being quadratic in utf8 implementations (calling string-ref on each index).  If (srfi 135) were built on top of (srfi 130),

Adding another kernel to SRFI 135 shouldn't be that hard, if you wanted to.  I note, however, that Chibi uses kernel8; the recommended kernel for interpreters is kernel0.  Did you try that one?

kernel0 is just a thin wrapper around the builtin strings, which means it will have performance strictly worse than the "chibi string-ref (utf8)" entries from my benchmarks.  I chose kernel8 to have the most similar representation to the native strings.
 
Add to this SRFI 135 strings are incompatible with native strings, and I'm confused as to why it is getting votes.

It isn't.  SRFI 135 was voted in as part of the Red Edition, so it's already in the large language.  The issue here is "What should we provide to allow users to manipulate the Scheme strings (fixed-length and mutable) that every implementation has had since R2RS?"

The ballot says somewhat confusingly "meant to be used with texts," but the issues are still the same, which I will clarify below.

  Note this implies indexes are a poor idiom, since they are soon invalidated - something akin to marks are called for.

Marks are sui generis, because some of them must move with insertions, others must not.  In any case, I think buffers are best implemented on top of either u32vectors (for speed) or ad hoc u24vectors (for space).  Of course, a native implementation of u24vectors achieves both.  UTF-8 for buffers would be painful.

Not at all - I've implemented utf8 buffers several times over, for Gauche, Chicken and Chibi.  The space efficiency and low I/O cost probably makes it the best option in fact.

I feel that R7RS large is attempting to marginalize any implementation which does not implement O(1) string-ref, despite R7RS small explicitly not making this a requirement.

SRFI 135 does not provide a portable and efficient solution for the utf8/16 impls as claimed, nor is it trivial to fix.

SRFI 130 does not provide this either, but the fix is a cond-expand import of a handful of renamed cursor primitives already available in the affected impls.

SRFI 135 adds overhead to the utf8/16 impls, SRFI 130 adds overhead to nothing.

SRFI 152 puts its head in the sand and pretends there are no problems with strings, encouraging an API which drastically penalizes the utf8/18 impls.

Why should we care about these rogue impls, apart from the WG1 vote?

Because they are actually more efficient at some things.  It's hard to do a holistic eval of string performance since it will vary so much per application, but it's obvious to me that utf8 will be all around better in some cases, utf16 in others, and ucs4 in yet others.  I can give some simple practical examples where utf8 clearly shines though.

Consider "wc -l" (attached as wc.scm) on an old 60MB mail archive, taking the median of 3 runs in chibi (with and without the index->cursor table) and larceny:

# no index->cursor lookup
$ time chibi-scheme wc.scm < ~/chicken-users.txt
1174650
real    0m1.636s

# with index->cursor lookup
$ time chibi-scheme wc.scm < ~/chicken-users.txt
1174650
real    0m1.789s

$ time larceny -utf8 wc.scm < ~/chicken-users.txt
1174650
real    0m2.145s

Wait, how is Chibi's unoptimized vm 24% faster than Larceny?  Because it's just I/O, and Chibi has no translation cost for utf8 I/O.

Maybe too simple, what if we do some actual work?  Let's try grepping for references to Scheme:

# no index->cursor lookup
$ time chibi-scheme fgrep.scm < ~/chicken-users.txt
2978
real    0m1.688s

# with index->cursor lookup
$ time chibi-scheme fgrep.scm < ~/chicken-users.txt
2978
real    0m1.799s

$ time larceny -utf8 fgrep.scm < ~/chicken-users.txt
2978
real    0m20.584s


Now Chibi is more than 10x faster than Larceny!  I'm not actually sure why, but string-contains in Chibi is just strstr which is quite fast.
As you use more non-builtin procedures Larceny will overtake Chibi again, but a native compiled impl with utf8 like Chicken should still outperform.

Note also compared to wc(1) and fgrep(1) these are both quite slow, which is why although the index->cursor table isn't too expensive I'm hesitant to consider it negligible.

So although Chibi was not designed for it, it's fast startup time and choice of string representations makes it particularly suited to scripting.
I just want to allow for this - I want a wide variety of impls to choose from - and I don't want certain impls penalized harshly by assumptions made in 3rd-party code.

I'm open to discussion though, and have not completely written out switching the index->cursor lookup to be enabled by default.
I'm also willing to write a revised string-cursor library without the problems of SRFI 130.
But I want to know why SRFI 135 is popular?  Since everything is reduced to multiple choice these days, you can just check below:

[   ] 1. I don't think the overhead of SRFI 135 for utf8/16 impls is large (please provide counter-benchmarks)
[   ] 2. I don't care about the overhead for utf8/16 impls (at least you're honest)
[   ] 3. I think the overhead for string-cursors is larger (please provide counter-benchmarks)
[   ] 4. I don't care about efficiency and think cursor migration is more work than string type migration (please provide examples)
[   ] 5. I don't care about efficiency and think cursors are confusing
[   ] 6. I've never used cursors and am vaguely worried about them, SRFI 135 looks superficially more familiar

--
Alex

wc.scm
fgrep.scm
Reply all
Reply to author
Forward
0 new messages