There was a request to summarize the discussion regarding the different string API options in the tangerine edition, so I'll do my best here.
Problem: Historically, Scheme strings were treated as vectors of characters. However, in the Unicode world, the only basic approach to string representations which preserves O(1) random access is UCS4, which is rarely used due to the space overhead. Most languages and APIs thus use either UTF16 or UTF8, and almost all I/O uses UTF8. Consequently, many Scheme implementations do not guarantee O(1) random access on strings, and R7RS respects this and does not enforce any time requirements. Nonetheless, some existing code assumes constant time, and more importantly we need to be able to write code which can be expected to run efficiently in a variety of implementations.
I think everyone gets this - there's a time-space tradeoff. What's unclear is what the proposed solutions do and exactly the trade-offs become.
The approaches to solving this generally involve either optimizations to ensure both reasonable time and space performance in strings, or alternate APIs using cursors instead of integer offsets (i.e. treating strings like lists instead of vectors). Note due to the difficulties of changing internal representations, it's not practical to guarantee constant-time index operations on the string type itself, so proposals revolve around a new type either way: a new string type, or a cursor type. Either of these could be disjoint types, optionally or not.
Note there seems to be near unanimous agreement that mutation is to be discouraged, so this is not a crucial factor. There also seems to be interest in a higher level above strings allowing size-changing mutations, though disagreement on whether this should be purely functional or in-place, and the features available (up to and including Emacs-style marks), so we won't address this for now.
For completeness we can also consider a few variant approaches that have not been proposed.
1. Require constant-time string-ref.
Proposed in: R6RS
Pros:
- backwards compatible with R6RS and older Scheme code
Cons:
- no portable implementation - incompatible with many existing Schemes
- requires up to 4x the space of UTF8
- requires more encoding conversions for FFI and I/O
2. Create a new string type with required constant-time access
Proposed in: SRFI 135
Pros:
- provides a familiar though incompatible API to R6RS and older Scheme
- allows a portable implementation
Cons:
- reference implementation is slow for non-native Schemes
- native implementations still incur time and space overhead
- new type is not a string, requiring conversions at library (and FFI) boundaries
- no portable literal syntax
3. Strings as lists (string-car and string-cdr relying on efficient substring/shared)
Proposed in: nothing (though SRFI 13 has optional shared substrings)
Pros:
- simple functional API
- allows space efficient storage
Cons:
- not portable - efficient implementations need GC support
- even simple iteration generates garbage
4. Strings with cursors instead of indexes
Proposed in: SRFI 130, (chibi string)
Pros:
- provides a familiar API to indexes using the native string type
- allows space efficient storage and zero time overhead with a near portable implementation
Cons:
- integration with existing code may require conversions between indexes and cursors in places
- SRFI 130 made the poor design choice of polymorphism on non-disjoint types
The only alternative in the ballot is 4 (SRFI 130). SRFI 13 predates awareness of the issues, and SRFI 152, being a stripped down version of SRFI 13, ignores the issues with no performance guarantees, thus not addressing the problem to begin with. SRFI 140 (not on the ballot) is something of a cross between 1 and 2, creating a new type but requiring string literals to be of that type.
It's unclear what approach requires more migration efforts. New string types will require more (and more expensive) conversions in general, since many string operations never even need to explicitly mention indexes/cursors. However, certain specialized algorithms may perform index arithmetic which requires non-trivial translation to cursors. For new code either should be easy enough, with the balance tipping in favor of cursors preserving string literal support.
Cursors have the best performance, but what is the real cost of the alternatives? No one is arguing strongly for UCS4 with up to 4x times the storage, which means the solutions are more complex involving both time and space overhead. In implementations without native support the time overhead alone can be prohibitive - Chibi is packaged with both (srfi 130) and (srfi 135), but iterating over a string in the latter is about 100x slower. I think with native support this should only be a small fraction increase in both time and space, but as an implementor I would be disinclined to pay even that given that it still doesn't provide backwards compatibility.
I'll try to provide some more detailed benchmarks later.
--
Alex