SRFI 14 (character sets) needs replacement in R7RS-large

24 views
Skip to first unread message

John Cowan

unread,
Sep 19, 2019, 7:19:12 PM9/19/19
to srfi...@srfi.schemers.org, srf...@srfi.schemers.org, scheme-re...@googlegroups.com
Although SRFI 14 is currently part of R7RS-large (it was voted in as part of the Red Edtion), both the SRFI itself and its implementation need replacement.  The procedures should be aligned with SRFI 103 by adding analogues of the set-search!, set>?, set<=?, set>=?, set-remove, and set-partition procedures.  This should be extremely straightforward.  As long as backward compatibility is maintained, changing (scheme char-set) to a different SRFI library is maintained.

In addition, the standard character sets are defined correctly for ASCII and Latin-1 characters, but the Unicode definitions are based on Java 1.0.  Both Java and Unicode itself have replaced them with new definitions since then, and these are what should be incorporated into the replacement SRFI.  This is technically a break in backward compatibility, but since SRFI 14 is incompatible with every non-Scheme system in use today, I rule as R7RS-large chair that the incompatibility is de minimis.

The present implementation is for Latin-1 only, and the internal representation of a character set is a (mutable) Scheme string of length 256, such that if character n is present in the set, the nth character of the string is #\u1; and otherwise #\u0;.  This is both wasteful and inadequate, but it was the best that could be done portably at the time.

Fortunately, the Chibi implementation of SRFI 14 handles full Unicode and is built on top of the (chibi iset) library (also available on Chicken), which contains a minimal bitvector library that is based on bytevectors.  It is quite portable and would be suitable for the new SRFI.

Latin-1 is quickly becoming obsolete online, but ASCII is still very important.  The Chibi implementation uses a tree of bitvectors whose lengths are between 128 and 512 bits each (16 to 64 bytes), so it will be as efficient (modulo a small constant factor) in space and time as a purpose-built ASCII-only implementation.   Therefore, I recommend that everything set-like be removed from SRFI 175.

Because of the implementation dependency, I'll wait until (chibi iset) becomes an integer-set SRFI, which won't be difficult: it just needs some procedures to be renamed and to be augmented with iset<. iset>, iset-disjoint?, iset-count, and iset-remove! procedures, which should be straightforward as well.
Reply all
Reply to author
Forward
0 new messages