1. A character is a byte. This works for ASCII and latin-1. It is
what the server currently uses.
2. A character is a wider word. For example, WCHAR on Windows is
16-bits, and some programs do not implemente that a character may
require two of these.
3. A character is a code-point. UTF-8 requires one to four bytes and
encodes the entire code point. This is the 'recommended' approach
by W3C.
4. There is a many-to-many mapping between characters and code
points. For example, the ligature 'fi' is two characters, but
requires only one code point. Likewise, many 'combining' code
points serve to modify the previous base code point to form a
single character. Likewise, sequences of code-points can be
compressed using singletons to save space or expanded out to
determine equivalence. Appending two string may re-order code
points at the interface between the two strings. Upper and lower
casing strings may add or remove code-points.
At this time, the fourth approach is out of reach. It's too
complicated.
I have considered several forms for implementing efficient buffers
with color and UTF-8:
Method 1 (Brute Force Arrays) -- The brute force method is an array of
ColorState (2 bytes per character) and an array of wide characters
(4 bytes per character). The advantages are that random and sequential
access only requires an integer array index. Visual width (given that
we are not supporting the 4th model above) is the length of the
string.
The disadvantages are that this requires 6 bytes per character or 48KB
per LBUF-length string. Since the cost of manipulating these strings
is related primarily to how much memory is touched, we expect this
approach to be a factor 5 slower than the current implementation.
Another disadvantage is that to import or export from this form
(to/from the network/dumps) requires encode/decoding to UTF-8. For
example, to delete a single character would require much encoding and
re-coding of the rest of the string.
Method 2 (keep text as UTF-8) -- UTF-8 is variable length so by
keeping the text in UTF-8 form, we lose the ability to randomly access
text. However, UTF-8 can be parsed forwards and backwards, so it is
possible to construct and update 'cursors' which are only slightly
more complicated that array indexes. The first random probe might use
existing cursors to bracket it's search, but in the worst-case, a
sequence search from the nearest end is required. Fortunately, random
access is not performed that often.
The ColorState for a character would be stored in the position of the
first UTF-8 character. So, potentially only 1 in 4 entries of the
ColorState are used. Fortunately, the larger UTF-8 characters do not
occur frequently.
Still, most text (even ASCII) uses 'normal' color.
Method 3 (keep text as UTF-8 and compress ColorState) -- One method of
compressing the ColorState array is to using the leftover 3 bits as a
count of how many characters the ColorState applies, so we can support
runs of eight characters. Since UTF-8 already requires sequential
access to character positions, our position 'cursors' can also handle
the complication of tracking which ColorState applies.
Another way of encoding the same information is to add LBUF_OFFSET
field for each ColorState which tracks at which position the ColorState
comes into force.
The disadvantage of both of these is that potentially, user data may
change the color of each character.
Method 4 (Encode ColorState as UTF-8 characters) -- It requires 13
bits to express a ColorState. One problem to overcome is that while
UTF-8 is parsable in both directions, ColorState is not. So solve
this, we could encode 13*2 or 26 bits of ColorState. One state would
describe the ColorState going forward. The other would describe the
ColorState going backwards. Only the transitions are encoded so if the
text contains no color, it contains no ColorState.
Alternatively, this mega transition could be collapsed in separate
ColorState attributes, so that for example, there would be 32
code points assigned to expressing the foreground color transition.
The advantage of this is that we get down to a single array again with
the same truncation behavior for information which extends beyond the
maximum size.
> • Any UTF-8 implementation will cause the visual width to be
> different from the buffer size, so visual width will need to be
> maintained and updated separately.
I'll also point out that the visual width of a Unicode string might not
be terribly meaningful, given the possibility of right-to-left
characters, accent characters that may be applied to the previous
character, etc.
--Robby