Future Buffer Management Requirements

Stephen Dennis

unread,

Jan 22, 2007, 3:53:43 PM1/22/07

to tinym...@googlegroups.com

For TinyMUX 2.7 and later, there are a series of competing goals related to buffer management. The current mux_string implementation punts on most of these issues. That it is being used in a few functions is enough for now. Future demands include:

Any UTF-8 implementation will cause the visual width to be different from the buffer size, so visual width will need to be maintained and updated separately.
Also, with UTF-8, the index correspondence between character buffer and color buffer is lost. Perhaps only the color state of the first byte of a UTF-8 sequence is used although, this seems wasteful.
Most buffers will not have color. Perhaps 'normal' color can be special-cased. Perhaps a NULL pointer or a frequently-used reference to a common 'normal' color state array?
Lazy-conversions between dbref, integer, floating-point, string, and lists of the same could be added so that instead of passing lbufs or mux_strings around, we pass mux_variants instead, mux_string being one case of a mux_variant.
The ability to temporarily extend the maximum size of an intermediate result to something beyond the compile-time LBUF_SIZE limit. wizbuf(...something big...)
Reference-counting helps global registers a lot, and it would be nice to unify that with the rest of buffer handling.
Global registers benefit from allocating only as much as space as needed. If that is a benefit generally, it would be nice to unify that with the rest of buffer handling.

Unification, by itself, it not a compelling goal, so if two more-specialized mechanisms are needed, so be it.

Brazil

brazi...@gmail.com

unread,

Jan 24, 2007, 12:27:31 PM1/24/07

to tinymux-dev

>From studying Unicode these past few days, there appear to be three
models of increasing sophistication.

1. A character is a byte. This works for ASCII and latin-1. It is
what the server currently uses.

2. A character is a wider word. For example, WCHAR on Windows is
16-bits, and some programs do not implemente that a character may
require two of these.

3. A character is a code-point. UTF-8 requires one to four bytes and
encodes the entire code point. This is the 'recommended' approach
by W3C.

4. There is a many-to-many mapping between characters and code
points. For example, the ligature 'fi' is two characters, but
requires only one code point. Likewise, many 'combining' code
points serve to modify the previous base code point to form a
single character. Likewise, sequences of code-points can be
compressed using singletons to save space or expanded out to
determine equivalence. Appending two string may re-order code
points at the interface between the two strings. Upper and lower
casing strings may add or remove code-points.

At this time, the fourth approach is out of reach. It's too
complicated.

I have considered several forms for implementing efficient buffers
with color and UTF-8:

Method 1 (Brute Force Arrays) -- The brute force method is an array of
ColorState (2 bytes per character) and an array of wide characters
(4 bytes per character). The advantages are that random and sequential
access only requires an integer array index. Visual width (given that
we are not supporting the 4th model above) is the length of the
string.

The disadvantages are that this requires 6 bytes per character or 48KB
per LBUF-length string. Since the cost of manipulating these strings
is related primarily to how much memory is touched, we expect this
approach to be a factor 5 slower than the current implementation.

Another disadvantage is that to import or export from this form
(to/from the network/dumps) requires encode/decoding to UTF-8. For
example, to delete a single character would require much encoding and
re-coding of the rest of the string.

Method 2 (keep text as UTF-8) -- UTF-8 is variable length so by
keeping the text in UTF-8 form, we lose the ability to randomly access
text. However, UTF-8 can be parsed forwards and backwards, so it is
possible to construct and update 'cursors' which are only slightly
more complicated that array indexes. The first random probe might use
existing cursors to bracket it's search, but in the worst-case, a
sequence search from the nearest end is required. Fortunately, random
access is not performed that often.

The ColorState for a character would be stored in the position of the
first UTF-8 character. So, potentially only 1 in 4 entries of the
ColorState are used. Fortunately, the larger UTF-8 characters do not
occur frequently.

Still, most text (even ASCII) uses 'normal' color.

Method 3 (keep text as UTF-8 and compress ColorState) -- One method of
compressing the ColorState array is to using the leftover 3 bits as a
count of how many characters the ColorState applies, so we can support
runs of eight characters. Since UTF-8 already requires sequential
access to character positions, our position 'cursors' can also handle
the complication of tracking which ColorState applies.

Another way of encoding the same information is to add LBUF_OFFSET
field for each ColorState which tracks at which position the ColorState
comes into force.

The disadvantage of both of these is that potentially, user data may
change the color of each character.

Method 4 (Encode ColorState as UTF-8 characters) -- It requires 13
bits to express a ColorState. One problem to overcome is that while
UTF-8 is parsable in both directions, ColorState is not. So solve
this, we could encode 13*2 or 26 bits of ColorState. One state would
describe the ColorState going forward. The other would describe the
ColorState going backwards. Only the transitions are encoded so if the
text contains no color, it contains no ColorState.

Alternatively, this mega transition could be collapsed in separate
ColorState attributes, so that for example, there would be 32
code points assigned to expressing the foreground color transition.

The advantage of this is that we get down to a single array again with
the same truncation behavior for information which extends beyond the
maximum size.

Robby Griffin

unread,

Jan 24, 2007, 12:38:02 PM1/24/07

to tinym...@googlegroups.com

On Jan 22, 2007, at 14:53, Stephen Dennis wrote:

> • Any UTF-8 implementation will cause the visual width to be

> different from the buffer size, so visual width will need to be
> maintained and updated separately.

I'll also point out that the visual width of a Unicode string might not
be terribly meaningful, given the possibility of right-to-left
characters, accent characters that may be applied to the previous
character, etc.

--Robby

Stephen Dennis

unread,

Jan 24, 2007, 2:34:43 PM1/24/07

to tinym...@googlegroups.com

That was the distinction between the 3rd and 4th approaches in my previous post or what the following link from W3C describe as the 'character string' concept and 'grapheme cluster' concept:

http://www.w3.org/TR/2005/REC-charmod-20050215/

If we implement according to the 'character string' concept, and a player uses these composition code points, the columns in @mail, WHO, etc. will not line up properly. It may be possible to compress some of the sequences to singletons, but there aren't enough singletons to cover everything that is possible with composition code points. In fact, the server cannot easily determine how the client will render a sequence of code points at all, and formating-oriented functions like center(), wrap(), columns(), table() as well as @mail, WHO, @list allocations can only work properly by delegating this formating to the client with tags.

Brazil

Reply all

Reply to author

Forward