Poking at making Unicode tables smaller.

5 views
Skip to first unread message

Brazil

unread,
Mar 24, 2008, 12:07:12 PM3/24/08
to tinymux-dev
Most of the tables are small:

// 150 states, 99 columns, 15106 bytes
// 6 states, 14 columns, 340 bytes
// 6 states, 14 columns, 340 bytes
// 8 states, 23 columns, 440 bytes
// 6 states, 14 columns, 340 bytes
// 8 states, 12 columns, 352 bytes
// 3 states, 6 columns, 274 bytes
// 6 states, 21 columns, 382 bytes
// 74 states, 191 columns, 28524 bytes
// 70 states, 191 columns, 13626 bytes
// 40 states, 84 columns, 3616 bytes
// 43 states, 89 columns, 4083 bytes
// 43 states, 89 columns, 4083 bytes
// 5 states, 13 columns, 321 bytes

The largest one (28,524 bytes) is used to down-convert Unicode to
Latin-1. It's twice as big as the Unicode to ASCII table because there
are more than 256 states (including the accepting states). Otherwise,
it would be about 13KB as well.

There are ways to compress this down. In fact, initial estimates are
that the unicode-to-latin1 translation table can be reduced to 4,726
bytes or perhaps even smaller while still maintaining a respectable
look-up time.

The idea is to use two phrase types to compress each row in the state
transition table:

RUN := <count> <value>
COPY := <count> <value(1)> <value(2)> ... <value(count)>

Since the state transition matrix is a 2-dimensional array, it is
necessary to multiple by the row size. But, when using these phrases,
the multiple is not necessary. Instead, the count in the RUN and COPY
phrases are subtracted from the state until the correct phrase is
found and the value within that phrase is retrieved.

To keep ASCII fast, the entire first row could be intentionally
encoded as a COPY phrase even though it may contain repeated values.
Or, perhaps limited to contain only one or two phrases whatever they
may be.

To further compress, phrases can then be blobbed together into a
single-dimensional array which attempts to overlap phrases as much as
possible. A separate single-dimensional array would map a row number
to a starting position within the blob.

If a single Unicode-to-Latin-X table takes only 4KB to encode, there
may be no reason to not support most or all of the Latin-X encodings
at the networking interface.

Here's the patch to estimate the table compression:

diff --git a/utf/smutil.cpp b/utf/smutil.cpp
index ee2a83b..e5a1477 100644
--- a/utf/smutil.cpp
+++ b/utf/smutil.cpp
@@ -708,6 +708,7 @@ void StateMachine::ValidateStatePointer(State
*pState, int iLine)

void StateMachine::OutputTables(char *UpperPrefix, char *LowerPrefix)
{
+ int SizeOfCompressedMachine = 256;
int SizeOfState;
int SizeOfMachine;
MinimumMachineSize(&SizeOfState, &SizeOfMachine);
@@ -775,6 +776,9 @@ void StateMachine::OutputTables(char *UpperPrefix,
char *LowerPrefix)
State *pi = m_stt[i];

printf(" {");
+ int kLastValue = 0;
+ int kRunCount = 0;
+ int kCopyCount = 0;
int j;
for (j = 0; j < 256; j++)
{
@@ -809,7 +813,63 @@ void StateMachine::OutputTables(char
*UpperPrefix, char *LowerPrefix)
printf(",");
}
printf(" %3d", k);
+
+ if (0 == j)
+ {
+ kCopyCount = 1;
+ kRunCount = 0;
+ kLastValue = k;
+ }
+ else if (kLastValue == k)
+ {
+ if (0 < kRunCount)
+ {
+ kRunCount++;
+ }
+ else if (1 == kCopyCount)
+ {
+ kCopyCount = 0;
+ kRunCount = 2;
+ }
+ else // if (1 < kCopyCount)
+ {
+ SizeOfCompressedMachine += SizeOfState *
((kCopyCount-1)+1);
+ kCopyCount = 0;
+ kRunCount = 2;
+ kLastValue = k;
+ }
+ }
+ else // if (kLastValue != k)
+ {
+ if (0 < kCopyCount)
+ {
+ kCopyCount++;
+ kLastValue = k;
+ }
+ else if (1 == kRunCount)
+ {
+ kCopyCount = 2;
+ kLastValue = k;
+ }
+ else // if (1 < kRunCount)
+ {
+ SizeOfCompressedMachine += 2 * SizeOfState;
+ kCopyCount = 1;
+ kRunCount = 0;
+ kLastValue = k;
+ }
+ }
}
+
+ if (0 < kRunCount)
+ {
+ SizeOfCompressedMachine += 2 * SizeOfState;
+ }
+ else if (0 < kCopyCount)
+ {
+ SizeOfCompressedMachine += SizeOfState * (kCopyCount+1);
+ }
+
printf("}");
if (i < m_nStates - 1)
{
@@ -818,6 +878,7 @@ void StateMachine::OutputTables(char *UpperPrefix,
char *LowerPrefix)
printf("\n");
}
printf("};\n");
+ printf("// SizeOfCompressedMachine is %d bytes\n",
SizeOfCompressedMachine);
}

void StateMachine::TestString(UTF8 *pStart, UTF8 *pEnd, int
AcceptingState)


Brazil
Reply all
Reply to author
Forward
0 new messages