Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Merry Christmas 3

3 views

Skip to first unread message

clayl...@comcast.net

unread,

Feb 1, 2008, 11:01:20 PM2/1/08

Using the random number seeds generated from the specification "Merry
Christmas 2" below provides a source of random numbers for the
procedures below.

1) This procedure is a variant of the simplistic old "shuffle" of
message input byte order which was popular long ago. Generate 256
arrays of 256 bytes each, in which each array contains a different
shuffle for an input data stream block of 256 bytes. The index of the
array used for a particular input block is selected by taking the
least significant byte generated from a master random number generator
which is the controller for usage of the 256 arrays. Alternatively
there could be multiple levels of tables each of which indexes the one
below it in the hierarchy and is controlled by a table above it. To
decode a block of 256 un-shuffled output bytes, simply determine the
index used to select a shuffle array. The shuffle array index
determines the index of the un-shuffled target byte. The contents of
the shuffle array at that index determines the byte to be pulled from
the shuffled block of 256 bytes.

2) This procedure is a variant of the ages old simplistic character
substitution technique. Generate 256 arrays of 256 bytes each, in
which each array provides a mapping of input to output byte value. The
value of the nth byte of the source data stream is used as the index
to the nth substitution array and the value to be substituted is the
contents of the nth substitution array at that index. To decode, the
256 arrays generate a second set of 256 arrays which invert the
substitution process. For example, at index n in a particular
substitution array the contents are m. Then in the derived decoding
array index m the contents would be value n. The weakness of the old
substitution method is that a frequency analysis of output allows
guesses as to which input letter was coded. The technique above
decouples correspondence and makes such guesses worthless.

The specification "Merry Christmas" below is intended as a first step
in the data compression and encoding process for documents. A weakness
of that specification is that it does not handle coding for such data
constructs as audio and video. The two methods given above open up the
power of recursive usages of encryption to bury the source data so
deep that nobody except the owners will ever be able to exhume it.

Composing this document required two hours of work - another trivial
example of what I have conceived over the years.
Don't clap, throw money!
Send checks to 3395 Harrell Road, Arlington TN 38002-4261.
Signed By :  Lonnie Courtney Clay aka Laughing Crazy Coot aka ZORRO
Chic Logo Guy
----------------------------------------------------------------------------------------

See Google Group "Lonnie Courtney Clay" thread "Merry Christmas 2"

I hereby declare that this specification is not copyrighted or
patented. As a product of my imagination, I offer it free of charge as
a "public domain" entity, to be redistributed freely. So copy this
post to your websites anytime you please. Develop front end processes
for password usage from it if you so desire, because there are no
license fees involved.

This is a specification for converting an ASCII password string into
a larger byte string which can be used to initialize an array of
random number seeds. A neural net is the basis for the concepts given.
A "node" is the data processing and state vector component of the
system. A "nexus" is the derived set of links for each node. For each
node there are up to 26 active links in its nexus. The network
occupies a three dimensional cube with sides of length six. In the two
dimensional example given, node 01 is the network input node and node
18 is the network output node. Node 01 links to nodes 02, 03, and 04.
Node 05 links to nodes 02, 03, 04, 06, 07, 10, 11, and 12 etc.
++++++++++++++++++ 01 ++++++++++++++++++  ++++++++++++++ 04 = 02 = 03 +
+++++++++++++  ++++++++++ 09 = 06 = 05 = 07 = 08 ++++++++++  +++++++++
+ 14 = 11 = 10 = 12 = 13 ++++++++++  ++++++++++++++ 17 = 15 = 16 ++++++
++++++++  ++++++++++++++++++ 18 ++++++++++++++++++
The password string is a series of byte values which are used to
initialize the net. Each byte is converted to a set of coordinates
within the net for placement of the next node. This is done by taking
the remainder after division (by six) of the byte value for the "x"
coordinate (value 0-5). Then division of the byte value by thirty six
is used to obtain the "y" coordinate (value 0-5) from the remainder.
The "z" coordinate is the value obtained from division by thirty six
of the byte value. Once all nodes have been selected from the seeding
password string, the nodes are linked as an initial nexus network to
their nearest neighbors in the x,y, and z coordinates. The 64 bit
state vector of each used node is initialized (going from 0,0,0
towards 5,5,5) to the eight bytes of the password string beginning at
the first byte and wrapping around (when necessary) from the last
back  to the first. So as an example, if the password string is 11
characters, the first node found gets the bytes 0-7 in its state
vector, the fifth node found gets bytes 4-10 then 0... Once all
initialization nodes have the state vector initialized, the password
string is submitted to the net eight bytes at a time at the node
closest to 0,0,0 (wrapping around when necessary). Each of the nodes'
inputs 64 bits are xored with the node's state vector to produce a new
state vector, and the node's output value is the bitwise "not" of the
final result after all inputs are processed. Inputs come from the
closest neighbors in x,y,z coordinates which have a lower coordinate
value, while outputs are clocked to the closest neighbors in x,y,z
coordinates which have a higher coordinate value. So information
propagates from 0,0,0 towards 5,5,5.
The neural net nexus is rewired to have new nodes based upon the final
output node's value. As the net evolves, the input and output nodes
are revised to pick the closest nodes to 0,0,0 and 5,5,5. For each
clock cycle the eight byte output node result is converted byte by
byte into eight new nodes' x,y,z coordinates as specified above (/6
then /36). If the net already contains a node at the specified
coordinates, no action is taken. Otherwise the new node gets its state
vector initialized to the current value of the output node shifted
circularly right by a distance depending upon which byte is currently
being processed. The nexus of the new node is connected up to nearest
neighbors and those neighbors are revised to access the new node in
addition to the existing links. When TBD cycles pass without a new
node being added, this phase of filling in the network ends. Then the
latest output node is connected to the input, and each clock cycle of
the net produces a 64 bit number to be used as a random number seed.
Alternatively, a second (larger) neural net can be built from the
outputs of the current net, relatively disengaging dependency upon the
original password string to form a new neural net for use in
generating random number seeds.
Composing this document required four hours of work - another trivial
example of what I have conceived over the years.
Don't clap, throw money!
 Send checks to 3395 Harrell Road, Arlington TN 38002-4261.
Signed By :  Lonnie Courtney Clay aka Laughing Crazy Coot aka ZORRO
Chic Logo Guy
---------------------------------------------------------------------------
See Google Group "Lonnie Courtney Clay" thread "Merry Christmas".

***************************************************************************
I hereby declare this specification to be public domain, free of all
copyright and patent claims.
**************************************************************************
Components of this selection :  1) Library of "words".  2) Acronym
packer  3) Acronym unpacker with disambiguation.  4) Archive
processor
1) Library of "words"
The library of "words" contains an entry for every "word" likely to
be  encountered in the data, beginning with the 256 possible values of
a  byte. I estimate that a million "words" will be about right for a
common library. The root "word" for real words is found as in
standard  dictionaries and every variant possibility has a separate
"word"  entry. For example run, runs, running and so forth. A "word"
may be a  composite of two real words separated by a space for very
common  combinations such as "if the", "for a" etc.
a) Lookup of the dictionary can be performed using a "learned"
dynamic  memory image of frequently encountered "words", with disk
access for  rarely used "words". The memory image is initialized from
file(s) at  the start of each message session. As a user accumulates
message  traffic, the memory image can optionally be saved to disk
after each  message session. I leave the details as a TBD. From 1987
until 1989, I  developed a tool called Splash - "Selective parenthesis
locater and  string handler" which did analysis (of C language
function and macro  calls found in all files on a disk) to create a
memory structure which  was very fast for "word" lookup. That memory
structure was used to  generate a cross reference (of all the function
calls found) sorted  alphabetically by function name, sub-sorted by
the file the reference  was found in, sub-sorted by the line of the
file on which the  reference occurred. Rather than give details of
Splash here, I suggest  that my dossier should be checked to see if
the government already has  a copy of the free-ware disks which I
mailed out. If not, then I  provided all disks to a person named Fred
Phillips in 1997, who  claimed to be a college student trying to start
an Amiga free-ware  diskette publishing company. He in turn provided
the disks to others.  I also provided disks to Fred Fish of Amiga fame
at various times, so  surely somebody has them. After 10 more years I
doubt that the disk  copies which I have of the originals are
readable....
b) Each entry in the dictionary contains :
.1 The byte string of the "word", whose length is specified by a four
byte prefix, and the first byte character to be used for the encoded
"word". This provides for the dictionary to include normal and first
character capitalized versions of a real word. It also provides for
resolution of disambiguation problems by decoupling the first
character of the byte string from the first byte of the encoded
"word". For encoding lookup purposes a dynamic data structure needs
to  be generated which gives fast recognition for byte strings whose
first  byte does not match the first byte character in the encoded
word. As  an example, no match is found for a source byte sequence in
the  dictionary. Then a check is made using a table of 256 chain
header  addresses whose entry is selected by the first byte of the
sequence.  The attached chain of exception "words" points to
dictionary entries  which are exceptions, with quick recognition made
possible by  providing in each entry the second through fifth bytes of
the sequence  of the "word".
.2 The longitudinal redundancy check of the "word" - a byte value.
.3 The shifted longitudinal redundancy check of the "word" - a byte
value. The circular shifting of bits used for a particular dictionary
can vary from standard (such as one bit shift left circular) to
produce custom dictionaries. This is the first of many options
available to enhance security.
.4 A byte containing eight different (CRC-16) parity bits (for the
two  bytes of .2,.3) calculated from a cafeteria of cyclic redundancy
check  techniques. Which bit of the byte is used for which technique
must be  agreed upon for users to share a common dictionary. This byte
provides  not only disambiguation but also the option to perform a
high  reliability integrity check on the library. First the two LRC
bytes  are verified, then they are used to verify the byte string of
the  "word". There may be hundreds of millions of computers using
this  algorithm, so it is desirable to have one common dictionary for
low  security data. However there should be a different version for
secure  business (and even departmental) traffic to avoid
vulnerability to  hacking of the entire user network at once.
Furthermore, the choice of  which byte (of .2 and .3) comes first in
each 16 bit composite can be  different for each of the eight bit
slots - big endian versus little  endian, which gives an enormous
number of variations possible beyond  the 256 possible values of a
byte. According to a Wikipedia article,  there are at least 26 CRC
algorithms. 26!/18! is a quite a big number,  making it rather
difficult to guess what techniques are being used on  the 256 possible
endian variations (for the values of .2 and .3). So  as I said,
security can be quite rigorous if a user wants to construct  a custom
library.
Notes :
If a dictionary creation utility is provided then anybody could
create  custom dictionaries used for private files or just a few
users. As an  example, a variant of a standard dictionary could be
generated which  changes the SLRC shifting, the CRC technique
combination, and the  endian combination to produce a new set of codes
for bytes 3-4 of each  code specification for a standard library
"word" set.. The utility  could also crunch the archive specific
dictionary created by LCC-zip  specified below in part 4 as a seed of
definitions rather than using  the standard library of "words". The
first high quality product to  market could be a real GOLD MINE!!!
To improve execution time for decoding a dynamic data structure is
used which begins with a 64k entry table giving subsidiary chain
header addresses. The table is indexed by the first two bytes of the
four byte specification. The subsidiary chain attached to a table
entry gives the third bytes in use and each third byte entry is
accompanied by a subsidiary chain header for fourth byte lookup. The
fourth byte chain members point to the data for the relevant library
"word".
Discussion :
a) A "word" is uniquely identified in a particular dictionary by the
dictionary first byte character for the encoded "word" and the 16
million possible combinations provided by the three disambiguation
bytes (.2,.3..4).
b) For alphabetical first characters (real words) a further
disambiguation is available by the choice of upper or lower case for
the first letter, giving 8 million combinations. The upper case
should  be reserved for conflicts not sufficiently disambiguated by a)
above.  If a capitalized word is encountered, then it is first checked
to see  if the dictionary has a matching entry. If not then the
capitalized  word is represented by a double four byte specification.
The first  specification says to capitalize the word indicated by the
second  specification. This is a tradeoff between coded file bloat
and  dictionary bloat.
c) Single byte "words" such as non-alphabetical special characters
and  numerals require four bytes to specify them, bloating the coded
data  by a factor of four. But the average "word" length for real
words  (including a space between words) usually exceeds four,
resulting in  compression of the total data stream. For common
multiple byte special  character combinations (such as a period
followed by newline) the  dictionary can include the combination as a
defined "word".
d) Standard data features such as logos, headers, footers, and
warning  labels can be included in the library using a special first
byte  "word" and three byte disambiguation to result in high
compression for  pre-defined standards. In this case the byte string
entry in the  dictionary may contain anything at all such as HTML
coded images to  give a single example. This is a TBD for dictionary
designers. See f)  below.
e) During decode a space after a real word is implied when the real
word is immediately followed by another real word. Real words are
recognized by beginning with an alphanumeric character as the first
byte of the text string for the "word". If a "word" in the library
already includes a trailing space for some reason, then the automatic
space generation at decode is suppressed.
f) A first byte code specification value of zero (null) is used to
denote such text formatting functions as capitalization of the next
word in the acronym, and other text formatting codes, with 16 million
combinations available by the three associated disambiguation bytes,
which are NOT generated as in b .2,.3,.4 above but are instead
defined  by the dictionary designers to meet all anticipated
requirements. I  have no intention of re-inventing HTML here, so what
text formatting  codes should be included (in what combinations) I
leave as a TBD for  designers.
.1 One of the requirements may be the regular presence of numeric
strings in source data so often that the bloat must be avoided. In
such a case the first byte of the specification code of null denotes
special function, the first disambiguation byte denotes a numerical
byte string, and the remaining two bytes specify the length in bytes
(up to 65535) of the numerical byte string. This can be used to
result  in compression rather than bloat because such packing
techniques as  binary coded decimal can be used to replace the ASCII
text of the  source data.
.2 Rather than provide a data interceptor with nice fat strings of
numerical data in unencrypted form, the following option can be used.
Have a key disk for the user which decodes BCD coded strings
according  to a rotating mapping of the 16 possible hexadecimal codes
to each of  the ten decimal digits. As each hexadecimal half byte is
processed,  the mapping changes to the next according to the key disk
lookup table  which is indexed by a pseudo random number generator
whose seed is  provided by a message specific encryption key-code seed
for the PRNG.  I leave the details as a TBD, including how to inform
the recipient of  the relevant encryption key-code.
.3 For byte strings of mixed alphabetical and numeric characters
which  do not form any of the standard "word" entries in the library,
a  dumpster catch all solution is to use a null/zero, followed by a
first  disambiguation code byte denoting a mixed string, followed by
two  bytes giving the string length. The disambiguation code byte
specifies  for the particular library whether the byte string is given
in the  clear or processed first by some scrambling technique. The
scrambling  technique can be (shared) encryption key controlled or
something as  simple as a time stamp defined rotating substitution
table for the  common library. I leave the details as a TBD.
---------------------------------------------------------------------------
---------------------  2 Acronym packer
a) Each processed data stream is prefixed by a 64+ bit time stamp in
the clear which uniquely specifies the behavior of the acronym packer
for the particular processed data stream for the particular released
version of packer and library of "words". For messages between secure
users the time stamp should be processed using an agreed upon 64+ bit
encryption key to result in a modified time stamp which is used
instead. I leave the details of encryption key generation (presumably
from an encryption key string) such as whether it should be generated
from the key string depending upon the time stamp in the clear (and
how the time stamp in the clear is combined with the resulting
encryption key) up to "experts". I simply declare "LET THERE BE a 64+
bit time stamp", and behold, it SHALL exist....
b) The acronym packer processes the input stream of "word" four byte
specifications using the time stamp to scramble the order of the
output data stream to be truly FUBAR. When a specification is
encountered for a variant "null" byte stream as in f .1,.2,.3 then
that specification cuts short the preceding block of "word" four byte
specifications and is handled by itself in the input processed data
stream specification sequential position without further scrambling.
Blocking then resumes until another "null" byte stream or end of
message is encountered. The first byte of each four byte
specification  ALWAYS comes first. A time stamp and input data stream
sequence  controlled rotating reordering of the other three bytes for
each four  byte specification can be used as a first scramble. I leave
how the  rotation is controlled as a TBD, but it should be non-trivial
- see c)  for an example. An overview processor for the data can be
used to chop  it into blocks between null variants in a deterministic
manner.
c) For each block (of four byte specifications), the block length in
number of specifications and how the block members are scrambled is a
library designer TBD and time stamp controlled to maximize confusion.
For example, the time stamp can be used to generate a table of
varying  block lengths which rotates in a manner similar to the order
scrambling tables described below. The time stamp can be used to
generate (input data stream specification) order scrambling tables
for  each block length which are nested to any desired depth depending
upon  data stream length. As an example, blocks may be 1, 2, 4, 8, 16,
32,  64, 128, 256 (of four byte specifications) in length. The lowest
level  table would be a simple order scramble giving the index in the
original stream (for the implied table index) of the processed
stream.  Each entry of the table for the second level up would specify
a  direction of rotation and number of entries to rotate the first
level  table after each block is processed. The second level table has
256  entries and is completely used once with wraparound from one end
to  the other when necessary. The third level table and on up the
chain  for a while operates on the table below it to specify a
direction of  rotation and the start index for pulling table entries
from the table  below it. At the top is a table of 256 PRNG seeds
which is derived  from the current time stamp (being processed by a
TBD method) which  might be as simple as (a library designer or key-
disk specified)  direction and circular rotation bits to shift for the
current time  stamp to generate each entry. Alternatively it might be
a bit scramble  of the preceding time stamp entry in the table. I am
getting bored and  am in pain, so as I said above call it TBD. Using
the table of seeds,  (for each seed as it is indexed) the lowest level
table is regenerated  by using a scratchpad diminishing linked list of
entry numbers not yet  pulled and taking the latest random number
modulo number of entries  remaining. Techniques such as this should
make scrambling of multiple  gigabyte messages follow no apparent
rhyme or reason and should be  unbreakable. If not then I will come up
with something more elegant  when i am not ill....
---------------------------------------------------------------------------
---------------------  3 Acronym unpacker with disambiguation
Do I really need to explain this ? Maybe later...
---------------------------------------------------------------------------
---------------------  I was going to hold this in reserve, but I have
decided to fire off  all of my guns at once, so here goes nothing.
4 Archive processor
I am going to re-invent zip, which is a thoroughly mature
application.  I think that I understand how it works as applied to
processing of  archives. If not, then I will revise this later. I am
giving this  specification because I want zipped files to use the
library of  "words" described in item 1) above to result in coded zip
files. I  call my version LCC-zip. For a particular run of LCC-zip,
there is one  pass through all input files to develop the LCC-zip
dictionary. That  pass is followed by a second pass through all input
files to generate  packed files giving the LCC-zip dictionary and the
compressed data  stream utilizing the run specific LCC-zip dictionary.
Output file  formats I leave as an exercise for the student.
Optionally that pass is followed by usage of the "library of words"
described in item 1) above to replace the clear text definitions in
the LCC-zip dictionary by library defined four byte specifications so
that the source is not only compressed but also securely coded. If a
LCC-zip dictionary entry is a composite of multiple library "words"
then a sequence of four byte specifications from the library is given
rather than just four bytes. For standard features {as specified in
1)  discussion item d above}, the body of the zipped results is
searched  for first byte codes of 0xc0 (specifying not tied to the
dictionary)  and when found the byte string is compared against the
library for a  match. Comparison can go very quickly because the
lengths of the byte  string and the library object must be equal. If a
match is found, then  the entire 5+n byte sequence is replaced by a
five byte sequence. The  first byte is 0xff and the next four bytes
are provided by the  library. This could result in enormous
compressions for well  constructed dictionaries..
For unzip you simply replace the library codes in the LCC-zip
dictionary by the clear text definitions from the library before
processing the zipped files. As the zipped files are processed they
are searched for first bytes of 0xff. When 0xff is found the library
is used to restore the original byte stream. It gives a lot of pain
to  snoopy people at very low cost.....
a) LCC-zip creates a custom dictionary for the specific set of packed
files which uses 2 to 5 bytes to specify a (processed) byte string
code for each common source byte sequence found in the archive.
.1 The string code first byte most significant two bits give the
number of additional bytes (1-4) required for disambiguation. The
most  common 16k source byte sequences require only two bytes. Three
bytes  give the most common 4 million, four bytes for 1 billion, and
finally  five bytes for 256 billion, which should be rare indeed.
.2 In the case of unique source byte sequences such as are found in
the HTML objects for graphics and sound, there is no dictionary
entry.  Instead of that, the first byte code of 0xc0 is always
followed by a  four byte count of bytes in the sequence then by the
unprocessed  source byte sequence. The code 0xc1 could be used with a
four byte  compressed string byte count to convert long ASCII numeric
strings to  BCD packed format. The LCC-zip which is developed needs to
have the  ability to decode at least HTML formatted files in addition
to text  files. Whether it supports smart processing of other file
formats is  up to the code developers. I am NOT going to write this
thing, since  doing so is an exercise for journeymen rather than
masters such as I.
b) In the first pass all files are scanned to create a dynamic memory
structure similar to my Splash which has a 64k address root table
indexed by the first two bytes of the source data sequence in the
current window of scan. Each entry in the root table has an
associated  four byte counter which is pegged at 2**32-1 to avoid
overflow. The  maximum width of the current window of scan (up to 255
bytes) is  decided by the user at run time and determines how much
execution time  is consumed in the first pass. I recommend 48 bytes
maximum width as a  default. If the LCC-zip has smart analysis of file
components, then  just process for dictionary construction windows of
scan confined to  text data streams.
The current window of scan begins with either of five cases :
.1 If the entire current window of scan processing is filled with
ASCII numeric characters or space/blank characters, then increment
the  start index for the next window of scan by the maximum width of
the  window of scan and end processing for the current window of
scan.
.2 An alphanumeric character following a non alphanumeric character :
In this case the current window of scan processing is truncated to
the  first character of the last block of non alphanumeric characters
in  the current window of scan, provided the current window of scan
contains any non alphanumeric characters. Notice that this allows for
multiple real word combinations to be processed, but does not include
real word fragments chopped off by the current window of scan's end.
It also usually (for text) terminates the current window of scan with
either a space/blank or punctuation. After processing of the current
window of scan, the start index for the next window of scan is set to
the first non alphanumeric character in the current window of scan.
If  all of the characters are alphanumeric, then the start index is
just  incremented by one.
.3 An alphanumeric character which follows an alphanumeric
character :  In this case the current window of scan processing is
truncated to the  first non alphanumeric character found in the
current window of scan.  If all of the characters in the current
window of scan are  alphanumeric, then after processing the start
index used for the next  window of scan is just incremented by one.
Otherwise the start index  for the next window of scan is set to point
to the first non  alphanumeric character in the current window of
scan. Notice that this  provides for processing to move past the tag
end of long alphanumeric  sequences as soon as a non alphanumeric
character is encountered.
.4 A non alphanumeric character other than space/blank :  In this case
the current window of scan processing is truncated to the  character
preceding the first character found in the current window of  scan
which is alphanumeric. After processing of the current window of
scan, the start index used for the next window of scan is just
incremented by one. Notice that for the very common case of ". " this
results in the next window of scan starting with a space/blank. A
punctuation character following a real word sequence is included in
the processing of the real word sequence itself.
.5 A first character of space/blank :  In this case the processing of
the current window of scan is truncated  to the character preceding
the first non space/blank character in the  current window of scan. If
the result is a single space/blank then the  current window of scan
processing ends after incrementing the index  for the next window of
scan by one. In the case of multiple space/  blank strings the index
for the next window of scan is set to point to  the first non space/
blank in the current window of scan. If the entire  current window of
scan is space/blanks then the index used for the  next window of scan
is just incremented by one. Since the root table  is indexed by the
first two bytes of the current window of scan, such  strings as ". "
are already covered by the preceding window of scan  processing
anyway. Note that for the typical case of two words  separated by a
space/blank, the space/blank is already processed as a  terminating
character for the preceding word sequence.
Discussion : It will usually occur that a real word is processed with
the current window of scan including the real word followed by a
space/  blank or punctuation. So there will be a data entry in the
dynamic  memory structure which gives the real word and whose
subsidiary data  elements are sub-strings continuing the word with a
single character  such as " " or "," or ".". In such a case the
dictionary will generate  an entry which includes the punctuation only
if it is commonly  occurring.
c) The window of scan processing has two input components, a string
length and a byte string sequence which starts at the current window
of scan processing index. If the string length is one, then the
sequence has already been processed by the preceding window of scan,
so no further processing is required. Otherwise the first two bytes
of  the sequence are used to index the root table and increment its
associated counter provided that an increment does not result in
overflow. If the string length is two then processing is completed.
Now check whether the Otherwise the address in the root table entry
is  either still null or it points to a dynamically allocated data
element  whose structure type is one of three possibilities and which
contains  the following data :
.1 A byte value identifying the type of structure for the data
element  which is selected from the following :
.1.1 - Zero for type zero "header" data elements initially attached
to  any of the 64k slots of the root table when the first 3+ length
byte  string sequence references the header table entry. Each header
specifies a sub-string whose byte sequence (including the initial
third position byte value) down to the current byte string sequence
end point is initially found to be a unique sub-string sequence
appended to the preceding two byte sequence which determines the
header's parent slot entry. If an alternative variation for the sub-
string sequence is found, then the header is revised to specify a
shorter sub-string ending just before the fork point byte and to
point  to a subsidiary element chain head (first of two new type one
data  elements) which begins a new linked list giving sub-strings
beginning  at the fork point byte onwards. If the fork point is at the
first  character of the header's sub-string (byte three of the total
sequence), then the header becomes obsolete and the header's parent
entry is revised to point to a new type one subsidiary data element
chain head and the subsidiary data element points back to the
obsolete  header's parent entry. In either case the chain head sub-
string  (beginning at the fork point byte) is inherited from the
header and  its counter is initialized from the counter of the header.
For the  second type one entry in the new linked list the counter is
initialized to one and the sub-string is the newly discovered
alternative sequence beginning at the fork point byte and continuing
to the end of the current byte string sequence. If a header already
has a subsidiary element connected and a variant is discovered for
its  sub-string, then the subsidiary for the header's new subsidiary
linked  list header is initialized to point to the old header
subsidiary  element. The new linked list header element has a sub-
string inherited  from the header which begins at the fork point byte
and includes the  tag end of the header's old sub-string. The new
variant element's  subsidiary is not yet established, so its address
is initialized to  null. The data element of the new variant has a sub-
string from the  fork point byte to the end of the current byte string
sequence The  subsidiary address of the header is revised to point to
the new chain  head and the chain head's parent link points back to
the header. In  all cases of headers the byte string sequence first
two bytes provide  the root table entry containing the address of the
header and the  referenced header points back to the entry in the root
table.
.1.2 - One for a linked list of type one data elements each of which
gives a next link address for the list chain. Each member after the
head has a parent address pointing back to the previous chain link.
The chain head parent address points back to the parent which
specifies the prefix for the chain members' sub-string sequences
which begin at the fork point byte index. Each member of the linked
list specifies a sub-string beginning at the chain fork point byte
index and may give the address of a subsidiary data element which
specifies how the byte sequence continues after the sub-string of the
type one data element. The subsidiary data element pointed to can be
either a type one chain head, or a type two table element. When the
linked list length exceeds TBD different (index of fork point byte
values) type one data elements, then the linked list becomes obsolete
and is replaced by a type two data element. The chain's affected type
one elements all undergo revision or removal to take into account the
new prefixing byte value determined by the entry in the type two data
element. If the sub-string of an affected type one data element is of
length one, then it becomes obsolete and its subsidiary element
address is placed into the table entry of the new type two data
element specified by the single character sub-string value of the now
obsolete type one data element. Otherwise its sub-string start
pointer is advanced one byte and it becomes the head of a single
element new linked list chain attached as a subsidiary to the new
type  two data element at the entry of its old sub-string first byte
character value. In either case when a new type two entry is created,
the frequency counter is initialized for the table entry to the value
from the matching type one data element. If a variant of the type one
element's sub-string is found and the variation occurs after the fork
point byte index, then a new subsidiary chain is formed whose head is
pointed to by the affected type one element's subsidiary address. The
head entry frequency counter is initialized from the parent and the
new variant counter is initialized to one. Each type one entry
specifies a continuation sub-string which is prefixed by all of the
parent entries up the structure to form the byte sequence of the full
string by appending the type one entry's sub-string and (if a
subsidiary link is active) specifying the variations of the byte
sequence which are suffixes to the type one element's sub-string.
.1.3 - Two for a type two data element which gives a table of 256
data  element subsidiary addresses and 256 associated four byte
counters  protected from overflow of course. This is the most memory
intensive  of the dynamically allocated data elements. On the other
hand it  speeds up execution time significantly when compared to
searching  linked lists. Trade offs are a basic fact of life....
.2 A byte value giving the character count n in the data element's
byte sub-string sequence.
.3 A byte giving the first character in the byte sub-string sequence
for the current data element, except in the case of type two data
elements.
.4 a byte value - PAD
.5 A pointer to the parent which is accessing the structure.
.6 One of the following blocks of information :
.6.0 - for type zero data elements,  .6.0.1 A four byte frequency
counter which is incremented each time  that the current structure is
found to match the byte string sequence,  protected from overflow of
course.  .6.0.2 A pointer to a dynamically allocated memory image sub-
string  start character.  .6.0.3 A pointer to the address of a
subsidiary data element.
.6.1 - for type one data elements,  .6.1.1 A four byte frequency
counter which is incremented each time  that the current structure is
found to match the byte string  sequence , protected from overflow of
course.  .6.1.2 A pointer to a dynamically allocated memory image sub-
string  start character.  .6.1.3 A pointer to the address of a
subsidiary data element.  .6.1.4 A pointer next link address for
connecting list members.
.6.2 - for type two data elements,  .6.2.1 A table of 256 pointers to
the address of the subsidiary data  elements.  .6.2.2 A table of 256
four byte frequency counters.
d) The window of scan processing begins a walk through the dynamic
memory data elements with access of the entry of the root table
indicated by the first two bytes of the current byte string sequence.
.1 If the table pointer is still null, then a new type zero header
element is created with initial count of one. The current byte string
sequence is copied from the current file's memory image onto a
dynamic  memory stack so that it remains valid after the next file is
loaded.  The item .6.0.2 start sub-string pointer is set for the
initial sub-  string of the current byte string sequence beginning at
the third  character of the sequence. The subsidiary link is set to
null and the  parent link points back to the selected entry of the
root table. The  sub-string character count is initialized to be two
less than the  total length of the current byte string sequence. The
first character  is copied from the current byte string sequence third
character  position.
---------------------------------------------------------------------------
---------------------
After getting up from a nap and seeing my father for his birthday
party, I sat down again to continue this document. I knew that I had
taken twice as long for part 4 Archive processor to be typed up so
far as for all of parts 1-3 combined. Now I notice that part 4 is
half  the size of the total document, with no clear end in sight. So
I  checked what I had typed and concluded that I was suffering from
diarrhea of the mouth. So I am going to stop the micro-management
explaining and just give some pointers for the presumably master
developers who may be reading this document. The dynamic memory
allocated data structures are sufficient tutorial for experts to
deduce how to write the program.....
A) Dynamic memory allocation is essential for this application. For
Splash I created a master scratchpad memory manager which got the
contiguous memory chunks whenever I ran out, including an oversight
function which was called each time anybody wanted to put something
on  the scratchpad which was the only caller of the master scratchpad
memory manager. The number of bytes requested was checked against the
remaining current contiguous block free space, and appropriate action
taken such as returning an address in the current block or making the
call for a new block before returning an address at the start of the
new block.
B) In the development of Splash I tried out garbage collection of the
dynamic memory allocated data which was made obsolete. On a two
megabyte machine, I determined that it took as much memory for the
code to handle GC as what the GC code recovered. However on a
gigabyte  plus machine GC is worthwhile and should be a considered as
a  fundamental part of the plan.
C) After all of the files are scanned, the entire dynamic memory
structure should be stepped though to determine which data elements
have counters of n=3+ (TBD) or more for a byte count of the byte
sequence string (from the root table to the end of the sub-string of
the current data element) of four or more. Have a four byte master
counter which is incremented by 9+ (byte sequence byte count) for
each  data element meeting the test. (In the case of byte counts under
four,  all of the combinations with counts of m=5+ (TBD) are allocated
to the  first four million codes. So there is little to manage in the
way of  sophistication for short strings.) If the byte master counter
for the  passing data elements found would exceed the free memory
available  during the next step D) below, then increment the required
minimum  counter value and re-scan the entire dynamic memory structure
until  memory is sufficient.
D) Dynamically allocate a dictionary table of size 256 which is
indexed by sub-string byte count and which is a list of chain head
entries for dynamically allocated chains whose members give a next
link address and a pointer to a null terminated text string. Also
dynamically allocate an associated table indexed by byte count which
contains (four byte) counters to be incremented each time a member is
added to the head of chain for the entry. Once again step through the
entire dynamic memory structure and this time when the byte count
requirement is met, add the string with null terminator to a dynamic
scratchpad. Add a new member to the head of the chain for the
appropriate dictionary table entry (given by the byte count) and set
the text string pointer of the new member to the start of the new
string in the dynamic scratchpad.
E) All of this work in memory is vulnerable to loss if the power
glitches. So it is about time to create a dictionary file. The file
is  divided into sections of increasing byte count with each section
giving a list of null terminated strings. Each section has a header
giving byte count followed by the list. Then a footer such as sixteen
0xff or some other recognizable code is used to terminate the
section.  Start the file list off by writing to file the two byte
sequences  implied by the entry number in the root table (which meet
the minimum  count requirement), following each two byte code by a
null  terminator. The next section is three byte codes. Step through
the  root table and process each attached subsidiary data entry to
find all  three byte codes (which meet the minimum count requirement).
As each  code is found write it to file with a null terminator.
F) When I say "write it to file" I really mean to a file scratchpad
of  reasonable size such as 32 megabytes. Managing the file scratchpad
I  leave as an exercise for the student. So far the codes written
result  in very little data compression. They are present to avoid
the  unseemly situation of a short byte sequence which is not found in
the  dictionary and therefor must be represented by a five byte 0xc0
code  followed by the byte sequence, which would result in bloat
rather  than compression.
G) Step through the dictionary table starting with the entry for a
byte count of four. Use the associated chain member counter to
determine whether you will have an overflow problem in your file
scratchpad and manage accordingly. Beginning at the chain head, step
through the chain using the pointer to null terminated text string to
extract and write to file each of the text strings from the chain. As
each chain of a particular byte count is processed, write the file
scratchpad to disk.
H) When the disk file is complete, free up all of the dynamically
allocated memory because we are starting over!!! Reallocate the root
table of 64k addresses and this time you need a table of 64k two byte
codes rather than counters. There is just one type of data element in
the dynamically allocated memory structure now. It contains the
following :
.1 Parent link address for the current entry in the chain. For chain
headers this points to the parent at the next level up rather than
giving the chain previous link address.
.2 Next link address for the current level chain.
.3 Subsidiary link chain header address for the next level down.
.4 Pointer to a dynamically allocated null terminated text string
which gives the entire source byte string sequence for the current
element (beginning with the two byte sequence of the root table
entry)  and which terminates at the character preceding the fork point
byte in  the source byte string sequence implied by the subsidiary
link chain  members.
.5 Pointer to the start of the sub-string for the current entry which
is prefixed by the parent element sequence (beginning at the root
table) and which terminates at the character preceding the fork point
byte in the source byte sequence implied by the subsidiary link chain
members. This pointer reuses the text string of .4 above and is
present as a speedy way to get the start byte of the sub-string so
that as the current window of processing steps through a chain (for
processing of a particular source byte string sequence) there is a
quick way to check whether the current data element sub-string
matches  the current character sub-string of the source byte string
sequence.
.6 A pointer to the dictionary code for the current entry's text
string (giving the entire source byte sequence) which is set the
first  time that the entry is actually used to replace source file
text by a  dictionary code. Although a parsimonious developer would
use all of  the two byte codes before proceeding to three byte codes
etc., I  recommend that instead of doing so the two byte codes should
be  reserved for two and three byte sequences. The four million three
byte  codes should be reserved for source byte sequences of byte count
five  or less. The billion four byte codes should be reserved for
sequences  of byte count 32 or less. Anything with a greater than 32
byte count  should be a five byte code.
I) Read in the dictionary file, creating the dynamically allocated
memory structure entries by stepping through the structure beginning
at the root table.  .1 If the dictionary text byte count is two, then
pull the next two  byte dictionary code and set the two byte code of
the indexed entry in  the root table. Add the code and its associated
null terminated byte  string to the memory dictionary. Go back to step
I) until the last two  byte dictionary entry has been read in.  .2 If
the root table subsidiary link is null, then hook up a new  element
chain header address to the root table and set the new element  parent
to the address of the root table entry. Go back to step I)  .3 If the
dictionary code is a three byte sequence then replace the  chain
header for the current entry in the root table by the new  element. So
point the parent of the old header to the new element, the  next link
address of the new element to the old header, the parent of  the new
element to the root table entry, and the root table subsidiary  link
to the new element. Go back to step I)
J) When the root table entry gives a subsidiary link and the byte
count of the next file dictionary string is greater than three then
beginning with an index of 3 and pointing to the chain head entry
given by the root table entry for the first two bytes of the file
dictionary string :  .1 Step through the relevant chain of elements
whose sub-string begins  at byte "index". Because the dictionary was
formed to have a counter  value for parent elements at least as great
as that of subsidiary  elements, there should always be an element in
some parent chain which  has already been read in from the dictionary
before a dictionary entry  with a text string byte count greater than
the parent's.  .2 When the first character of the sub-string for the
current element  matches the corresponding byte in the text string of
the dictionary  entry, then check to see if a subsidiary link has
already been  allocated.  .a If so then check whether the current
element sub-string is a match  to the byte sequence of the dictionary
entry being added. If so then  increment "index" and compare it to the
byte count of the current file  dictionary string. Check whether they
are equal then :  .a1 If so then replace the subsidiary link chain
header by a new  element for the dictionary entry being added. Point
the parent of the  old subsidiary chain header to the new element,
point the next link  address of the new element to the old subsidiary
chain header, point  the parent of the new element to the current
level element, and point  the subsidiary link of the current element
to the new element. Then  pull a new file dictionary entry and go back
to step J) above.  .a2 If not then point to the subsidiary element of
the current element  and go back to step .1
.b If a subsidiary link has NOT already been allocated, then check
whether the current element sub-string is a match to the byte
sequence  of the dictionary entry being added.  .b1 If so then create
a new subsidiary chain header element. Point the  subsidiary link of
the current element to the new element, point the  new element parent
link back to the current element, and set the next  link address of
the new element to NULL. Then pull a new file  dictionary entry and go
back to step J) above.  .b2 If not then something has gone wrong
ROTFLMAO.
.3 When the first character of the sub-string for the current element
does NOT match the corresponding byte in the text string of the
dictionary entry, then step forward using the next link address to
the  next element in the chain.  .a If there WAS no next link then
something has gone wrong ROTFLMAO  .b Otherwise go back to step .1
---------------------------------------------------------------------------
----------------------------  I notice that I have become slap happy
with fatigue and I am no longer  being very helpful. So it is time for
the fat lady to sing and shut  this circus down. Just a few more
comments and I will post it.
K) When the dictionary has been reconstructed in memory from file,
begin Pass 2 scanning of all archive files again. Using knowledge of
file formats, process only text data with the dictionary. Use
whatever  compression technique is appropriate for non text data such
as HTML  objects for images and audio. For long numeric strings
convert them to  BCD strings. Using the library of section 1
discussion item d)  recognize standard features and replace them by an
identifying code  which prefixes the four byte specifications from the
library. For the  text data use a smart algorithm to chop a text
stream into blocks of  dictionary defined strings which are either
contiguous or separated by  special character strings to be specially
coded as 0xc0 - 4 byte count  - string. When each dictionary entry is
first used pull an appropriate  unused code, set the code for the
dictionary data element and  otherwise track what is actually used to
compress files. When all  files have been processed save to file the
dictionary which was  actually used with the text strings coded by the
library of section 1  above.
L) When the archive is to be unpacked, use the library to decode the
dictionary and restore coded standard features and otherwise use
expert techniques to minimize resources consumed in the unpacking
process.
Well that was a whirlwind cleaning up the remaining discussion like a
white tornado!
---------------------------------------------------------------------------
-----------------------------  Completed at 1:40 AM on July 23, 2007
It took less than two days to produce this document. Frankly speaking
I have no good estimate of how long it took to develop the concepts
embodied here. However just to give a notion, I spent over 1000 hours
developing the free-ware Splash, which is the basis for the part 4
specification. I really hope that others appreciate the value of this
document. But as the amateur entertainers say - DONT CLAP - THROW
MONEY!!! Send checks to 3395 Harrell Rd. Arlington TN, 38002-4261.
Laughing Crazy Coot/  Lonnie Courtney Clay

0 new messages