Okay, let me start by describing the model. Recall that the
model is built on top of a MySql database. THere is only one
table in the database, called nodes. Each entry in the table
represents a node in the HTM. All nodes are very similar
in their template. There are currently about eight fields for
each node, but only two of them are important at this point in
the description.
The HTM model is pretty simple. Each node has a node id,
and stands for a sequential pair of sub nodes. The pair of sub
nodes is denoted by just two descendant links, pre_link and
cur_link.
I build nodes up from alphanumeric characters in a string
of text. As the text is read in, one character at a time, the
system considers the current char and the previous char as a
sequential pair. It first checks to see if there already exists a
node for that pair of chars. If not, it creates a new node with its
pre_link and cur_link pointing respectively to special nodes that
represent the text characters being considered. So, at the bottom
of every HTM hierarchy are special nodes that represent specific
characters of text.
Whenever a space char is encountered, a hierachy is built up from the
level of chars to the next higher level of char pairs, and to the next
higher level of pairs of pairs, and so on, until there is a hierarchy
representing the entire word just encounntered. Whenever a statement
terminator is encountered, such as a period or question mark or
semi colon, etc., the string of words in the statement are denoted by
a hierarchy built from higher level nodes representing sequential pairs of
words, and pairs of pairs of words, etc. up to the top node representing the
entire statement.
It is really fun to watch this system start to learn by reading text.
Of course, at first, it has to build up every word and every common string
of words. But as it goes along, it starts encountering words it has
already seen before, and doesn't need to recreate them. After a while,
many common word strings have been learned, and are used over and
over again. So, the system eats through the first million nodes
pretty quickly, but then slows down as it encounters patterns of text it
has seen before. It is a very powerful system for compressing text.
Once it learns the most commonly used word combinations, it can
represent them with a single node id at the top of the corresponding
hierarchy.
Let me know if this is clear. I don't want to go on typing if I haven't
even gotten the basic model established.
--Mark Martin