Re: CLA for NLP

10 views
Skip to first unread message

James Tauber

unread,
Jul 24, 2013, 6:39:39 PM7/24/13
to nu...@lists.numenta.org, htm-...@googlegroups.com
So it turns out there *is* some interest. I've CC'd my old google group to try to rekindle things.

James


On Wed, Jul 24, 2013 at 12:50 PM, James Tauber <jta...@jtauber.com> wrote:
Back when Jeff's book first came out and Numenta was founded, I started a mailing list htm-ling for talking about HTM as applied to linguistics.

The list never really had much conversation on it but I wondered if, with NuPIC and the CLA, there might be renewed interest or existing work on linguistic and NLP applications.

Anyone interested or know of any work?

Anyone at OSCON interested in talking about this more?

James
--
James Tauber
@jtauber on Twitter



--
James Tauber
@jtauber on Twitter

Rich Hammett

unread,
Jul 24, 2013, 7:07:54 PM7/24/13
to htm-...@googlegroups.com, nu...@lists.numenta.org
I'm interested...and wait, I thought I was on both of those lists...I'll have to check, I haven't heard anything from nupic list lately.

rich


From: "James Tauber" <jta...@jtauber.com>
To: nu...@lists.numenta.org, htm-...@googlegroups.com
Sent: Wednesday, July 24, 2013 6:39:39 PM
Subject: [htm-ling] Re: CLA for NLP
--
You received this message because you are subscribed to the Google Groups "HTM and Linguistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to htm-ling+u...@googlegroups.com.
To post to this group, send email to htm-...@googlegroups.com.
Visit this group at http://groups.google.com/group/htm-ling.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

ma...@frolix.com

unread,
Jul 25, 2013, 2:25:24 PM7/25/13
to htm-...@googlegroups.com
I too remain interested in the linguistic aspects of HTM.

I've done some programming to translate text into an HTM
built on top of a MySQL database. Such an HTM holds all the
syntactical information about all the text it has seen, and the
information is in an easy to access format. My hope is that
there is some easy algorithm to take parts of various sentences
and combine them in a way that is syntactically correct. But
I got diverted to another project that has taken all my time.

I believe I have gained some insight into where the problems
are in an HTM representation of text. I'll get into the details if
anyone is interested.

--Mark Martin


On 24 Jul 2013 at 19:07, Rich Hammett wrote:

>
> I'm interested...and wait, I thought I was on both of those
> lists...I'll have to check, I haven't heard anything from nupic list
> lately. rich
>
>

Rich Hammett

unread,
Jul 25, 2013, 3:34:00 PM7/25/13
to htm-...@googlegroups.com
I'm very interested!

rich

ma...@frolix.com

unread,
Jul 25, 2013, 6:27:03 PM7/25/13
to htm-...@googlegroups.com
Okay, let me start by describing the model.  Recall that the
model is built on top of a MySql database.  THere is only one
table in the database, called nodes.  Each entry in the table
represents a node in the HTM.  All nodes are very similar
in their template.  There are currently about eight fields for
each node, but only two of them are important at this point in
the description.

The HTM model is pretty simple.  Each node has a node id,
and stands for a sequential pair of sub nodes.  The pair of sub
nodes is denoted by just two descendant links, pre_link and
cur_link.  

I build nodes up from alphanumeric characters in a string
of text.  As the text is read in, one character at a time, the
system considers the current char and the previous char as a
sequential pair.  It first checks to see if there already exists a
node for that pair of chars.  If not, it creates a new node with its
pre_link and cur_link pointing respectively to special nodes that
represent the text characters being considered. So, at the bottom
of every HTM hierarchy are special nodes that represent specific
characters of text.  

Whenever a space char is encountered, a hierachy is built up from the
level of chars to the next higher level of char pairs, and to the next
higher level of pairs of pairs, and so on, until there is a hierarchy
representing the entire word just encounntered.  Whenever a statement
terminator is encountered, such as a period or question mark or
semi colon, etc., the string of words in the statement are denoted by
a hierarchy built from higher level nodes representing sequential pairs of
words, and pairs of pairs of words, etc. up to the top node representing the
entire statement.

It is really fun to watch this system start to learn by reading text. 
Of course, at first, it has to build up every word and every common string
of words.  But as it goes along, it starts encountering words it has
already seen before, and doesn't need to recreate them.  After a while,
many common word strings have been learned, and are used over and
over again.  So, the system eats through the first million nodes
pretty quickly, but then slows down as it encounters patterns of text it
has seen before.  It is a very powerful system for compressing text. 
Once it learns the most commonly used word combinations, it can
represent them with a single node id at the top of the corresponding
hierarchy.

Let me know if this is clear.  I don't want to go on typing if I haven't
even gotten the basic model established.

--Mark Martin
Reply all
Reply to author
Forward
0 new messages