Re: CLA for NLP

James Tauber

unread,

Jul 24, 2013, 6:39:39 PM7/24/13

to nu...@lists.numenta.org, htm-...@googlegroups.com

So it turns out there *is* some interest. I've CC'd my old google group to try to rekindle things.

James

On Wed, Jul 24, 2013 at 12:50 PM, James Tauber <jta...@jtauber.com> wrote:

Back when Jeff's book first came out and Numenta was founded, I started a mailing list htm-ling for talking about HTM as applied to linguistics.

The list never really had much conversation on it but I wondered if, with NuPIC and the CLA, there might be renewed interest or existing work on linguistic and NLP applications.

Anyone interested or know of any work?

Anyone at OSCON interested in talking about this more?

James
--
James Tauber
http://jtauber.com/

@jtauber on Twitter

--
James Tauber

http://jtauber.com/

@jtauber on Twitter

Rich Hammett

unread,

Jul 24, 2013, 7:07:54 PM7/24/13

to htm-...@googlegroups.com, nu...@lists.numenta.org

I'm interested...and wait, I thought I was on both of those lists...I'll have to check, I haven't heard anything from nupic list lately.

rich

From: "James Tauber" <jta...@jtauber.com>
To: nu...@lists.numenta.org, htm-...@googlegroups.com
Sent: Wednesday, July 24, 2013 6:39:39 PM
Subject: [htm-ling] Re: CLA for NLP

--
You received this message because you are subscribed to the Google Groups "HTM and Linguistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to htm-ling+u...@googlegroups.com.
To post to this group, send email to htm-...@googlegroups.com.
Visit this group at http://groups.google.com/group/htm-ling.
For more options, visit https://groups.google.com/groups/opt_out.

ma...@frolix.com

unread,

Jul 25, 2013, 2:25:24 PM7/25/13

to htm-...@googlegroups.com

I too remain interested in the linguistic aspects of HTM.

I've done some programming to translate text into an HTM
built on top of a MySQL database. Such an HTM holds all the
syntactical information about all the text it has seen, and the
information is in an easy to access format. My hope is that
there is some easy algorithm to take parts of various sentences
and combine them in a way that is syntactically correct. But
I got diverted to another project that has taken all my time.

I believe I have gained some insight into where the problems
are in an HTM representation of text. I'll get into the details if
anyone is interested.

--Mark Martin

On 24 Jul 2013 at 19:07, Rich Hammett wrote:

>
> I'm interested...and wait, I thought I was on both of those
> lists...I'll have to check, I haven't heard anything from nupic list
> lately. rich
>
>

Rich Hammett

unread,

Jul 25, 2013, 3:34:00 PM7/25/13

to htm-...@googlegroups.com

I'm very interested!

rich

ma...@frolix.com

unread,

Jul 25, 2013, 6:27:03 PM7/25/13

to htm-...@googlegroups.com

Okay, let me start by describing the model. Recall that the

model is built on top of a MySql database. THere is only one

table in the database, called nodes. Each entry in the table

represents a node in the HTM. All nodes are very similar

in their template. There are currently about eight fields for

each node, but only two of them are important at this point in

the description.

The HTM model is pretty simple. Each node has a node id,

and stands for a sequential pair of sub nodes. The pair of sub

nodes is denoted by just two descendant links, pre_link and

cur_link.

I build nodes up from alphanumeric characters in a string

of text. As the text is read in, one character at a time, the

system considers the current char and the previous char as a

sequential pair. It first checks to see if there already exists a

node for that pair of chars. If not, it creates a new node with its

pre_link and cur_link pointing respectively to special nodes that

represent the text characters being considered. So, at the bottom

of every HTM hierarchy are special nodes that represent specific

characters of text.

Whenever a space char is encountered, a hierachy is built up from the

level of chars to the next higher level of char pairs, and to the next

higher level of pairs of pairs, and so on, until there is a hierarchy

representing the entire word just encounntered. Whenever a statement

terminator is encountered, such as a period or question mark or

semi colon, etc., the string of words in the statement are denoted by

a hierarchy built from higher level nodes representing sequential pairs of

words, and pairs of pairs of words, etc. up to the top node representing the

entire statement.

It is really fun to watch this system start to learn by reading text.

Of course, at first, it has to build up every word and every common string

of words. But as it goes along, it starts encountering words it has

already seen before, and doesn't need to recreate them. After a while,

many common word strings have been learned, and are used over and

over again. So, the system eats through the first million nodes

pretty quickly, but then slows down as it encounters patterns of text it

has seen before. It is a very powerful system for compressing text.

Once it learns the most commonly used word combinations, it can

represent them with a single node id at the top of the corresponding

hierarchy.

Let me know if this is clear. I don't want to go on typing if I haven't

even gotten the basic model established.

--Mark Martin

Reply all

Reply to author

Forward