Learning patterns in proofs via NNs + Probabilistic Pattern Mining

25 views

Skip to first unread message

Ben Goertzel

unread,

May 11, 2017, 7:43:07 AM5/11/17

to opencog, Zarathustra Goertzel, Nil Geisweiller, Alexey Potapov, Cosmo Harrigan

Here's an interesting thought on how to do pattern mining in a corpus
of mathematical proofs, or a corpus of probabilistic-logic proofs
(like we'll gather from applying PLN). This is inspired by the
below email I just sent regarding how to combine "pattern mining on
parse trees" with NN learning for finding syntactic categories and
doing word sense disambiguation...

Suppose you have a corpus of proofs... then

1) You find surprising patterns in the corpus, using e.g. the OpenCog
Pattern Miner. Each such pattern is a series of inference-rule
applications, with a mix of concrete terms, categories of terms, or
variables being subjected to the rules...

2) You associate each node in an inference-tree (each
inference-rule-application) with a large sparse vector, which has an
entry for each possible pattern identified in step 1. (So a 1 in
entry i indicates that the i'th pattern in the dictionary exists in
the node corresponding to that vector)

3) Then you learn a big matrix that does compression of these sparse
vectors to dense vectors. (more on this below)

4) Then, you can associate a dense vector to each node in the
inference tree. You can then learn a neural net that tries to (in
"skip-gram" style) predict a node in an inference tree given its
surrounding context (where the surrounding context can be summarized
in various ways, including simple ways like the context-matrix I
suggested in step 4 of my linguistics algorithm below, and potentially
other ways...). You can also do this separately for successful and
unsuccessful inference chains...

5) You co-learn the matrix in step 3 with the NN in step 4

6) Using the predictor in step 4, you guide forward and/or backward
chaining inference (e.g. using the NN trained on successful inferences
to help choose which inference tree nodes are likely to be good)

7) After doing 6) for a while you extract new patterns, and you update
your pattern library in step 1), and repeat the whole thing

...

Note that the learning in step 5) and the inference-step selection in
step 6) can both be framed using probabilistic programming...

This combines "deep math" type analysis with information-theoretic
pattern mining...

Of course one can also add

1') Apply probabilistic inference to extrapolate from the patterns
mined, to learn new conjectural patterns. Throw those into the mix
when going to step 2.

This last step 1' makes the whole thing additionally recursive,
because the probabilistic inference used there is guided based on
steps 1-7 ...

And so, to use my favorite Aussie expression, Bob's your uncle!
Singularity awakens!

..

I'll post a full implementation in a couple hours [urrggghh ... I wish...]

-- Ben

---------- Forwarded message ----------
From: Ben Goertzel <b...@goertzel.org>
Date: Thu, May 11, 2017 at 5:40 PM
Subject: Re: [Link Grammar] Re: Word similarity database report
To: link-grammar <link-g...@googlegroups.com>
Cc: opencog <ope...@googlegroups.com>, Ruiting Lian
<rui...@hansonrobotics.com>

On Thu, May 11, 2017 at 11:27 AM, Linas Vepstas <linasv...@gmail.com> wrote:
> There are two hard parts to clustering. One is writing all the code to get
> the clusters working in the pipeline. I guess I'll have to do that. The
> other is dealing with words with multiple meanings: "I saw the man with the
> saw" and clustering really needs to distinguish saw the verb from saw the
> noun. Not yet clear about the details of this. i've a glimmer of the
> general idea,

I was thinking to explore addressing this with (fairly shallow) neural
networks ...

This paper

https://nlp.stanford.edu/pubs/HuangACL12.pdf

which I've pointed out before, does unsupervised construction of
word2vec type vectors for word senses (thus, doing sense
disambiguation sorta mixed up with the dimension-reduction process)

Now that algorithm takes sentences as inputs, not parse trees. But I
think you could modify the approach to apply to our context, in an
interesting way...

The following describes one way to do this. I'm sure there are others.

1) A first step would be to use the OpenCog pattern miner to mine the
surprising patterns from the set of parse trees produced by MST
parsing.

2) Then, one could associate with each word-instance W a set of
instance-pattern-vectors. Each instance vector is very sparse, and
contains an entry for each of the patterns (among the surprising
patterns found in step 1) that W is involved in. Given these
instance-pattern-vectors, one can also calculate word-pattern-vectors
or word-sense-pattern-vectors (via averaging the instance-vectors for
all instance of the word or word-sense)

3) Their algorithm involves an embedding matrix L that maps: a binary
vector with a 1 in position i representing the i'th word in the
dictionary, into a much smaller dense vector. I would suggest
instead having an embedding matrix L that maps the pattern-vectors
representing words or senses (constructed in step 2) into a much
smaller dense vector. This is word2vec-ish, but the data it's drawing
on is the set of patterns observed in a corpus of parse trees...

4) Their algorithm involves, in the local score function, using a
sequence [x1, ..., xm], where xi is the embedding vector assigned to
word i in the sequence being looked at. Instead, we could use a
structure like the following, where w is the word being predicted and
S is the sentence containing w,

[ avg. embedding vector of words one link to the left of w in the
parse tree of S, avg. embedding vector of words one link to the right
of w in the parse tree of S, avg. embedding vector of words two links
to the left of w in the parse tree of S, avg. embedding vector of
words two links to the right of w in the parse tree of S]

This context-matrix is a way of capturing "the embedding vectors of
the words constituting the context of w in parsed sentence S" as a
linear vector... Stopping at "two links away" is arbitrary, probably
we want to go 4-5 links away (yielding a vector of length 8-10); this
would have to be experimented with...

...

Given these changes, one could apply the algorithm in the paper for
sense disambiguation and clustering...

Of course, there would also be a lot of other ways to mix up the same
ingredients mentioned in the above ... the two unique ingredients I
have introduced are

* creating dense vectors for words or senses from pattern-vectors

* creating context-matrices partly capturing the context of a
word-instance (or word or sense) based on a corpus of parse trees...

...and one could play with these in many different ways.

To put it more precisely, there are a lot of ways that one could iteratively

-- cluster word-instances based on their context-matrices (thus
generating word labels)

-- learn an embedding matrix (starting from pattern-vectors) that
enables accurate skip-gram prediction based on knowing the labels of
the words produced by the clustering done in the preceding step

Mimicking the algorithm from the above paper (with the changes I've
suggested) is one way to do this but there are lots of other ways one
could try...

-- Ben

--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

Reply all

Reply to author

Forward

0 new messages