Re: new developer

232 views
Skip to first unread message

Linas Vepstas

unread,
May 14, 2018, 6:44:38 PM5/14/18
to Alexey Potapov, Константин Тимофеев, Nil Geisweiller, opencog


On Mon, May 14, 2018 at 4:54 PM, Alexey Potapov <pot...@aideus.com> wrote:
Hi Linas,
this is quite an interesting discussion, and I believe we should involve Ben and others in it.

Ben has been involved  in this discussion for a decade; I think he knows the general outline, even as we argue violently about the details.  Anyway, this is why I say "we should discuss on the mailing list", a dictum that I have been violating.
 
Actually, I have some concerns regarding too strong focus on the formal logic and automatic theorem proving in OpenCog since I'm not completely convinced of its fundamental role in AGI... ;)

Yes, I agree.  I am very much trying to pursue a probabilistic approach. The question is then "probabilities on what?" and I have an answer that I like, but I have trouble sharing that answer in a way that is understandable.  Let me try to sketch that, in this context.

You know probabilistic programming very very much better than I, so please correct my mis-steps.  In probabilistic programming, one essentially has lambda-expressions, and is assigning probabilities to them.   For example: what is the probability of taking this branch (this if-statement) or not? The probability of running this loop N times? the probability of calling this routine? The probability of making this list/array longer or shorter?  This might be done with either custom-built probabilistic programming languages, or with modified versions of C or Lisp.  At any rate, these can be abstractly understood as being lambda-expressions, with probabilities assigned to them.

However, as I exhort below, lambdas are closely tied to cartesian products, and to classical logic, and we have plenty of evidence that suggests that this is inappropriate for natural, human intelligence. So what is the right thing? To keep a long story short, for me, it is this concept of "disjoined connector sets" (aka "jigsaw-puzzle pieces"), and I attach probabilities to those. More generally, more vaguely, the structures are "patterns", this is what the pattern miner looks for. There are various theoretical arguments that show that these are sufficient for general purposes.   For  example, theorem-proving is like assembling jigsaw-puzzle pieces, where he puzzle-pieces are rules of inference; this is why theorem-proving is like parsing.  However, general knowledge representation is also like this: solving Sherlock Holmes mysteries has long been said to be like "solving a jigsaw puzzle": well, except now, we have the mathematical formalism to make this statement precise. I think much of what neural nets and deep learning do also fits into this general framework; I want to write a paper on this, but have not had the time yet.

So, I am not really into formal logic, per-se; rather, I am trying to amass evidence why probabilistic patterns are the correct approach, while probabilistic lambdas are too stringent, too constrained, too tight to be useful.  One has to assign probabilities to the patterns, the arrangements, and NOT to the individual components of the pattern. (Also, I keep saying "probability" when I actually mean "mutual information" or maybe "surprisingness", as the case may be. Just as lambda is not quite right, I get the feeling that probability is not quite right, and that talking about mutual information and surprisingness is more correct, than talking about probability.  But this intuition remains difficult to explain.).

-- Linas
 
Best regards,
Alexey.

2018-05-15 0:13 GMT+03:00 Linas Vepstas <linasv...@gmail.com>:
Alexey, Konstantin,

On Mon, May 14, 2018 at 2:41 PM, Константин Тимофеев <k.tim...@gmail.com> wrote:
Nil, Linas, thanks for the explanations. It seems like I have to learn some new things before I could understand clearly what is the issue behind this discussion. Hopefully it will not take too much time.

If it doesn't take much time, then you are a super-human genius working in the wrong field.  To understand the issue, you'd have to read a book or two on proof theory, along with a large rainbow of related topics. Its taken me ten years to understand what the problem is.  I can't magically transfer this knowledge, but I can tell you what keywords to search for and read on.   Besides proof theory and model theory, anything that describes linear logic should make clear what the issue with cartesian products is (in proof theory, it corresponds to the rules of weakening and contraction, which are assumed by the lambda abstraction but are forbidden by tensor products.)  To understand why products in linguistics are tensor products and not cartesian products, look at the wikipedia article on "pregroup grammar".  The earliest reference that I know of that discusses this clearly is Marcus Solomon "Algebraic Linguistics" (1967) which your compatriots in Novosibirsk sent me.  The categorial grammars there are discusses in pages 90-120, which you can match up to what the wikipedia pregroup grammar article states.

Last time I said that "theorem proving is like parsing", I said it to Ben's son Zar, who is studying theorem proving, and his response was "yeah, so what, everyone knows that".  If you already know that, good.  My point is that since we know that theorem proving is incompatible with lambda, we should stop using lambda so much. To me, this still remains a deep, important insight; I'm trying to use it to guide the design of atomese.  I can't imagine that you will understand in only a few months.  A few years, maybe. I'm just saying, now is the time to start reading about it.

-- Linas


On Mon, May 14, 2018 at 8:45 PM, Linas Vepstas <linasv...@gmail.com> wrote:


On Mon, May 14, 2018 at 12:54 AM, Nil Geisweiller <ngei...@googlemail.com> wrote:
Hi,

On 05/11/2018 11:36 PM, Linas Vepstas wrote:
light sprinkling of lambda calculus on top.  Nil keeps putting in too much lambda calculus, I can't seem to get him to stop.  It needs more prolog, I keep telling him.

But

I meant that to be funny. Like saying "it needs more pepper, no, it needs more salt".
 
the left hand side term of a prolog fact shares the same goodies I'm after when I use a LambdaLink. It exposes a list of variables, in a determined order, abstracting the body away.

But Alexey and Konstantin have not heard this argument before, so let me sketch it briefly.

* Yes, lambdas provide abstraction. They're useful. But ...
* lambda calc is old, it was originally developed in the view of Hilbert's program, of providing a foundational basis for classical logic.
* prolog is new, developed for knowledge representation and theorem proving, that is, non-classical logic.
* lambda calc is the internal language of cartesian, closed categories: viz anything with lists, pairs, cartesian products. lambda calc corresponds to classical logic.

* by contrast, theorem proving is known to NOT use classical logic; the logic of theorem proving is intuitionist logic, and more generally, the logic of Kripke frames. Insofar as the rule engine and PLN is a kind-of-like theorem prover, it should be based on the principles given in books on proof theory, and NOT on classical logic!  People who work on proof theory have made a lot of advancements since the 1930's. We should know of, understand, and employ those concepts. Proof theory is a big deal. We should not ignore it.

* There are other things that classical logic fails to describe, besides proof theory: natural language and biochemistry. Both of these appear to be described by intuitionistic logic, or, more properly, by fragments of linear logic. The difference is that linear logic throws away the notion of pairs, lists, cartesian products and lambdas: lambda's cannot be used to describe language or biochemistry (In physics, the failure/incorrectness of lambda is "well-known", and is called the "no-cloning theorem". The fundamental reason for this is that tensor products violate the assumption of the existence of pairs, lists, cartesian products. Tensor products are incompatible with cartesian products.  Insofar as the lambda abstracts the cartesian product, it too must be discarded from the theory, as being incompatible. Any theory that has tensor products cannot have lambdas in it. They can't co-exist without contradiction.)

* Insofar as we work with language, learning and theorem proving, where there are "well-known" theorems that prove that lambda calc is incompatible, we should perhaps stop focusing on it so much. Lambda calc has it's place, -- lambda is great for abstraction, but not everything can be abstracted in that way.  Most notably (for me), the link-types of link-grammar cannot be abstracted by lambdas, because the disjuncts of link-grammar (alternately, the VP, NP of phrase-structure grammars) look like tensor products, and not link cartesian products.

For these reasons, I think we are over-using lambdas. They offer a deeply-incorrect, basically-broken view of the world we are interested in: proof theory, linguistics, knowledge, biochemistry. 
 

I know that FreeLink automatically digs up the variables, but it has other problems

1. The variable order depends on their appearances in the body. Even worse, this may depend on the variable names (see https://github.com/opencog/atomspace/issues/1677).

Yes, that is the definition of a "free variable".  Do you want to propose some kind of semi-free variable?

2. There's no way to tell FreeLink to escape some variable (unless a QuoteLink is used).

Yeah, quote links are awful. I think a very very high priority is to figure out how to do the things we need to do, without having to use quotes.  I'm not sure, but the need for quotes seem to be a side-effect of using lambdas or something like that.  I don't understand.  In all the work I do, with link-grammar, with language learning, quotes are never needed; the concept of lambda abstraction is replaced by connectors and disjuncts; and these never need to be quoted. Actually, I can't even imagine how one could "quote" a connector or disjunct, it doesn't make sense, conceptually.  I would have to think hard about that.

(Well, there is also no concept of "variables", either. The roll of variables are taken up by connectors, which are kind-of-like typed free variables. Sort of. Again, this requires deep thought.).

Upshot: maybe if we used lambdas less, then we would not need quotes. Maybe they are partners in crime.


3. Variables cannot be typed.

If they were typed, they would not be free, right?  Maybe there is a way to have typed free variables, I'm not sure; its certainly not  obvious; but I can also see how it might be possible.  You'd have to design it carefully, to avoid contradictions.
 

For these 3 reasons I generally stay away from FreeLink, even though I'm sure it has good uses, if one is careful.

!? I don't recall ever suggesting that we should use FreeLink for anything other than what it is currently used for. it provides some basic utilities for locating variables in an expression, but that's it.  I don't understand why you even brought it up... ??
 
--linas

--
cassette tapes - analog TV - film cameras - you




--
cassette tapes - analog TV - film cameras - you




--
cassette tapes - analog TV - film cameras - you

Ben Goertzel

unread,
May 15, 2018, 12:44:00 AM5/15/18
to opencog, Alexey Potapov, Константин Тимофеев, Nil Geisweiller
Ah, I have some new thoughts on the "theorem proving + AGI + parsing"
side, but will have to wait to type them in till I get a little time
at the computer, I'm traveling between meetings and conferences in
Europe just now...
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> Visit this group at https://groups.google.com/group/opencog.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAHrUA356-6JTNTp1Txct21ZTuBYXwDs6bdMcr781EusGijVT3w%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.



--
Ben Goertzel, PhD
http://goertzel.org

"Only those who will risk going too far can possibly find out how far
they can go." - T.S. Eliot

Ben Goertzel

unread,
May 16, 2018, 7:32:29 PM5/16/18
to opencog, Alexey Potapov, Константин Тимофеев, Nil Geisweiller
Alexey, Nil, Zar, Linas, others...

****
Ah, I have some new thoughts on the "theorem proving + AGI + parsing"
side, but will have to wait to type them in till I get a little time
at the computer, I'm traveling between meetings and conferences in
Europe just now...
****

And here we go...

GENERAL BLATHER

About the general issue of what role logic and theorem-proving have to
play in AGI, obviously there is room for different opinions on this….
My own view, as most of you know, is

1) There are going to be many viable routes to human-level and
superhuman AGI, not just one

2) A route with a large role for probabilistic-logic theorem-proving
is one viable route

3) Advantages of a route of type 2) would seem to be:

3a) A system that can make good use of current computing hardware
which is logic-based (whereas e.g. a neural net based architecture
would make better use of analog computing hardware)

3b) A system that should be good at scientific and engineering
thinking, and also good at meta-thinking about how to improve its own
code … i.e. a good candidate to figure out how to cure aging, create
nanotech and femtotech, launch the Singularity, etc.

3c) A system that should be coherent in maintaining its goals over
time, as compared to a system operating more similarly to mammals



Anyway that is my view which is the result of a lot of thought from
diverse directions, but of course none of us knows for sure how to
create advanced AGI yet so disagreement is understandable..

I note also that since, by Curry-Howard correspondence, programs and
proofs are equivalent, the statement that AGI can be founded centrally
on theorem-proving is equivalent to the statement that AGI can be
founded centrally on program-execution. The statement that AGI can
be founded centrally on probabilistic logic theorem proving, is
equivalent to the statement that AGI can be founded centrally on
probabilistic programming. The difference between a theorem proving
based AI and a program learning based AI is merely an “implementation
detail” ;-) …

Now more interestingly let me explore some specifics I have been
thinking about lately!

CONNECTOR THEOREM PROVING

First of all, Linas, if you haven’t seen it before you will enjoy the
diagrams in

http://www.scholarpedia.org/article/Connection_method

which are explained in more detail in

“A Vision for Automated Deduction Rooted in the Connection Method” (W. Bibel0

https://www.researchgate.net/publication/318226306_A_Vision_for_Automated_Deduction_Rooted_in_the_Connection_Method?enrichId=rgreq-e84163672cff86aca93bd93eac91d998-XXX&enrichSource=Y292ZXJQYWdlOzMxODIyNjMwNjtBUzo1NDg5MjU5MjY4OTU2MTZAMTUwNzg4NTU0NzA0NQ%3D%3D&el=1_x_3&_esc=publicationCoverPdf

and in even more detail in the book

Wolfgang Bibel:
Automated Theorem Proving. Vieweg Verlag, Wiesbaden, 293 pp. (1982);
2. Edition 289 pp, 1987.

which I found online in pdf via sci-hub …

This is connection-based theorem proving, in which a proof of a
theorem is constructed by drawing connections (links) between terms in
the theorem, in such a way that each link joins two instances that
involve the same predicate (but one in negated form and one in
non-negated form, where “negated vs. non-negated” is assessed based on
whether an instance would be negated in a DNF normalized version of
the theorem). A proof is a bunch of links that form a set of
complete paths thru the theorem (so every instance is along some path
leading to the final instance of some term in the theorem); and such
that there is some unification of all the terms linked, that plays
nicely with equating each linked pair of instances.

A complication is that sometimes there need to be a couple links
emanating from a given instance, i.e. an instance in the theorem may
get used twice, three times, etc. (This counting of usages would
provide an obvious connection with linear logic, which may or may not
be useful…)

This general concept has been around since the 1980s, but recently has
been used within some highly effective theorem-provers, i.e. leanCop

http://www.leancop.de

and its descendants. These provers are interesting, among other
reasons, because they consist of extremely concise Prolog code, yet
they work nearly as well as the state-of-the-art theorem-provers that
are much more complex as code, and have been pretty heavily
optimized...

Leancop operates on theorems in disjunctive normal form. However, a
variant called nanocop

http://www.leancop.de/nanocop/

operates on non-normalized theorems, using a straightforward and
elegant variant of the connection-based method given here

http://www.jens-otten.de/papers/concalc_tab11.pdf

Also, randocop

http://ceur-ws.org/Vol-373/paper-08.pdf

is like leancop but with a more efficient search algorithm. Leancop
and nanocop basically use chaining based search involving two
operators, “extension” and “retraction.” XXX … What randocop does, is
considers many randomizations of the order of the clauses in the DNF
version of the theorem, and does the chaining search on each of these.
This ends up working faster. (This is not surprising, it’s in line
with many other sorts of results in the search space, showing that
doing a shallow search from a lot of randomly selected starting points
can often be better than doing a deep search from a single starting
point…)

SAT SOLVERS FOR CONNECTOR THEOREM PROVING

As a semi-aside here, one purely algorithmic thought I had when
reading all this is that the search+unification here can probably be
done way faster using an SAT solver, at least for all but the shortest
proofs. This is vaguely similar to how with the link parser it's
way faster for long sentences to use SAT for parsing than a standard
NL parsing algorithm...

Use of SAT for theorem-proving has been tried already in a slightly
different context, see

https://www.lsi.upc.edu/~oliveras/espai/smtSlides/lynch.pdf

There are also various other tricky ways to code e.g. the unification
part as an SAT problem,

https://pdfs.semanticscholar.org/59ad/1b79d7a99c9330a494787deb8d6d3020d376.pdf

(that paper's not about SAT it's about unification using neural nets,
but the part where they code unification as constraint satisfaction
would apply to SAT as well as to NNs)

So bottom line is, there are lots of ways to play with the encoding
but one could use SAT or SMT here as a substitute or augmentation to
chaining.

To use SAT here you’d presumably have to set a bound on how many times
each instance can be re-used. Then you could increase the bound
incrementally, trying the SAT solver again for the constraints
generated with each new bound. Though the SAT solver would want to
work on the normalized version of the theorem, one could use
information from the grouping of instances in the non-normalized
version to guide the search inside the SAT solver.

CONNECTORS, PROOF SKETCHES AND AGI

Now why do I like this from an AGI perspective? Because it lets one
do higher-level, abstract probabilistic reasoning just by reasoning
about the links (and setting the unification aside).

To see how this might work consider that a “proof sketch” could be
written in connector form as: A set of connectors that is not
complete, and skips some steps.

For instance, in a proof sketch one might have a link skipping from an
instance of P in clause 1 to an instance of P in clause 10, bypassing
instances of P in clauses 3 and 5.

Or, in a proof sketch, one might have an instance of P used for the
second time, but not for the first time (leaving it open when it will
be used for the first time). Or one might have an instance of P used
for the k’th time in one link, and used for the m’th time in another
link, with the constraint that k<m but no commitment made about the
values of k and m.

In general a proof sketch, then, is a set of proofs sharing some
common elements.

If one is doing probabilistic induction or abduction across a bunch of
connector proofs, one is going to learn a lot about which connections
and which combinations of connections occur in which kind of contexts
within which kinds of proofs. This learning will naturally lend
itself to various conjectural proof sketches for newly presented
theorems.

The process of filling in a proof sketch to get a proof can be
confronted just like the problem of proving a theorem de novo - by
chaining or by SAT/SMT or some combination thereof.

SYNTACTIC LINKS AND LOGICAL CONNECTORS

As a minor aside, given the loose analogy between link parsing and
connector proofs, it’s also interesting to look at how the links in a
sentence’s link parse relate to the connectors in that same sentence’s
logical interpretation.

The links in a sentence’s link parse become atomic relationships

Li(w_a, w_b)

where w_a and w_b are word-instances and Li is a link of type i. The
logical interpretation of a sentence then involves a bunch of
implications such as

L1(w_1, w_3) & L5(w_3,w_7) & L9(w_1,w_7) ==> P4(w_1,w_7) & P8(w_7, w_3) <p>

where the Pi are logical relationships and <p> is a probability value.
Sometimes there may be complex quantifier relations on the right hand
side of the implication.

The overall semantic interpretation of a sentence, then looks like an
implication of the form

(conjunction of all the syntactic links found in the sentence)
==>
(logical formula involving conjunctions and disjunctions and negations
of logical relationships P_i between logical terms corresponding to
the word-instances in the sentence)

where, in the logical formula on the right hand side, each term may be
tagged with a probability value.

The existence of a syntactic link between two word-instances in a
sentence, has the effect of causing the logical terms corresponding to
the word-instances to be grouped into (one or more of) the same
probability-tagged conjunctions of logical relationships in the
logical interpretation of the sentence.

If the sentence describes N different situations (each one with
different observation-instances providing groundings for the
word-instances) S1, S2, …, S_N , then a proof of the validity of the
logical interpretation of the sentence as a model of the situations,
is produced by drawing connectors from the observation-instances to
the logical terms…

Alexey Potapov

unread,
May 19, 2018, 2:00:38 PM5/19/18
to Linas Vepstas, Константин Тимофеев, Nil Geisweiller, opencog
I am very much trying to pursue a probabilistic approach. The question is then "probabilities on what?"

Ultimately, probabilities over observational data.
 

You know probabilistic programming very very much better than I, so please correct my mis-steps.  In probabilistic programming, one essentially has lambda-expressions, and is assigning probabilities to them.   For example: what is the probability of taking this branch (this if-statement) or not? The probability of running this loop N times? the probability of calling this routine? The probability of making this list/array longer or shorter?  This might be done with either custom-built probabilistic programming languages, or with modified versions of C or Lisp.  At any rate, these can be abstractly understood as being lambda-expressions, with probabilities assigned to them.

Well... traditional probabilistic programming is a logical probabilistic programming. It's definitely not about lambda-calculus. Stochastic generative grammar can also be considered as a sort of PPL. It is also based on the formal definition of algorithms alternative to lambda-calculus. Indeed, PPLs based on functional languages are quite popular now and are closer to lambda-calculus, but this doesn't really matter too much. And I don't see much sense in assiciating e.g. Probabilistic C with lambda calculus rather than with Turing machines or other formalizations of algorithms.
What does matter is that we get a possibility to define arbitrary probabilistic generative models incuding Turing-complete models. Apparently, we would not like to use pure lambda-calculus for this, since it is not too convenient, it doesn't provide an efficient inductive bias (if we use it as a reference machine, but not just as a programming language to implement other reference machines), etc.
Procedural/functional PPLs are much more convenient to write down useful generative models in comparison with logic-based PPLs. They are very appropriate to describe (a part of) human/natural intelligence. In principle, one could solve the whole problem of AGI using them without any automatic deduction component (which they would learn implicitly), if it would not be necessary to worry about computational resources.
Can one write a program in ProbLog, which desctibe a probability distribution over the sequences of fundamental matrices, sunspot time series, etc.? Maybe, but this is far from convenient.
Our knowledge is built from data. Deduction systems (probabilistic or not) lack this connection, while functional PPLs are well-suited for this.
So, the question is not over what are basic probabilistic choices defined, but over what does it help to define final probabilities.


However, as I exhort below, lambdas are closely tied to cartesian products, and to classical logic, and we have plenty of evidence that suggests that this is inappropriate for natural, human intelligence. So what is the right thing? To keep a long story short, for me, it is this concept of "disjoined connector sets" (aka "jigsaw-puzzle pieces"), and I attach probabilities to those. More generally, more vaguely, the structures are "patterns", this is what the pattern miner looks for. There are various theoretical arguments that show that these are sufficient for general purposes.   For  example, theorem-proving is like assembling jigsaw-puzzle pieces, where he puzzle-pieces are rules of inference; this is why theorem-proving is like parsing.  However, general knowledge representation is also like this: solving Sherlock Holmes mysteries has long been said to be like "solving a jigsaw puzzle": well, except now, we have the mathematical formalism to make this statement precise.

I'm ok with this, although I think that we cannot resort to only one representation.
 
I think much of what neural nets and deep learning do also fits into this general framework; I want to write a paper on this, but have not had the time yet.

You can also map a functional programming (with algebraic types, pattern matching, etc.) to neural networks. One my student has written a nice diploma work on this topic.
So, it's cool, but this doen't give us much per se...
I would prefer to interpret DNNs in terms of probabilistic programming (extended by metacomputations)...



So, I am not really into formal logic, per-se; rather, I am trying to amass evidence why probabilistic patterns are the correct approach, while probabilistic lambdas are too stringent, too constrained, too tight to be useful.  One has to assign probabilities to the patterns, the arrangements, and NOT to the individual components of the pattern. (Also, I keep saying "probability" when I actually mean "mutual information" or maybe "surprisingness", as the case may be. Just as lambda is not quite right, I get the feeling that probability is not quite right, and that talking about mutual information and surprisingness is more correct, than talking about probability.  But this intuition remains difficult to explain.).

Well, what PPLs do is exactly assigning probabilities to patterns (or you can easily treat it in terms of information/Kolmogorov complexity). If your reference machine is Turing-complete, then it doesn't matter too much, what exactly machine do you use. My point is that no reference machine is a silver bullet. The problem is beyond the choice of the reference machine...

Alexey Potapov

unread,
May 19, 2018, 2:02:50 PM5/19/18
to Ben Goertzel, opencog, Константин Тимофеев, Nil Geisweiller
The difference between a theorem proving
based AI and a program learning based AI is merely an “implementation
detail” ;-) …

Well, true, but the devil is in the implementation detail.

Ben Goertzel

unread,
May 20, 2018, 1:04:52 AM5/20/18
to Alexey Potapov, opencog, Константин Тимофеев, Nil Geisweiller
Alexey,

***
Our knowledge is built from data. Deduction systems (probabilistic or
not) lack this connection, while functional PPLs are well-suited for
this.
***

I don't understand why you think this way...

The semantics of probabilistic logic systems can be naturally framed
in a fully observation-based way, which is what the original PLN book
is about...

It's true that a logic system, as part of its formulation, makes some
commitments about the initial logic rules, which are not initially
derived from the data but rather supplied by the system designer

OTOH a probabilistic programming system, as part of its formulation,
makes some commitments about the initial programming language
primitives, which are not initially derived from the data but rather
supplied by the system designer

And then there are well known mathematical mappings btw assumptions
about logic rules, and assumptions about programming language
primitives

So why do you think the latter are more suited for being built from data?

From my view it's intuitively sort of the opposite -- I have very
detailed picture of how the semantics of PLN is built up from a
system's observations, whereas I don't have such a detailed picture of
how a functional PPL's semantics is built up from observations. OTOH
from a math rather than intuitive perspective I can see it's all the
same shit...

-- Ben

Alexey Potapov

unread,
May 20, 2018, 4:26:49 AM5/20/18
to Ben Goertzel, opencog, Константин Тимофеев, Nil Geisweiller
Ben,



2018-05-20 8:04 GMT+03:00 Ben Goertzel <b...@goertzel.org>:
Alexey,

***
Our knowledge is built from data. Deduction systems (probabilistic or
not) lack this connection, while functional PPLs are well-suited for
this.
***

I don't understand why you think this way...

The semantics of probabilistic logic systems can be naturally framed
in a fully observation-based way, which is what the original PLN book
is about...

For me, observational data is sensory data. It doesn't contain concepts, predicates, etc. As far as I understand, PLN is designed to deal with more high-level data, e.g. textual. If we have an observation that a particular crow is black, then this is an observation for which generalization logical languages/PLN can suite not worse or even better than (functional) PPLs. But there are no purely black crows. It's just an abstraction, which itself should be somehow generalized from raw data.
How can we calculate P(crow,black|image)? It's ~ P(crow,black)P(image|crow,black).
You can derive P(crow,black) (or rather P(crow,black|some other knowledge, e.g. cawing)) using PLN (or PPLs also in fact, but maybe less elegant/efficient). But how will you calculate P(image|crow,black)? This probability is easily describable within functional generative models, but it's very cumbersome within logical languages.
Well... I'm sure you know all this stuff. So, maybe this is just the question about the difference in our attentional focus.

 

It's true that a logic system, as part of its formulation, makes some
commitments about the initial logic rules, which are not initially
derived from the data but rather supplied by the system designer

OTOH a probabilistic programming system, as part of its formulation,
makes some commitments about the initial programming language
primitives, which are not initially derived from the data but rather
supplied by the system designer

Exactly. So, what we are talking about is the difference in the available primitives in two cases. This might seem as a mere practical, but not fundamental difference. However, this practical difference is so large that it is almost fundamental. Logic deals with truth values, not numbers. One can introduce Peano axioms, basic grounded predicates saying something like "it's true that the pixel with coordinates (x, y) has (r,g,b) color", and infer the truth value that we see a crow and it is black given this image, but this is much more easier to just crunch numbers. But if you introduce imperative number crunching into your logical system, you loose the ability to logically reason about these particular numbers.
Non-logical PPLs don't deal with probabilities directly. They generate values of random variables. These values can be numbers or arbitrary data structures. These PPLs naturally inherit the powerfulness of number crunching of imperative languages. That's why I say they are better suited for learning from (raw) data. Of course, the back side is that implementing reasoning with them as cumbersome as data processing with logic.

Well, actually my worries are very technical, and I will describe them in a new thread (hopefully) soon.

-- Alexey

Ben Goertzel

unread,
May 20, 2018, 5:54:59 AM5/20/18
to Alexey Potapov, opencog, Константин Тимофеев, Nil Geisweiller
> But how will you calculate P(image|crow,black)?

Well as you know, if you really want to, something like "the RGB value
of the pixel at coordinate (444,555) is within a distance .01 of
(.3,.7,.8)" can be represented as a logical atom ... so there is no
problem using logic to reason about perceptual data in a very raw way
if you want to

OTOH I don't really want to do it that way... instead, as you know, I
want to model visual data using deep NNs of the right sort, and then
feed info about the structured latent variables of these NNs and their
interrelationships into the logical reasoning engine.... This is
because it seems like NNs, rather than explicit logic or probabilistic
programming, are more efficient at processing large-scale raw video
data...

It occurs to me that -- while I don't have time for this week, while
traveling around doing more business-y stuff -- it might be valuable
to just bite the theoretical bullet, and work out in more detail the
mapping between PLN probabilistic logic in particular (including its
indefinite-probability truth values, intensional inference, and so
forth) and probabilistic programming...

In principle this is just some specific fiddling in the direction of
Curry-Howard correspondence, but still, working it out in particular
might well teach us something. This is, I suppose, something that
you, me, Nil and Matt Ikle' could contribute to.... It's a pretty
interesting topic to me, but it might help us make progress beyond
throwing around generalities and expressions of differences in
individual taste?

-- Ben

Ben Goertzel

unread,
May 20, 2018, 6:05:23 AM5/20/18
to Alexey Potapov, opencog, Константин Тимофеев, Nil Geisweiller, Zarathustra Goertzel
Alexey -- e.g. if one stays in the world of finite discrete
distributions, one can construct probabilistic logics with
sampling-based semantics...

https://arxiv.org/pdf/1602.06420.pdf

To extend this to deal with PLN, basically one just needs to jump up
to second (and for quantifiers, third) order distributions a bit,
which should be "straightforward" ...

But cashing out this theory in some concrete examples would be interesting..

ben

Nil Geisweiller

unread,
May 21, 2018, 2:45:07 AM5/21/18
to Alexey Potapov, Linas Vepstas, Константин Тимофеев, Nil Geisweiller, opencog
On 05/19/2018 09:00 PM, Alexey Potapov wrote:
> Our knowledge is built from data. Deduction systems (probabilistic or
> not) lack this connection, while functional PPLs are well-suited for this.

Deduction system can be understood very broadly, and may encompass
inferences based on PPL models as well.

PLN definitely draws, at least in principle, the relationship between
deduction and data.

ATM in practice it's a bit lacking though, for instance the link between
the TV

Implication <TV>
P
Q

obtained from instances of P and Q is forgotten after the inference.
This should be corrected. Meaning the inference rule

D ;; <- instances of P and Q
|-
Implication <TV>
P
Q

should be more something like

LinkBetweenDataAndImplication <TV>
D ;; <- instances of P and Q
Implication
P
Q
d ;; <- new instance pair of P and Q
|-
LinkBetweenDataAndImplication <TV_update>
Cons
d
D
Implication
P
Q

It would also provide an incremental way to calculate the TV as opposed
to batch processing every time.

It's kinda scary, computationally wise, but it seems to do well most
inference traces need to be recorded, not just conclusions. Yet another
meta-learning black hole...

Nil

Ben Goertzel

unread,
May 21, 2018, 6:49:01 AM5/21/18
to opencog, Alexey Potapov, Linas Vepstas, Константин Тимофеев, Nil Geisweiller
But Nil -- those record-keeping links can be put in an auxiliary
Atomspace, not necessarily the same Atomspace where the main thrust of
reasoning is proceeding...
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> Visit this group at https://groups.google.com/group/opencog.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/a9852fa0-fc0d-bedc-7279-750d13ca8599%40gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.



Nil Geisweiller

unread,
May 21, 2018, 7:40:42 AM5/21/18
to Ben Goertzel, opencog, Alexey Potapov, Linas Vepstas, Константин Тимофеев, Nil Geisweiller
On 05/21/2018 01:48 PM, Ben Goertzel wrote:
> But Nil -- those record-keeping links can be put in an auxiliary
> Atomspace, not necessarily the same Atomspace where the main thrust of
> reasoning is proceeding...

Yes, but for rules like incremental direct calculation, and TV revision
in general, it seems traces do need to be taken into account as
reasoning is taking place. That doesn't mean they have to pollute the
main atomspace, but it does show that some form of attention allocation
/ meta-learning will be necessary. Well, that is true regardless of that
problem, it just adds more weight to the scale.

Nil

Ben Goertzel

unread,
May 21, 2018, 7:46:11 AM5/21/18
to Nil Geisweiller, opencog, Alexey Potapov, Linas Vepstas, Константин Тимофеев
Yeah, true...

Alexey Potapov

unread,
May 21, 2018, 4:19:11 PM5/21/18
to Nil Geisweiller, Linas Vepstas, Константин Тимофеев, opencog
Nil,
This might be ok when we are talking about small-dimensional tasks, but I don't think this is a good idea for real-world problems...
BTW, one my colleague (Vitaly Khudobahshov) has had some ideas regarding that meta-computations should/can be carried out on low-dimensional instances of problems, and then derived specialized solvers to apply to higher-dimensional instances... So, maybe, it's ok to have LinkBetweenDataAndImplication ...

Linas Vepstas

unread,
May 21, 2018, 6:55:15 PM5/21/18
to Alexey Potapov, Anton Kolonin, Andres Suarez, Константин Тимофеев, Nil Geisweiller, opencog
On Sat, May 19, 2018 at 1:00 PM, Alexey Potapov <pot...@aideus.com> wrote:


Well... traditional probabilistic programming is a logical probabilistic programming. It's definitely not about lambda-calculus.

I don't know what to do with this statement. There is a famous theorem, the church-turing theorem, dating to the 1930's, that states that anything turing-computable is equivalent to lambda calculus. There have been many extensions, refinements, generalizations and clarifications of that theorem, since then.

If you have a probabilistic programming language working on a modern-day digital computer, then its lambda-calculus. If you have a theoretical algebra working on infinite-precision topological spaces, that's something else. The quantum-computing machines are often understood as infinite-precision topological vector-space machines  (where the space is complex-projective, and the operators are unitary).

Topological computing is .. interesting, but I never got the sense, from quick skims of the literature, that this is what was being explored.

> I think much of what neural nets and deep learning do also fits into this general framework; I want to write a paper on > this, but have not had the time yet.

> You can also map a functional programming (with algebraic types, pattern matching, etc.)
> to neural networks. One my student has written a nice diploma work on this topic.
> So, it's cool, but this doen't give us much per se...

Well, one of the problems in the unsupervised natural-language-learning project is to factor large tensor products into approximately diagonal components.  The factorization can be done slowly, by walking over all elements, comparing them sorting them.  I claim that the factorization can also be done quickly, using NN algorithms, but discussions about this have always gotten stuck in various misunderstandings. Thus, having this explicitly written down is important.

In a very abstract, hand-wavey fashion: there is this general concept of "integrated information". The unsupervised natural-language-learning project is all about finding the those parts which are least-integrated, and performing explicit cuts there. What remains are the highly-integrated parts, grouped up into classes: nouns, verbs, morphemes, syntactic relations, semantic similarity, etc.  I guess you could say that its "discrimination", but the field is not some 2D pixel field, but instead this certain abstract graph.

-- Linas

Nil Geisweiller

unread,
May 22, 2018, 12:04:38 AM5/22/18
to Alexey Potapov, Nil Geisweiller, Linas Vepstas, Константин Тимофеев, opencog
On 05/21/2018 11:19 PM, Alexey Potapov wrote:
> This might be ok when we are talking about small-dimensional tasks, but
> I don't think this is a good idea for real-world problems...

Yeah, could be. Or I suppose it could be some hybrid incremental/batch.

> BTW, one my colleague (Vitaly Khudobahshov) has had some ideas regarding
> that meta-computations should/can be carried out on low-dimensional
> instances of problems, and then derived specialized solvers to apply to
> higher-dimensional instances...
Sounds interesting. Without details being provided that sounds to me
like meta-learning + schematization (turning the most frequent paths of
a generalized solver into a more narrow efficient program, as you did
mention during the last SingularityNET meeting). Would like to hear more
about it.

Nil

>

Alexey Potapov

unread,
May 22, 2018, 4:14:59 AM5/22/18
to Nil Geisweiller, Linas Vepstas, Константин Тимофеев, opencog
2018-05-22 7:04 GMT+03:00 Nil Geisweiller <ngei...@googlemail.com>:
Sounds interesting. Without details being provided that sounds to me like meta-learning + schematization (turning the most frequent paths of a generalized solver into a more narrow efficient program, as you did mention during the last SingularityNET meeting). Would like to hear more about it.


Unfortunately, only few initial steps were made in this direction, and he stopped to research this topic, at least, for now, so there are no (published) details... Initially, it was not a machine learning approach - special meta-computation techniques (based on operations of computation traces) were supposed to be used. However, I've heard he mentioned something about applying machine learning to learn specializers - this might be more similar to meta-learning + schematization. In any case, I mentioned this only as a relevant idea.

Linas Vepstas

unread,
May 22, 2018, 2:19:01 PM5/22/18
to opencog, Ben Goertzel, Константин Тимофеев, Nil Geisweiller


On Sun, May 20, 2018 at 3:26 AM, Alexey Potapov <pot...@aideus.com> wrote:


For me, observational data is sensory data. It doesn't contain concepts, predicates, etc. . If we have an observation that a particular crow is black, ... But there are no purely black crows. It's just an abstraction, which itself should be somehow generalized from raw data.
How can we calculate P(crow,black|image)? 

Do not assume that a probability is what you actually want.  Let me give three examples.

In real life, when you see a crow, and it is dark, and you want to talk about it, you just say "black crow" as an identifier of the object in the scene.  You don't pull out your photometer and measure it's darkness at 87.68% and a blueish hue of 77%. Why? Because you don't need to do that to have a conversation about it's presence, location, movement, etc. You only need to evaluate crow-ness and blackness sufficiently to distinguish it from all other elements of the scene, and then you can assign P==100% for most practical purposes.

In neural nets, the sigma-function is a non-linear component, used to boost results towards extremes. whatever sum of weights or evidence or whatever it is that you have, as inputs feeding the neural net, you apply the non-linear sigma, to try to sharpen everything closer to either 0% or 100% -- to discriminate. To increase contrast.  This is kind-of the "secret" as to why neural nets work, and probabilities don't.

In "integrated information" theory, you work with a large complex network of things that are all inter-related, all interconnected.  The goal of applying the theory is to find those extensions of the net that are highly interlinked, interconnected, and then to draw an accurate boundary around them.   If and when you can perceive that boundary, you can give everything inside one name, and everything outside a different name.  The names assigned are unambiguous, unique, even if the actual boundary is perhaps uncertain, even if there is a gradation, a smooth-ish transition from the highly-interconnected thing, to the mostly disconnected parts.   The act of name-tagging is what gives a handle on being able to think about the object in symbolic terms.

-- Linas

Linas Vepstas

unread,
May 22, 2018, 4:56:30 PM5/22/18
to opencog, Anton Kolonin, Andres Suarez, Alexey Potapov, Константин Тимофеев, Nil Geisweiller
On Wed, May 16, 2018 at 6:32 PM, Ben Goertzel <b...@goertzel.org> wrote:
Alexey, Nil, Zar, Linas, others...


GENERAL BLATHER


2) A route with a large role for probabilistic-logic theorem-proving
is one viable route

I'm starting to wonder if probabilistic logic, in this narrow sense, is actually needed, or whether its a distraction.  From elsewhere in this email chain, we have the example "Look at that black crow!" The tasks  here are to determine if there is something that can pass for "black crow" in the visual scene; if so, go with that at 100% probability. We don't need a fractional assignment.  Now, the thing being identified might not actually be the black crow, in which case we say that the visual subsystem was tricked by an optical illusion: it saw a crow where there wasn't one.  The solution is not to have the optical subsystem report  "I think its a crow with 95% confidence", but rather, assume perfect accuracy, until the conversation falls apart: e.g. "you're looking in the wrong place, look here not there" at which point a more sophisticated analysis is required, with a percentage assignment: viz: "I'm not sure, but I think I see it now."

Similar remarks about what the word "Look" might mean, in that sentence, and what one should do, if that is the word that (you think) you heard.

This is a very Medieval, Scholastic conception of probability, which lies at the foundation of modern legal systems.  Trial courts don't assign a number, 69.73% probability that the accused committed a murder. Rather, very complex networks of inter-related claims, proofs, evidence are presented, and one examines the consistency of that network, looking for logical flaws and self-contradictions, weeding those out, as needed.   At the end of the trial, there are two complex networks left: a proof of innocence, and a proof of guilt.  Ideally, one of those networks has a high self-consistency and consistency with the external world, and the other is low.    The court of law proceeds not by computing logical probabilities, but by analyzing complex networks.

There are "probabilities" in there: the accused "probably" had enough time to drive from point A to point B, commit the crime, and return before dinner.  So there are some "probabilities" and "likelihoods"  in there, but they cannot be articulated over terribly long deduction chains.  As you know, confidence rapidly decays over long deduction chains, when individual steps have mediocre probability.  But this is the stuff of detective novels and crime stories: if the accused did this and the accused did that, then maybe, just maybe, it was possible .. this is where the drama and suspense comes from.

What does this mean in practice, for AGI and software algorithms? It means we should focus on networks and network analysis. We need to construct networks from various bits of evidence, and then explicitly crawl over them .. sometimes assigning crisp but competing parallel-world truth assignments, sometimes assigning confidence values (or mutual information, or "surprisingness") to identify weak links and strong links.  Weak links in the sense of "integrated information". Weak links in the sense of identifying irrelevant information irrelevant arguments, irrelevant deductions that can be pruned away.
 



First of all, Linas, if you haven’t seen it before you will enjoy the
diagrams in

http://www.scholarpedia.org/article/Connection_method


I have not seen that before, but yes, that is what I'm talking about.  It is presented a bit awkwardly, and I can explain why: he's focusing on implication P->Q as (not-P or Q) and P,Q are binary-valued T/F so of course, his connectors and links are of the form (P, not-P)   Some of this awkwardness can (maybe) be removed by switching to link-grammar style connectors, where we simply say that A+ connects to A- without making assumptions that A is a binary T/F value, or that A is a probability or something else.

 

 link parser it's
way faster for long sentences to use SAT for parsing

FYI, with various tunings and fixes, we squeezed out factors of 1.5x here and 3x there, and today's LG parser is maybe an order of magnitude faster than it used to be.  Which we've sponged up by making the dictionary far more complex.  SAT is faster, now, only for very long sentences.
 

chaining


Yes. The Bibel/Kreitz connection method is a great example of "parsing" instead of "chaining", and the suggestion/claim that this is faster and easier than chaining.

 


L1(w_1, w_3) & L5(w_3,w_7) & L9(w_1,w_7) ==> P4(w_1,w_7) & P8(w_7, w_3)   <p>

where the Pi are logical relationships and <p> is a probability value.

The point of my Medieval/Scholastic Probability thesis is that <p> does not have to be terribly accurate; rather, that it is more important to assign a parse ranking so that one can make judgments of the form:

(a)   L1(w_1, w_3) & L5(w_3,w_7) & L9(w_1,w_7) ==> P4(w_1,w_7) & P8(w_7, w_3)

is more likely than

(b)   L1(w_1, w_3) & L5(w_3,w_7) & L9(w_1,w_7) ==> P2(w_3,w_5) & P6(w_1, w_2)

One then works in two parallel universes: one universe (A) where (a) is 100% true, and a second universe(B)  where (b) is 100% true (but is less likely).  After extended network analysis on universe (A), we might discover logical inconsistencies in universe (A), which forces us to discard universe (A) and conclude that universe (B) (however unlikely) true. 

All of this in the face of the fact that the initial parse might have assigned <p>=0.95% to (a) and <p>=5% to (b).   The fact that (a) had a very high probability simply does not matter, if universe (A) is inconsistent, flawed, self-contradictory.

Sherlock Holmes. Wikiquote. How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?

Linas.

Alexey Potapov

unread,
May 23, 2018, 8:52:20 AM5/23/18
to Linas Vepstas, Anton Kolonin, Andres Suarez, Константин Тимофеев, Nil Geisweiller, opencog
2018-05-22 1:54 GMT+03:00 Linas Vepstas <linasv...@gmail.com>:
On Sat, May 19, 2018 at 1:00 PM, Alexey Potapov <pot...@aideus.com> wrote:
Well... traditional probabilistic programming is a logical probabilistic programming. It's definitely not about lambda-calculus.

I don't know what to do with this statement. There is a famous theorem, the church-turing theorem, dating to the 1930's, that states that anything turing-computable is equivalent to lambda calculus. There have been many extensions, refinements, generalizations and clarifications of that theorem, since then.

Hmm... I thought you were talking about differences in alternative formalizations of algorithms. Of course, they are all equivalent. But what did you mean then saying you don't like lambda-calculus? 
 

If you have a probabilistic programming language working on a modern-day digital computer, then its lambda-calculus. If you have a theoretical algebra working on infinite-precision topological spaces, that's something else. The quantum-computing machines are often understood as infinite-precision topological vector-space machines  (where the space is complex-projective, and the operators are unitary).

Well... if you meant super-Turing computations, then this is completely different story... It's fun, but I would not like to discuss it in this context.
 

Alexey Potapov

unread,
May 23, 2018, 2:01:39 PM5/23/18
to opencog, Ben Goertzel, Константин Тимофеев, Nil Geisweiller
Well, I cannot agree with you. I can object to the first two paragraphs at least. However, I'm not sure if it will be productive, since I have a feeling that the source of discrepancy between our views is mostly in different definitions... But if you wish, we can continue...

Linas Vepstas

unread,
May 23, 2018, 9:53:04 PM5/23/18
to opencog, Ben Goertzel, Константин Тимофеев, Nil Geisweiller
I think we should continue, since we are nominally working together. One cannot work together effectively if there is miscommunication or misunderstandings. These kind of problems don't melt away or evaporate.

If you think the problem is definitional, then what do you see as the differences of definition?

I tried to give plausible verbal arguments for why things should be a certain way; I don't think I'm being stupid, and I don't particularly see a flaw, per se, at this level. There are, of course 1001 different but important details that need to be resolved, to make things work.

The second point about neural nets, is far far more hand-wavey, so if you don't like that one, ignore it. The third point is perhaps much too vague, requiring much too much effort in giving precise definitions, at this time. It was meant to evoke a certain vision. The first point is perhaps the most concrete, in terms of how one can actually proceed, with acutal code that does actual things.

--linas

Alexey Potapov

unread,
May 24, 2018, 5:04:34 AM5/24/18
to opencog, Ben Goertzel, Константин Тимофеев, Nil Geisweiller
Linas,
OK, let's continue the discussion about probabilities.

2018-05-22 21:18 GMT+03:00 Linas Vepstas <linasv...@gmail.com>:
Do not assume that a probability is what you actually want.  Let me give three examples.

In real life, when you see a crow, and it is dark, and you want to talk about it, you just say "black crow" as an identifier of the object in the scene.  You don't pull out your photometer and measure it's darkness at 87.68% and a blueish hue of 77%. Why? Because you don't need to do that to have a conversation about it's presence, location, movement, etc. You only need to evaluate crow-ness and blackness sufficiently to distinguish it from all other elements of the scene, and then you can assign P==100% for most practical purposes.

I cannot agree. This is controversial at least. My vision system does perform photometry. Why? Because it is designed for reconstructing invariant physical properties (like albedo) of the surrounding world. Humans may not have a direct consciouss access to the specific values, but many visual 'illusions' show this. E.g. the 'chessboard illusion' shows exactly that we are not interested in raw brightnesses, but try to reconstruct reflective properties of observable surfaces. Then, what is a 'crow-ness'? If somebody tells me: "Hey, there is a white crow", then my vision system will try to detect and recognize it given a priori information about its presence. Then, it should conclude if it is really a crow or not. Your "crow-ness" is a rough estimation of P(crow|...). There are very many practical situations, when I'm not sure I see a crow, because it is too distant or is flying too fast or it is a badly drawn picture, so I conclude that this might be a crow with higher or lower confidence. Even if I don't need a detailed information on a level of consciousness to talk about this particular crow, this doesn't mean that my vision system doesn't try to assign probabilities of this object being a crow. Even if my vision system works in a discriminative way and tries to distinguish crows from other objects, this can be treated as assigning probabilities; these are just conditional, but not mutual probabilities, so they cannot be used in any other direction of inference, but just to estimate whether this is a crow or not; but they are probabilities.

When I'm saying 'probability', of course, I don't mean its precise objective value. At least, I mean a variational approximation of subjective posteriors, or mean-field approximation, or naive Bayes approximation, or fuzzy-logic approximation, or whatever else approximation. Of course, we don't try to calculate these probabilities precisely, because it is computationally too expensive. But approximate probabilities are still probabilities. People who say that these are not probabilities, actually insist on a particular way of approximating probabilities. And they are completely wrong, because there is no universal way for approximating probabilities. Sometimes, you can calculate them precisely. Sometimes, you can use a variational approximation with DNNs. Sometimes, you can use fuzzy logic. Sometimes, you can even not calculate these probabilities or their approximations explicitly, like in model-free reinforcement learning, in which probabilities are 'summed out' inside Q-functions, but they are still there in the Bellman equation.

So, no, probabilities are what I want. I don't want precise values of these probabilities in most cases though (I would like to, but I know this is infeasible). So, if you just meant that precise values of probabilities are not needed (in sense that their calculation would be a waste of limited computational resources), than yes, I know.



In neural nets, the sigma-function is a non-linear component, used to boost results towards extremes. whatever sum of weights or evidence or whatever it is that you have, as inputs feeding the neural net, you apply the non-linear sigma, to try to sharpen everything closer to either 0% or 100% -- to discriminate. To increase contrast.  This is kind-of the "secret" as to why neural nets work, and probabilities don't.

Sigma-function (which is actually not too popular now for hidden layers) itself doesn't boost results towards extremes. What does boost is the crossentropy loss. And sigmoid actually makes the outputs of neurons probabilities (and the crossentropy loss is the result of this probabilistic, or information-theoretic if you wish, interpretation). So, these are exactly probabilities, and they help to achieve this result, which you are talking about.
 

In "integrated information" theory, you work with a large complex network of things that are all inter-related, all interconnected.  The goal of applying the theory is to find those extensions of the net that are highly interlinked, interconnected, and then to draw an accurate boundary around them.   If and when you can perceive that boundary, you can give everything inside one name, and everything outside a different name.  The names assigned are unambiguous, unique, even if the actual boundary is perhaps uncertain, even if there is a gradation, a smooth-ish transition from the highly-interconnected thing, to the mostly disconnected parts.   The act of name-tagging is what gives a handle on being able to think about the object in symbolic terms.

Sounds like yet another approximation to probabilities ;)

-- Alexey

Linas Vepstas

unread,
May 25, 2018, 12:05:38 AM5/25/18
to opencog, Ben Goertzel, Константин Тимофеев, Nil Geisweiller
On Thu, May 24, 2018 at 4:04 AM, Alexey Potapov <pot...@aideus.com> wrote:
Linas,
OK, let's continue the discussion about probabilities.

2018-05-22 21:18 GMT+03:00 Linas Vepstas <linasv...@gmail.com>:
Do not assume that a probability is what you actually want.  Let me give three examples.

In real life, when you see a crow, and it is dark, and you want to talk about it, you just say "black crow" as an identifier of the object in the scene.  You don't pull out your photometer and measure it's darkness at 87.68% and a blueish hue of 77%. Why? Because you don't need to do that to have a conversation about it's presence, location, movement, etc. You only need to evaluate crow-ness and blackness sufficiently to distinguish it from all other elements of the scene, and then you can assign P==100% for most practical purposes.

I cannot agree. This is controversial at least.

I was referring specifically to the interaction between language and perception and conscious-attention-awareness. Most of your rebuttal is about perception, only. I'll make some random unstructured commentary...

 
My vision system does perform photometry.

Well, yes, of course, at both low and high levels.  Another interesting thing that it does is to solve differential equations for motion prediction. Apparently, this is done in the cortical columns. One of the most interesting modern applications is the teaching of the "still eye" technique to athletes. The goal of still-eye is to minimize head-movement and eye movement as much as possible, to obtain the longest possible duration of accurate motion capture data. A tenth of a second additional data can make a huge difference in motion prediction. The most dramatic example of this are the photos of football receivers catching the ball with their eyes closed: this is to halt visual input, once the visual data is no longer accurate/usable (because they are being hit.) 

http://www.espn.com/espn/e60/news/story?id=4407415
http://www.science20.com/mark_changizi/wide_receivers_who_catch_their_eyes_closed_explained
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0172061

Baseball players don't close their eyes, but are taught still-eye techniques; the difference between college players and league players has been measured: the league players watch the ball for an additional tenth of a second, or so.  You can try this at home: shake your head while someone throws something at you; its very noticeable.

 
 If somebody tells me: "Hey, there is a white crow", then my vision system will try to detect

No, almost certainly not. You would probably spend a few moments thinking to yourself, "what the heck is a white crow?" and then decide to look for something that looks like a dove.  Or a bird, but not a small bird, that is somehow (?) light-colored. Or something. Kind of confusing. Only after spotting something that might be appropriate, you would do the visual pat-down - is it really a crow, or just a white dove, or something else? If its not a crow, keep looking some more.

"Looking", here, is in the sense of performing (semi-)conscious decisions about the perceived object; your visual cortex never stopped working, but the judgment system has to review each nominated object for "whiteness" and "crowness".
 
 Your "crow-ness" is a rough estimation of P(crow|...).

Well, yes, in this kind of "what the heck are you talking about" situation, the optical subsystem has to nominate one or more candidates, while the judgmental subsystem then accepts or rejects each. This goes back to my Medieval Scholastic court-room drama: several objects are accused of being white crows, but does the various evidence actually support that conclusion?  There's shape, there's size, there is posture.  If there is insufficient evidence, you discard the claim.

By the time the perceived object is in conscious awareness, ready to interact with the language subsystem, it is effectively assigned a P=100%, probability with confidence=100%. (there are corner cases where this would not be the case) But for the large majority of every-day situations, this is the case - like when someone says "please pass the salt".  You don't stop to think: "geee, I think he said salt with 78% probability", and "I think I see a salt shaker with 92% probability."  And "therefore I will move my arm with 0.78*0.92 probability".   You just reflexively grab the object, and perform the action.  Some 5% of the time, you will incorrectly pass the pepper shaker -- which provides indirect evidence that the judgment system made the wrong judgment 5% of the time --due to bad lighting, poor language comprehension, being tired, being drunk, etc.
 

When I'm saying 'probability',

And probabilities are great, at certain subsystem levels.  But, by the time you start having to perform conscious decision making, you are no longer computing probabilities, per se: you are evaluating a large network of inter-related evidence, to see if there is sufficient support to arrive at a particular conclusion.

You may entertain multiple competing hypothesis: maybe "white crow" is this object, or maybe its that object. Maybe the speaker is making a funny joke, and I should be looking not for birds, but old ladies dressed in white.    These are possibilities, and its not so much that the possibility is assigned a numerical value, per se, but rather, the entire network of evidence of support, pro or con, is evaluated.  The network with the most comprehensive evidence is selected, and then the verbal conversation proceeds with the assumption of 100% correctness.  The verbal conversation proceeds with assumptions of 100% correctness, until either the conversation halts or moves on or points of disagreement are found.

BTW, I am not claiming that this is how the human brain actually works (or am I? I want to have it both ways) - for that we need neuroscience. But I am claiming that this is a practical way to engineer a system that can have conversations about the world.  I am not claiming that this is AGI. I am claiming that, as a device that could actually be built within the next few years, it would be a better approximation to AGI than what we currently have.

Linas
Reply all
Reply to author
Forward
0 new messages