Fwd: Perception and Action

24 views

Skip to first unread message

Linas Vepstas

unread,

Apr 28, 2024, 10:35:25 PMApr 28

to link-grammar

By convention, Link Grammar is used only to parse natural language. However, it can be used to create linkages in a more general setting, having nothing to do with language. Below is a description of a non-mainstream application of LG to create connection states between different kinds of data processing subsystems.

It takes the form of some experimental computer science, in the following git repo:

https://github.com/opencog/sensory

The README has more:

Sensory Atomese
===============
This repo explores how perception and action within an external world
might work with the [AtomSpace](https://github.com/opencog/atomspace).

TL;DR: Explores philosophical approaches to perception & action via
actual working code using low-level AtomSpace sensory I/O Atoms.

The experimental lab for this is "perceiving" filesystem files,

"moving" through directories, and likewise for IRC chat streams.

Philosophical Overview
----------------------
The issue for any agent is being able to perceive the environment that
it is in, and then being able to interact with this environment.

For OpenCog, and, specifically, for OpenCog Atomese, all interaction,
knowledge, data and reasoning is represented with and performed by
Atoms (stored in a hypergraph database) and Values (transient data
flowing through a network defined by Atoms).

It is not hard to generate Atoms, flow some Values around, and perform
some action (control a robot, say some words). The hard part is to
(conceptually, philosophically) understand how an agent can represent
the external world (with Atoms and Values), and how it should go about
doing things. The agent needs to perceive the environment in some way
that results in an AtomSpace representation that is easy for agent
subsystems to work with. This perception must include the agent's
awareness of the different kinds of actions it might be able to perform
upon interacting with the external world. That is, before one can
attempt to model the external world, one must be able to model one's own
ability to perform action. This is where the boundary between the agent
and the external world lies: the boundary of the controllable.

Traditional human conceptions of senses include sight and hearing;
traditional ideas consist of moving (robot) limbs. See, for example,
the git repo [opencog/vision](https://github.com/opencog/vision)
for OpenCV-based vision Atomese. (Note: It is at version 0.0.2)

The task being tackled here is at once much simpler and much harder:
exploring the unix filesystem, and interacting via chat. This might
sound easy, trivially easy, even, if you're a software developer.
The hard part is this: how does an agent know what a "file" is?
What a "directory" is? Actions it can perform are to walk the directory
tree; but why? Is it somehow "fun" for the agent to walk directories
and look at files? What should it do next? Read the same file again,
or maybe try some other file? Will the agent notice that maybe some
file has changed? If it notices, what should it do? What does it mean,
personally, to the agent, that some file changed? Should it care? Should
it be curious?

The agent can create files. Does it notice that it has created them?
Does it recognize those files as works of it's own making? Should it
read them, and admire the contents? Or perform some processing on them?
Or is this like eating excrement? What part of the "external world"
(the filesystem) is perceived to be truly external, and what part is
"part of the agent itself"? What does it mean to exist and operate in
a world like this? What's the fundamental nature of action and
perception?

When an agent "looks at" a file, or "looks at" the list of users on
a chat channel, is this an action, or a perception? Both, of course:
the agent must make a conscious decision to look (take an action) and
then, upon taking that action, sense the results (get the text in the
file or the chat text). After this, it must "perceive" the results:
figure out what they "mean".

These are the questions that seem to matter, for agent design. The code
in this git repo is some extremely low-level, crude Atomese interfaces
that try to expose these issues up into the AtomSpace.

Currently, two interfaces are being explored: a unix filesystem
interface, and an IRC chat interface. Hopefully, this is broad enough to
expose some of the design issues. Basically, chat is not like a
filesystem: there is a large variety of IRC commands, there are public
channels, there are private conversations. They are bi-directional.
The kind of sensory information coming from chat is just different than
the sensory information coming from files (even though, as a clever
software engineer, one could map chat I/O to a filesystem-style
interface.) The point here is not to be "clever", but to design
action-perception correctly. Trying to support very different kinds
of sensorimotor systems keeps us honest.

Typed Pipes and Data Processing Networks
----------------------------------------
In unix, there is the conception of a "pipe", having two endpoints. A
pair of unix processes can communicate "data" across a pipe, merely by
opening each endpoint, and reading/writing data to it. Internet sockets
are a special case of pipes, where the connected processes are running
on different computers somewhere on the internet.

Unix pipes are not typed: there is no a priori way of knowing what kind
of data might come flowing down them. Could be anything. For a pair of
processes to communicate, they must agree on the message set passing
through the pipe. The current solution to this is the IETF RFC's, which
are a rich collection of human-readable documents describing datastream
formats at the meta level. In a few rare cases, one can get a machine-
-readable description of the data. An example of this is the DTD, the
[Data Type Definition](https://en.wikipedia.org/wiki/Document_type_definition),
which is used by web browsers to figure out what kind of HTML is being
delivered (although the DTD is meant to be general enough for "any" use.)
Other examples include the X.500 and LDAP schemas, as well as SNMP.

However, there is no generic way of asking a pipe "hey mister pipe, what
are you? What kind of data passes over you?" or "how do I communicate
with whatever is at the other end of this pipe?" Usually, these
questions are resolved by some sort of hand-shaking and negotiation
when two parties connect.

The experiment being done here, in this git repo, in this code-base, is
to assign a type to a pipe. This replaces the earliest stages of
protocol negotiation: if a system wishes only connect to pipes of type
`FOO`, then it can know what is available a priori, by examining the
connection types attached to that pipe. If they are
`BAR+ or FOO+ or BLITZ+`, then we're good: the `or` is a disjunctive-or,
a menu choice of what is being served on that pipe. Upon opening that
pipe, some additional data descriptors might be served up, again in the
form of a menu choice. If the communicating processes wish to exchange
text data, when eventually find `TEXT-` and `TEXT+`, which are two
connectors stating "I'll send you text data" and "That's great, because
I can receive text data".

So far, so good. This is just plain-old ordinary computer science, so
far. The twist is that these data descriptors are being written as Link
Grammar (LG) connector types. Link Grammar is a language parser: given a
collection of "words", to which a collection of connectors are attached,
the parser can connect up the connectors to create "links". The linkages
are such that the endpoints always agree as to the type of the
connector.

The twist of using Link Grammar to create linkages changes the focus
from pair-wise, peer-to-peer connections, to a more global network
connection perspective. A linkage is possible, only if all of the
connectors are connected, only if they are connected in a way that
preserves the connector types (so that the two endpoints can actually
talk to one-another.)

This kind of capability is not needed for the Internet, or for
peer-to-peer networks, which is why you don't see this "in real life".
That's because humans and sysadmins and software developers are smart
enough to figure out how to connect what to what, and corporate
executives can say "make it so". However, machine agents and "bots" are
not this smart.

So the aim of this project is to create a sensory-motor system, which
self-describes using Link Grammar-style disjuncts. Each "external world"
(the unix filesystem, IRC chat, a webcam or microphone, etc.) exposes
a collection of connectors that describe the data coming from that
sensor (text, images ...) and a collection of connectors that describe
the actions that can be taken (move, open, ...) These connector-sets
are "affordances" to the external world: they describe how an agent can
work with the sensori-motor interface to "do things" in the external
world.

Autonomous Agents
-----------------
The sensori-motor system is just an interface. In between must lie a
bunch of data-processing nodes that take "inputs" and convert them to
"outputs". There are several ways to do this. One is to hand-code,
hard-code these connections, to create a "stimulus-response" (SRAI)
type system. For each input (stimulus) some processing happens,
and then some output is generated (response). A second way is to create
a dictionary of processing elements, each of which can take assorted
inputs or outputs, defined by connector types. Link Grammar can then be
used to obtain valid linkages between them. This approach resembles
electronics design automation (EDA): there is a dictionary of parts
(resistors, capacitors, coils, transistors ... op amps, filters, ...)
each able to take different kinds of connections. With guidance from the
(human) user, the EDA tool selects parts from the dictionary, and hooks
them up in valid circuits. Here, Link Grammar takes the role of the EDA
tool, generating only valid linkages. The (human) user still had to
select the "LG words" or "EDA parts", but LG/EDA does the rest,
generating a "netlist" (in the case of EDA) or a "linkage" (in the case
of LG).

What if there is no human to guide parts selection and circuit design?
You can't just give an EDA tool a BOM (Bill of Materials) and say
"design some random circuit out of these parts". Well, actually, you
can, if you use some sort of evolutionary programming system. Such
systems (e.g. [as-moses](https://github.com/opencog/as-moses)) are able
to generate random trees, and then select the best/fittest ones for some
given purpose. A collection of such trees is called a "random forest" or
"decision tree forest", and, until a few years ago, random forests were
competitive in the machine-learning industry, equaling the performance
seen in deep-learning neural nets (DLNN).

Deep learning now outperforms random forests. Can we (how can we) attach
a DLNN system to the sensori-motor system being described here? Should
we, or is this a bad idea? Let's review the situation.

* Yes, maybe hooking up DLNN to the sensory system here is a stupid
idea. Maybe it's just technically ill-founded, and there are easier
ways of doing this. But I don't know; that's why I'm doing these
experiments.

* Has anyone ever built a DLNN for electronic circuit design? That is,
taken a training corpus of a million different circuit designs
(netlists), and created a new system that will generate new
electronic circuits for you? I dunno. Maybe.

* Has anyone done this for software? Yes, GPT-4 (and I guess Microsoft
CodePilot) is capable of writing short sequences of valid software to
accomplish various tasks.

* How should one think about "training"? I like to think of LLM's as
high-resolution photo-realistic snapshots of human language. What you
"see" when you interact with GPT-2 are very detailed models of things
that humans have written, things in the training set. What you see
in GPT-4 are not just the surface text-strings, but a layer or two
deeper into the structure, resembling human reasoning. That is, GPT-2
captures base natural language syntax (as a start), plus entities and
entity relationships and entity attributes (one layer down, past
surface syntax.) GPT-4 does down one more layer, adequate capturing
some types of human reasoning (e.g. analogical reasoning about
entities). No doubt, GPT-5 will do an even better job of emulating
the kinds of human reasoning seen in the training corpus. Is it
"just emulating" or is it "actually doing"? This is where the
industry experts debate, and I will ignore this debate.

* DLNN training is a force-feeding of the training corpus down the
gullet of the network. Given some wiring diagram for the DLNN,
carefully crafted by human beings to have some specific number of
attention heads of a certain width, located at some certain depth,
maybe in several places, the training corpus is forced through the
circuit, arriving at a weight matrix via gradient descent. Just like a
human engineer designs an electronic circuit, so a human engineer
designs the network to be trained (using TensorFlow, or whatever).

The proposal here is to "learn by walking aboud". A decade ago, the MIT
Robotics Lab (and others) demoed randomly-constructed virtual robots
that, starting from nothing, learned how to walk, run, climb, jump,
navigate obstacles. The training here is "learning by doing", rather
than "here's a training corpus of digitized humans/animals walking,
running, climbing, jumping". There's no corpus of moves to emulate;
there's no single-shot learning of dance-steps from youtube videos.
The robots stumble around in an environment, until they figure out
how things work, how to get stuff done.

The proposal here is to do "the same thing", but instead of doing it
in some 3D landscape (Gazebo, Stage/Player, Minecraft...) to instead
do it in a generic sensori-motor landscape.

Thus, the question becomes: "What is a generic sensori-motor landscape?"
and "how does a learning system interface to such a thing?" This git
repo is my best attempt to try to understand these two questions, and to
find answers to them. Apologies if the current state is underwhelming.

-- Linas

Patrick: Are they laughing at us?

Sponge Bob: No, Patrick, they are laughing next to us.

Patrick: Are they laughing at us?

Sponge Bob: No, Patrick, they are laughing next to us.

Linas Vepstas

unread,

May 1, 2024, 5:48:06 PMMay 1

to ope...@googlegroups.com, Amir P, link-grammar

Hi Ivan,

On Tue, Apr 30, 2024 at 4:26 AM Ivan V. <ivan....@gmail.com> wrote:

Actions it can perform are to walk the directory tree; but why? Is it somehow "fun" for the agent to walk directories and look at files? What should it do next?

Very interesting questions.

"Basal cognition" . Type it in as a search term.

I eventually concluded that my demos are too complicated. So I fell back to a simpler case: a single source (of "sensory input" coming from the external environment) and a single sink (a "motor" that can accept data, and affect the external world in some way.) For this most basic example, I'm using two xterms: you can type into one (this is the "sensory input") while the second xterm displays the text that the agent generated. For the agent itself, I am trying to set up just a single pipe, from source to sink, and all it does is to copy the data. So it's a super-dumb agent: it passes input directly to output. You type into one terminal, and whatever you type in shows up in the other terminal.

The hard part is mapping capabilities to actions.

The xterm has four capabilities: describe, open, read, write. The "describe" capability is "god-given", it always exists, and so the "lookat" action `(cog-execute! (LookatLink ...))` will always return a description. For the xterm, the description is:

(OPEN- & TXT+) or (WRITE- & TXT-)

This uses Link Grammar notation for connectors and disjuncts. The OPEN- connector says that you can send the "open" command as a valid message to the xterm device. The (OPEN- & TXT+) disjunct says that, if you send the Open message, you will get back a text-source, a stream that can spew endless amounts of text.

The WRITE- connector says that you can send the "write" message. The (WRITE- & TXT-) says that if you send the "write" message, you must also provide a text source. You must provide a readable text-pipe from which text-data can be sucked out of, and written into the external environment.

Clearly, TXT- can be hooked up to TXT+ to form an LG link. However, the linkage is incomplete, because OPEN- and WRITE- remain unconnected. For that, an agent is needed: the agent is able to provide the connectors needed to obtain a full linkage, completing the diagram. By examination, the agent needs to be (OPEN+ & WRITE+) Thus, the link-grammar dictionary is:

agent: (OPEN+ & WRITE+);

xterm: (OPEN- & TXT+) or (WRITE- & TXT-);

The "sentence" to parse is

"xterm agent xterm"

and the parse is the LG diagram

+-------------> TXT ------>+
+<-- OPEN <--+--> WRITE -->+
| | |
xterm agent xterm

The above diagram describes a system consisting of one sensory device (the xterm on the left) one motor device (the xterm on the right) and an agent that is capable of triggering the open and write commands, so as to make text flow, from the input device to the output device.

The above is a linkage, and it is the only possible linkage for the bag of two xterms and this particular kind of agent. If the link-grammar generator were set to run free on this bag-of-parts list, this is the only valid diagram that can result.

Using an electronics analogy, the parts list is the BOM (Bill of Material) and it's like you have a bag with resistors, capacitors, transistors, and you shake it around, and circuits fall out. The only valid circuits are those which have all wires fully connected (soldered together). The link-grammar linkage is the same thing as the electronics EDA netlist.

Of course, most random electronic circuits are going to be useless. If only there was some way of having a training corpus of electronic circuits, and some way of using deep learning to create a LECM (Large Electronic Circuit Model).. but alas, no such system exists.

In biology, we have this concept of "assembly theory". Some way of building self-assembling systems. So you throw a bunch of lipids in a bag, and out pops a bilipid layer. Throw a bunch of amino acids in a bag, and out pops a protein. Throw in some ribose sugars, out pops a DNA strand. Things connect up to other things, in a somewhat mysterious way, using affinities on each connector. The Bayesian priors for each connecting atom is a so-called "mixed state" of quantum conformations, assigning a likelihood ("prior") to each possibility. The actual hookup, where things actually connect, is a selection of one of these possible connectors from the set of disjuncts, weighted by the "cost" (It's called "cost" in link-grammar, it's called "enthalpy" in chemistry.) The "big idea" from Bill Friston is nothing more than a mixed state of Bayesian priors. But I digress.

Anyway, consider adding a chatbot to this mix. A chabot is a processing device that can accept text as input, and generate text as output, so a disjunct that is (TXT- & TXT+)

The corresponding LG dictionary is

ioagent: (OPEN+ & WRITE+);

chatbot: (TXT- & TXT+);

xterm: (OPEN- & TXT+) or (WRITE- & TXT-);

then the following circuit diagram can result:

+------------> TXT --+----> TXT --->+
+<-- OPEN <--+-------|---> WRITE -->+
| | | |
xterm ioagent chatbot xterm

This is perhaps the simplest example of hooking up one sensory device (source of data from the environment) and one motor (device capable of acting upon the environment) to some machine (the chatbot) that does some "data processing" in the middle.

Some subtle points: in general, for some sensory device, we don't know what kind of data it delivers or accepts. That's why TXT+ and TXT- are needed. Here, TXT is a type-theory type. It's a class-label stuck onto the kinds of messages that can flow about. Likewise, OPEN and WRITE are also types.Sensory devices will typically accept OPEN messages, and motors will typically accept WRITE, but this is not always obvious.

Whether or not any given device has "open" and "write" methods on it is "described" by the dictionary, which encodes the message types that can be sent & received. So "open" and "write" become messages.

The number of "arguments" and argument types for a given message is also specified by LG connectors. For example, "open" might need a URL as an argument. For an IRC chatbot, you need two arguments: the name of the IRC network and also the chat channel to join. In this case, the disjunct for IRC would be (OPEN- & NET- & CHAN-) where the NET type is a subtype of TXT and CHAN is also a subtype of TXT, and any agent attempting to open an IRC chat needs to provide text strings for these two. Those text strings must "come from somewhere", they don't just magically appear.

They might, for example, come from an xterm, where a human user types them in. This is just like a web-browser, where the URL bar is like the xterm, it allows a human user to type in a URL, which then gets piped along. Just to be clear, the GUI on a web browser would also have a disjunct like (OPEN- & FLOAT+ & FLOAT+) which says that, after opening, the web browser promises to generate a click-stream of x,y window coords. There's also a CSS agent that's got a (FLOAT- & FLOAT- & URL+) that says it will accept a pair of window x,y coordinates, and return the URL that was actually clicked on by the user.

So far, I've been using heterosexual +/- marks on the connectors, which, in the above examples, stand for "input" and "output". Reality is more complicated, and so the Atomese connectors have a SexNode which can be + or - but can also be other things with different kinds of mating rules. The +/- rules are enough to implement classical lambda calculus and beta reduction, so it is "turing complete", but lambda calculus is kind of hard to deal with, so having a SexNode to provide more complex mating styles seemed like the right thing to do. (If you recall details from the grammar-learning project from 5-10 years ago, these were called ConnectorDirectionNode in that code-base. This was too much to type in, so its just SexNode now.)

Please note that ROS (the Robot Operating System) already implements maybe half or 2/3rds of what I describe above. The LG links are unix pipes and/or tcpip sockets. The "dictionary" is actually a YAML file that describes the motor and/or sensor. Then there are additional YAML files that provide the netlist of what hooks up to what. The problem is, of course, the YAML is not written in link-grammar format, nor is int in Atomese format, so you can't just slap around some Atomese can get a working ROS netlist out of it. And, if you have used ROS, then you know that writing ROS YAML is just about as hard as writing Atomese: they're both annoyingly difficult to tweak.

If you use github actions, some of the above might remind you of the github actions or circleci YML files. This is not an accident: the circleci files are describing a process network flow to perform unit tests, where the sensory device is a read of a git repo, the agents and actions are compile & run of the unit tests, and the output is whether the unit tests passed or failed.

If you've ever screwed around with javascript node.js then you know ` package.json` and `package-lock.json` These are sensory-device description files (or more accurately, agent-description files), in that they describe what this particular node.js device provides: the inputs, the outputs. You then use `npm run make` to parse the system and perform all the hookups. You will also be familiar with the carpet burns you get from screwing the electron. hooking together open connectors into a function network is difficult.

There's no "one ring to rule them all". I am not aware of any generic comp-sci theory for exploring the autogeneration of networks. Doing Atomese Sensory is sort of my personal best-guess for the generic theory/system of "making things hook together into a sensible autopoetic system". The "basal cognition" for agent-environment interaction.

Anyway, this is the current status of the language-learning effort. The demos work but are incomplete. Version 0.2 at https://github.com/opencog/sensory

The twist is that these data descriptors are being written as Link Grammar (LG) connector types.

I believe there is a correspondence between types and grammars. I worked out a typed framework (implementation is in process) which uses a kind of unrestricted grammar instead of traditional types.

Write it up. Send it out.

IvanV

pon, 29. tra 2024. u 04:28 Linas Vepstas <linasv...@gmail.com> napisao je:
For your disbelief and general entertainment: a new project exploring what perception and action is, and how this could be integrated with Atomese agents:

https://github.com/opencog/sensory

The README has more:

Sensory Atomese
===============
This repo explores how perception and action within an external world
might work with the [AtomSpace](https://github.com/opencog/atomspace).

TL;DR: Explores philosophical approaches to perception & action via

actual working code using low-level AtomSpace sensory I/O Atoms. Experimental lab for this is

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA36nfE29duzPNwx0kW4dzhsb44Q%3DSJ8ezEudLb0YLUpRLg%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAB5%3Dj6Xx1ZbKEhVqWBmpWBHWHOdU2SV_QeSU0EdAsntyKm0GFg%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages