Re: Audio-video unsupervised learning [was: Re: [opencog-dev] UnionLink, IntersectionLink, ComplementLink

Skip to first unread message

Linas Vepstas

Sep 13, 2021, 1:53:54 PMSep 13
to opencog, link-grammar
On Mon, Sep 13, 2021 at 6:49 AM Adrian Borucki <> wrote:
> On Sunday, 12 September 2021 at 18:55:23 UTC+2 linas wrote:
>> On Sun, Sep 12, 2021 at 8:29 AM Adrian Borucki <> wrote:
>> >
>> >> ----
>> >> As to divine intervention vs. bumbling around: I'm still working on
>> >> unsupervised learning, which I hope will someday be able to learn the
>> >> rules of (common-sense) inference. I think I know how to apply it to
>> >> audio and video data, and am looking for anyone who is willing to get
>> >> neck-deep in both code and theory. In particular, for audio and
>> >> video, I need someone who knows GPU audio/video processing libraries,
>> >> and is willing to learn how to wrap them in Atomese. For starters.
>> >
>> >
>> > I might have some time to help with this - I only did a bit of video / audio processing for ML but I have
>> > some familiarity of AtomSpace, so that part should be easier.
>> >
>> Wow! That would be awesome!
>> I thought some more about the initial steps. A large part of this
>> would be setting up video/audio filters to run on GPU's, with the goal
>> of being able to encode the filtering pipeline in Atomese -- so that
>> expressions like "apply this filter then that filer then combine this
>> and that" are stored as expressions in the AtomSpace.
>> The research program would then be to look for structural correlations
>> in the data. Generate some "random" filter sequences (building on
>> previously "known good" filter structures) and see if they have
>> "meaningful" correlations in them. Build up a vocabulary of "known
>> good" filter sequences.
>> One tricky part is finding something simple to start with. I imagined
>> the local webcam feed: it should be able to detect when I'm in front
>> of the keyboard, and when not, and rank that as an "interesting" fact.
> Sounds like something that would be processed with a library like OpenCV — it’s important to distinguish between
> video data loading and using GPU-accelerated operations. My experience with the latter is very small as this is something usually wrapped with some
> library like PyTorch or RAPIDS. Also there is a difference between running something on-line vs batch processing of a dataset — you mostly gain from GPU acceleration
> when working with the latter, unless it’s something computationally expensive that’s supposed to run in real time.
> First, we need to elucidate what actual “filters” are supposed to be used — when we have a list I can think about how the operations would be run.
> Second, if you don’t have an existing dataset that we can use then we have to build one, that is probably the most time and resource-consuming task here… probably should be done first actually.
> There are existing video datasets that might be useful, it’s worth looking into those.

Good. Before that, though, I think we need to share a general vision
of what the project "actually is", because that will determine
datasets, libraries, etc. I tried to write those down in a file -- but
it is missing important details, so let me try an alternate sketch.

So here's an anecdote from Sophia the Robot: she had this habit of
trying to talk through an audience clapping. Basically, she could not
hear, and didn't know to pause when the audience clapped. (Yes, almost
all her performances are scripted. Some small fraction are ad libbed.)
A manual operator in the audience would have to hit a pause button, to
keep her from rambling on. So I thought: "How can I build a clap
detector?" Well, it would have to be some kind of audio filter -- some
level of white noise (broad spectrum noise), but with that peculiar
clapping sound (so, not pure white noise, but dense shot noise.)
Elevated above a threshold T for some time period of S at least one
second long. It is useful to think of this as a wiring diagram: some
boxes connected with lines; each box might have some control
parameters: length, threshold, time, frequency.

So how do I build a clap detector? Well, download some suitable audio
library, get some sound samples, and start trying to wire up some
threshold detector *by hand*. Oooof. Yes, you can do it that way:
classical engineering. After that, you have a dozen different other
situations: booing. Laughing. Tense silence. Chairs scraping. And
after that, a few hundred more... it's impossible to hand-design a
filter set for every interesting case. So, instead: unleash automated
learning. That is, represent the boxes and wires as Nodes and Links
in the AtomSpace (the audio stream itself would be an
AudioStreamValue) and let some automated algo rearrange the wiring
diagram until it finds a good one.

But what is a "good wiring diagram"? Well, the current very
fashionable approach is to develop a curated labelled training set,
and train on that. "Curated" means "organized by humans" (Ooof-dah.
humans in the loop again!) and "labelled" means each snippet has a
tag: "clapping" - "cheering" - "yelling". (Yuck. What kind of yelling?
Happy? Hostile? Asking for help? Are the labels even correct?) This
might be the way people train neural nets, but really, its the wrong
approach for AGI. I don't want to do supervised training. (I mean, we
could do supervised training in the opencog framework, but I don't see
any value in that, right now.) So, lets do unsupervised training.

But how? Now for a conceptual leap. This leap is hard to explain in
terms of audio filters (its rather abstract) so I want to switch to
vision, before getting back to audio. For vision, I claim there
exists something called a "shape grammar". I hinted at this in the
last email. A human face has a shape to it - a pair of eyes,
symmetrically arranged above a mouth, in good proportion, etc. This
shape has a "grammar" that looks like this:

left-eye: (connects-to-right-to-right-eye) and
(connects-below-to-mouth) and (connects-above-to-forehead);
forehead: (connects-below-to-left-eye) and
(connects-below-to-right-eye) and (connects-above-to-any-background);

Now, if you have some filter collection that is able to detect eyes,
mouths and foreheads, you can verify whether you have detected an
actual face by checking against the above grammar. If all of the
connectors are satisfied, then you have a "grammatically correct
description of a face". So, although your filter collection was
plucking eye-like and mouth-like features out of an image, the fact
that they could be arranged into a grammatically-correct arrangement
raises your confidence that you are seeing a face.

Those people familiar with Link Grammar will recognize the above as a
peculiar variant of a Link-Grammar dictionary. (and thus I am cc'ing
the mailing list.)

But where did the grammar come from? For that matter, where did the
eye and mouth filters come from? It certainly would be a mistake to
have an army of grad students writing shape grammars by hand. The
grammar has to be learned automatically, in an unsupervised fashion.
... and that is what the opencog/learn project is all about.

At this point, things become very highly abstract very quickly, and I
will cut this email short. Very roughly, though: one looks for
pair-wise correlations in data. Having found good pairs, one then
draws maximum spanning trees (or maximum planar graphs) with those
pairs, and extracts frequently-occurring vertex-types, and their
associated connectors. That gives you a raw grammar. Generalization
requires clustering specific instances of this into general forms. I'm
working on those algos now.

The above can learn (should be able to learn) both a "shape grammar"
and also a "filter grammar" ("meaningful" combinations of processing
filters. Meaningful, in that they extract correlations in the data.)

So that is the general idea. Now, to get back to your question: what
sort of video (or audio) library? What sort of dataset? I dunno.
Beats me. Best to start small: find some incredibly simple problem,
and prove that the general idea works on that. Scale up from there.
You get to pick that problem, according to taste.

One idea was to build a "French flag detector": this should be "easy":
its just three color bars, one above the other. The grammar is very
simple. The training set might be a bunch of French flags. Now, if
the goal is to ONLY learn the shape grammar, then you have to hack up,
by hand, some adhoc color and hue and contrast filters. If you want to
learn the filter grammar, then .. well, that's a lot harder for
vision, because almost all images are extremely information-rich. The
training corpus would have to be selected to be very simple: only
those flags in canonical position (not draped) Then, either one has
extremely simple backgrounds, or one has a very large corpus, as
otherwise, you risk training on something in the background, instead
of the flags.

For automated filter-grammars, perhaps audio is simpler? Because most
audio samples are not as information-rich as video/photos?

I dunno. This is where it becomes hard. Even before all the fancy
theory and what-not, finding a suitable toy problem that is solvable
without a hopeless amount of CPU -processing and practical stumbling
blocks .. that's hard. Even worse is that state-of-the-art neural-net
systems have billions of CPU-hours behind them, computed with
well-written, well-debugged, highly optimized software, created by
armies of salaried PhD's working at the big tech companies. Any
results we get will look pathetic, compared to what those systems can

The reason I find it promising is this: All those neural net systems
do is supervised training. They don't actually "think", they don't
need to. They don't need to find relationships out of thin air. So I
think this is something brand new that we're doing that no one else
does. Another key difference is that we are working explicitly at the
symbolic level. By having a grammar, we have an explicit part-whole
relationship. This is something the neural-net guys cannot do (Hinton,
I believe, has a paper on how one day in the distant future, neural
nets might be able to solve the part-whole relationship problem. By
contrast, we've already solved it, more or less from day one.)

We've also "solved" the "symbol grounding problem" -- from day one.
This is another problem that AI researchers have been wringing their
hands about, from the 1960's onwards. Our symbols are grounded, from
the start: our symbols are the filter sets, the grammatical dictionary
entries, and we "know what they mean" because they work with explicit

Another very old AI problem is the "frame problem", and I think that
we've got that one licked, too, although this is a far more tenuous
claim. The "frame problem" is one of selecting only those things that
are relevant to a particular reasoning problem, and ignoring all of
the rest. Well, hey: this is exactly what grammars do: they tell you
exactly what is relevant, and they ignore the rest. The grammars have
learned to ignore the background features that don't affect the
current situation. But whatever... This gets abstract and can lead to
an endless spill of words. I am much more interested in creating
software that actually works.

So .. that's it. What are the next steps? How can we do this?

-- Linas

Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.
Reply all
Reply to author
0 new messages