Tutorial Notes for OpenCog NLP Tutorial on Tuesday 5 May CST

27 views

Skip to first unread message

Linas Vepstas

unread,

May 5, 2009, 2:13:41 PM5/5/09

to opencog-d...@googlegroups.com, opencog, OpenCog Summer of Code Announcements & Discussion List

Appended below are tutorial notes. Will attempt to get through
these, if possible.

--linas

> Announce: OpenCog NLP Tutorial on Tuesday 5 May CST
>
> Hi, Sorry for the very late announce; I forgot that this wasn't
> widely announced.
>
> I'm planning on giving an OpenCog natural language processing
> tutorial and/or question/answer session. I'll try to quickly cover
> everything from the link-grammar parser, what relex is and does,
> to how NLP data is represented and manipulated within OpenCog.
> I'll provide notes just before the tutorial.
>
> location: #opencog IRC channel on freenode.net
> date: 5May 2009 4:15 - 5:45PM CST
> That's 5:15-6:45 EST,
> 9:15-10:45 PM UTC (I hope that's not to late for Europe...)
> 9:15-10:45 AM next day in New Zealand
> Middle of the night for India :-(
>
> I have a hard-stop at the end but am on IRC a lot and can
> answer questions anytime.
>
> For your entertainment, there is now a toy chatbot that illustrates
> the full NLP processing pipeline (as it stands today) that can answer
> simple questions with one-word answers. You can play with it by
> logging onto the #opencog channel on freenode IRC. Address
> the bot by prefacing your statements with cog: or cogita: or
> cogita-bot: it will reply after a few seconds. (Alternately, you can
> talk to it in private, in which case the cog: in front is not required.)
>
> Please note: the bot is as dumb as a rock and it does **no reasoning
> whatsoever**. Everything it knows/remembers is based purely on
> linguistic pattern matching, and nothing more! Its easy to stump;
> keep your statements and questions simple.

OpenCog NLP Tutorial
--------------------
April 2009
Linas Vepstas <linasv...@gmail.com>

OpenCog NLP tutorial, provides short overview of important NLP components.

Outline
=======

-- Link Grammar
-- RelEx
-- OpenCog representation
-- "La Cogita" chatbot
-- Pattern matching
-- Semantic normalization (triples)
-- Common Sense Reasoning

Link Grammar is a parser
========================
http://www.abisource.com/projects/link-grammar/

linkparser> John threw the ball
Found 1 linkage (1 had no P.P. violations)
Unique linkage, cost vector = (CORP=6.0467 UNUSED=0 DIS=0 AND=0 LEN=5)

+-----Os----+
+--Ss--+ +--Ds-+
| | | |
John.m threw.v the ball.n

-- Links are Ss,Os,Ds
-- Ss == Subject, singular
-- Os == Object, singular
-- Ds == Determiner, singular

-- There are about 100 link types, and many more subtypes.
-- Links are bidirectional; no head-word.
-- Words have "disjuncts" which fit like puzzle pieces:
e.g. threw: S- & O+ means that verb "threw" must have
subject on left, object on right.
-- Parser arranges puzzle pieces until all fit together.

RelEx is a dependency relation extractor
========================================
http://opencog.org/wiki/RelEx

-- Uses link-grammar input to obtain relations:
_subj (<<throw>>, <<John>>)
_obj (<<throw>>, <<ball>>)

-- First word of the relation is the "head word",
the second word is the "dependent" word:
e.g. "John threw the red ball"

_amod (<<ball>>, <<red>>)

-- Also performs feature tagging:
Features are part-of-speech, tense, noun-number, etc.

pos (ball, noun)
noun_number (ball, singular)
DEFINITE-FLAG (ball, T)
pos (throw, verb)
tense (throw, past)
person-FLAG (John, T)

-- RelEx "extracts" information from the parse-graph by means of
"pattern matching" on subgraphs:

e.g. "if link-type is Ss then word on left is singular"
e.g. "if link-type is Os then word on right is singular"
e.g. "if link-type is Ss then word on right is verb"
e.g. "if link-type is Os then word on right is noun"

A sequence of "rules" are applied:

"if (predicate) then implication"
where "predicate" is a graph pattern to be matched,
and "implication" is a set of nodes/edges to be added, deleted.

Result of applying pattern-match rules results in a transformation
of the graph.

-- Pattern matching (and all of RelEx) implemented in Java.
-- Pattern matching fairly closely tied to linguistics
-- Pattern matching is on graphs, not hypergraphs.

-- RelEx has other assorted other functions too ...
(framenet, pronoun resolution, entity identification, ...)

Parsed sentences as OpenCog hypergraphs
=======================================
-- Can be output directly from RelEx
-- Can be quickly generated from a "compact parse format":
Allows parsed texts to be saved, input to opencog later.

-- word instances are a special case of a word:
(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "John@df4398c5-7f03-45c9-bb30-85f715ba83c0")
(WordNode "John")
)

-- word instances belong to a parse:
(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "John@df4398c5-7f03-45c9-bb30-85f715ba83c0")
(ParseNode "sentence@235033cb-a934-4a57-8b0f-0307705ed931_parse_0")
)

-- parses belong to a sentence; sentences belong to a document, etc.
-- Link Grammar links:

(EvaluationLink (stv 1.0 1.0)
(LinkGrammarRelationshipNode "Os")
(ListLink
(WordInstanceNode "threw@e69139f2-6322-4836-9d8c-73ce8d1cf881")
(WordInstanceNode "ball@f6aa0e0a-fc4b-40f6-b5b9-2b441393bda5")
)
)

-- Relex Relations:
; _obj (<<throw>>, <<ball>>)
(EvaluationLink (stv 1.0 1.0)
(DefinedLinguisticRelationshipNode "_obj")
(ListLink
(WordInstanceNode "threw@e69139f2-6322-4836-9d8c-73ce8d1cf881")
(WordInstanceNode "ball@f6aa0e0a-fc4b-40f6-b5b9-2b441393bda5")
)
)

-- word features:
; tense (throw, past)
(InheritanceLink (stv 1.0 1.0)
(WordInstanceNode "threw@e69139f2-6322-4836-9d8c-73ce8d1cf881")
(DefinedLinguisticConceptNode "past")
)

-- Clearly very verbose; lots of information about the input sentences.

"La Cogita" Chatbot
===================
bzr: opencog/nlp/chatbot/README

-- A quick-n-dirty hookup of IRC to link-grammar/relex to OpenCog
-- "remembers" what it was told.
-- It was "told" about 5K simple assertions from the MIT ConceptNet
project: e.g. "Baseball is a sport".
-- Can answer simple questions about what it was told, using hypergraph
pattern matching.
-- Single-word replies, since NL generation not hooked up yet.

-- NO REASONING WHATSOEVER. Chatbot is as dumb as a rock!

Pattern matching
================
bzr: opencog/query/README

-- Similar in idea to RelEx pattern matching, but this time its
1) full general, 2) implemented within OpenCog.

-- Given a hypergraph, containing VariableNodes, find a matching
hypegraph which "solves" or "grounds" the variables.

-- Example: "Who threw a ball?"
_subj (<<throw>>, <<_$qVar>>)
_obj (<<throw>>, <<ball>>)

is easily grounded by:
_subj (<<throw>>, <<John>>)
_obj (<<throw>>, <<ball>>)

Answer to question: John.

-- Example: "What did John throw?"

_subj (<<throw>>, <<John>>)
_obj (<<throw>>, <<_$qVar>>)

Answer to question: ball

-- Pattern matcher is "completely general", works for any hypergraph,
not just NLP.

-- Works vaguely like push-down automaton, maintains stack of partial
matches/groundings.

-- Final accept/reject of a potential match is determined by user callback,
and is thus configurable.

-- Solutions/groundings are reported via callback, too, so search can be
run to exhaustion, or terminated early.

-- Can test for "optional" clauses, and/or absence of clauses (to reject
matches that also contain certain subgraphs).

Semantic normalization aka Semantic Triples
===========================================
-- "triples" are very fashionable:
"Semantic Web", OWL, RDF, N3, SPARQL, ISO Topic Maps,
Semantic Nets, Upper Ontology, etc.

-- Although a triple can be "any" list of three items
e.g. (_obj, throw, ball)

A "semantic triple" captures a "semantic" relation:

e.g. capital_of(Spain, Madrid)

-- Often prepositional in nature ("kind_of", "inside_of", "next_to"...)
but can copular: "is-a", "has-a"

-- Can provide (partial) solution to normalization problem:
e.g. "The capital of Spain is Madrid"

_subj (<<be>>, <<capital>>)
_obj (<<be>>, <<Madrid>>)
of (<<capital>>, <<Spain>>)

FAILS to pattern match the question: "What is the capital of Spain?":

_subj (<<be>>, <<_$qVar>>)
_obj (<<be>>, <<capital>>)
of (<<capital>>, <<Spain>>)
COPULA-QUESTION-FLAG (capital, T)
QUERY-TYPE (_$qVar, what)

because subject, object are reversed.
The semantic triple provides a (partial) solution for this.

-- Implemented by means of pattern matching: e.g.

; Sentence: "The capital of Germany is Berlin"
; var0=capital, var1=Berlin var2=Germany
# IF %ListLink("# APPLY TRIPLE RULES", $sent)
^ %WordInstanceLink($var0,$sent) ; $var0 and $var1 must be
^ %WordInstanceLink($var1,$sent) ; in the same sentence
^ _subj(be,$var0) ^ _obj(be,$var1) ; reversed subj, obj
^ $prep($var0,$var2) ; preposition
^ %LemmaLink($var0,$word0) ; word of word instance
^ $phrase($word0, $prep) ; convert to phrase
THEN ^3_$phrase($var2, $var1)

Above is actual rule: it is converted to an OpenCog
ImplicationLink (quite verbose!) and run through pattern matcher.

-- Noteworthy: Makes use of "processing anchors" i.e.
the sentence $sent MUST be connected, via ListLink to

AnchorNode "# APPLY TRIPLE RULES"

i.e. this rule does *not* apply to all sentences ever read, but only
those sentences that are attached to this AnchorNode.
After processing, the anchor is released.

Common-sense Reasoning
======================
-- Not implemented, under construction, wide open for experimentation!

-- Starting point: Read in many sentences, e.g. from MIT ConceptNet,
which has 800K ++ "common-sense" assertions: "Ice cream is made from milk"
Parse them. Extract semantic triples. Remember them. Answer questions.

-- Use reasoning to answer: "Aristotle is a man. All men are mortal.
Is Aristotle mortal?"

-- Use reasoning for sense-disambiguation: "I heard a bark in the night":
Can one hear "tree bark"? No. So "bark" is probably not "tree bark".

-- Concept formation, refinement: Today: "What is an instrument?"
Answer: "cymbal ukulele scale drum chronometer saxophone"
Oh you meant "musical instrument", not "scientific instrument".

Musical instruments make a sounds, scientific instruments usually
do not: "I heard a chronometer in the night".

Important things this tutorial did not cover:
=============================================
-- Framenet-like markup in RelEx
-- Pronoun resolution in RelEx
-- Word-sense disambiguation in OpenCog.