Training data

27 views
Skip to first unread message

Andrew Buck

unread,
Mar 30, 2017, 6:14:26 PM3/30/17
to opencog
I would like to help contribute to the OpenCog project and have tried to do so numerous times in the past on various pieces.  However in every case I got bogged down by the enormous complexity of both the codebase itself, as well as the complexity of the overall design and how all the pieces fit together.  I am sure I am not the only person in this position of wanting to help out but not having the time/ability to really dig into the core pieces of the project.

I do however have the time to help in the creation and curation of training data and corpuses for the various efforts being undertaken by this project.  I would like to start a discussion around this domain.  Basically, what kind of data is needed, what have we got so far, and how could what we have be extended.  One of the ideas I had (and this is just a "brainstorming" kind of idea which may not be useful) would be to go through something like the link-parser word lists and put them into categories.  For example the "colors" category would have all the words pertaining to color (red, green, etc).  This category information could then be used to provide context for OpenCog when it is parsing sentences.  Obviously this is just a simple example, but it illustrates the kind of thing that people like me could easily work on to do some of the "grunt work" so that people who actually can code can focus their time on that and whenever they want to test something they have a nice library of clean, easily parsed data available to see how their theories work.

Another possible thing that could be created would be a large dataset consisting of simple sentences like "John threw the ball." and a small bit of atomese representing the pieces of information that can be learned from the sentence.  In this example you could learn a couple of things, the obvious one is the action of the ball being thrown and who threw it, another is that john no longer has the ball, john is likely human since few other entities can throw a ball, etc.  Basically you would have a little block of text, one or a few sentences, and then a bunch of atomese to go along with it.  Then with a large library of such things you could use things like PLN or MOSES type learning to try to map something like relex output into the atomese in the training data.  Again, this is just a suggestion and may not be that useful, but it illustrates the kinds of things volunteers like me could work on.

For any of these projects we would need some guidance and examples to get us started but once the general format of what you would like has been worked out we should be able to largely carry it forward on our own.  I think there are probably a lot of people in this community that would like to help out on these kinds of efforts, we just need to know where to start.

-AndrewBuck

Ben Goertzel

unread,
Mar 30, 2017, 10:38:23 PM3/30/17
to opencog
Hi Andrew,

thanks for the email... I'll think about this a bit and discuss with
some others here in our Hong Kong lab...

One thought that occurs to me, though, is that creation of "training
data" is sorta beside the point for an AGI approach that is supposed
to be based on unsupervised and reinforcement learning.... I.e.
making training data is not really key to our enterprise as we're now
conceiving it...

We are doing some supervised learning on a parallel (English, Lojban)
corpus aimed at learning new semantic mapping rules. Depending on how
this pans out in our experiments over the next few months, we might
need to expand our English/Lojban parallel corpus.... OTOH that would
require knowledge of Lojban which is time-consuming to gain...

But the general question of how we can leverage volunteers who don't
want to deal with the complexity of OpenCog dev is worth more
thinking...

One sort of work that occurs to me, which could be done by folks
knowing a bit of programming but not super-hard-core, would be writing
scripts that transform various kinds of structured or semi-structured
data into Atomese... i.e. thus building up a library of Atomspaces
that could be used for various purposes.... To support this someone
would need to write a guide on proper use of the common OpenCog Atom
types, but that wouldn't be such a herculean task...

More later!
ben
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+u...@googlegroups.com.
> To post to this group, send email to ope...@googlegroups.com.
> Visit this group at https://groups.google.com/group/opencog.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/894952b4-56e9-4438-a78a-edbac050f275%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin
Reply all
Reply to author
Forward
0 new messages