Google makes Concept Corpora tagged by Wikipedia Articles

William Taysom

unread,

May 19, 2012, 12:19:08 AM5/19/12

to general-in...@googlegroups.com

You guys may be interested in this <http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html>.

== quote ==

We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in [information retrieval and natural language processing].

Matt Mahoney

unread,

May 19, 2012, 8:55:55 PM5/19/12

to general-in...@googlegroups.com, OpenCog

I'm not sure what to do with this. It's main purpose is context free
word sense disambiguation. But disambiguation is not really a useful
intermediate goal in natural language processing unless it's for
something like a search engine. For most NLP tasks the real problem is
modeling or computing probabilities over long strings. Disambiguation
naturally falls out of the summation of constraints by the set of
nearby words. They reinforce only for the correct sense.
Disambiguation not a problem you have to solve separately, although
people have thought so because they were attempting to solve the NLP
problem backward by parsing before semantic analysis. We know that
approach only works for formal languages, not for natural languages.
Children learn semantics before grammar. You can't parse a sentence
unless you know what it means.

The database is not a complete language model. It doesn't alleviate
the need to have a language learning algorithm. To train the model you
still need lots of plain text, such as a Wikipedia dump or transcripts
of likely conversations in the problem domain.

The data is valuable for research because the only way to collect it
is to crawl the internet. It would probably improve an existing
language model. The problem for an application is that it is static
and would get out of date.

--
-- Matt Mahoney, mattma...@gmail.com

Linas Vepstas

unread,

May 20, 2012, 7:16:58 PM5/20/12

to general-in...@googlegroups.com, OpenCog

Nice reply, Matt,

I'll add only a few minor remarks...

On 19 May 2012 19:55, Matt Mahoney <mattma...@gmail.com> wrote:

On Sat, May 19, 2012 at 12:19 AM, William Taysom <wta...@gmail.com> wrote:
> You guys may be interested in this <http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html>.
>
> == quote ==
>
> We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in [information retrieval and natural language processing].

I'm not sure what to do with this.

Well, but it's still fun; I once created a similar collection for Ben (and specifically, for an English-language teaching website), and it was interesting to see how words are connected to other words. One or several of the opencog blog entries dealt with the network statistics on such a graph; Joel then created a visualizations of some 20K or 100K or whatever of the most-connected word pairs, and Ben then demoed this at some conference. I think this was 2009.

FWIW, my data was not "context free", but rather the result of ranking word pairs within the context of the sentences that they occur in. That is, a pair of words were associated with each other only if there was not some other, better association in the sentence. Also, word pairs need not be adjeacent, there could have been intervening words that factored out in other ways.

The point is that, when you squint, this really does offer a first approximation to an association between words and concepts. The google post doesn't really emphasize this, but when you crawl the graph, it does become clear: the word pair "Northern Ireland" really does almost surely refer to the political/geographical entity that you think it does, with a very, very high probability.

The key word here is "probability". So, yes, as you point out, "you can't parse a sentence until you know what it means", but the reverse is also true. If we work in the framework of Bayesian nets (or any similar network framework), then there's a back-n-forth: "Northern Ireland" provides a Bayesian prior: a very good place to start the parse is to assume "Northern Ireland" is a compound noun, and go from there. In the end, the parse may not allow this, but its a good place to start.

To re-iterate: such a graph is a first approximation to the connection between words and concepts. People have also talked about and taken the next step (something that I too, would very much like to replicate): build the graph of verbs that are likely to connect concepts: so e.g. bicycles can be ridden because people ride bikes. Being ride-able is a property (attribute) of the concept bicycle. By knowing the network of such attributes, one can now parse all that much more accurately: again, as a Bayesian prior: if we believe that a bicycle is ride-able with probability 0.95, then I can infer that the sentence: "I put on my bicycle helmet and rode out there." almost surely means that I rode a bicycle. Whatever.

Anyway, if one thinks of language, and concepts, and understanding as network graphs, then the google dataset is an interesting approximation. One still needs to have agents that can crawl over the network, add and remove nodes, strengthen and loosen connections, or qualify and intermediate between them, and also perform logical deduction, inference, etc. on them. The hard part is creating such agents, and getting them to work properly.

Ben Goertzel

unread,

May 20, 2012, 7:26:28 PM5/20/12

to ope...@googlegroups.com, general-in...@googlegroups.com, a...@listbox.com

Certainly this is a major resource for information retrieval R&D and
system-building

It's not particularly exciting for AGI, though. An AGI system could
make use of this resource, but it would be an extra effort-saving
resource for the AGI, rather than a must-have...

-- Ben G

> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To post to this group, send email to ope...@googlegroups.com.
> To unsubscribe from this group, send email to
> opencog+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/opencog?hl=en.

--
Ben Goertzel, PhD
http://goertzel.org

"My humanity is a constant self-overcoming" -- Friedrich Nietzsche

Reply all

Reply to author

Forward