Regarding the twitter interface

Enki A. Waterhaus II

unread,

Jul 26, 2011, 5:02:24 PM7/26/11

to NELL: Never-Ending Language Learner

Judging by the types of misspellings I see from the assumption-ranking
mechanism, I get the impression that NELL learns from twitter in
addition to posting some of its assumptions. Is this accurate?
Alternately, what kinds of sources does it learn from aside from the
2009 web scrape mentioned in the methodology section of the web page?

Bryan Kisiel

unread,

Jul 27, 2011, 6:14:20 PM7/27/11

to NELL: Never-Ending Language Learner

Hi Enki,

NELL is not hooked up to read twitter feeds especially. However, the SEAL
sub-learner uses a process of issuing queries to search engines,
retrieving pages from links that look interesting, and trying to extract
facts from the content of those pages. So it's possible for NELL to read
things from twitter if SEAL downloads somebody's twitter page and finds
something good to extract. But it turns out that there are only a handful
of things that NELL believes that are supported by something it read from
twitter.

At the moment, downloading pages from search results via SEAL is the only
other significant source that NELL reads other than the 2009 web scrape.

There are other sub-learners that look at what has been read from these
two sources and look for patterns, or chains of inference, and things like
that. The CMC sub-learner looks for orthographical regularities, and has,
for instance, noticed that names of rivers frequently end in the word
"river", "creek", or "brook" (see the first entry of
http://rtw.ml.cmu.edu/rtw/kbbrowser/predmeta:river). It has also noticed
that person names tend to begin with capital letters. It's not always so
clever, though, and we've noticed that it often latches on to strange
spellings, so that may be responsible for some of what you've noticed.

Another thing that I have seen happen is that SEAL will retrieve a page
that is desinged to be highly ranked by search engines by containing many
common misspellings of a popular word. Then it can get tricked into
thinking that it's seeing a list of different things that are all in the
same category. That's what happened with all the misspellings of
"pregnancy" for
http://rtw.ml.cmu.edu/rtw/kbbrowser/pred:nondiseasecondition

bki...@cs.cmu.edu

John Ohno

unread,

Jul 28, 2011, 8:19:32 AM7/28/11

to cmu...@googlegroups.com

Have you looked into using some of the other sources of 'good' data?
For instance, the wikipedia full text or the project gutenberg texts?
The two combined are smaller than Project Lemur's web scrapes and
arguably have a greater signal to noise ratio than arbitrary web
searches. While I realize that the goal is to read the web (with all
the parsing of dirty data that implies), starting off with sources
that have comparatively few typos (and other non-coding
irregularities) seems like a reasonable choice.

--
--
John Ohno
http://firstchurchofspacejesus.blogspot.com/

Bryan Kisiel

unread,

Jul 28, 2011, 3:58:03 PM7/28/11

to cmu...@googlegroups.com

Hi John,

I guess the real reason why we never gave much thought to pointing SEAL at
a higher-quality corpus is that it would take time and effort to get that
done. SEAL isn't written to operate off of a corpus sitting on disk, and
we'd also have to index that corpus for searching, and then we might
discover that we don't have enough computer power to run those searches as
fast as we'd like. NELL's biggest limitation is really that we don't have
enough person-hours to keep up with all the good ideas. But it would be
valuable to do as you suggest and give SEAL access to local corpora, and
it would also be valuable to give CPL acceess to the terrabytes worth of
web pages that SEAL has downloaded in the past year and a half. One of
these days (I hope)...

I think that when we look at NELL, we tend to see so many opportunities
for doing things better or for getting more out of the text that it reads
that we are not inclined to worry too much about the mistakes that it
currently makes.

bki...@cs.cmu.edu

Reply all

Reply to author

Forward