NELL is not hooked up to read twitter feeds especially. However, the SEAL
sub-learner uses a process of issuing queries to search engines,
retrieving pages from links that look interesting, and trying to extract
facts from the content of those pages. So it's possible for NELL to read
things from twitter if SEAL downloads somebody's twitter page and finds
something good to extract. But it turns out that there are only a handful
of things that NELL believes that are supported by something it read from
twitter.
At the moment, downloading pages from search results via SEAL is the only
other significant source that NELL reads other than the 2009 web scrape.
There are other sub-learners that look at what has been read from these
two sources and look for patterns, or chains of inference, and things like
that. The CMC sub-learner looks for orthographical regularities, and has,
for instance, noticed that names of rivers frequently end in the word
"river", "creek", or "brook" (see the first entry of
http://rtw.ml.cmu.edu/rtw/kbbrowser/predmeta:river). It has also noticed
that person names tend to begin with capital letters. It's not always so
clever, though, and we've noticed that it often latches on to strange
spellings, so that may be responsible for some of what you've noticed.
Another thing that I have seen happen is that SEAL will retrieve a page
that is desinged to be highly ranked by search engines by containing many
common misspellings of a popular word. Then it can get tricked into
thinking that it's seeing a list of different things that are all in the
same category. That's what happened with all the misspellings of
"pregnancy" for
http://rtw.ml.cmu.edu/rtw/kbbrowser/pred:nondiseasecondition
--
--
John Ohno
http://firstchurchofspacejesus.blogspot.com/
I guess the real reason why we never gave much thought to pointing SEAL at
a higher-quality corpus is that it would take time and effort to get that
done. SEAL isn't written to operate off of a corpus sitting on disk, and
we'd also have to index that corpus for searching, and then we might
discover that we don't have enough computer power to run those searches as
fast as we'd like. NELL's biggest limitation is really that we don't have
enough person-hours to keep up with all the good ideas. But it would be
valuable to do as you suggest and give SEAL access to local corpora, and
it would also be valuable to give CPL acceess to the terrabytes worth of
web pages that SEAL has downloaded in the past year and a half. One of
these days (I hope)...
I think that when we look at NELL, we tend to see so many opportunities
for doing things better or for getting more out of the text that it reads
that we are not inclined to worry too much about the mistakes that it
currently makes.