Judging by the types of misspellings I see from the assumption-ranking
mechanism, I get the impression that NELL learns from twitter in
addition to posting some of its assumptions. Is this accurate?
Alternately, what kinds of sources does it learn from aside from the
2009 web scrape mentioned in the methodology section of the web page?
NELL is not hooked up to read twitter feeds especially. However, the SEAL sub-learner uses a process of issuing queries to search engines, retrieving pages from links that look interesting, and trying to extract facts from the content of those pages. So it's possible for NELL to read things from twitter if SEAL downloads somebody's twitter page and finds something good to extract. But it turns out that there are only a handful of things that NELL believes that are supported by something it read from twitter.
At the moment, downloading pages from search results via SEAL is the only other significant source that NELL reads other than the 2009 web scrape.
There are other sub-learners that look at what has been read from these two sources and look for patterns, or chains of inference, and things like that. The CMC sub-learner looks for orthographical regularities, and has, for instance, noticed that names of rivers frequently end in the word "river", "creek", or "brook" (see the first entry of http://rtw.ml.cmu.edu/rtw/kbbrowser/predmeta:river). It has also noticed that person names tend to begin with capital letters. It's not always so clever, though, and we've noticed that it often latches on to strange spellings, so that may be responsible for some of what you've noticed.
Another thing that I have seen happen is that SEAL will retrieve a page that is desinged to be highly ranked by search engines by containing many common misspellings of a popular word. Then it can get tricked into thinking that it's seeing a list of different things that are all in the same category. That's what happened with all the misspellings of "pregnancy" for http://rtw.ml.cmu.edu/rtw/kbbrowser/pred:nondiseasecondition
> Judging by the types of misspellings I see from the assumption-ranking > mechanism, I get the impression that NELL learns from twitter in > addition to posting some of its assumptions. Is this accurate? > Alternately, what kinds of sources does it learn from aside from the > 2009 web scrape mentioned in the methodology section of the web page?
Have you looked into using some of the other sources of 'good' data? For instance, the wikipedia full text or the project gutenberg texts? The two combined are smaller than Project Lemur's web scrapes and arguably have a greater signal to noise ratio than arbitrary web searches. While I realize that the goal is to read the web (with all the parsing of dirty data that implies), starting off with sources that have comparatively few typos (and other non-coding irregularities) seems like a reasonable choice.
On Wed, Jul 27, 2011 at 6:14 PM, Bryan Kisiel <bkis...@cs.cmu.edu> wrote: > Hi Enki,
> NELL is not hooked up to read twitter feeds especially. However, the SEAL > sub-learner uses a process of issuing queries to search engines, retrieving > pages from links that look interesting, and trying to extract facts from the > content of those pages. So it's possible for NELL to read things from > twitter if SEAL downloads somebody's twitter page and finds something good > to extract. But it turns out that there are only a handful of things that > NELL believes that are supported by something it read from twitter.
> At the moment, downloading pages from search results via SEAL is the only > other significant source that NELL reads other than the 2009 web scrape.
> There are other sub-learners that look at what has been read from these two > sources and look for patterns, or chains of inference, and things like that. > The CMC sub-learner looks for orthographical regularities, and has, for > instance, noticed that names of rivers frequently end in the word "river", > "creek", or "brook" (see the first entry of > http://rtw.ml.cmu.edu/rtw/kbbrowser/predmeta:river). It has also noticed > that person names tend to begin with capital letters. It's not always so > clever, though, and we've noticed that it often latches on to strange > spellings, so that may be responsible for some of what you've noticed.
> Another thing that I have seen happen is that SEAL will retrieve a page that > is desinged to be highly ranked by search engines by containing many common > misspellings of a popular word. Then it can get tricked into thinking that > it's seeing a list of different things that are all in the same category. > That's what happened with all the misspellings of "pregnancy" for > http://rtw.ml.cmu.edu/rtw/kbbrowser/pred:nondiseasecondition
> On Tue, 26 Jul 2011, Enki A. Waterhaus II wrote:
>> Judging by the types of misspellings I see from the assumption-ranking >> mechanism, I get the impression that NELL learns from twitter in >> addition to posting some of its assumptions. Is this accurate? >> Alternately, what kinds of sources does it learn from aside from the >> 2009 web scrape mentioned in the methodology section of the web page?
I guess the real reason why we never gave much thought to pointing SEAL at a higher-quality corpus is that it would take time and effort to get that done. SEAL isn't written to operate off of a corpus sitting on disk, and we'd also have to index that corpus for searching, and then we might discover that we don't have enough computer power to run those searches as fast as we'd like. NELL's biggest limitation is really that we don't have enough person-hours to keep up with all the good ideas. But it would be valuable to do as you suggest and give SEAL access to local corpora, and it would also be valuable to give CPL acceess to the terrabytes worth of web pages that SEAL has downloaded in the past year and a half. One of these days (I hope)...
I think that when we look at NELL, we tend to see so many opportunities for doing things better or for getting more out of the text that it reads that we are not inclined to worry too much about the mistakes that it currently makes.
On Thu, 28 Jul 2011, John Ohno wrote: > Have you looked into using some of the other sources of 'good' data? > For instance, the wikipedia full text or the project gutenberg texts? > The two combined are smaller than Project Lemur's web scrapes and > arguably have a greater signal to noise ratio than arbitrary web > searches. While I realize that the goal is to read the web (with all > the parsing of dirty data that implies), starting off with sources > that have comparatively few typos (and other non-coding > irregularities) seems like a reasonable choice.
> On Wed, Jul 27, 2011 at 6:14 PM, Bryan Kisiel <bkis...@cs.cmu.edu> wrote: >> Hi Enki,
>> NELL is not hooked up to read twitter feeds especially. However, the SEAL >> sub-learner uses a process of issuing queries to search engines, retrieving >> pages from links that look interesting, and trying to extract facts from the >> content of those pages. So it's possible for NELL to read things from >> twitter if SEAL downloads somebody's twitter page and finds something good >> to extract. But it turns out that there are only a handful of things that >> NELL believes that are supported by something it read from twitter.
>> At the moment, downloading pages from search results via SEAL is the only >> other significant source that NELL reads other than the 2009 web scrape.
>> There are other sub-learners that look at what has been read from these two >> sources and look for patterns, or chains of inference, and things like that. >> The CMC sub-learner looks for orthographical regularities, and has, for >> instance, noticed that names of rivers frequently end in the word "river", >> "creek", or "brook" (see the first entry of >> http://rtw.ml.cmu.edu/rtw/kbbrowser/predmeta:river). It has also noticed >> that person names tend to begin with capital letters. It's not always so >> clever, though, and we've noticed that it often latches on to strange >> spellings, so that may be responsible for some of what you've noticed.
>> Another thing that I have seen happen is that SEAL will retrieve a page that >> is desinged to be highly ranked by search engines by containing many common >> misspellings of a popular word. Then it can get tricked into thinking that >> it's seeing a list of different things that are all in the same category. >> That's what happened with all the misspellings of "pregnancy" for >> http://rtw.ml.cmu.edu/rtw/kbbrowser/pred:nondiseasecondition
>> On Tue, 26 Jul 2011, Enki A. Waterhaus II wrote:
>>> Judging by the types of misspellings I see from the assumption-ranking >>> mechanism, I get the impression that NELL learns from twitter in >>> addition to posting some of its assumptions. Is this accurate? >>> Alternately, what kinds of sources does it learn from aside from the >>> 2009 web scrape mentioned in the methodology section of the web page?