Is there an easy way to get completed data as shown on KB Browser?

93 views
Skip to first unread message

Xiaoxu Gao

unread,
Jan 17, 2017, 9:05:37 AM1/17/17
to NELL: Never-Ending Language Learner
Hey,

I'm Xiaoxu Gao. I just sent the same question to the group but I couldn't find it after sending it. So, here is my question.

I'm a master student from KTH, Stockholm. I'm working on my thesis on Knowledge Graph. I want to get all "concept:company" names and relationship data of these companies as shown on the KG browser. However, after I downloaded KB from the website, I couldn't find all companies in the file (NELL.08m.995.esv.csv). 

Therefore, I would like to is there any efficient way to get name list of a concept and metadata of each entity in the concept? (Is there any good NELL API?)

Thanks,
Xiaoxu

Bryan Kisiel

unread,
Jan 17, 2017, 10:13:06 AM1/17/17
to NELL: Never-Ending Language Learner
Hi Xiaoxu,

A few sources of difference come to mind. Firstly, it seems you found a
sorting bug in the website. The most recent KB dumb is from iteration
1035, not 955, but the webiste is sorting lexographically and incorrectly
takes 955 to come last in order. The most up to date KB dump is
http://rtw.ml.cmu.edu/resources/results/08m/NELL.08m.1035.esv.csv.gz and
should more closely match what is shown on the website.

Another potential source of difference is that in these dumps only an
entity is only directly assigned to the leafmost possible categories.
So, for instance, an entity believed to be recordLabel would not also
explicitly be recorded as a company because all recordLabels are companies
according to the ontology. So if you wanted a list of all companies from
the KB dump file, you would want to select all rows where
"concept:company" is one of the categories listed in the "Categories for
Entity" column -- looking only at those rows where an entity generalizes
directly to "concept:company" would exclude all concepts believed to be
something more specific than just a company.

(Side note: the website attempts to replicate some of this logic, although
it's highly imperfect due to some implementation details that can't easily
be fixed without rewriting the whole website -- so the website content
won't always match the dump exactly, but if anything the dump should be
more complete.)

The last source of differences that comes to mind is that the web site
shows candidate knoweldge (in gray) as well as knowledge that NELL
believes to be true (in red). If you are looking at the gray as well as
the red, then you will find a large number of companies that appear to be
missing in the KB dump, since that dump contains only believed knowledge.
Using the "every candidate belief in the KB" link will get you a KB dump
that includes candidate as well as believed knowledge. Here again the
website is sorting incorrectly; the most up to date candidate dump would
be http://rtw.ml.cmu.edu/resources/results/08m/NELL.08m.1030.cesv.csv.gz

If all of that is accounted for and you're still seeing significant
differences between the dump and the website, let me know and we can try
to dig deeper.

As for an API, there is the "Ask NELL" web service, but this will return
results significantly different from what is shown in the KB browser
because it is attempting to perform inference on the fly rather than just
showing what NELL somewhat arbitrarily happened to have learned on its
own. In general, this provides higher coverage, which might be more
useful for your purposes, but grabbing a KB dump is likely to be
dramatically faster and easier.

bki...@cs.cmu.edu
> --
> You received this message because you are subscribed to the Google Groups "NELL: Never-Ending Language Learner" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cmunell+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
Reply all
Reply to author
Forward
0 new messages