One of NELL's most important sources of negative training data comes from
mutual exclusion implied by the hierarchy of predicates. For instance, if
you look at the metadata for chemical
(https://rtw.ml.cmu.edu/rtw/kbbrowser/predmeta:chemical) you'll see a
section "mutexPredicates" indicating that academic fields, automobile
engines, body parts, etc. all should be treated as examples of things that
are not chemicals.
Now of course you've pointed out the sticky philosophical question of
trying to define what is and is not a chemical. And there's a bit of a
problem here because if we really wanted to try to nail down definitions
for all of our predicates and make them all 100% logically sound and
correct, then we'd be spending all of our time trying to do that and none
of it building NELL. But of course if we don't spend enough time doing
that, then NELL won't be able to learn well. That's why the descriptions
for many of the predicates are carefully worded -- chemical, for instance,
is defined to include molecular names but not mixtures of different kinds
of molecules. That's a pretty debatable definition, but at least it is
easy for a human to judge, and it manages to get things mostly right, and
to demonstrate that the basic approach is workable for the common cases.
Ultimately, we're going to have to move toward a less black-and-white way
of doing things. For instance, we've defined "fruit" in a very literal
way so that tomatoes, nuts, and beans are all technically fruits, but we
probably want to say instead that that technical definition is right 95%
of the time or something like that, and allow NELL to learn that some
technical fruits are generally or can generally be regarded as vegetables.