crea-nlp: Software for compiling text into graphs, for automatically building knowledge bases

63 views
Skip to first unread message

Mark Farrell

unread,
Aug 1, 2014, 1:39:15 PM8/1/14
to scalanlp...@googlegroups.com
Hi,
 
I've been working on software that I believe devises and implements a marginally new method for constructing
knowledge bases automatically from text articles. I  tokenize documents into sentence and use The Berkeley Parser to parse each sentence.
Then, I pattern match on the trees produced to extract predicates, objects and subjects, forming simple logical propositions
that I represent on a graph. Specifically, I am trying to help out with some work the Center for Research and Education on Aging
down at Berkeley. Our goal is to build a software system that automates the construction of a knowledge base on aging
by searching for keywords on pubmed, downloading articles, filtering out spam (a.k.a non-scientific) sentences in articles, compiles
the text, stores results in a graph database, and provides a web viewer for the knowledge base.
 
The project is hosted on Github and also provides a link to an online knowledge base demo for you to view:
 
There are current drawbacks: I've yet to support all phrase-structures necessary to parse many of the sentences and
brute-force pattern matching over all possible phrase structures seems to be a little bit slow. Also, parenthetical phrases
have been giving me a little bit of trouble due to the fact that they seem to be able to be placed all over sentences and introduce new clauses.
On the other hand, the software is parallelizable for performance and, in principal, the method seems to work well. I seem to be
able to extract knowledge from ~40% of all English sentences (I adopt an all-or-nothing strategy), and the graphs I produce are
have high (90%+) modularity -- I can classify groups of nodes quite effectively. This may not fully justify that my method is valid/of high quality,
but I thought it was an interesting result that at least, in my own mind, shows I'm on the right track.
 
I seek support, contributions and advice from this community; would benefit from people who work in natural language processing
inspecting my software. Specifically, I'd wonder if I should, and how, I might enhance my software with tools available
in the scalanlp ecosystem. Feel free to contact me if you are interested in this project.
 
 
 

David Hall

unread,
Aug 1, 2014, 2:30:47 PM8/1/14
to scalanlp...@googlegroups.com
Hi Mark,

Thanks for writing, and cool demo!

So, the Berkeley parser is not trained on biomedical texts, and having "in-domain" training data is super important for parsing accuracy. I'm about to throw together a parser that's trained on lots of data from a bunch of different domains--including biomedical--and I'll share that with you when it's ready (a few weeks, probably). That might help with your project. I think the Stanford parser's combined english model already does this. 

A student in our lab (not on this list) is working on a joint coreference resolution, named entity recognition, and entity linkage (i.e. associating NPs with a canonical entry in e.g. Wikipedia) project, and that might prove useful. When it's ready, he'll release the system.

-- David


Gabriel Schubiner

unread,
Aug 6, 2014, 10:18:31 PM8/6/14
to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu
Hi mark, sounds like a really great project! As with David, I have a labmate who works on joint coref/NER/linking, and thought I'd just chime in with the reference in case it'd be useful to see how some others are tackling this problem. I believe that at least some of their code is in scala, so it may (or may not) be useful. 

https://homes.cs.washington.edu/~lzilles/papers/hzwz-emnlp13.pdf

Re: scala-nlp integration, probably the most relevant component would be David's parsing frameworks, as he mentioned, but if you were looking to implement any ML/optimization algorithms, breeze is a great numerical library with first- and second-order generic optimization code, and Nak has some ML algorithms built in already, although Nak is not as mature a library as breeze w.r.t. API stability. 

Mark Farrell

unread,
Aug 16, 2014, 8:59:30 AM8/16/14
to scalanlp...@googlegroups.com


Hi David and Gabriel, 

Thanks for the information. 

David: any updates on the parser trained with biomedical texts?

Gabriel:

 Didn't have too close of a look at the resources you posted yet.

However, I'm starting to look at using the category/concept datasets published by Carnegie Mellon's Read The Web to categorize literal-nouns when 
using my software to relate literal-nouns by the actions that they can perform on each other.

I'm playing with the idea of using Neo4j to store entries of our knowledge base on aging, as it seems to be an appropriate technology 
to meet the format/specifications mentioned by people who work at CREA, building parts of the knowledge base by hand, before I started
working on my automation project: 

Thoughts?
 
Also, I find it somehow shocking/interesting that one paper you mentioned is titled "Joint Coreference Resolution
and Named-Entity Linking with Multi-pass Sieves". 

I've been at Defence R&D Canada this summer, working on some software tools to estimate static attributes of tracked unidentified vessels, 
under the supervision of someone who has been researching a new method for global (combinatorial) vessel identification, publishing a paper titled "Joint Identification of Multiple Tracked Targets". 
Even more coincidental, he originally titled that paper "Identity Resolution in Wide-Area Surveillance" before the editors of the Information Fusion journal asked suggested he change the name.

My observation is probably not that significant. I suppose, from the perspective of an undergraduate student with my level of knowledge,  I'm just marvelling that my "hobby" project seems to share
similarities with what members of my section have been doing at work all along.

David Hall

unread,
Aug 18, 2014, 7:51:57 PM8/18/14
to scalanlp...@googlegroups.com
On Sat, Aug 16, 2014 at 5:59 AM, Mark Farrell <m4fa...@csclub.uwaterloo.ca> wrote:


Hi David and Gabriel, 

Thanks for the information. 

David: any updates on the parser trained with biomedical texts?

Getting close. I have some preliminary models trained but I'm not quite happy with the results. One more week hopefully.
 

Gabriel:

 Didn't have too close of a look at the resources you posted yet.

However, I'm starting to look at using the category/concept datasets published by Carnegie Mellon's Read The Web to categorize literal-nouns when 
using my software to relate literal-nouns by the actions that they can perform on each other.

I'm playing with the idea of using Neo4j to store entries of our knowledge base on aging, as it seems to be an appropriate technology 
to meet the format/specifications mentioned by people who work at CREA, building parts of the knowledge base by hand, before I started
working on my automation project: 

Thoughts?

Seems reasonable
 
 
Also, I find it somehow shocking/interesting that one paper you mentioned is titled "Joint Coreference Resolution
and Named-Entity Linking with Multi-pass Sieves". 

I've been at Defence R&D Canada this summer, working on some software tools to estimate static attributes of tracked unidentified vessels, 
under the supervision of someone who has been researching a new method for global (combinatorial) vessel identification, publishing a paper titled "Joint Identification of Multiple Tracked Targets". 
Even more coincidental, he originally titled that paper "Identity Resolution in Wide-Area Surveillance" before the editors of the Information Fusion journal asked suggested he change the name.

My observation is probably not that significant. I suppose, from the perspective of an undergraduate student with my level of knowledge,  I'm just marvelling that my "hobby" project seems to share
similarities with what members of my section have been doing at work all along.


There are probably some commonalities, but I suspect it's more that there are certain buzzwords that are popular these days...

-- David
Reply all
Reply to author
Forward
0 new messages