crea-nlp: Software for compiling text into graphs, for automatically building knowledge bases

Mark Farrell

unread,

Aug 1, 2014, 1:39:15 PM8/1/14

to scalanlp...@googlegroups.com

Hi,

I've been working on software that I believe devises and implements a marginally new method for constructing

knowledge bases automatically from text articles. I tokenize documents into sentence and use The Berkeley Parser to parse each sentence.

Then, I pattern match on the trees produced to extract predicates, objects and subjects, forming simple logical propositions

that I represent on a graph. Specifically, I am trying to help out with some work the Center for Research and Education on Aging

down at Berkeley. Our goal is to build a software system that automates the construction of a knowledge base on aging

by searching for keywords on pubmed, downloading articles, filtering out spam (a.k.a non-scientific) sentences in articles, compiles

the text, stores results in a graph database, and provides a web viewer for the knowledge base.

The project is hosted on Github and also provides a link to an online knowledge base demo for you to view:

https://github.com/markfarrell/crea-nlp/

There are current drawbacks: I've yet to support all phrase-structures necessary to parse many of the sentences and

brute-force pattern matching over all possible phrase structures seems to be a little bit slow. Also, parenthetical phrases

have been giving me a little bit of trouble due to the fact that they seem to be able to be placed all over sentences and introduce new clauses.

On the other hand, the software is parallelizable for performance and, in principal, the method seems to work well. I seem to be

able to extract knowledge from ~40% of all English sentences (I adopt an all-or-nothing strategy), and the graphs I produce are

have high (90%+) modularity -- I can classify groups of nodes quite effectively. This may not fully justify that my method is valid/of high quality,

but I thought it was an interesting result that at least, in my own mind, shows I'm on the right track.

I seek support, contributions and advice from this community; would benefit from people who work in natural language processing

inspecting my software. Specifically, I'd wonder if I should, and how, I might enhance my software with tools available

in the scalanlp ecosystem. Feel free to contact me if you are interested in this project.

David Hall

unread,

Aug 1, 2014, 2:30:47 PM8/1/14

to scalanlp...@googlegroups.com

Hi Mark,

Thanks for writing, and cool demo!

So, the Berkeley parser is not trained on biomedical texts, and having "in-domain" training data is super important for parsing accuracy. I'm about to throw together a parser that's trained on lots of data from a bunch of different domains--including biomedical--and I'll share that with you when it's ready (a few weeks, probably). That might help with your project. I think the Stanford parser's combined english model already does this.

A student in our lab (not on this list) is working on a joint coreference resolution, named entity recognition, and entity linkage (i.e. associating NPs with a canonical entry in e.g. Wikipedia) project, and that might prove useful. When it's ready, he'll release the system.

-- David

Gabriel Schubiner

unread,

Aug 6, 2014, 10:18:31 PM8/6/14

to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu

Hi mark, sounds like a really great project! As with David, I have a labmate who works on joint coref/NER/linking, and thought I'd just chime in with the reference in case it'd be useful to see how some others are tackling this problem. I believe that at least some of their code is in scala, so it may (or may not) be useful.

https://homes.cs.washington.edu/~lzilles/papers/hzwz-emnlp13.pdf

http://www.cs.washington.edu/research-projects/nlp/neco

Re: scala-nlp integration, probably the most relevant component would be David's parsing frameworks, as he mentioned, but if you were looking to implement any ML/optimization algorithms, breeze is a great numerical library with first- and second-order generic optimization code, and Nak has some ML algorithms built in already, although Nak is not as mature a library as breeze w.r.t. API stability.

Mark Farrell

unread,

Aug 16, 2014, 8:59:30 AM8/16/14

to scalanlp...@googlegroups.com

Hi David and Gabriel,

Thanks for the information.

David: any updates on the parser trained with biomedical texts?

Gabriel:

Didn't have too close of a look at the resources you posted yet.

However, I'm starting to look at using the category/concept datasets published by Carnegie Mellon's Read The Web to categorize literal-nouns when

using my software to relate literal-nouns by the actions that they can perform on each other.

I'm playing with the idea of using Neo4j to store entries of our knowledge base on aging, as it seems to be an appropriate technology

to meet the format/specifications mentioned by people who work at CREA, building parts of the knowledge base by hand, before I started

working on my automation project:

http://crea.berkeley.edu/FASEB_POSTER2012_FINAL_FINAL_GOLD_VLSB_yellow_title_box_56x36_PDF.pdf

Thoughts?

Also, I find it somehow shocking/interesting that one paper you mentioned is titled "Joint Coreference Resolution

and Named-Entity Linking with Multi-pass Sieves".

I've been at Defence R&D Canada this summer, working on some software tools to estimate static attributes of tracked unidentified vessels,

under the supervision of someone who has been researching a new method for global (combinatorial) vessel identification, publishing a paper titled "Joint Identification of Multiple Tracked Targets".

Even more coincidental, he originally titled that paper "Identity Resolution in Wide-Area Surveillance" before the editors of the Information Fusion journal asked suggested he change the name.

My observation is probably not that significant. I suppose, from the perspective of an undergraduate student with my level of knowledge, I'm just marvelling that my "hobby" project seems to share

similarities with what members of my section have been doing at work all along.

David Hall

unread,

Aug 18, 2014, 7:51:57 PM8/18/14

to scalanlp...@googlegroups.com

On Sat, Aug 16, 2014 at 5:59 AM, Mark Farrell <m4fa...@csclub.uwaterloo.ca> wrote:

Hi David and Gabriel,

Thanks for the information.

David: any updates on the parser trained with biomedical texts?

Getting close. I have some preliminary models trained but I'm not quite happy with the results. One more week hopefully.

Gabriel:

Didn't have too close of a look at the resources you posted yet.

However, I'm starting to look at using the category/concept datasets published by Carnegie Mellon's Read The Web to categorize literal-nouns when
using my software to relate literal-nouns by the actions that they can perform on each other.

I'm playing with the idea of using Neo4j to store entries of our knowledge base on aging, as it seems to be an appropriate technology
to meet the format/specifications mentioned by people who work at CREA, building parts of the knowledge base by hand, before I started

working on my automation project:
http://crea.berkeley.edu/FASEB_POSTER2012_FINAL_FINAL_GOLD_VLSB_yellow_title_box_56x36_PDF.pdf

Thoughts?

Seems reasonable

Also, I find it somehow shocking/interesting that one paper you mentioned is titled "Joint Coreference Resolution
and Named-Entity Linking with Multi-pass Sieves".

I've been at Defence R&D Canada this summer, working on some software tools to estimate static attributes of tracked unidentified vessels,
under the supervision of someone who has been researching a new method for global (combinatorial) vessel identification, publishing a paper titled "Joint Identification of Multiple Tracked Targets".

Even more coincidental, he originally titled that paper "Identity Resolution in Wide-Area Surveillance" before the editors of the Information Fusion journal asked suggested he change the name.

My observation is probably not that significant. I suppose, from the perspective of an undergraduate student with my level of knowledge, I'm just marvelling that my "hobby" project seems to share

similarities with what members of my section have been doing at work all along.

There are probably some commonalities, but I suspect it's more that there are certain buzzwords that are popular these days...

-- David

Reply all

Reply to author

Forward