Hi,
I've been working on software that I believe devises and implements a marginally new method for constructing
knowledge bases automatically from text articles. I tokenize documents into sentence and use The Berkeley Parser to parse each sentence.
Then, I pattern match on the trees produced to extract predicates, objects and subjects, forming simple logical propositions
that I represent on a graph. Specifically, I am trying to help out with some work the Center for Research and Education on Aging
down at Berkeley. Our goal is to build a software system that automates the construction of a knowledge base on aging
by searching for keywords on pubmed, downloading articles, filtering out spam (a.k.a non-scientific) sentences in articles, compiles
the text, stores results in a graph database, and provides a web viewer for the knowledge base.
The project is hosted on Github and also provides a link to an online knowledge base demo for you to view:
There are current drawbacks: I've yet to support all phrase-structures necessary to parse many of the sentences and
brute-force pattern matching over all possible phrase structures seems to be a little bit slow. Also, parenthetical phrases
have been giving me a little bit of trouble due to the fact that they seem to be able to be placed all over sentences and introduce new clauses.
On the other hand, the software is parallelizable for performance and, in principal, the method seems to work well. I seem to be
able to extract knowledge from ~40% of all English sentences (I adopt an all-or-nothing strategy), and the graphs I produce are
have high (90%+) modularity -- I can classify groups of nodes quite effectively. This may not fully justify that my method is valid/of high quality,
but I thought it was an interesting result that at least, in my own mind, shows I'm on the right track.
I seek support, contributions and advice from this community; would benefit from people who work in natural language processing
inspecting my software. Specifically, I'd wonder if I should, and how, I might enhance my software with tools available
in the scalanlp ecosystem. Feel free to contact me if you are interested in this project.