SolrSherlock -> OpenSherlock Re: [qa-oss]: Re: [SolrSherlock] Project Update

15 views
Skip to first unread message

Jack Park

unread,
Apr 17, 2015, 12:46:11 AM4/17/15
to Leonid Boytsov, qa-...@googlegroups.com
Hi Leonid,

That's a very good question. Thanks for asking. I changed the topic to reflect the new subject: SolrSherlock was originally created using Apache Solr.

That was for two very good (I thought) reasons in the beginning. First, Solr came with many plug in features I thought would be helpful. Second, the book Taming Text was just becoming available in "early access"; it is a great book from the perspective that it is a cookbook on using Solr and OpenNLP to build question answering systems.

There is, in fact, a GitHub repo for SolrSherlock.

But, as I developed code and started loading text into the system to be read and parsed, I soon realized that the architecture could be improved by separating processing agents from the indexing engine: the architecture switched from being Solr with plug in features, to Solr as index with remote agents running in different JVM instances. 

I then realized that I could take advantage of some features in ElasticSearch, so the project name was changed from SolrSherlock to just OpenSherlock, leaving room for varieties of implementation details.

In a recent talk delivered in Tokyo, I describe the system in a bit more detail and explain what is behind the project. Those slides are online here:


Over time, I will be adding more details. There is now a repo at GitHub for OpenSherlock. There is a shell for the HyperMembrane component, but it is still in such a high rate of evolution (refactoring) that it putting the code online seems premature.

Cheers
Jack

On Thu, Apr 16, 2015 at 6:53 PM, Leonid Boytsov <l...@boytsov.info> wrote:
Hi, I checked your slides, but I still don't get why the project called SolrSherlock. Is there a description of the SOLR-specific part?

Thanks!

On Friday, April 18, 2014 at 2:24:11 PM UTC-4, jackpark wrote:
Tomorrow, April 19, I will be giving a talk at a BigData Sciences
meetup; my slides for that talk are now online at
http://www.slideshare.net/jackpark/big-datasci20140419

The slides introduce a new concept to the SolrSherlock conversation;
after many conversations with Patrick Durusau, Mark Szpakowski, and
Sherry Jones, we decided to give this new concept the name
HyperMembrane.

The term is inspired by a paper by Ted Nelson:
A COSMOLOGY FOR A DIFFERENT COMPUTER UNIVERSE:
Data Model, Mechanisms, Virtual Machine and Visualization Infrastructure
http://xanadu.com/zigzag/ZZdnld/zzRefDef/

where there is an intersection between the very notion of cosmology in
our context, and the topological nature of harvested information
resources.  Ted talks about his ZigZag architecture, which, in short,
is like a beads-on-a-string representation of information resources,
where, technically, each topic has just one bead (node).  That means
that each bead (node) will have many "strings" passing through it,
depending on context.

In a sense, that is an "information fabric", but one of potentially
high dimensionality. A membrane is a 2-dimensional sheet, a hyper
membrane is a sheet but one of many dimensions.  In another
vernacular, one might think of the framework as that of intersecting
manifolds -- topology at work.

I will have much more to say about all of that soon; at this moment,
the code necessary to build and maintain that structure is partially
running. It uses a link-grammar parser, and relies on a topic map to
maintain identity of and relationships among those nodes (beads).

Primary nodes are those of nouns and noun phrases, and verbs and verb
phrases; the fabric ignores what would otherwise be called "stop
words", though they are kept around internally.

Another topic raised in my slides is that of Literature-based
Discovery. There's a huge literature on that topic; my slides present
a simplified version of Don R. Swanson's Fish Oil, Raynaud's Syndrome
discovery, the paper that started that field: undiscovered public
knowledge.

Before I make the source code for this work available at GitHub, it
will be able to demonstrate the ability to perform literature-based
discoveries based on resources captured in the fabric.

More soon!

--
You received this message because you are subscribed to the Google Groups "qa-oss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qa-oss+un...@googlegroups.com.
To post to this group, send email to qa-...@googlegroups.com.
Visit this group at http://groups.google.com/group/qa-oss.
To view this discussion on the web visit https://groups.google.com/d/msgid/qa-oss/e2ec8d54-30ab-449e-b0d6-4bc552bf291d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages