Integrating Dataverse 4.x with SolrCloud

Shannon Schlueter

unread,

Feb 2, 2016, 3:02:56 PM2/2/16

to Dataverse Users Community

As part of an ongoing data science initiative at UNC-Charlotte, I would very much like to integrate dataverse within our existing data and analytics platform.

Our platform is based on Cloudera Hadoop CDH 5.5 and integrates Cloudera Search (which is based on solr 4.10.3). Indexing and query distribution in this environment is handled through solrCloud.

I noticed the github issues ( "Indexing: enable Dataverse to use Solr in a distributed environment #1083 " and "Solr: load balancing, fault tolerance, and high availability #2322 " ) which seem to be the beginnings of feature enhancements that would be necessary to integrate Dataverse with SolrCloud. I also noticed the 'Triaged' status, the absence of an FRD, and the lack of an assignee on the later issue. I'm guessing this means that there is no current development related to this?

I'm interested in doing a bit of digging and coding to flush out the requirements/issues with such integration. I noticed Phillips comments in the CONTRIBUTING.md doc

"Before you start coding, please reach out to us either on our dataverse-community Google Group, IRC, or viasup...@dataverse.org to make sure the effort is well coordinated and we avoid merge conflicts."

So, here I am!

I've forked the IQSS/dataverse git to bionary/dataverse and loaded it onto a CentOS 6.7 VM with netbeans and all the prereq's as suggested in the reorg'd installation guide. The basic install went fine and worked great. I then moved to test against solrCloud by rolling out a cloudera quickstart VM. This VM has solrCloud preconfigured as part of the Cloudera Search CDH component. It's running solr 4.10.3 as it's base.

From experience, solrCloud does it's best to abstract it's API such that most indexing and query requests are indistinguishable from those run via a single-host based solr server. With the exception of still having a single point of failure by not querying zookeeper for cloud entry points. That said you can still issue queries to any individual solrCloud server node and expect that it will distribute the query and/or indexing task and collect/return the results.

The hope here was that the addition of Phillips work on the :SolrHostColonPort option would make this "just work". No Joy.

I'm digging into this and was hoping to start a discussion on this thread to determine where the skeletons are in this effort.

Here are a few of the issues that need to be coded/addressed that I've come across so far:

installation/configuration: option to utilize zookeeper based solrCloud host discovery

SolrJ includes a 'smart' client for SolrCloud, which is ZooKeeper aware.

installation/configuration: option to indicate remote solr host in installation scripts

could implement custom dataverse schema.xml loading using solrctl to automate this task
:SolrHostColonPort call in setup-all.sh to configure for dataverse
altered solr/update call in setup-all.sh to use remote solr host

sidenote: (with solr 4 update/json is no longer necessary as the update request header understands what to do with json input)

installation/configuration: option to indicate a solr collection id in installation scripts

solrCloud doesn't like URI's that don't explicitly declare a solr collection. For instance the existing call to "/solr/update/json?commit=true ..." fails when pointed at a solrCloud host. However a call to "/solr/collection1/update/json?commit=true ..."
this might open up an avenue for associating individual dataverses with their own solr collection to enable customized index authorization and more highly distributed/partitioned indexing

installation/configuration: option to indicate TLS/SSL (https://) for solr host

This looks like it would effect the solrServer <HttpSolrServer> object instantiation in edu.harvard.iq.dataverse.search.IndexServiceBean.java and possibly elsewhere.

most likely this would call for the addition of a 'getSolrUrlSchema' method to the systemConfig
if zookeeper is ultimately used to determine solrCloud URLs this is taken care of by solrj directly

installation/configuration: capabilities for kerberos authentication using solrj/jaas interface

authenticating the dataverse server to the solrCloud via kerberos and encrypted with TLS would go a long way toward securing the indexes as well as paving the way for customizing index authorization per dataverse user. I'm mainly interested in this as a user impersonation mechanism to match the HDFS/Hive/iRODS/Sentry authorization federation we currently have.

Well that's what I have so far. What all am i missing?

--

Shannon

Philip Durbin

unread,

Feb 2, 2016, 3:34:43 PM2/2/16

to dataverse...@googlegroups.com

Hi Shannon!

Wow! Thanks for reaching out! I've taken your (fantastic) list of issues and pasted them into a Google Doc called "Integrating Dataverse 4.x with SolrCloud":

https://docs.google.com/document/d/1tfYtmoKPy-Ci7s6TOLa-PG-Ez9K_BQB5gGYKCAHbFO8/edit?usp=sharing

I've made it so anyone can comment without login but please request "write" access so we can keep the doc up to date.

You're right that https://github.com/IQSS/dataverse/issues/2322 (Solr: load balancing, fault tolerance, and high availability) that no one is actively working in this area. However, if you're able to gain some traction with some assistance from myself and others, I'd be *very* interested in reviewing a pull request. The pull request would ideally include changes to the Vagrant environment so we can see it all working... but I'm getting ahead of myself. :)

Let's try to coordinate via the Google doc. You're also welcome to reach me at http://chat.dataverse.org

Out of curiosity, is it a requirement of yours to make these changes in order to integrate your data and analytics platform with Dataverse? Or are you saying it would somehow make it easier? I would think that any integration wouldn't be blocked by changing all this stuff about Solr but you tell me. :)

Thanks again!

Phil

p.s. I'm glad the Dataverse install went ok. :)

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/a67b9d07-52d1-4d76-940e-3959910f0168%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Shannon Schlueter

unread,

Feb 2, 2016, 4:07:47 PM2/2/16

to Dataverse Users Community, philip...@harvard.edu

I've just recently begun to get familiar with Vagrant, but I'd be happy to incorporate changes to the vagrantfile as I learn more.

We're somewhere in between with regard to these issues blocking the integration efforts. I've laid out a stop-gap plan to firewall off a standalone solr server and coordinate index sync's so that the dataverse index is visible in both the dataverse/solr and the hadoop/search environments. This seems to work for the involved researchers at present but won't scale well, hence the interest in contributing :-)

Thanks for setting up the google doc. I hope to have pull requests coming in for you in the near future.

--

Shannon

Mercè Crosas

unread,

Feb 2, 2016, 6:04:02 PM2/2/16

to dataverse...@googlegroups.com

Shannon,

That's wonderful. You might want to also connect with the ODUM team at UNC-Chapel Hill - they have a Dataverse installation and might be interested in combining efforts.

All the best,

Merce

Mercè Crosas, Ph.D.

Chief Data Science and Technology Officer, IQSS

Harvard University

http://scholar.harvard.edu/mercecrosas

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/033c52dc-72bd-425b-af42-5c6d599a8e80%40googlegroups.com.

Ben Companjen

unread,

Feb 3, 2016, 4:28:46 AM2/3/16

to dataverse...@googlegroups.com

Hi Shannon,

Great work! I wish I could say I can help and work together, but I can echo your call for more scalability, security and flexibility through modularity and configuration.

Have you by any chance worked with Ansible or Puppet and are you aware of https://github.com/IQSS/dataverse-ansible and https://github.com/IQSS/dataverse-puppet?

Although it's great to have installation scripts that help get you going, for running a Dataverse service I believe it's more sustainable to do proper configuration management. These initiatives hopefully grow to support a choice of running a cluster of Dataverse installations with various Solr(Cloud) and PostgreSQL configurations, or perhaps having it all on one machine.

Regards,

Ben

DataverseNL – https://dataverse.nl

Jonathan Crabtree

unread,

Feb 3, 2016, 12:50:11 PM2/3/16

to dataverse...@googlegroups.com

Shannon

Odum would be interested in this as well. We are working toward integrating Dataverse into a virtual combination of tools we collectively call VISR

http://renci.org/wp-content/uploads/2015/05/VISRWhite-Paper-No3_2015_highres.pdf

A connection to a Hadoop instant would be very beneficial.

We are working with out partners at RENCI and the iRODS Consortium folks to get this pilot off the ground.

I would be happy to talk about your ideas they sound great.

Jon

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/a67b9d07-52d1-4d76-940e-3959910f0168%40googlegroups.com.

Shannon Schlueter

unread,

Feb 3, 2016, 4:46:36 PM2/3/16

to Dataverse Users Community

Hi Ben,

I'm new to the orchestration and configuration management scene but learning quickly. We're investigating options to orchestrate the platform I mentioned in the OP. I personally have no experience with these tools yet, but I'm working with a new hire to our team that has training and experience with puppet.

I had not noticed the ansible & puppet dataverse forks. Thanks for the links.

Shannon Schlueter

unread,

Feb 3, 2016, 5:22:24 PM2/3/16

to Dataverse Users Community

Jon,

Thanks for the white paper link. I'd previously heard about your work with VISR at Odum through mutual acquaintances at RENCI. I had you on my list of people to get in touch with once we got closer to full production level operations. So this thread is fortuitous. I'll send you an email shortly to see if we can touch base sometime soon to compare notes.

By the way, the platform I've referenced here is the UNCC contribution to the UNC-GA's ROI project for developing a federated data science environment among UNCC, RENCI, and NC State using iRODS zone federation. Each institution has taken on the challenge of producing/providing a unique computational infrastructure to better suit various analytics needs. We wanted data to be accessible, transferable, and replicable across the federation such that workflows can be tailored to optimize staging data at an optimal computational platform and analytics workflows could be transferred to appropriate platforms to be data locale.