As part of an ongoing data science initiative at UNC-Charlotte, I would very much like to integrate dataverse within our existing data and analytics platform.
Our platform is based on Cloudera Hadoop CDH 5.5 and integrates Cloudera Search (which is based on solr 4.10.3). Indexing and query distribution in this environment is handled through solrCloud.
I noticed the github issues ( "Indexing: enable Dataverse to use Solr in a distributed environment #1083
" and "Solr: load balancing, fault tolerance, and high availability #2322
" ) which seem to be the beginnings of feature enhancements that would be necessary to integrate Dataverse with SolrCloud. I also noticed the 'Triaged' status, the absence of an FRD, and the lack of an assignee on the later issue. I'm guessing this means that there is no current development related to this?
I'm interested in doing a bit of digging and coding to flush out the requirements/issues with such integration. I noticed Phillips comments in the CONTRIBUTING.md doc
"Before you start coding, please reach out to us either on our dataverse-community Google Group, IRC, or viasup...@dataverse.org to make sure the effort is well coordinated and we avoid merge conflicts."
So, here I am!
I've forked the IQSS/dataverse git to bionary/dataverse and loaded it onto a CentOS 6.7 VM with netbeans and all the prereq's as suggested in the reorg'd installation guide
. The basic install went fine and worked great. I then moved to test against solrCloud by rolling out a cloudera quickstart VM
. This VM has solrCloud preconfigured as part of the Cloudera Search CDH component. It's running solr 4.10.3 as it's base.
From experience, solrCloud does it's best to abstract it's API such that most indexing and query requests are indistinguishable from those run via a single-host based solr server. With the exception of still having a single point of failure by not querying zookeeper for cloud entry points. That said you can still issue queries to any individual solrCloud server node and expect that it will distribute the query and/or indexing task and collect/return the results.
The hope here was that the addition of Phillips work on the :SolrHostColonPort option would make this "just work". No Joy.
I'm digging into this and was hoping to start a discussion on this thread to determine where the skeletons are in this effort.
Here are a few of the issues that need to be coded/addressed that I've come across so far:
- installation/configuration: option to utilize zookeeper based solrCloud host discovery
- SolrJ includes a 'smart' client for SolrCloud, which is ZooKeeper aware.
- installation/configuration: option to indicate remote solr host in installation scripts
- could implement custom dataverse schema.xml loading using solrctl to automate this task
- :SolrHostColonPort call in setup-all.sh to configure for dataverse
- altered solr/update call in setup-all.sh to use remote solr host
- sidenote: (with solr 4 update/json is no longer necessary as the update request header understands what to do with json input)
- installation/configuration: option to indicate a solr collection id in installation scripts
- solrCloud doesn't like URI's that don't explicitly declare a solr collection. For instance the existing call to "/solr/update/json?commit=true ..." fails when pointed at a solrCloud host. However a call to "/solr/collection1/update/json?commit=true ..."
- this might open up an avenue for associating individual dataverses with their own solr collection to enable customized index authorization and more highly distributed/partitioned indexing
- installation/configuration: option to indicate TLS/SSL (https://) for solr host
- This looks like it would effect the solrServer <HttpSolrServer> object instantiation in edu.harvard.iq.dataverse.search.IndexServiceBean.java and possibly elsewhere.
- most likely this would call for the addition of a 'getSolrUrlSchema' method to the systemConfig
- if zookeeper is ultimately used to determine solrCloud URLs this is taken care of by solrj directly
- installation/configuration: capabilities for kerberos authentication using solrj/jaas interface
- authenticating the dataverse server to the solrCloud via kerberos and encrypted with TLS would go a long way toward securing the indexes as well as paving the way for customizing index authorization per dataverse user. I'm mainly interested in this as a user impersonation mechanism to match the HDFS/Hive/iRODS/Sentry authorization federation we currently have.
Well that's what I have so far. What all am i missing?