I have
Cloudera 5.4.10 with
HBase 1.x and
SolrCloud 4.10.x on it.
I have 5 HBase tables as below :
1) PARENT : PARENT_ID(RowKey)
2) CHILD_1 : PER1 : CHILD_ID_PARENT_ID (RowKey), PARENT_ID, SOME_CHILD_1_STRING | NOTE : PARENT_ID -> ONE-TO-MANY -> SOME_CHILD_1_STRING (Like for each PARENT_ID there can be 1 - 50 records here and we only want single field for indexing)
3) CHILD_2 : PER2 : CHILD_ID_PARENT_ID (RowKey), PARENT_ID, SOME_CHILD_2_STRING | NOTE : PARENT_ID -> ONE-TO-MANY -> SOME_CHILD_2_STRING (Like for each PARENT_ID there can be 1 - 50 records here and we only want single field for indexing)
4) CHILD_3 : PER3 : CHILD_ID_PARENT_ID (RowKey), PARENT_ID, SOME_CHILD_3_STRING | NOTE : PARENT_ID -> ONE-TO-MANY -> SOME_CHILD_3_STRING (Like for each PARENT_ID there can be 1 - 50 records here and we only want single field for indexing)
5) CHILD_4 : PER4 : CHILD_ID_PARENT_ID (RowKey), PARENT_ID, SOME_CHILD_4_STRING | NOTE : PARENT_ID -> ONE-TO-MANY -> SOME_CHILD_4_STRING (Like for each PARENT_ID there can be 1 - 50 records here and we only want single field for indexing)
and have one single Solr Core "MyUnifiedCore" :
parent_id : uniqueKey, long
child_1 : string, indexed, stored, multivalued (Need array of SOME_CHILD_1_STRING fields corresponding to a single Parent ID in solr)
child_2 : string, indexed, stored, multivalued (Same as above for CHILD_2)
child_3 : string, indexed, stored, multivalued (Same as above for CHILD_3)
child_4 : string, indexed, stored, multivalued (Same as above for CHILD_4)
Now as schema shown you a brief Idea what I am trying to achieve is :
Working:
I want to index the PARENT table in the MyUnifiedCore using LilyHBaseIndexer which is working in Near Real Time(NRT).
Not Working:
1) I want to partial update my solr document in NRT using Lily indexer on my other 4 CHILD tables in the same core corresponding to my parent_id in MyUnifiedCore in Solr.
2) When I do try to partial update it only the Last record appears, And it adds a new document(Dedupe and Partial update should be working but it is not).
What I need:
1) I need the Morphline config to partial update index my children tables in MyUnifiedCore.
2) I need a Solr configuration or Handler or Component which will remove Duplication of documents in Solr and Partially updates my core for each PARENT_ID
What I am working On:
PARENT : IT is as normal Config. So I am not mentioning here.
CHILD : (SAME for other 3 tables too)
Command :
hbase-indexer add-indexer --name ChildIndexer1 --indexer-conf /path/to/indexer_config.xml --connection-param solr.zk=$ZK_HOST/solr --connection-param solr.collection=myUnifiedCore --zookeeper $ZK_HOST:2181
Indexer config :
<?xml version="1.0"?>
<indexer table="CHILD_1" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper" unique-key-field="parentId" mapping-type="row">
<param name="morphlineFile" value="/path/to/morphlines.conf" />
</i
Morphline Conf :
SOLR_LOCATOR : {
# Name of solr collection
collection : myUnifiedCore
# ZooKeeper ensemble
zkHost : "$ZK_HOST"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**", "com.ngdata.**"]
commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : "PER1:PARENT_ID"
outputField : parentId
type : long
source : value
}
{
inputColumn : "PER1:SOME_CHILD_1_STRING"
outputField : child_1
type : string
source : value
}
]
}
}
{
sanitizeUnknownSolrFields {
solrLocator : {
collection : myUnifiedCore
zkHost : "$ZK_HOST"
}
}
}
# java command that performs partial updates
{
java {
imports : "import java.util.*;"
code: """
Long dID = Long.parseLong(record.getFirstValue("parentId"));
String child_1 = (String) record.getFirstValue("child_1 ");
record.put("parentId", parentId);
record.put("child_1 ", Collections.singletonMap("add", child_1 ));
return child.process(record);"""
}
}
#{ logDebug { format : "output record: {}", args : ["@{}"] } }
]
}
]