Partial Updates in Lily Indexer

57 views
Skip to first unread message

Murtaza Kanchwala

unread,
May 24, 2016, 2:55:03 AM5/24/16
to HBase Indexer Users
I have Cloudera 5.4.10 with HBase 1.x and SolrCloud 4.10.x on it.

I have 5 HBase tables as below : 

1) PARENT : PARENT_ID(RowKey)
2) CHILD_1 : PER1 : CHILD_ID_PARENT_ID (RowKey), PARENT_ID, SOME_CHILD_1_STRING  | NOTE : PARENT_ID -> ONE-TO-MANY -> SOME_CHILD_1_STRING (Like for each PARENT_ID there can be 1 - 50 records here and we only want single field for indexing)
3) CHILD_2 : PER2 : CHILD_ID_PARENT_ID (RowKey), PARENT_ID, SOME_CHILD_2_STRING  | NOTE : PARENT_ID -> ONE-TO-MANY -> SOME_CHILD_2_STRING (Like for each PARENT_ID there can be 1 - 50 records here and we only want single field for indexing)
4) CHILD_3 : PER3 : CHILD_ID_PARENT_ID (RowKey), PARENT_ID, SOME_CHILD_3_STRING  | NOTE : PARENT_ID -> ONE-TO-MANY -> SOME_CHILD_3_STRING (Like for each PARENT_ID there can be 1 - 50 records here and we only want single field for indexing)
5) CHILD_4 : PER4 : CHILD_ID_PARENT_ID (RowKey), PARENT_ID, SOME_CHILD_4_STRING  | NOTE : PARENT_ID -> ONE-TO-MANY -> SOME_CHILD_4_STRING (Like for each PARENT_ID there can be 1 - 50 records here and we only want single field for indexing)

and have one single Solr Core "MyUnifiedCore" : 

parent_id : uniqueKey, long
child_1 : string, indexed, stored, multivalued (Need array of SOME_CHILD_1_STRING fields corresponding to a single Parent ID in solr)
child_2 : string, indexed, stored, multivalued (Same as above for CHILD_2)
child_3 : string, indexed, stored, multivalued (Same as above for CHILD_3)
child_4 : string, indexed, stored, multivalued (Same as above for CHILD_4)

Now as schema shown you a brief Idea what I am trying to achieve is :

Working: 
I want to index the PARENT table in the MyUnifiedCore using LilyHBaseIndexer which is working in Near Real Time(NRT).

Not Working:

1) I want to partial update my solr document in NRT using Lily indexer on my other 4 CHILD tables in the same core corresponding to my parent_id in MyUnifiedCore in Solr.

2) When I do try to partial update it only the Last record appears, And it adds a new document(Dedupe and Partial update should be working but it is not).

What I need:

1) I need the Morphline config to partial update index my children tables in MyUnifiedCore.

2) I need a Solr configuration or Handler or Component which will remove Duplication of documents in Solr and Partially updates my core for each PARENT_ID

What I am working On: 

PARENT : IT is as normal Config. So I am not mentioning here.

CHILD : (SAME for other 3 tables too)

Command : 

hbase-indexer add-indexer --name ChildIndexer1 --indexer-conf /path/to/indexer_config.xml --connection-param solr.zk=$ZK_HOST/solr --connection-param solr.collection=myUnifiedCore --zookeeper $ZK_HOST:2181

Indexer config : 
<?xml version="1.0"?>
<indexer table="CHILD_1" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper" unique-key-field="parentId" mapping-type="row">

<param name="morphlineFile" value="/path/to/morphlines.conf" />

</i

Morphline Conf :

SOLR_LOCATOR : {
  # Name of solr collection
  collection : myUnifiedCore

  # ZooKeeper ensemble
  zkHost : "$ZK_HOST"
}
morphlines : [
 {
   id : morphline1
   importCommands : ["org.kitesdk.**", "com.ngdata.**"]
   commands : [
     {
       extractHBaseCells {
         mappings : [
           {
             inputColumn : "PER1:PARENT_ID"
             outputField : parentId
             type : long
             source : value
           }
           {
             inputColumn : "PER1:SOME_CHILD_1_STRING"
             outputField : child_1 
             type : string
             source : value
           }
         ]
       }
     }
     {
       sanitizeUnknownSolrFields {
         solrLocator : {
           collection : myUnifiedCore
           zkHost : "$ZK_HOST"
         }
       }
     }
     
     # java command that performs partial updates
     {
       java {
imports : "import java.util.*;" 
code: """
Long dID = Long.parseLong(record.getFirstValue("parentId"));
String child_1 = (String) record.getFirstValue("child_1 ");
record.put("parentId", parentId);
record.put("child_1 ", Collections.singletonMap("add", child_1 ));
return child.process(record);"""
       }
    }
     
     #{ logDebug { format : "output record: {}", args : ["@{}"] } }
   ]
}
]
Reply all
Reply to author
Forward
0 new messages