Indexing JSON stored in an HBase cell as separate fields

93 views
Skip to first unread message

andrew....@bigdatapartnership.com

unread,
Dec 8, 2015, 11:39:32 AM12/8/15
to HBase Indexer Users
Hello,

I'm looking to set up an HBase indexer for Solr that will be able to take a cell from HBase which is a JSON string and split it into separate fields in Solr.

i.e HBase single cell value {"firstname" : "John","lastname" : "Smith"} maps to separate searchable fields firstname and lastname.

I've already tried creating a custom Lucene character filter to extract individual the fields when indexing, but this still leaves the entire JSON string as the stored value for each field leading to duplication. Would anyone be able to tell me whether this is currently doable with the HBase indexer mapping capability or otherwise how to best approach this? 

Many Thanks,

Andrew

Gabriel Reid

unread,
Dec 8, 2015, 11:42:58 AM12/8/15
to andrew....@bigdatapartnership.com, HBase Indexer Users
Hi Andrew,

This sounds like something that would fit really well with morphlines,
which is part of the Kite SDK and is also supported in hbase-indexer.

Cloudera has quite a bit of documentation on setting up a
morphline-based indexer:
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/search_hbase_batch_indexer.html

- Gabriel
> NOTICE AND DISCLAIMER
>
> This email (including attachments) is confidential. If you are not the
> intended recipient, notify the sender immediately, delete this email from
> your system and do not disclose or use for any purpose.
>
> Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United
> Kingdom
> Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United
> Kingdom
> Big Data Partnership Limited is a company registered in England & Wales with
> Company No 7904824
>
> --
> You received this message because you are subscribed to the Google Groups
> "HBase Indexer Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hbase-indexer-u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

andrew....@bigdatapartnership.com

unread,
Dec 8, 2015, 12:13:56 PM12/8/15
to HBase Indexer Users, andrew....@bigdatapartnership.com
Thanks Gabriel,

The extractHBaseCells method doesn't quite seem to be able to do what I need as the mapping accepts cell 'value' as an enum, so I don't see where I could do any pre-processing on it using another method. Would I have to create a custom class to essentially replicate the extractHBaseCells method with some added JSON parsing on the cell value?

Many Thanks,

Andrew

Gabriel Reid

unread,
Dec 9, 2015, 7:05:02 AM12/9/15
to andrew....@bigdatapartnership.com, HBase Indexer Users
I'm not actually knowledgable enough on morphlines to be sure whether
or not something like this is possible with them, but it does seem
like something that would likely be possibel.

Another option is to implement your own ResultToSolrMapper [1], and
configure it via the mapper parameter [2] in your indexer config. This
will give you full control over what needs to happen in the conversion
from an HBase Result object into a Solr document.

- Gabriel


1. https://github.com/NGDATA/hbase-indexer/blob/da92ee93e1e3a1919cd74da1531dc13723540e03/hbase-indexer-engine/src/main/java/com/ngdata/hbaseindexer/parse/ResultToSolrMapper.java
2. https://github.com/NGDATA/hbase-indexer/wiki/Indexer-configuration#mapper
>> > email to hbase-indexer-u...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>
>
> NOTICE AND DISCLAIMER
>
> This email (including attachments) is confidential. If you are not the
> intended recipient, notify the sender immediately, delete this email from
> your system and do not disclose or use for any purpose.
>
> Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United
> Kingdom
> Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United
> Kingdom
> Big Data Partnership Limited is a company registered in England & Wales with
> Company No 7904824
>
> --
> You received this message because you are subscribed to the Google Groups
> "HBase Indexer Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hbase-indexer-u...@googlegroups.com.

Andrew Bumstead

unread,
Dec 10, 2015, 5:56:58 AM12/10/15
to Gabriel Reid, HBase Indexer Users
Thanks Gabriel, I've got it working using a custom morphlines class.

Many thanks,

Andrew

Pankil

unread,
Dec 16, 2015, 12:54:20 PM12/16/15
to HBase Indexer Users, gabrie...@gmail.com
Hi Andrew,

I am trying to achieve something similar. Can you give me more inputs on how did you achieve by custom morphline class? do you use cloudera packaged hbase-indexer?

I am using hbas-indexer stand alone and trying to achieve the same functionality.

Pankil


>> > For more options, visit https://groups.google.com/d/optout.
>
>
> NOTICE AND DISCLAIMER
>
> This email (including attachments) is confidential. If you are not the
> intended recipient, notify the sender immediately, delete this email from
> your system and do not disclose or use for any purpose.
>
> Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United
> Kingdom
> Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United
> Kingdom
> Big Data Partnership Limited is a company registered in England & Wales with
> Company No 7904824
>
> --
> You received this message because you are subscribed to the Google Groups
> "HBase Indexer Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> For more options, visit https://groups.google.com/d/optout.

Andrew Bumstead

unread,
Dec 17, 2015, 4:21:31 AM12/17/15
to Pankil, HBase Indexer Users, Gabriel Reid
Hi Pankil,

I downloaded the source code from https://github.com/NGDATA/hbase-indexer, created an additional class in the Morphlines sub project and recompiled it. If you're using a Cloudera distribution then you might find it better using their own fork of the indexer from https://github.com/cloudera/hbase-indexer.

My implementation largely replicates the existing ExtractHBaseCellsBuilder class, with a modification to do processing on the retrieved HBase value before it is indexed. In my case I'm just indexing a single field from the JSON, taking the field to extract from config.

i.e

for (Object value : byteArrayMapper.map(iter.next())) {
    // Modifying value retrieved from HBase
    value = extractFieldFromJson(value, fieldToExtract);
    record.put(outputField, value);
}

where

    private static String extractFieldFromJson(Object value, String fieldToExtract) {
String result = null;
        Map<String,Object> parsedJsonMap = new HashMap<String, Object>();
        try{
            parsedJsonMap = new com.fasterxml.jackson.databind.ObjectMapper().readValue((String)value, HashMap.class);
        } catch (IOException e){
            e.printStackTrace();
        }
        if (parsedJsonMap.containsKey(fieldToExtract)){
            result = (String) parsedJsonMap.get(fieldToExtract);
        }
        return result;
    }

Hope this is of help to you.

Regards
Andrew





>> > For more options, visit https://groups.google.com/d/optout.
>
>
> NOTICE AND DISCLAIMER
>
> This email (including attachments) is confidential. If you are not the
> intended recipient, notify the sender immediately, delete this email from
> your system and do not disclose or use for any purpose.
>
> Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United
> Kingdom
> Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United
> Kingdom
> Big Data Partnership Limited is a company registered in England & Wales with
> Company No 7904824
>
> --
> You received this message because you are subscribed to the Google Groups
> "HBase Indexer Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> For more options, visit https://groups.google.com/d/optout.

NOTICE AND DISCLAIMER

This email (including attachments) is confidential. If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose.

Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United Kingdom
Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United Kingdom
Big Data Partnership Limited is a company registered in England & Wales with Company No 7904824

--
You received this message because you are subscribed to a topic in the Google Groups "HBase Indexer Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hbase-indexer-user/3wVDpPW2ulw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hbase-indexer-u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Wolfgang Hoschek

unread,
Dec 17, 2015, 5:00:25 AM12/17/15
to Andrew Bumstead, Pankil, HBase Indexer Users, Gabriel Reid
In order to separate and modularize these concerns, I'd recommend moving that JSON logic into a custom morphline command and inserting that into the command chain after the extractHBaseCells command, like so:

commands : [                    
  {
    extractHBaseCells {
      mappings : [
        {
          inputColumn : "data:item"
          outputField : "_attachment_body" 
          type : "byte[]" 
          source : value
        }
      ]
   }

   {
     myCustomJSONCommand {}
  }
]

Wolfgang.
Reply all
Reply to author
Forward
0 new messages