im struggling with hive hql to produce results from a subdocument. wanna help me?

11 views
Skip to first unread message

Tjay Belt

unread,
Mar 17, 2015, 6:10:16 PM3/17/15
to uh...@googlegroups.com

I have a json document that has several sub documents. I can query this doc and pull out numbers from a repetitive subdocument. The result is a list within brackets. I want to sum these numbers, and cannot figure out how.


Here is a sample document. 

{
    "_id": "005f2c46-4a70-4fed-a8e9-a445011d2aba",
    "Revision": 7,
    "ActivityCountedCollection": [
        {
            "Lesson": "98d66ab9-1ef4-4b61-a05d-857b3e07e0f8",
            "DateTime": "2015-03-02T14:11:05.5407801+00:00",
            "ElapsedSeconds": 66.80226
        },
        {
            "Lesson": "30698aac-5a3d-4464-935c-16de4ba9db70",
            "DateTime": "2015-03-02T14:18:29.7132608+00:00",
            "ElapsedSeconds": 435.24593
        },
        {
            "Lesson": "30698aac-5a3d-4464-935c-16de4ba9db70",
            "DateTime": "2015-03-02T14:21:38.4381108+00:00",
            "ElapsedSeconds": 182.47075
        }
    ]
}



I can use the following hive code from within HDInsight to pull out the values. 

DROP TABLE IF EXISTS HDStudents;
CREATE EXTERNAL TABLE HDStudents
(
  json_response STRING
)
STORED AS TEXTFILE LOCATION 'wasb://bl...@tjaystorage.blob.core.windows.net/HDStudentData';

DROP TABLE IF EXISTS HDStudent_Usage;
CREATE TABLE HDStudent_Usage
(
  id STRING,
  sec STRING
);
FROM HDStudents
INSERT OVERWRITE TABLE HDStudent_Usage
SELECT
    CAST(get_json_object(json_response, '$.id') as STRING),
    get_json_object(json_response, '$.ActivityCountedCollection.ElapsedSeconds') as STRING;

and the result ends up being something like the following

    \N[66.80226,435.24593,182.47075]

I dont know whats happening to my ID, but that is not important now. But the numbers from the above example come out as a list, within brackets. Its not even formed as JSon anymore, so the previous tricks to pull them out of their key value pair do not apply. 
Since it is reading it as a string, with commas and numbers, i cannot perform aggregate functions on it.
Online, i get suggestions to create some maven compiled code function that is referenced, and i get lost attempting this too. 
I thought about pushing these results into yet another table, and doing some processing on it. But do not know how to extract the items from the list, or array, or whatever it is in now. 


Help?

Brian Burton

unread,
Mar 17, 2015, 6:29:59 PM3/17/15
to uh...@googlegroups.com
Hi Tjay,

This is outside my area of expertise, but I know we (Cloudera) have a blog post that discusses querying nested JSON documents that might help get you pointed in the right direction:


If that doesn't help, let me know and I can check with my colleagues who have more experience in that area.

Thanks,

Brian Burton
Technical Writer


--
--
Visit us on the net: http://www.uhug.org
 
You received this message because you are subscribed to the Google
Groups "Utah Hadoop Users Group" group.
To manage your subscription, visit this group at
http://groups.google.com/group/uhug?hl=en

---
You received this message because you are subscribed to the Google Groups "Utah Hadoop Users Group - Big Data Utah" group.
To unsubscribe from this group and stop receiving emails from it, send an email to uhug+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
Message has been deleted
0 new messages