Hive Nested JSON Arrray UDF

This UDF takes in a 'JSON string' and a path to a JSON array and collects all elements specified by this path (handles nested JSON arrays as well).

Example:

Assume this JSON is in a row of some table:

{
    "request" : {
        "user" : "Mario",
        "location" : "His house.",
        "siblings" : ["Luigi"],
        "countries" : [
            {
                "name" : "USA",
                "regions": ["California", "New York", "Washington"]
            },
            {
                "name" : "Japan",
                "regions" : ["Tokyo", "Osaka", "Aichi"]
            },
            {
                "name" : "Italy",
                "regions" : ["Lazio", "Lombardy", "Veneto"]
            }
        ],
        "id" : 619
    },
    "flags" : {
        "activated" : false
    },
    "meta" : {
        "name" : "Alpha"
    }
}

Path 1

hive> select name from json_table LATERAL VIEW explode(jsonArray(data, 'request.countries')) t AS name;
.
.
.
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 1.3 sec   HDFS Read: 643 HDFS Write: 174 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 300 msec
OK
{"name":"USA","regions":["California","New York","Washington"]}
{"name":"Japan","regions":["Tokyo","Osaka","Aichi"]}
{"name":"Italy","regions":["Lazio","Lombardy","Veneto"]}
Time taken: 17.082 seconds, Fetched: 3 row(s)

Path 2

hive> select name from json_table LATERAL VIEW explode(jsonArray(data, 'request.countries.regions')) t AS name;
.
.
.
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 1.23 sec   HDFS Read: 643 HDFS Write: 71 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 230 msec
OK
California
New York
Washington
Tokyo
Osaka
Aichi
Lazio
Lombardy
Veneto
Time taken: 14.134 seconds, Fetched: 9 row(s)

Dependencies : Maven

Compiling and Building

To compile run:

# mvn compile

To get JAR (located in target/) run:

# mvn package

Usage

hive> ADD JAR target/hive-udfs-1.0-SNAPSHOT.jar; 
hive> create temporary function jsonArray as 'ArrayUDF';
hive> create table json_table( data string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/hive/test/';
hive> select name from json_table LATERAL VIEW explode(jsonArray(data, 'request.countries')) t AS name;

Downloads:

Sample Java Code and Dataset (Hive Nested JSON Array)

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Pages

Monday, 20 June 2016

Hive UDF's : Hive Nested JSON Arrray UDF

Hive Nested JSON Arrray UDF

Compiling and Building

Usage

No comments:

Post a Comment