Sunday, 19 June 2016

Hive UDF's : Funnel Analysis


Hive UDF's for Funnel Analysis

Funnel analysis is a method for tracking user conversion rates across actions. This enables detection of actions causing high user fallout.
These Hive UDFs enables funnel analysis to be performed simply and easily on any Hive table.

Requirements

Maven is required to build the funnel UDFs.

How to build

There is a provided Makefile with all the build targets.

Build JAR

make jar
This creates a funnel.jar in the target/ directory.

Register JAR with Hive

To use the funnel UDFs, you need to register it with Hive.
With temporary functions:
ADD JAR funnel.jar;
CREATE TEMPORARY FUNCTION funnel         AS 'com.yahoo.hive.udf.funnel.Funnel';
CREATE TEMPORARY FUNCTION funnel_merge   AS 'com.yahoo.hive.udf.funnel.Merge';
CREATE TEMPORARY FUNCTION funnel_percent AS 'com.yahoo.hive.udf.funnel.Percent';
With permenant functions you need to put the JAR on HDFS, and it will be registered with a database (you have to replaceDATABASE and PATH_TO_JAR with your values):
CREATE FUNCTION DATABASE.funnel         AS 'com.yahoo.hive.udf.funnel.Funnel'  USING JAR 'hdfs:///PATH_TO_JAR/funnel.jar';
CREATE FUNCTION DATABASE.funnel_merge   AS 'com.yahoo.hive.udf.funnel.Merge'   USING JAR 'hdfs:///PATH_TO_JAR/funnel.jar';
CREATE FUNCTION DATABASE.funnel_percent AS 'com.yahoo.hive.udf.funnel.Percent' USING JAR 'hdfs:///PATH_TO_JAR/funnel.jar';

How to use

There are three funnel UDFs provided: funnelfunnel_mergefunnel_percent.
The funnel UDF outputs an array of longs showing conversion rates across the provided funnels.
The funnel_merge UDF merges multiple arrays of longs by adding them together.
The funnel_percent UDF takes a raw count funnel result and converts it to a percent change count.
There is no need to sort the data on timestamp, the UDF will take care of it. If there is a collision in the timestamps, it then sorts on the action column.

funnel

funnel(action_column, timestamp_column, array(funnel_1_a, funnel_1_b), funnel_2, ...)
  • Builds a funnel report applied to the action_column, sorted by the timestamp_column.
  • The funnels are scalars or arrays of the same type as the action column. This allows for multiple matches to move to the next funnel.
    • For example, funnel_1 could be array('register_button', 'facebook_invite_register'). The funnel will match the first occurence of either of these actions and proceed to the next funnel.
    • Or, funnel_1 could just be 'register_button'.
  • You can have an arbitrary number of funnels.
  • The timestamp_column can be of any comparable type (Strings, Integers, Dates, etc).

funnel_merge

funnel_merge(funnel_column)
  • Merges funnels. Use with funnel UDF.

funnel_percent

funnel_percent(funnel_column)
  • Converts the result of a funnel_merge to percent change. Use with funnel and funnel_merge UDF.
  • For example, a result from funnel_merge could look like [245, 110, 54, 13]. This is result is in raw counts. If we pass this through funnel_percent then it would look like [1.0, 0.44, 0.49, 0.24].

Examples

Assume a table user_data:
actiontimestampuser_idgender
signup_page1001f
confirm_button2001f
submit_button3001f
signup_page2002m
submit_button4002m
signup_page1003f
confirm_button2003f
decline2003f
............

Simple funnel

SELECT funnel_merge(funnel)
FROM (SELECT funnel(action, timestamp, array('signup_page', 'email_signup'),
                                       'confirm_button',
                                       'submit_button') AS funnel
      FROM user_data
      GROUP BY user_id) t1;
Result: [3, 2, 1]

Simple funnel with percent

SELECT funnel_percent(funnel_merge(funnel))
FROM (SELECT funnel(action, timestamp, 'signup_page',
                                       'confirm_button',
                                       'submit_button') AS funnel
      FROM user_data
      GROUP BY user_id) t1;
Result: [1.0, 0.66, 0.5]

Funnel with multiple groups

SELECT gender, funnel_merge(funnel)
FROM (SELECT gender,
             funnel(action, timestamp, 'signup_page',
                                       'confirm_button',
                                       'submit_button') AS funnel
      FROM table
      GROUP BY user_id, gender) t1
GROUP BY gender;
Result: m: [1, 0, 0], f: [2, 2, 1]

Multiple parallel funnels

SELECT funnel_merge(funnel1), funnel_merge(funnel2)
FROM (SELECT funnel(action, timestamp, 'signup_page',
                                       'confirm_button',
                                       'submit_button') AS funnel1
             funnel(action, timestamp, 'signup_page',
                                       'decline') AS funnel2
      FROM table
      GROUP BY user_id) t1;
Result: [3, 2, 1] [3, 1]
Downloads:
I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadopping with Patrick..

No comments:

Post a Comment