Friday, 17 June 2016

Apache Hadoop : Banking Transaction PIG Case Study






Banking Transaction example

we will analyze a banking domain dataset, which contains several files with details of its customers. 

The dataset :

We will refer to the Transaction dataset throughout this analysis. It;s a collection of financial information from a unknown bank. The dataset deals with over 5,300 bank clients with approximately 68,614 transactions. 

Q1. Find how many records are in the data set?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv';

-- Group each record by itself
tGroup = GROUP transactions ALL;

--Count number of groups
tCount = FOREACH tGroup GENERATE COUNT(transactions);


DUMP tCount;

Q2. Show the top 5 customers with largest total sales?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
        USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int, Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:int);

-- Group all entries by Customer_Number
customer = GROUP transactions BY Customer_Number;

-- Add each sale by Customer_Number
sales = FOREACH customer GENERATE group,SUM(transactions.Sales_Amount) AS totalSales;

-- Rank the sum of sales from largest to smallest
rankedSales = RANK sales BY totalSales DESC;

-- Show only top 5 largest numbers
top5 = FILTER rankedSales BY $0 <= 5;

DUMP top5;

Q3. Count customers who made sales in System_Period 200401,200402 and 200403?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
        USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int, Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:int);

-- Selects transactions that contain 200401, 200402, 200403 from System_Period
sysFilter = FILTER transactions BY System_Period == 200401 OR
        System_Period == 200402 OR System_Period == 200403;

DUMP sysFilter;

Q4. Show top 3 employees (using employee number)who have processed highest average sales?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
        USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:int);

-- Group each record by Employee_number
employeeGroup = GROUP transactions BY Employee_Number;

-- Average sales by employee number
employeeSales = FOREACH employeeGroup GENERATE group,
        AVG(transactions.Sales_Amount) AS avgSales;

-- Rank average sales
rankedSales = RANK employeeSales BY avgSales DESC;

-- Show top three from rankedSales
top3 = FILTER rankedSales BY $0<=3;

DUMP top3;

Q5. Show how many transactions were made during system periods 20040?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
        USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:chararray);

-- Filter transactions by column System_Period starting with 20040
sysFilter = FILTER transactions BY STARTSWITH(System_Period, '20040');

-- Group all of sysFilter
sysGroup = GROUP sysFilter ALL;

-- Count entries after being filtered
sysCount = FOREACH sysGroup GENERATE COUNT(sysFilter);

DUMP sysCount;

Q6. Display each unique Sales_Amount by Product_Number?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
        USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:int);

-- Creates an alias for columns Product_Number and Sales_Amount
prices = FOREACH transactions GENERATE $5 as col0, $6 as col1;

-- Groups groups columns
priceGroup = GROUP prices BY (col0,col1);

-- Displays each unique Sales_Amount for each Product_Number
priceFilter = FOREACH priceGroup {
        sort = ORDER prices BY col0 DESC;
        topRec = LIMIT prices 1;
        GENERATE FLATTEN(topRec);
        };

DUMP priceFilter;

Downloads:

1.Sample Input file (Transaction.csv)

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..


1 comment:

  1. which one is the use case where u inserted pig and hive in shell script for scheduling.I saw that case some days before but can't find out now.

    Actually I want to see how can jobs be scheduled in shell scripts.
    Can u point out that use case or url to me please.

    ReplyDelete