HADOOP GYAAN: Apache Hadoop : Banking Transaction PIG Case Study

Banking Transaction example

we will analyze a banking domain dataset, which contains several files with details of its customers.

The dataset :

We will refer to the Transaction dataset throughout this analysis. It;s a collection of financial information from a unknown bank. The dataset deals with over 5,300 bank clients with approximately 68,614 transactions.

Q1. Find how many records are in the data set?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv';

-- Group each record by itself
tGroup = GROUP transactions ALL;

--Count number of groups
tCount = FOREACH tGroup GENERATE COUNT(transactions);

DUMP tCount;

Q2. Show the top 5 customers with largest total sales?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int, Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);

-- Group all entries by Customer_Number
customer = GROUP transactions BY Customer_Number;

-- Add each sale by Customer_Number
sales = FOREACH customer GENERATE group,SUM(transactions.Sales_Amount) AS totalSales;

-- Rank the sum of sales from largest to smallest
rankedSales = RANK sales BY totalSales DESC;

-- Show only top 5 largest numbers
top5 = FILTER rankedSales BY $0 <= 5;

DUMP top5;

Q3. Count customers who made sales in System_Period 200401,200402 and 200403?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int, Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);

-- Selects transactions that contain 200401, 200402, 200403 from System_Period
sysFilter = FILTER transactions BY System_Period == 200401 OR
System_Period == 200402 OR System_Period == 200403;

DUMP sysFilter;

Q4. Show top 3 employees (using employee number)who have processed highest average sales?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);

-- Group each record by Employee_number
employeeGroup = GROUP transactions BY Employee_Number;

-- Average sales by employee number
employeeSales = FOREACH employeeGroup GENERATE group,
AVG(transactions.Sales_Amount) AS avgSales;

-- Rank average sales
rankedSales = RANK employeeSales BY avgSales DESC;

-- Show top three from rankedSales
top3 = FILTER rankedSales BY $0<=3;

DUMP top3;

Q5. Show how many transactions were made during system periods 20040?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);

-- Filter transactions by column System_Period starting with 20040
sysFilter = FILTER transactions BY STARTSWITH(System_Period, '20040');

-- Group all of sysFilter
sysGroup = GROUP sysFilter ALL;

-- Count entries after being filtered
sysCount = FOREACH sysGroup GENERATE COUNT(sysFilter);

DUMP sysCount;

Q6. Display each unique Sales_Amount by Product_Number?

-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);

-- Creates an alias for columns Product_Number and Sales_Amount
prices = FOREACH transactions GENERATE $5 as col0, $6 as col1;

-- Groups groups columns
priceGroup = GROUP prices BY (col0,col1);

-- Displays each unique Sales_Amount for each Product_Number
priceFilter = FOREACH priceGroup {
sort = ORDER prices BY col0 DESC;
topRec = LIMIT prices 1;
GENERATE FLATTEN(topRec);
};

DUMP priceFilter;

Downloads:

1.Sample Input file (Transaction.csv)

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

1 comment:

Unknown25 August 2016 at 11:18
which one is the use case where u inserted pig and hive in shell script for scheduling.I saw that case some days before but can't find out now.

Actually I want to see how can jobs be scheduled in shell scripts.
Can u point out that use case or url to me please.

Pages

Friday, 17 June 2016

Apache Hadoop : Banking Transaction PIG Case Study

1 comment: