Banking Transaction example
we will analyze a banking domain dataset, which contains several files with details of its customers.
The dataset :
We will refer to the Transaction dataset throughout this analysis. It;s a collection of financial information from a unknown bank. The dataset deals with over 5,300 bank clients with approximately 68,614 transactions.
Q1. Find how many records are in the data set?
-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv';
-- Group each record by itself
tGroup = GROUP transactions ALL;
--Count number of groups
tCount = FOREACH tGroup GENERATE COUNT(transactions);
DUMP tCount;
Q2. Show the top 5 customers with largest total sales?
-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int, Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);
-- Group all entries by Customer_Number
customer = GROUP transactions BY Customer_Number;
-- Add each sale by Customer_Number
sales = FOREACH customer GENERATE group,SUM(transactions.Sales_Amount) AS totalSales;
-- Rank the sum of sales from largest to smallest
rankedSales = RANK sales BY totalSales DESC;
-- Show only top 5 largest numbers
top5 = FILTER rankedSales BY $0 <= 5;
DUMP top5;
Q3. Count customers who made sales in System_Period 200401,200402 and 200403?
-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int, Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);
-- Selects transactions that contain 200401, 200402, 200403 from System_Period
sysFilter = FILTER transactions BY System_Period == 200401 OR
System_Period == 200402 OR System_Period == 200403;
DUMP sysFilter;
Q4. Show top 3 employees (using employee number)who have processed highest average sales?
-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);
-- Group each record by Employee_number
employeeGroup = GROUP transactions BY Employee_Number;
-- Average sales by employee number
employeeSales = FOREACH employeeGroup GENERATE group,
AVG(transactions.Sales_Amount) AS avgSales;
-- Rank average sales
rankedSales = RANK employeeSales BY avgSales DESC;
-- Show top three from rankedSales
top3 = FILTER rankedSales BY $0<=3;
DUMP top3;
Q5. Show how many transactions were made during system periods 20040?
-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);
-- Filter transactions by column System_Period starting with 20040
sysFilter = FILTER transactions BY STARTSWITH(System_Period, '20040');
-- Group all of sysFilter
sysGroup = GROUP sysFilter ALL;
-- Count entries after being filtered
sysCount = FOREACH sysGroup GENERATE COUNT(sysFilter);
DUMP sysCount;
Q6. Display each unique Sales_Amount by Product_Number?
-- Load data from Transactions.csv
transactions = LOAD '/home/cloudera/datasets/hadoopgyaan/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);
-- Creates an alias for columns Product_Number and Sales_Amount
prices = FOREACH transactions GENERATE $5 as col0, $6 as col1;
-- Groups groups columns
priceGroup = GROUP prices BY (col0,col1);
-- Displays each unique Sales_Amount for each Product_Number
priceFilter = FOREACH priceGroup {
sort = ORDER prices BY col0 DESC;
topRec = LIMIT prices 1;
GENERATE FLATTEN(topRec);
};
DUMP priceFilter;
Downloads:
1.Sample Input file (Transaction.csv)
I hope this tutorial will surely help you. If you have any
questions or problems please let me know.
Happy Hadooping with Patrick..
which one is the use case where u inserted pig and hive in shell script for scheduling.I saw that case some days before but can't find out now.
ReplyDeleteActually I want to see how can jobs be scheduled in shell scripts.
Can u point out that use case or url to me please.