HADOOP GYAAN: Mapreduce

Showing posts with label Mapreduce. Show all posts

Thursday, 4 August 2016

Apache Hadoop : London 2012 Olympics Twitter Analysis MapReduce Case Study

London 2012 Olympics Tweet Analysis MapReduce example :

The goal of this project is to develop several simple Map/Reduce programs to analyze one provided dataset. The dataset contained 18 million Twitter messages captured during the London 2012 Olympics period. All messages are related in some way to the events happening in London (as they have a term such as London2012).

Check it out above case study in below weblink:

London 2012 Olympics Twitter Analysis MapReduce Case Study

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Wednesday, 3 August 2016

Apache Hadoop : Earthquake Log Analysis MapReduce Case Study

Earthquake Log Analysis MapReduce example :

This example showing seismic data which measure earthquake magnitudes around the world. There are thousands of such sensors deployed around the world recording earthquake data in log files.
I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Check it out above Case Study on below link:

Earthquake Log Analysis MapReduce Case Study

Happy Hadooping with Patrick..

Sunday, 31 July 2016

Apache Hadoop : Weather Analysis using HBASE, MapReduce and HDFS

Weather Analysis using HBASE, MapReduce and HDFS example:

The project is to download weather history data for most of the countries in the world and put data to HDFS. After data is put in HDFS, mapper and reducer jobs run against it and saved the analysis results to HBase. The code is developed and executed on Hadoop using Java and Hbase as the NoSQL database.

Here are steps to run through the application

1. Run the shell scripting and python code to parse the webpage to get all country codes, and use country code to download xml files for all countries

All the XML files are saved as xml_files/weather_xxx.xml (xxx is the country code)

2. Copy the xml files to HDFS

hadoop fs -mkdir /user
hadoop fs -mkdir /user/hadoop
hadoop fs -mkdir /user/hadoop/data
hadoop fs -ls /user/hadoop/data
hadoop fs -copyFromLocal /home/hadoop-weather-analysis/xml_files /user/hadoop/data/

3. Create weather tables in HBase database

create 'weather', 'mp'
create 'weather_sum', 'mp'

4. Load xml files from HDFS to weather table in HBase

hadoop jar loadXml2.jar com.hadoopgyaan.hbase.dataload.HBaseDriver /user/hadoop/data/xml_files /out1 weather

5. Check the data in the HBase table

count 'weather'
t = get_table 'weather'
t.scan

6. Process data to get the monthly data for past 10 years and save back to HBase table

hadoop jar processweather.jar com.hadoopgyaan.hbase.process.DBDriver

7. Check the results in the HBase table

scan 'weather_sum'

Downloads :

Python,HBase and MapReduce Coding

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Wednesday, 13 July 2016

Apache Hadoop : Wikipedia Pagerank Implementation using Mapreduce Case Study

Wikipedia Pagerank Implementation using Mapreduce Case Study

The Case Study contain package called PageRank. Main class called PageRank.java handles all the MapReduce jobs to calculate the page rank of all the pages in the dataset. The project is divided into following parts:

1. (OutlinkMapperStage1.java, OutlinkReducerStage1.java)

First task of the project consists of extracting the wiki links and the removal of the red links from the large dataset. To extract the valid data i.e. contents of <page> ... </page> XMLInputFormat class is used which is already provided. In first MapReduce task, the mapper extracts the title of the page which is present in the <title> ... </title> and all the contents of the <text> ... </text> to find all the out-links from current title. A regular expression is written to extract all the valid links from the text tag. In our case There are two types of valid links, [[A]], and [[A|B]]. From both links A is extracted and all the spaces are replaced with the underscore. All the titles are emitted with (title, #) so that for every title one bucket will be created by the combiner. Instead of emitting (title, link) to reducer all the in-links of the page are emitted to reducer i.e. (link, title).
Now in reducer put all the contents of the bucket in the Set to keep only the unique links. As we emitted # in every title bucket, if # is not present in the bucket then it is not a valid page. Now when the set has # sent as a (title, #) which will go to its specific bucket and all the other links are sent as single ou-links.

2. (OutlinkMapperStage2.java, OutlinkReducerStage2.java)

This part contains the MapReduce job to generate the out-link adjacency graph. Mapper splits all the value entries of the bucket in (key, value) pair and emits it to the reducer.
As we are left with all the unique pages there is no need of #, which is then removed from all the links. all the values are combined into StringBuilder and emitted as an output to generate the adjacency graph. This output will be stored in PageRank.outlink.out. 

3. (LinkCountMapper.java, LinkCountReducer.java)

This part contains the MapReduce job to compute the total number of pages denoted as N in the equation. Receive the output from the OutlinkReducerStage2 and send it to the LinkCountMapper. There are two ways to calculate N. First, look for all the page and title tags in the large dataset, which is not an optimized solution. Second, Instead of counting all the page tags in the large dataset count number of lines in the ou-link graph which contains all the unique titles. LinkCountReducer will then emit N=Number of pages in the dataset and write it to the PageRank.n.out.

4. (RankCalculateMapperStage1.java, RankCalculateReducerStage1.java, RankCalculateMapper.java, RankCalculateReducer.java)

This MapReduce job is required to calculate the PageRank for 8 iterations. To initialize the page ranks of all the pages, we introduce  initial rank for all the pages to 1/N. The output of the Reducer is of the format <title> <initialized rank> <out-links> and is stored for further calculation in tmp/PageRank.iter0.out. 

For the next 8 iterations, split the values part into title, rank and out-links. Now, count all the out-links for that page title and calculate the rankVote of the current page for all of its out-links, which is rankVote = rank / outlinkCount. Now emit this vote to all the out-links of that page.
Now in Reducer add all the rank votes from all the links and count the page rank of that page. Formula for the page rank is,

PR(p1) = (1 - d)/N + (PR(p2)/L(p2) + PR(p3)/L(p3) + ...) where
d = damping factor
PR(p1) = page rank of page p1
N = total number of pages
L(p2) = total number of ou-links on page p2
Once we are done with the page rank calculation, emit the newly calculated page rank values in the format <title> <new rank> <out-links>.

5. (SortMapper.java, SortReducer.java)

After 8 iterations sorting is performed on iter1 and iter8 in this MapReduce job. Mapper of this part emits page rank (rank, page) in the sorted order. To emit the page ranks in the descending order we have overridden the compare() method in Key Comparator class.
Reducer now receives all the page ranks in descending order and compares if the page rank is greater than 5/N. If the rank is greater than the 5/N value then reducer emits the pagerank as (page title, rank).

As the MapReduce splits all the output in parts we merged the corresponding outputs in single files. We have created new file using FSDataOutputStream to store all the data from output parts to single file. By opening a handle for every file some chunk of bytes are transferred to output file using read() method of FSDataInputStream class. All the output files PageRank.outlink, PageRank.n.out, PageRank.iter1.out, PageRank.iter8.out are stored in the results directory.


Downloads :


Wikipedia PageRank Input files


Wikipedia Pagerank Java Codes



I hope this tutorial will surely help you. If you have any questions or problems please let me know.




Happy Hadooping with Patrick..

Sunday, 3 July 2016

Apache Hadoop : Movie Recommender MapReduce Case Study

Movie Recommender MapReduce example

This is a very simple Movie Recommender in Hadoop.

The whole job is broken in 4 Map-Reduce jobs which are to be run sequentially as shown in below.

The steps are

<1> Normalization
<2> Finding Distances
<3> Contribution of Rating and
<4> Adding up the Ratings

Check out above Mapreduce Case study in below link:

Movie Recommender MapReduce Case Study

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Thursday, 30 June 2016

Apache Hadoop : Facebook Likes Analyzer MapReduce Case Study

Facebook Likes Analyzer example

It is a small java codes that shows how to use Hadoop to analyze facebook data i.e likes. It tells the most frequent likers in the friend circle.

Check out above case study in below weblink:

Facebook Likes Analyzer MapReduce Case Study

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Wednesday, 29 June 2016

Apache Hadoop : “People You May Know” Social Network Friendship Recommendation MapReduce Case Study

“People You May Know” Social Network Friendship Recommendation example

The best friendship recommendations often come from friends. The key idea is that if two people have a lot of mutual friends, but they are not friends, then the system should recommend them to be connected to each other.

Let’s assume that the friendships are undirected: if A is a friend of B then B is also a friend of A. This is the most common friendship system used in Facebook, Google+, Linkedin, and several social networks.

The relationships between user and user can be understood easier in the graph.

0 1,2,3
1 0,2,3,4,5
2 0,1,4
3 0,1,4
4 1,2,3
5 1,6
6 5

In the graph, you can see user 0 is not friends of user 4, and 5, but user 0 and user 4 have mutual friends 1, 2, and 3; user 0 and user 5 have mutual friend 1. As a result, we would like to recommend user 4 and 5 as friends of user 0.

The output recommended friends will be given in the following format. <Recommended friend to USER(# of mutual friends: [the id of mutual friend, ...]),…>. The output result is sorted according to the number of mutual friends, and can be verified from the graph.

0 4 (3: [3, 1, 2]),5 (1: [1])
1 6 (1: [5])
2 3 (3: [1, 4, 0]),5 (1: [1])
3 2 (3: [4, 0, 1]),5 (1: [1])
4 0 (3: [2, 3, 1]),5 (1: [1])
5 0 (1: [1]),2 (1: [1]),3 (1: [1]),4 (1: [1])
6 1 (1: [5])

Now, let’s fit this problem into single MapReduce job. User 0 has friends, 1, 2, and 3; as a result, the pair of <1, 2>, <2, 1>, <2, 3>, <3, 2>, <1, 3>, and <3, 1> have mutual friend of user 0. As a result, we can emit <key, value> = <1, r=2; m=0>, <2, r=1; m=0>, <2, r=3; m=0>…, where r means recommended friend, and m means mutual friend. We can aggregate the result in the reduce phase, and calculate how many mutual friends they have between a user and recommended user. However, this approach will cause a problem. What if user A and the recommended user B are already friends? In order to overcome this problem, we can add another attribute is Friend into the emitted value, and we just don’t recommend the friend if we know they are already friends in the reduce phase. In the following implementation, m = -1 is used when they are already friends instead of using extra field.

Mapper Class

Reducer Class

Driver Class


	import java.util.; import java.io.;

	import org.apache.hadoop.conf.Configuration;
	import org.apache.hadoop.fs.Path;
	import org.apache.hadoop.conf.Configured;
	import org.apache.hadoop.util.Tool;
	import org.apache.log4j.Logger;
	import org.apache.hadoop.util.ToolRunner;
	import org.apache.hadoop.io.IntWritable;
	import org.apache.hadoop.io.LongWritable;
	import org.apache.hadoop.io.Text;
	import org.apache.hadoop.mapreduce.Job;
	import org.apache.hadoop.mapreduce.Mapper;
	import org.apache.hadoop.mapreduce.Reducer;

	import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
	import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

	public class Recommender {



	public static class FriendMapper extends Mapper<LongWritable, Text, IntWritable, Text> {



	IntWritable userID = new IntWritable();
	Text friends = new Text();
	HashMap<String,String> hash = new HashMap<String,String>();
	@Override
	public void map(LongWritable key, Text value, Context context)
	throws IOException, InterruptedException {

	// Program now begins reading input dataset line by line

	String split_1[] = value.toString().split("\t");
	String split_2[]=null;

	userID.set(Integer.parseInt(split_1[0]));
	if(split_1.length==1){friends.set("null");context.write(userID,friends);}
	else{
	split_2 = split_1[1].split(",");

	//Write user id as key and each friendID,-1 as value. This shows that user and those id's are friends already and shouldnt be recommended.

	}
	if(split_2!=null)
	{
	for(int i=0;i<split_2.length;i++)
	{
	if(split_2[i] != null)
	{
	friends.set(split_2[i] + ",-1");

	context.write(userID,friends);
	// hash.put(split_2[i],split_2[i]);
	}

	}

	//Now iterate over the hashmap to map each friend combo. This time use +1 since we do not know whether or not they are friends.

	/* for(String hashkey : hash.keySet()){

	for(String hashValue : hash.values()){

	if((hashkey.equals(hashValue))==false)
	{
	userID.set(Integer.parseInt(hashkey));
	friends.set(hashValue + ",1");
	context.write(userID, friends);
	}
	}
	}*/

	for(int i=0;i<split_2.length;i++){
	for(int j=0;j<split_2.length;j++){
	if(split_2[i] != split_2[j]){
	userID.set(Integer.parseInt(split_2[i]));
	friends.set(split_2[j]+",1");
	context.write(userID, friends);
	}
	}
	}

	}



	}


	}

	public static class FriendReducer extends Reducer<IntWritable, Text, IntWritable, Text> {



	HashMap<Integer,Integer> hash = new HashMap<Integer,Integer>();
	StringBuilder recommendedList = new StringBuilder();
	LinkedList<Integer> friendId = new LinkedList<Integer>();
	LinkedList<Integer> comFriendCount = new LinkedList<Integer>();
	Text result = new Text();
	Text currentVal = new Text();
	int count,flag;
	int temp,temp1,temp2;


	public void reduce(IntWritable key, Iterable<Text> values, Context context)
	throws IOException, InterruptedException {

	for(Text value: values){

	currentVal = value;
	if(currentVal.toString().equals("null") == false){

	flag = Integer.parseInt(currentVal.toString().split(",")[1]);
	if(hash.containsKey(Integer.parseInt(currentVal.toString().split(",")[0]))){

	if(flag==1){
	if(hash.get(Integer.parseInt(currentVal.toString().split(",")[0])) != 0){


	temp=hash.get(Integer.parseInt(currentVal.toString().split(",")[0])) + 1;
	hash.put(Integer.parseInt(currentVal.toString().split(",")[0]),temp);
	}
	}
	else{
	hash.put((Integer.parseInt(currentVal.toString().split(",")[0])),0);
	}



	}
	else{
	if(flag==1){
	hash.put(Integer.parseInt(currentVal.toString().split(",")[0]), 1);
	}
	else{
	hash.put((Integer.parseInt(currentVal.toString().split(",")[0])),0);
	}
	}
	}

	else {

	result.set("\t");
	context.write(key,result);
	}
	}
	for(Map.Entry<Integer,Integer> entry : hash.entrySet() ){
	if(entry.getValue()!=0){
	friendId.add(entry.getKey());
	comFriendCount.add(entry.getValue());
	}

	}


	for(int i=0;i<comFriendCount.size();i++){
	for(int j=i+1;j<comFriendCount.size();j++){

	if(comFriendCount.get(i)<comFriendCount.get(j)){
	temp1=friendId.get(j);
	friendId.set(j, friendId.get(i));
	friendId.set(i,temp1);

	temp2=comFriendCount.get(j);
	comFriendCount.set(j, comFriendCount.get(i));
	comFriendCount.set(i,temp2);
	}

	else{

	if(comFriendCount.get(i)==comFriendCount.get(j) && friendId.get(i)>friendId.get(j)){

	temp1=friendId.get(j);
	friendId.set(j, friendId.get(i));
	friendId.set(i,temp1);

	}
	}
	}
	}
	if(friendId.size()>0)
	{
	recommendedList.append("\t").append(friendId.get(0).toString());
	for(int k=1;k<Math.min(10,friendId.size());k++){
	recommendedList = recommendedList.append(",").append(friendId.get(k).toString());
	}
	result.set(recommendedList.toString());
	hash.clear();
	friendId.clear();
	comFriendCount.clear();
	recommendedList.setLength(0);

	context.write(key, result);

	}



	}

	}





	public static void main(String[] args) throws Exception{

	/*int res = ToolRunner .run( new Recommender(), args);
	System .exit(res);*/
	Configuration myconf = new Configuration();
	Job job = Job .getInstance(myconf, " recommender ");
	job.setJarByClass(Recommender.class);

	FileInputFormat.addInputPath(job, new Path(args[ 0]));
	FileOutputFormat.setOutputPath(job, new Path(args[ 1]));

	job.setMapperClass( FriendMapper.class);
	job.setReducerClass( FriendReducer.class);
	job.setOutputKeyClass( IntWritable .class);
	job.setOutputValueClass( Text .class);
	System.exit(job.waitForCompletion(true) ? 0 :1);


	}


	}

Downloads :

People You May Know Sample Text

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Pages

Thursday, 4 August 2016

Apache Hadoop : London 2012 Olympics Twitter Analysis MapReduce Case Study

Wednesday, 3 August 2016

Apache Hadoop : Earthquake Log Analysis MapReduce Case Study

Sunday, 31 July 2016

Apache Hadoop : Weather Analysis using HBASE, MapReduce and HDFS

Here are steps to run through the application

1. Run the shell scripting and python code to parse the webpage to get all country codes, and use country code to download xml files for all countries

2. Copy the xml files to HDFS

3. Create weather tables in HBase database

4. Load xml files from HDFS to weather table in HBase

5. Check the data in the HBase table

6. Process data to get the monthly data for past 10 years and save back to HBase table

7. Check the results in the HBase table

Wednesday, 13 July 2016

Apache Hadoop : Wikipedia Pagerank Implementation using Mapreduce Case Study

Sunday, 3 July 2016

Apache Hadoop : Movie Recommender MapReduce Case Study

Thursday, 30 June 2016

Apache Hadoop : Facebook Likes Analyzer MapReduce Case Study

Wednesday, 29 June 2016

Apache Hadoop : “People You May Know” Social Network Friendship Recommendation MapReduce Case Study