Sunday, 31 July 2016

Apache Hadoop : Weather Analysis using HBASE, MapReduce and HDFS



Weather Analysis using HBASE, MapReduce and HDFS example:


The project is to download weather history data for most of the countries in the world and put data to HDFS. After data is put in HDFS, mapper and reducer jobs run against it and saved the analysis results to HBase. The code is developed and executed on Hadoop using Java and Hbase as the NoSQL database.

Here are steps to run through the application

1. Run the shell scripting and python code to parse the webpage to get all country codes, and use country code to download xml files for all countries
All the XML files are saved as xml_files/weather_xxx.xml (xxx is the country code)
2. Copy the xml files to HDFS
hadoop fs -mkdir /user
hadoop fs -mkdir /user/hadoop
hadoop fs -mkdir /user/hadoop/data
hadoop fs -ls /user/hadoop/data
hadoop fs -copyFromLocal /home/hadoop-weather-analysis/xml_files /user/hadoop/data/
3. Create weather tables in HBase database
create 'weather', 'mp'
create 'weather_sum', 'mp'
4. Load xml files from HDFS to weather table in HBase
hadoop jar loadXml2.jar com.hadoopgyaan.hbase.dataload.HBaseDriver /user/hadoop/data/xml_files /out1 weather
5. Check the data in the HBase table
count 'weather'
t = get_table 'weather'
t.scan
6. Process data to get the monthly data for past 10 years and save back to HBase table
hadoop jar processweather.jar com.hadoopgyaan.hbase.process.DBDriver
7. Check the results in the HBase table
scan 'weather_sum'

Downloads : 

Python,HBase and MapReduce Coding

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Saturday, 30 July 2016

Python : Python Script that Parse "Jobs"


Python Script that Parse "Jobs"

This small python program scrapes data off of a 'Hiring Now' page on Hacker News or any other jobs websites and only saves the jobs with certain keywords. Ie 'New York', 'San Francisco' etc You can also use the keywords to find specific jobs ie 'Machine Learning'
You need to install Beautiful Soup 4 in order to use this program
$ pip install beautifulsoup4
tested with python2.7-32

Problems

  • It will grab jobs that mention any of the keywords in your list.
  • It will break if someone creates a link with 'More' as the text :(
Downloads


I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Friday, 29 July 2016

How to Setup Multi Node Cluster Installation using CentOS v6.3 on HADOOP 2x




How to Setup Multi Node Cluster Installation using CentOS v6.3 on HADOOP 2x

In this tutorial, we are using Centos 6.3 and we are going to install multi node cluster Hadoop. For this tutorial, we need at least three nodes. One of them is going to be a master node and the other nodes is going to be a slave node. I’m using three nodes in this tutorial to make this guide as simple as possible. We will be installing namenode and jobtracker on the master node and installing datanode, tasktracker, and secondary namenode on the slave nodes.


Check it out above Multi Node Installation in below weblink:

Happy Hadooping with Patrick..

Wednesday, 13 July 2016

Python : Python Script that Parses PDF files




Python Script that Parses PDF files
PDF documents are beautiful things, but that beauty is often only skin deep. Inside, they might have any number of structures that are difficult to understand and exasperating to get at.That means that in the end, a beautiful PDF document is really meant to be read and its internals are not to be messed with.Below is the Python script through which you can get the contents of PDF.


DOWNLOADS:

Parsing PDF Python Script

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Apache Hadoop : Wikipedia Pagerank Implementation using Mapreduce Case Study



Wikipedia Pagerank Implementation using Mapreduce Case Study


The Case Study contain package called PageRank. Main class called PageRank.java handles all the MapReduce jobs to calculate the page rank of all the pages in the dataset. The project is divided into following parts:

1. (OutlinkMapperStage1.java, OutlinkReducerStage1.java)

First task of the project consists of extracting the wiki links and the removal of the red links from the large dataset. To extract the valid data i.e. contents of <page> ... </page> XMLInputFormat class is used which is already provided. In first MapReduce task, the mapper extracts the title of the page which is present in the <title> ... </title> and all the contents of the <text> ... </text> to find all the out-links from current title. A regular expression is written to extract all the valid links from the text tag. In our case There are two types of valid links, [[A]], and [[A|B]]. From both links A is extracted and all the spaces are replaced with the underscore. All the titles are emitted with (title, #) so that for every title one bucket will be created by the combiner. Instead of emitting (title, link) to reducer all the in-links of the page are emitted to reducer i.e. (link, title).
Now in reducer put all the contents of the bucket in the Set to keep only the unique links. As we emitted # in every title bucket, if # is not present in the bucket then it is not a valid page. Now when the set has # sent as a (title, #) which will go to its specific bucket and all the other links are sent as single ou-links.

2. (OutlinkMapperStage2.java, OutlinkReducerStage2.java)

This part contains the MapReduce job to generate the out-link adjacency graph. Mapper splits all the value entries of the bucket in (key, value) pair and emits it to the reducer.
As we are left with all the unique pages there is no need of #, which is then removed from all the links. all the values are combined into StringBuilder and emitted as an output to generate the adjacency graph. This output will be stored in PageRank.outlink.out. 

3. (LinkCountMapper.java, LinkCountReducer.java)

This part contains the MapReduce job to compute the total number of pages denoted as N in the equation. Receive the output from the OutlinkReducerStage2 and send it to the LinkCountMapper. There are two ways to calculate N. First, look for all the page and title tags in the large dataset, which is not an optimized solution. Second, Instead of counting all the page tags in the large dataset count number of lines in the ou-link graph which contains all the unique titles. LinkCountReducer will then emit N=Number of pages in the dataset and write it to the PageRank.n.out.

4. (RankCalculateMapperStage1.java, RankCalculateReducerStage1.java, RankCalculateMapper.java, RankCalculateReducer.java)

This MapReduce job is required to calculate the PageRank for 8 iterations. To initialize the page ranks of all the pages, we introduce  initial rank for all the pages to 1/N. The output of the Reducer is of the format <title> <initialized rank> <out-links> and is stored for further calculation in tmp/PageRank.iter0.out. 

For the next 8 iterations, split the values part into title, rank and out-links. Now, count all the out-links for that page title and calculate the rankVote of the current page for all of its out-links, which is rankVote = rank / outlinkCount. Now emit this vote to all the out-links of that page.
Now in Reducer add all the rank votes from all the links and count the page rank of that page. Formula for the page rank is,

PR(p1) = (1 - d)/N + (PR(p2)/L(p2) + PR(p3)/L(p3) + ...) where
d = damping factor
PR(p1) = page rank of page p1
N = total number of pages
L(p2) = total number of ou-links on page p2
Once we are done with the page rank calculation, emit the newly calculated page rank values in the format <title> <new rank> <out-links>.

5. (SortMapper.java, SortReducer.java)
After 8 iterations sorting is performed on iter1 and iter8 in this MapReduce job. Mapper of this part emits page rank (rank, page) in the sorted order. To emit the page ranks in the descending order we have overridden the compare() method in Key Comparator class.
Reducer now receives all the page ranks in descending order and compares if the page rank is greater than 5/N. If the rank is greater than the 5/N value then reducer emits the pagerank as (page title, rank).

As the MapReduce splits all the output in parts we merged the corresponding outputs in single files. We have created new file using FSDataOutputStream to store all the data from output parts to single file. By opening a handle for every file some chunk of bytes are transferred to output file using read() method of FSDataInputStream class. All the output files PageRank.outlink, PageRank.n.out, PageRank.iter1.out, PageRank.iter8.out are stored in the results directory.

Downloads :

Wikipedia PageRank Input files

Wikipedia Pagerank Java Codes


I hope this tutorial will surely help you. If you have any questions or problems please let me know.
Happy Hadooping with Patrick..

Monday, 11 July 2016

Apache Hadoop : Hive Partitioning and Bucketing Example on Twitter Data


Hive Partitioning and Bucketing Example on Twitter Data

Overview on Hive Partitioning :


Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
Overview on Hive Bucketing :
The Hive Partition can be further subdivided into Clusters or Buckets.Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.

Check it out above Examples in below weblink:


Hive Partitioning and Bucketing Example on Twitter Data

I hope this tutorial will surely help you. If you have any questions or problems please let me know.


Happy Hadooping with Patrick..

Friday, 8 July 2016

How to Create Dynamic Apache HADOOP Project using Maven in Eclipse


Steps for creating dynamic HADOOP Project using maven in eclipse.



Check out steps of Creating Hadoop Project using Maven click on below weblink:

How to Create Dynamic Apache HADOOP Project using Maven in Eclipse

I hope this tutorial will surely help you. If you have any questions or problems please let me know.


Happy Hadooping with Patrick..

Sunday, 3 July 2016

Apache Hadoop : Movie Recommender MapReduce Case Study




Movie Recommender MapReduce example


This is a very simple Movie Recommender in Hadoop.

The whole job is broken in 4 Map-Reduce jobs which are to be run sequentially as shown in below.
  
    The steps are

    <1> Normalization
    <2> Finding Distances
    <3> Contribution of Rating    and
    <4> Adding up the Ratings

Check out above Mapreduce Case study in below link:

Movie Recommender MapReduce Case Study


I hope this tutorial will surely help you. If you have any questions or problems please let me know.


Happy Hadooping with Patrick..