Showing posts with label Hive. Show all posts
Showing posts with label Hive. Show all posts

Friday, 19 August 2016

Spatial Analytics with Hive on UBER Anonymized GPS Logs Case Study


The explosion of smartphones in the consumer space (and smart devices of all kinds more generally) has continued to accelerate the next generation of apps such as Uber which depend on the processing of and insight from huge volumes of incoming data.

Check out above case study in below link : 

Spatial Analytics with Hive on UBER Anonymized GPS Logs Case Study

Happy Hadooping with Patrick..

Monday, 11 July 2016

Apache Hadoop : Hive Partitioning and Bucketing Example on Twitter Data


Hive Partitioning and Bucketing Example on Twitter Data

Overview on Hive Partitioning :


Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
Overview on Hive Bucketing :
The Hive Partition can be further subdivided into Clusters or Buckets.Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.

Check it out above Examples in below weblink:


Hive Partitioning and Bucketing Example on Twitter Data

I hope this tutorial will surely help you. If you have any questions or problems please let me know.


Happy Hadooping with Patrick..

Wednesday, 15 June 2016

Apache Hadoop : MovieLens HIVE and PIG Case Study



The MovieLens example

We will use MovieLens dataset for analysis with Pig. The data is available from here. This dataset has been collected by GroupLens Research Project. 

The data set:

The datasets contain movie ratings made by movie goers.It contains three text files: ratings.dat, users.dat and movies.dat.For the sake of completeness, data in the three files is briefly described here.


ratings.dat–>userid::movieid:rating::timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

users.dat–>userid::gender::age::occupation::zipcode

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

*  1:  "Under 18"
* 18:  "18-24"
* 25:  "25-34"
* 35:  "35-44"
* 45:  "45-49"
* 50:  "50-55"
* 56:  "56+"

- Occupation is chosen from the following choices:

*  0:  "other" or not specified
*  1:  "academic/educator"
*  2:  "artist"
*  3:  "clerical/admin"
*  4:  "college/grad student"
*  5:  "customer service"
*  6:  "doctor/health care"
*  7:  "executive/managerial"
*  8:  "farmer"
*  9:  "homemaker"
* 10:  "K-12 student"
* 11:  "lawyer"
* 12:  "programmer"
* 13:  "retired"
* 14:  "sales/marketing"
* 15:  "scientist"
* 16:  "self-employed"
* 17:  "technician/engineer"
* 18:  "tradesman/craftsman"
* 19:  "unemployed"
* 20:  "writer"

movies.dat–>movieID::title::genres

- Genres are pipe-separated and are selected from the following genres:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western

Tuesday, 14 June 2016

Apache Hadoop : Chicago Crime HIVE and PIG Case Study





The Chicago Crime example

Crime Data with HIVE and PIG Using the Chicago Crime data. Here I will answer a few simple questions to illustrate the use of some common big data tools.

The data set :

The data set contains a little over 90 plus records, perhaps not really on the scale of big data, however the tools and code used in this document (HIVE and PIG) will be unchanged if we were to handle this data set with tens of millions of records.



Questions to Answer: 

1. The most frequently occurring primary type (i.e. theft, narcotics etc..) 
2. Districts with the most reported incidents 
3. Blocks with the most reported incidents 
4. Blocks with the most reported incidents, grouped by primary type 
5. A look at the date and time when the highest number of incidents where reported 
6. Arrests by primary type 
7. Arrests by district 
8. A look at the date and time when the highest number of arrests took place.


In each instance we will restrict the reporting in this document to 10 lines of data, simply to preserve space.


The intention at a high level is to use historical data to assist law enforcement in answering, WHAT has been taking place (primary type i.e. narcotics, motor theft etc.), WHERE has it been taking place (district, block etc.), WHEN has it been taking place (month, day, hour). With this information law enforcement could operate in a more effective and efficient manner. In addition when combining this data with additional variables from other data sets/sources, law enforcement could possibly develop predictive models, further improving the effectiveness and efficiency of its operations.

1. The most frequently occurring primary type (i.e. theft, narcotics etc..)?

HIVE QUERY:


SELECT primarytype,
COUNT(*) AS cnt FROM crime GROUP BY primarytype
ORDER BY cnt DESC


PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv
(its path in which you have store your Chicago crime “csv” file. Path could be change as per your requirement)

crime_grp_type = GROUP crime BY primarytype;
crime_grp_type_cntd = FOREACH crime_grp_type GENERATE COUNT(crime) AS cnt;
srtd = ORDER crime_grp_type_cntd BY cnt;
DUMP srtd; 

RESULT:



2. Districts with the most reported incidents?


HIVE QUERY:

SELECT district,
COUNT(*) AS cntdistrict FROM crime GROUP BY district
ORDER BY cntdistrict DESC


PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_grp_dist = GROUP crime BY district;
crime_grp_dist_cntd = FOREACH crime_grp_dist GENERATE COUNT(crime) AS cnt;
 srtd = ORDER crime_grp_dist_cntd BY cnt;
DUMP srtd;

RESULT:



3. Blocks with the most reported incidents?


HIVE QUERY:

SELECT block,
COUNT(*) AS cntblock FROM crime
GROUP BY block
ORDER BY cntblock DESC


PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_grp_block = GROUP crime BY block;
 crime_grp_block_cntd = FOREACH crime_grp_block GENERATE COUNT(crime) AS cnt;
 srtd = ORDER crime_grp_block_cntd BY cnt;
 DUMP srtd; 

RESULT: 


4. Blocks with the most reported incidents, grouped by primary type?

 HIVE QUERY:


SELECT block,
primarytype, COUNT(*) AS cntblocktype FROM crime GROUP BY block,
primarytype ORDER BY cntblocktype DESC


PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_cogrp_block_type = COGROUP crime BY (block, primarytype);
crime_ cogrp_block_type _cntd = FOREACH crime_ cogrp_block_type GENERATE COUNT(crime) AS cnt;
srtd = ORDER crime_ cogrp_block_type _cntd BY cnt;
DUMP srtd;

RESULT:

5. A look at the date and time when the highest number of incidents where reported?

HIVE QUERY:

SELECT date,
COUNT(*) AS cnt FROM crime
GROUP BY date
ORDER BY cnt DESC


PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_grp_date = GROUP crime BY date;
crime_grp_date_cntd = FOREACH crime_grp_date GENERATE COUNT(crime) AS cnt;
srtd = ORDER crime_grp_date_cntd BY cnt;
DUMP srtd;

RESULT:
6. Arrests by primary type?

HIVE QUERY:

SELECT primarytype,
COUNT(*) AS cnt FROM crime WHERE arrest = True
GROUP BY primarytype
ORDER BY cnt DESC



PIG SCRIPT:

crime = LOAD ''/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_filter = FILTER crime BY ( UPPER (arrest) matches '.*TRUE.*' );
crime_grp_type = GROUP crime_filter BY primarytype;
crime_grp_type_cntd = FOREACH crime_grp_type GENERATE COUNT(crime_filter) AS cnt;
srtd = ORDER crime_grp_type_cntd BY cnt;
DUMP srtd; 

RESULT:

7. Arrests by district?

HIVE QUERY:

SELECT district,
COUNT(*) AS cntdistrictarrest FROM crime WHERE arrest = True
GROUP BY district
ORDER BY cntdistrictarrest DESC


PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_filter = FILTER crime BY ( UPPER (arrest) matches '.*TRUE.*' );
crime_grp_dist = GROUP crime_filter BY district;
crime_grp_dist_cntd = FOREACH crime_grp_dist GENERATE COUNT(crime_filter) AS cnt;
srtd = ORDER crime_grp_dist_cntd BY cnt;
DUMP srtd; 

RESULT:

8. A look at the date and time when the highest number of arrests took place?

HIVE QUERY:

SELECT date,
COUNT(*) AS cnt_arrest FROM crime WHERE arrest = True
GROUP BY date
ORDER BY cnt_arrest DESC