I guys, This blog have been move to new address hadoopgyaan.tk so, please from now onwards new case studies have been posts in the given website.Keep visiting my new website for latests case studies.
The explosion of smartphones in the consumer space (and smart devices of all kinds more generally) has continued to accelerate the next generation of apps such as Uber which depend on the processing of and insight from huge volumes of incoming data. Check out above case study in below link : Spatial Analytics with Hive on UBER Anonymized GPS Logs Case Study
Happy Hadooping with Patrick..
Hive Partitioning and Bucketing Example on Twitter Data Overview on Hive Partitioning :
Hive organizes tables into partitions. It is a way of dividing a
table into related parts based on the values of partitioned columns such as
date, city, and department. Using partition, it is easy to query a portion of
the data.
Overview on Hive Bucketing :
The Hive Partition can be
further subdivided intoClustersorBuckets.Hive Buckets is nothing but another technique of decomposing data or decreasing the
data into more manageable parts or equal parts. Check it out above Examples in below weblink:
The MovieLens example We will use MovieLens dataset for analysis with Pig. The data is available from here. This dataset has been collected by GroupLens Research Project. The data set: The datasets contain movie ratings made by movie goers.It contains three text files: ratings.dat, users.dat and movies.dat.For the sake of completeness, data in the three files is briefly described here.
ratings.dat–>userid::movieid:rating::timestamp
- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
The Chicago Crime example Crime Data with HIVE and PIG Using the Chicago Crime data. Here I will answer a few simple questions to
illustrate the use of some common big data tools. The data set : The data set contains a little over 90 plus records, perhaps not
really on the scale of big data, however the tools and code used in this
document (HIVE and PIG) will be unchanged if we were to handle this data set
with tens of millions of records.
Questions to Answer: 1. The most frequently occurring primary type (i.e.
theft, narcotics etc..) 2. Districts with the most reported incidents 3. Blocks
with the most reported incidents 4. Blocks with the most reported incidents,
grouped by primary type 5. A look at the date and time when the highest number
of incidents where reported 6. Arrests by primary type 7. Arrests by district 8. A look at the date and time when the highest number of arrests took place.
In
each instance we will restrict the reporting in this document to 10 lines of
data, simply to preserve space.
The
intention at a high level is to use historical data to assist law enforcement
in answering, WHAT has been taking place (primary type i.e. narcotics, motor
theft etc.), WHERE has it been taking place (district, block etc.), WHEN has it
been taking place (month, day, hour). With this information law enforcement
could operate in a more effective and efficient manner. In addition when
combining this data with additional variables from other data sets/sources, law
enforcement could possibly develop predictive models, further improving the effectiveness
and efficiency of its operations.
1. The most frequently occurring primary type (i.e. theft, narcotics
etc..)? HIVE QUERY: