Thursday, 23 June 2016

Apache Hadoop : Social Media (Twitter) Data Analysis PIG Case Study



Social Media (Twitter) Data Analysis example

This post contains examples of social media analysis using Pig. Individual examples are described in detail below.

The dataset :


  • full_text.txt: Contains geo-tagged Twitter data with the following fields:
    • Twitter user ID
    • Timestamp of the tweet
    • Location of the tweet
    • Latitude of the tweet
    • Longitude of the tweet
    • Tweet content
  • cities15000.txt: Contains information on cities around the world with the following fields:
    • Record ID
    • City name
    • Country code
    • Latitude of the city
    • Longitude of the city
    • Timezone ID

MostPopularHashtags.pig

This script file will find the top 5 hashtags from full_text.txt. For this example, a hashtag is defined to be any string that starts with '#' and contains numbers, letters or underscores.


-- Load Data data = LOAD '/hadoopgyaan/user/popularhashtags/full_text.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
-- Convert all tweets to lowercase to allow for accurate grouping
lowercase = FOREACH data GENERATE LOWER(tweet) as tweet;
-- Separate all tweets into individual words
tweetwords = FOREACH lowercase GENERATE FLATTEN(TOKENIZE(tweet)) as token;
-- Extract only hashtags from the collection of words
hashtags = FOREACH tweetwords GENERATE REGEX_EXTRACT(token, '(#)[a-z0-9_](\\w+)',0) as hashtag;
-- Group identical hashtags together and create an ordered list of aggregrate counts
grouphashtags = GROUP hashtags BY hashtag;
counthashtags = FOREACH grouphashtags GENERATE group as hashtag, COUNT(hashtags) as cnt;
orderhashtags = ORDER counthashtags BY cnt desc;
limithashtags = LIMIT orderhashtags 5;
DUMP limithashtags;

MostMobileTweeter.pig

This script file will find the user that tweeted from the greatest number of locations, i.e. the greatest number of distinct latitude and longitude pairs.


--Load Data data = LOAD '/hadoopgyaan/user/mobiletweeter/full_text.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
-- Join latitude and longitude coordintates into a tuple
locations = FOREACH data GENERATE id, TOTUPLE(lat, lon) as loc_tuple:tuple(lat:chararray, lon:chararray);
-- Find only unique locations for each user
distinct_locations = DISTINCT(FOREACH locations GENERATE id, loc_tuple);
-- Create an ordered list of the counts of unique locations for each user, returning the top result
group_locations = GROUP distinct_locations BY id;
count_locations = FOREACH group_locations GENERATE group as id, COUNT(distinct_locations) as cnt;
ordered_counts = ORDER count_locations BY cnt desc;
limit_counts = LIMIT ordered_counts 1;
DUMP limit_counts;

Downloads :


I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

No comments:

Post a Comment