Social Media (Twitter) Data Analysis example
This post contains examples of social media analysis using Pig. Individual examples are described in detail below.
The dataset :
- full_text.txt: Contains geo-tagged Twitter data with the following fields:
- Twitter user ID
- Timestamp of the tweet
- Location of the tweet
- Latitude of the tweet
- Longitude of the tweet
- Tweet content
- cities15000.txt: Contains information on cities around the world with the following fields:
- Record ID
- City name
- Country code
- Latitude of the city
- Longitude of the city
- Timezone ID
MostPopularHashtags.pig
This script file will find the top 5 hashtags from full_text.txt. For this example, a hashtag is defined to be any string that starts with '#' and contains numbers, letters or underscores.
-- Load Data data = LOAD '/hadoopgyaan/user/popularhashtags/full_text.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray); | |
-- Convert all tweets to lowercase to allow for accurate grouping | |
lowercase = FOREACH data GENERATE LOWER(tweet) as tweet; | |
-- Separate all tweets into individual words | |
tweetwords = FOREACH lowercase GENERATE FLATTEN(TOKENIZE(tweet)) as token; | |
-- Extract only hashtags from the collection of words | |
hashtags = FOREACH tweetwords GENERATE REGEX_EXTRACT(token, '(#)[a-z0-9_](\\w+)',0) as hashtag; | |
-- Group identical hashtags together and create an ordered list of aggregrate counts | |
grouphashtags = GROUP hashtags BY hashtag; | |
counthashtags = FOREACH grouphashtags GENERATE group as hashtag, COUNT(hashtags) as cnt; | |
orderhashtags = ORDER counthashtags BY cnt desc; | |
limithashtags = LIMIT orderhashtags 5; | |
DUMP limithashtags; |
MostMobileTweeter.pig
This script file will find the user that tweeted from the greatest number of locations, i.e. the greatest number of distinct latitude and longitude pairs.
--Load Data data = LOAD '/hadoopgyaan/user/mobiletweeter/full_text.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray); | |
-- Join latitude and longitude coordintates into a tuple | |
locations = FOREACH data GENERATE id, TOTUPLE(lat, lon) as loc_tuple:tuple(lat:chararray, lon:chararray); | |
-- Find only unique locations for each user | |
distinct_locations = DISTINCT(FOREACH locations GENERATE id, loc_tuple); | |
-- Create an ordered list of the counts of unique locations for each user, returning the top result | |
group_locations = GROUP distinct_locations BY id; | |
count_locations = FOREACH group_locations GENERATE group as id, COUNT(distinct_locations) as cnt; | |
ordered_counts = ORDER count_locations BY cnt desc; | |
limit_counts = LIMIT ordered_counts 1; | |
DUMP limit_counts; |
Downloads :
I hope this tutorial will surely help you. If you have any
questions or problems please let me know.
Happy
Hadooping with Patrick..
No comments:
Post a Comment