HADOOP GYAAN: Apache Hadoop : Social Media (Twitter) Data Analysis PIG Case Study

Social Media (Twitter) Data Analysis example

This post contains examples of social media analysis using Pig. Individual examples are described in detail below.

The dataset :

full_text.txt: Contains geo-tagged Twitter data with the following fields:

Twitter user ID
Timestamp of the tweet
Location of the tweet
Latitude of the tweet
Longitude of the tweet
Tweet content

cities15000.txt: Contains information on cities around the world with the following fields:
- Record ID
- City name
- Country code
- Latitude of the city
- Longitude of the city
- Timezone ID

MostPopularHashtags.pig

This script file will find the top 5 hashtags from full_text.txt. For this example, a hashtag is defined to be any string that starts with '#' and contains numbers, letters or underscores.


	-- Load Data data = LOAD '/hadoopgyaan/user/popularhashtags/full_text.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);

	-- Convert all tweets to lowercase to allow for accurate grouping
	lowercase = FOREACH data GENERATE LOWER(tweet) as tweet;

	-- Separate all tweets into individual words
	tweetwords = FOREACH lowercase GENERATE FLATTEN(TOKENIZE(tweet)) as token;

	-- Extract only hashtags from the collection of words
	hashtags = FOREACH tweetwords GENERATE REGEX_EXTRACT(token, '(#)[a-z0-9_](\\w+)',0) as hashtag;

	-- Group identical hashtags together and create an ordered list of aggregrate counts
	grouphashtags = GROUP hashtags BY hashtag;
	counthashtags = FOREACH grouphashtags GENERATE group as hashtag, COUNT(hashtags) as cnt;
	orderhashtags = ORDER counthashtags BY cnt desc;
	limithashtags = LIMIT orderhashtags 5;
	DUMP limithashtags;

MostMobileTweeter.pig

This script file will find the user that tweeted from the greatest number of locations, i.e. the greatest number of distinct latitude and longitude pairs.


	--Load Data data = LOAD '/hadoopgyaan/user/mobiletweeter/full_text.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);

	-- Join latitude and longitude coordintates into a tuple
	locations = FOREACH data GENERATE id, TOTUPLE(lat, lon) as loc_tuple:tuple(lat:chararray, lon:chararray);

	-- Find only unique locations for each user
	distinct_locations = DISTINCT(FOREACH locations GENERATE id, loc_tuple);

	-- Create an ordered list of the counts of unique locations for each user, returning the top result
	group_locations = GROUP distinct_locations BY id;
	count_locations = FOREACH group_locations GENERATE group as id, COUNT(distinct_locations) as cnt;
	ordered_counts = ORDER count_locations BY cnt desc;
	limit_counts = LIMIT ordered_counts 1;
	DUMP limit_counts;

Downloads :

Social Media (Twitter) Analysis Dataset

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Pages

Thursday, 23 June 2016

Apache Hadoop : Social Media (Twitter) Data Analysis PIG Case Study

MostPopularHashtags.pig

MostMobileTweeter.pig

No comments:

Post a Comment