We’re going to create a simple word count example. Given a text file, one should be able to count all occurrences of each word in it. In general, the program consists of three classes:
- WordCountMapper.java the mapper.
- WordCountReducer.java the reducer.
- WordCount.java the driver. Some configurations (input type, output type, job…) are done here.
WordCountMapper.java
The WordCountMapper.java contains the map() function: it just takes the input file line-by-line (in the value variable). For each line, it emits the key value (word, 1). Where word is a specific word found in that line.For Example
Here is the source code of
WordCountMapper.java:
package hadoopgyaan.mapreduce.wordcount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private static final IntWritable one = new IntWritable(1); private Text word = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] words = line.split(""); for (String w : words) { word.set(w); context.write(word, one); } }
}
package hadoopgyaan.mapreduce.wordcount; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; Iterator<IntWritable> itr = values.iterator(); while (itr.hasNext()) { sum += itr.next().get(); } context.write(key, new IntWritable(sum)); } }
WordCount.java
WordCount.java is responsible for the configuration, setup…. MapReduce job. It contains several configuration information of the system:
package hadoopgyaan.mapreduce.wordcount; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { String input = "wordcount.txt"; String output = "wordcountoutput"; // Create a new job Job job = new Job(); // Set job name to locate it in the distributed environment job.setJarByClass(WordCount.class); job.setJobName("Word Count"); // Set input and output Path, note that we use the default input format // which is TextInputFormat (each record is a line of input) FileInputFormat.addInputPath(job, new Path(input)); FileOutputFormat.setOutputPath(job, new Path(output)); // Set Mapper and Reducer class job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); // Set Output key and value job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Note that here, we don’t explicitly set input type for the mapper
function but use the default one. Infact, the default input type of
Hadoop framework is TextInputFormat in which each record is a line of
input. The key, a LongWritable, is the offset within the file of the
beginning of the line. The value, a Text, is the content of the line.
So, our input file, whose content is the following text:
this quick brown fox jumps over the lazy dog 1
this quick brown fox jumps over the lazy dog 2
......
this quick brown fox jumps over the lazy dog n
is divided into these records. Note that the keys are NOT line numbers.0 , this quick brown fox jumps over the lazy dog 1
offset_of_line_2, this quick brown fox jumps over the lazy dog 2
......
offset_of_line_n, this quick brown fox jumps over the lazy dog n
Now, put a paragraph into “wordcount.txt”, and run the program, you should have the result in “wordcountoutput” folder…
Downloads:
1. Input sample file
I hope this tutorial will surely help you. If you have any questions or problem let me know.
Happy Hadooping with Patrick..
No comments:
Post a Comment