HADOOP GYAAN: Apache Hadoop : Wordcount MapReduce Case Study

The Word count example

We’re going to create a simple word count example. Given a text file, one should be able to count all occurrences of each word in it. In general, the program consists of three classes:

WordCountMapper.java the mapper.
WordCountReducer.java the reducer.
WordCount.java the driver. Some configurations (input type, output type, job…) are done here.

WordCountMapper.java

The WordCountMapper.java contains the map() function: it just takes the input file line-by-line (in the value variable). For each line, it emits the key value (word, 1). Where word is a specific word found in that line.For Example

Here is the source code of

WordCountMapper.java:

package hadoopgyaan.mapreduce.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper 
        extends Mapper<LongWritable, Text, Text, IntWritable>{
    private static final IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String[] words = line.split("");
        for (String w : words) {
            word.set(w);
            context.write(word, one);
        }
    }

}

WordCountReducer.java

The reduce() function in WordCountReducer is even simpler. All we have to do is to loop over values of the same key and sum it up. The source code is straightforward and self-explained:

package hadoopgyaan.mapreduce.wordcount;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer 
        extends Reducer<Text, IntWritable, Text, IntWritable>{

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, 
            Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        Iterator<IntWritable> itr = values.iterator();
        while (itr.hasNext()) {
            sum += itr.next().get();
        }

        context.write(key, new IntWritable(sum));
    }
}

WordCount.java

WordCount.java is responsible for the configuration, setup…. MapReduce job. It contains several configuration information of the system:

package hadoopgyaan.mapreduce.wordcount;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    public static void main(String[] args) throws Exception {        

        String input = "wordcount.txt";
        String output = "wordcountoutput";

        // Create a new job
        Job job = new Job();

        // Set job name to locate it in the distributed environment
        job.setJarByClass(WordCount.class);
        job.setJobName("Word Count");

        // Set input and output Path, note that we use the default input format
        // which is TextInputFormat (each record is a line of input)
        FileInputFormat.addInputPath(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));

        // Set Mapper and Reducer class
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // Set Output key and value
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Note that here, we don’t explicitly set input type for the mapper function but use the default one. Infact, the default input type of Hadoop framework is TextInputFormat in which each record is a line of input. The key, a LongWritable, is the offset within the file of the beginning of the line. The value, a Text, is the content of the line. So, our input file, whose content is the following text:

this quick brown fox jumps over the lazy dog 1
this quick brown fox jumps over the lazy dog 2
......

this quick brown fox jumps over the lazy dog n

is divided into these records. Note that the keys are NOT line numbers.

0               , this quick brown fox jumps over the lazy dog 1
offset_of_line_2, this quick brown fox jumps over the lazy dog 2
......
offset_of_line_n, this quick brown fox jumps over the lazy dog n

Now, put a paragraph into “wordcount.txt”, and run the program, you should have the result in “wordcountoutput” folder…

Downloads:

1. Input sample file

2.Wordcount Jar file

I hope this tutorial will surely help you. If you have any questions or problem let me know.

Happy Hadooping with Patrick..

Pages

Sunday, 12 June 2016

Apache Hadoop : Wordcount MapReduce Case Study

I hope this tutorial will surely help you. If you have any questions or problem let me know.

No comments:

Post a Comment