重温wordcount

0 views
Skip to first unread message

潘飞

unread,
Nov 6, 2010, 8:03:11 AM11/6/10
to Hadoop中文用户组
再次熟悉一下:

java代码:

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount
{
  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
  {
    private final static IntWritable one = new IntWritable(1); 
    private Text word = new Text();
    
    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
    {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens())
      {
        word.set(tokenizer.nextToken());
        output.collect(word, one);
      }
    }
  }

  public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
  {
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
    {
      int sum = 0;
      while (values.hasNext())
      {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }
  
  public static void main(String[] args) throws Exception
  {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);
    
    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);
  }
}


将文件存储为WordCount.java
javac WordCount.java

light@flight-T-6346c:~/SourceCode/Java/Hadoop/wordcount$ ll
total 24
drwxr-xr-x 2 flight flight 4096 2010-11-06 19:58 ./
drwxr-xr-x 3 flight flight 4096 2010-11-06 19:22 ../
-rw-r--r-- 1 flight flight 1516 2010-11-06 19:44 WordCount.class
-rw-r--r-- 1 flight flight 1872 2010-11-06 19:43 WordCount.java
-rw-r--r-- 1 flight flight 1918 2010-11-06 19:44 WordCount$Map.class
-rw-r--r-- 1 flight flight 1591 2010-11-06 19:44 WordCount$Reduce.class

将一个文本文件烤到hdfs里:

hadoop fs -copyFromLocal ~/Documents/openDNS /test/input


然后我们就可以运行我们的任务了:

java  WordCount hdfs://localhost:9000/test/input hdfs://localhost:9000/test/output

要注意一点:运行前需要确定hdfs://localhost:9000/test/output是不存在的,如果存在会使任务执行失败;并且需要写上hdfs的全路径不能省略前面的hdfs://localhost:9000

--
Stay Hungry. Stay Foolish.

潘飞

unread,
Nov 6, 2010, 8:08:19 AM11/6/10
to Hadoop中文用户组
hadoop权威指南第二版已经可以从网上下载了,大家可以下下来看看o(∩∩)o...哈哈

潘飞

unread,
Nov 6, 2010, 12:15:28 PM11/6/10
to Hadoop中文用户组
在 2010年11月6日 下午8:03,潘飞 <cnw...@gmail.com>写道:
The directory shouldn’t exist before running the job,
as Hadoop will complain and not run the job. This precaution is to prevent data loss
(it can be very annoying to accidentally overwrite the output of a long job with
another)
 

--
Stay Hungry. Stay Foolish.

Reply all
Reply to author
Forward
0 new messages