继续浏览精彩内容
慕课网APP
程序员的梦工厂
打开
继续
感谢您的支持,我会继续努力的
赞赏金额会直接到老师账户
将二维码发送给自己后长按识别
微信支付
支付宝支付

MapReduce实战一手写WordCount案例

刘信坚
关注TA
已关注
手记 52
粉丝 7795
获赞 665

需求: 在一堆给定的文本文件中统计输出每一个单词出现的总次数
如下图所示为MapReduce统计WordCount的分析图:

https://img4.mukewang.com/5bde7868000102cd08710448.jpg

map阶段从文件中读取数据,行号作为key,读取的每行值作为value,将每个key/value对输出给reduce阶段,reduce阶段将map阶段所有执行完的结果进行reduce操作,每个相同的key执行一次reduce方法。

代码如下:

WordCountMapper.java

package com.lxj.wc;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;//Map阶段:输入的行号作为key,每行读取的值作为valuepublic class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{	private Text k  = new Text();	private IntWritable v = new IntWritable(1);	
	@Override
	protected void map(LongWritable key, Text value,Context context) throws java.io.IOException, java.lang.InterruptedException {	     
		// 1 将每次读入的一行进行分割
		String line = value.toString();		
		// 2 转换成String类型进行分割
		String[] words = line.split(" ");		
		// 3 将每个键值对都写出
		for (String word : words) {
			String trim = word.trim();			if(!" ".equals(trim)){
				k.set(trim);				// 4 map阶段将单词拆分,并不合并,所以固定值为1
				context.write(k, v);
			}
		}
	}
	
}


WordCountReducer.java

package com.lxj.wc;import java.util.Iterator;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;//Reduce阶段是以Map阶段的输出结果作为Reduce阶段的输入数据public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{	
	//同一个key有且仅只执行一次reduce方法
	@Override
	protected void reduce(Text text, Iterable<IntWritable> iterable, Context context) throws java.io.IOException, java.lang.InterruptedException {	    
		// 1. 将map阶段同一个key对应的value值求和
		int sum = 0;
		Iterator<IntWritable> iterator = iterable.iterator();		while(iterator.hasNext()){
			 sum += iterator.next().get();
		}		if(!text.toString().trim().equals("")){			//将结果输出
			context.write(text, new IntWritable(sum));
		}
	}
	
}


WordCountDriver.java

package com.lxj.wc;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

//驱动类,将map与reduce进行关联
public class WordCountDriver {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		// 1.获取配置信息
		Configuration configuration = new Configuration();
		Job job = Job.getInstance(configuration);
		
		// 2.设置加载jar的位置路径,直接传入当前Class对象
		job.setJarByClass(WordCountDriver.class);
		
		// 3.设置map和reduce类
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		
		// 4.设置map的输出类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		
		// 5.设置最终的输出
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		// 6.设置输入和输出路径
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		// 7.提交
		boolean result = job.waitForCompletion(true);
		System.exit( result ? 0 : 1);
		
	}
}


准备如下文件:

https://img1.mukewang.com/5bde7b17000108bf08960228.jpg

https://img4.mukewang.com/5bde7b250001872709560285.jpg

https://img2.mukewang.com/5bde7b3700016ad810300299.jpg

一 本地方法测试结果如下:

https://img2.mukewang.com/5bde7b61000194c307260156.jpg


Astonished	1
At	1
But	1
Fate	1
He	2
Immediately	1
Many	1
O	1
Phoenix	1
a	1
admired,	1
again	1
ages	1
al	1
amongst	1
an	1
and	5
animals,	1
appeared	1
around	1
at	1
away	1
beasts,	1
beauty,	1
been	2
began	1
being	1
birds	1
both	1
broke	1
compassion,	1
different	1
elasticserach	1
euraka	1
eye	1
flocked	1
friend	1
great	1
had	2
hadoop	1
hard	1
has	2
he	1
him	3
his	1
in	2
into	1
javaee	1
kinds	1
know	1
last	1
look	1
loved	1
loving	1
map	1
mate	1
most	1
mysql	1
neither	1
never	1
nor	1
now	1
of	4
or	1
out	1
passed	1
phoenix	1
pleasure	1
praise.	1
prudent	1
redis	2
reduce	1
seen	1
shiro	1
short	1
sighed	1
since	1
spark	1
ssh	1
ssm	1
stared	1
the	5
them	1
they	2
time,	1
to	2
unhappy	1
upon	1
will	1
wisest	1
with	1
world.	1
yarn	1
zookeeper	1

二 Hadoop集群上运行如下:

首先将项目打成jar包,然后上传到HDFS上面进行分析,并执行以下命令:

https://img1.mukewang.com/5bde7f9b0001a43a18310055.jpg

执行成功之后查看结果:

https://img4.mukewang.com/5bde7fdd00012bcd12070055.jpg

https://img.mukewang.com/5bde7ffa0001f01905720716.jpg

当然也可以直接在web端下载查看:

https://img.mukewang.com/5bde801d00014f1818100728.jpg



打开App,阅读手记
3人推荐
发表评论
随时随地看视频慕课网APP