一、Apache Pig 介绍
Pig 是 Apache 基金会的一个项目,它是一个大型数据集分析的平台,使用过
MapReduce 的程序员都知道,面对复杂的数据集常常需要编写多个 MapReduce
过程方能达到目的。Pig 正是为了解决这个问题而产生的,它包括两个部分:
*Pig Latin:描述数据流的文本语言;
*运行 Pig Latin 程序的执行环境:产生 MapReduce 程序的编译器。
Pig 具有三个特性:
(1)易编程。Pig Latin 程序由一系列的“操作”或“变换”构成,实际上通过“操
作”将 MapRecude 程序变成数据流,使得实现简单的和并行要求高的数据分析任
务变得非常容易,在它所提供的 Pig Latin 控制台上,可以用几行 Pig Latin 代码轻
松完成 TB 级的数据集处理任务。
(2)自动优化。系统会对编写的 Pig Latin 代码自动进行优化,程序员就可以
省去优化过程,不必关心效率问题,将大量的时间专注与分析语义方面。
(3)扩展性好。程序员可以按照自己的需求编写自定义函数。其载入(load)、
存储(store)、过滤(filter)、连接(join)过程均可定制。
二、Pig 的安装
(1)解压
[root@master hadoop]# tar -zxvf pig-0.13.0.tar.gz
(2)配置环境变量
[root@master hadoop]# vim /etc/profileexport PIG_HOME=/home/hadoop/pig-0.13.0export PATH=$PIG_HOME/bin:$PATHexport PIG_CLASSPATH=$HADOOP_HOME/conf [root@master hadoop]# source /etc/profile
(3)验证
[root@master hadoop]# pig help18/06/10 08:16:24 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL18/06/10 08:16:24 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE18/06/10 08:16:24 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType2018-06-10 08:16:24,416 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:582018-06-10 08:16:24,416 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1528632984415.log2018-06-10 08:16:24,761 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable2018-06-10 08:16:25,247 [main] ERROR org.apache.pig.Main - ERROR 2997: Encountered IOException. File help does not exist Details at logfile: /home/hadoop/pig_1528632984415.log[root@master hadoop]#
三、运行Pig
Pig 有两种运行模式:Local 模式和 MapReduce 模式。Local 模式只能访问本地
系统文件,一般用于处理小规模的数据集,不需要 Hadoop 集群环境的支持。
MapReduce 模式运行于 Hadoop 集群环境上,Pig 将 Pig Latin 程序编译为
MapReduce 作业执行。Pig 程序的运行由三种方法:脚本文件、Grunt Shell 和程
序嵌入式。这三种方法均适用于 Local 模式和 MapReduce 模式,在 Local 模式与
MapReduce 模式下的执行几乎一样,只需说明采用的模式就行。
(1)Local 模式
数据下载地址:http://download.csdn.net/detail/xiangchengguan/7567759
Grunt Shell
[root@master hadoop]# pig -x local18/06/10 08:24:42 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL18/06/10 08:24:42 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType2018-06-10 08:24:42,632 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:582018-06-10 08:24:42,632 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1528633482631.log2018-06-10 08:24:42,683 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found2018-06-10 08:24:42,929 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS2018-06-10 08:24:42,929 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address2018-06-10 08:24:42,931 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///2018-06-10 08:24:43,565 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable2018-06-10 08:24:43,693 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum2018-06-10 08:24:43,696 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS grunt> grunt> ls hdfs://master:9000/user/root/input <dir>hdfs://master:9000/user/root/output <dir>grunt> A = LOAD '/user/root/input/data.txt' USING PigStorage(' ') AS (ip:chararray); grunt> B = FOREACH(GROUP A BY ip) GENERATE group AS ip,COUNT(A) AS clickes; grunt> dump B //要跑一段时间,需要等一下,还有堆警告进程提示,别管他(1.207.63.200,9) (14.29.127.77,32) (1.204.253.188,18) (119.0.231.104,33) (182.118.49.47,4) (101.199.108.58,4) (218.201.249.196,9) grunt> C = ORDER B BY clickes DESC; grunt> D = LIMIT C 3; grunt> DUMP; //要跑一段时间,需要等一下,还有堆警告进程提示,没有错误就不要管(119.0.231.104,33) (14.29.127.77,32) (1.204.253.188,18)
(2)脚本文件
脚本文件实质上是pig命令的批处理文件。
我们给出的script.pig文件包含以下内容:
A = LOAD '/user/root/input/data.txt' USING PigStorage(' ') AS (ip:chararray); B = FOREACH(GROUP A BY ip) GENERATE group AS ip,COUNT(A) AS clickes; dump B; store B into '/root/Desktop/result.txt'; //将B的内容输出到本地文件中,/root/Desktop/result.txt存在要先删除,不然执行会报错。
然后通过执行pig -x local script.pig即可。
结果截图
mapreduce模式
这是pig的默认模式,在终端如果只输入pig就会以mapreduce运行
[root@master hadoop]# pig18/06/10 08:35:34 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL18/06/10 08:35:34 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE18/06/10 08:35:34 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType2018-06-10 08:35:34,506 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:582018-06-10 08:35:34,506 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1528634134505.log2018-06-10 08:35:34,548 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found2018-06-10 08:35:34,789 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address2018-06-10 08:35:34,789 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS2018-06-10 08:35:34,789 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master:90002018-06-10 08:35:35,175 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable2018-06-10 08:35:36,034 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFSgrunt>
其他操作和Local 模式一样。