reduceByKey
官方文档描述:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
函数原型:
def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairRDD[K, V]
**
该函数利用映射函数将每个K对应的V进行运算。
其中参数说明如下:
**
func:映射函数,根据需求自定义;
partitioner:分区函数;
numPartitions:分区数,默认的分区函数是HashPartitioner。
源码分析:
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope { combineByKey[V]((v: V) => v, func, func, partitioner) }
**
从源码中可以看出,reduceByKey()是基于combineByKey()实现的,其中createCombiner只是简单的转化,而mergeValue和mergeCombiners相同,都是利用用户自定义函数。reduceyByKey() 相当于传统的 MapReduce,整个数据流也与 Hadoop 中的数据流基本一样。在combineByKey()中在 map 端开启 combine(),因此,reduceyByKey() 默认也在 map 端开启 combine(),这样在 shuffle 之前先通过 mapPartitions 操作进行 combine,得到 MapPartitionsRDD, 然后 shuffle 得到 ShuffledRDD,再进行 reduce(通过 aggregate + mapPartitions() 操作来实现)得到 MapPartitionsRDD。
**
实例:
List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7); JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);//转化为K,V格式JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() { @Override public Tuple2<Integer, Integer> call(Integer integer) throws Exception { return new Tuple2<Integer, Integer>(integer,1); } }); JavaPairRDD<Integer,Integer> reduceByKeyRDD = javaPairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } }); System.out.println(reduceByKeyRDD.collect());//指定numPartitionsJavaPairRDD<Integer,Integer> reduceByKeyRDD2 = javaPairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } },2); System.out.println(reduceByKeyRDD2.collect());//自定义partitionJavaPairRDD<Integer,Integer> reduceByKeyRDD4 = javaPairRDD.reduceByKey(new Partitioner() { @Override public int numPartitions() { return 2; } @Override public int getPartition(Object o) { return (o.toString()).hashCode()%numPartitions(); } }, new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } }); System.out.println(reduceByKeyRDD4.collect());
foldByKey
官方文档描述:
Merge the values for each key using an associative function and a neutral "zero value" which may be added to the result an arbitrary number of times, and must not change the result (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
函数原型:
def foldByKey(zeroValue: V, partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]def foldByKey(zeroValue: V, numPartitions: Int, func: JFunction2[V, V, V]): JavaPairRDD[K, V]def foldByKey(zeroValue: V, func: JFunction2[V, V, V]): JavaPairRDD[K, V]
**
该函数用于将K对应V利用函数映射进行折叠、合并处理,其中参数zeroValue是对V进行初始化。
具体参数如下:
**
zeroValue:初始值;
numPartitions:分区数,默认的分区函数是HashPartitioner;
partitioner:分区函数;
func:映射函数,用户自定义函数。
源码分析:
def foldByKey( zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope { // Serialize the zero value to a byte array so that we can get a new clone of it on each key val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue) val zeroArray = new Array[Byte](zeroBuffer.limit) zeroBuffer.get(zeroArray) // When deserializing, use a lazy val to create just one instance of the serializer per task lazy val cachedSerializer = SparkEnv.get.serializer.newInstance() val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray)) val cleanedFunc = self.context.clean(func) combineByKey[V]((v: V) => cleanedFunc(createZero(), v), cleanedFunc, cleanedFunc, partitioner) }
**
从foldByKey()实现可以看出,该函数是基于combineByKey()实现的,其中createCombiner只是利用zeroValue对V进行初始化,而mergeValue和mergeCombiners相同,都是利用用户自定义函数。在这里需要注意如果实现K的V聚合操作,初始设置需要特别注意,不要改变聚合的结果。
**
实例:
List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7, 1, 2); JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);final Random rand = new Random(10); JavaPairRDD<Integer,String> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, String>() { @Override public Tuple2<Integer, String> call(Integer integer) throws Exception { return new Tuple2<Integer, String>(integer,Integer.toString(rand.nextInt(10))); } }); JavaPairRDD<Integer,String> foldByKeyRDD = javaPairRDD.foldByKey("X", new Function2<String, String, String>() { @Override public String call(String v1, String v2) throws Exception { return v1 + ":" + v2; } }); System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD.collect()); JavaPairRDD<Integer,String> foldByKeyRDD1 = javaPairRDD.foldByKey("X", 2, new Function2<String, String, String>() { @Override public String call(String v1, String v2) throws Exception { return v1 + ":" + v2; } }); System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD1.collect()); JavaPairRDD<Integer,String> foldByKeyRDD2 = javaPairRDD.foldByKey("X", new Partitioner() { @Override public int numPartitions() { return 3; } @Override public int getPartition(Object key) { return key.toString().hashCode()%numPartitions(); } }, new Function2<String, String, String>() { @Override public String call(String v1, String v2) throws Exception { return v1 + ":" + v2; } }); System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD2.collect());
作者:小飞_侠_kobe
链接:https://www.jianshu.com/p/164c02b682ed