Spark SQL：在数组值上使用 collect

由于此时您只能拥有少量行，因此您只需按原样收集属性并将结果展平（Spark >= 2.4）import org.apache.spark.sql.functions.{collect_set, flatten, array_distinct}val byState = Seq(  ("Canada", "America", Seq("A", "B")),  ("Belgium", "Europe", Seq("Z")),  ("USA", "America", Seq("A")),  ("France", "Europe", Seq("Y", "X"))).toDF("country", "continent", "attributes")byState  .groupBy("continent")  .agg(array_distinct(flatten(collect_set($"attributes"))) as "attributes")  .show+---------+----------+|continent|attributes|+---------+----------+|   Europe| [Y, X, Z]||  America|    [A, B]|+---------+----------+在一般情况下，事情更难处理，并且在许多情况下，如果您期望大型列表，每个组有许多重复项和许多值，则最佳解决方案*是从头开始重新计算结果，即input.groupBy($"continent").agg(collect_set($"attributes") as "attributes")一种可能的替代方法是使用Aggregatorimport org.apache.spark.sql.expressions.Aggregatorimport org.apache.spark.sql.catalyst.encoders.ExpressionEncoderimport org.apache.spark.sql.{Encoder, Encoders}import scala.collection.mutable.{Set => MSet}class MergeSets[T, U](f: T => Seq[U])(implicit enc: Encoder[Seq[U]]) extends      Aggregator[T, MSet[U], Seq[U]] with Serializable {  def zero = MSet.empty[U]  def reduce(acc: MSet[U], x: T) = {    for { v <- f(x) } acc.add(v)    acc  }  def merge(acc1: MSet[U], acc2: MSet[U]) = {    acc1 ++= acc2  }  def finish(acc: MSet[U]) = acc.toSeq  def bufferEncoder: Encoder[MSet[U]] = Encoders.kryo[MSet[U]]  def outputEncoder: Encoder[Seq[U]] = enc}并按如下方式应用case class CountryAggregate(  country: String, continent: String, attributes: Seq[String])byState  .as[CountryAggregate]  .groupByKey(_.continent)  .agg(new MergeSets[CountryAggregate, String](_.attributes).toColumn)  .toDF("continent", "attributes")  .show+---------+----------+|continent|attributes|+---------+----------+|   Europe| [X, Y, Z]||  America|    [B, A]|+---------+----------+但这显然不是 Java 友好的选择。另请参阅如何在 groupBy 之后将值聚合到集合中？（类似，但没有唯一性约束）。* 这是因为explode可能非常昂贵，尤其是在旧 Spark 版本中，与访问 SQL 集合的外部表示相同。

Spark SQL：在数组值上使用 collect_set？

1回答