Spark-Core源码精读(14)、Shuffle--Write部分-原创手记-慕课网

前面我们分析了Spark中具体的Task的提交和运行过程，从本文开始我们开始进入Shuffle的世界，Shuffle对于分布式计算来说是至关重要的部分，它直接影响了分布式系统的性能，所以我将尽可能进行详细的分析。

我们首先来看Shuffle中的Write部分：

override def runTask(context: TaskContext): MapStatus = {  // Deserialize the RDD using the broadcast variable.
  val deserializeStartTime = System.currentTimeMillis()  val ser = SparkEnv.get.closureSerializer.newInstance()  val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](    ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
  _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
  metrics = Some(context.taskMetrics)  var writer: ShuffleWriter[Any, Any] = null
  try {    val manager = SparkEnv.get.shuffleManager
    writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
    writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
    writer.stop(success = true).get
  } catch {    case e: Exception =>      try {        if (writer != null) {
          writer.stop(success = false)
        }
      } catch {        case e: Exception =>
          log.debug("Could not stop writer", e)
      }      throw e
  }
}

首先根据SparkEnv获得ShuffleManager，ShuffleManager是为Spark shuffle系统而抽象的可插拔的接口，它被创建在Driver和Executor上，具体是在SparkEnv实例化的时候进行配置的，源码如下：

val shortShuffleMgrNames = Map(  "hash" -> "org.apache.spark.shuffle.hash.HashShuffleManager",  "sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager",  "tungsten-sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager")val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")val shuffleMgrClass = shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase, shuffleMgrName)val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

可以看到是由"spark.shuffle.manager"配置项来决定具体使用哪种实现方式，默认情况下使用的是sort的方式(本文参考的是Spark 1.6.3版本的源码)。Driver通过ShuffleManager来注册shuffles，Executors可以通过它来读写数据。

获得到ShuffleManager后，就根据它来获得ShuffleWriter(根据具体ShuffleManager的getWriter方法获得)，顾名思义就是用来写数据，而接下来的工作就是调用具体的ShuffleWriter的write方法来进行写数据的工作。

先来看一下getWriter方法，这里的第一个参数dep.shuffleHandle是ShuffleDependency的一个成员变量：

val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
  shuffleId, _rdd.partitions.size, this)

这里的registerShuffle方法用来向ShuffleManager注册一个shuffle并且获得一个用来传递任务的句柄，会根据不同ShuffleManager有不同的实现，HashShuffleManager返回的是BaseShuffleHandle，而SortShuffleManager又会根据不同的情况返回BypassMergeSortShuffleHandle、SerializedShuffleHandle或者BaseShuffleHandle。

HashShuffleManager：

override def registerShuffle[K, V, C](
    shuffleId: Int,
    numMaps: Int,
    dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {  new BaseShuffleHandle(shuffleId, numMaps, dependency)
}

SortShuffleManager：

override def registerShuffle[K, V, C](
    shuffleId: Int,
    numMaps: Int,
    dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {  if (SortShuffleWriter.shouldBypassMergeSort(SparkEnv.get.conf, dependency)) {    // 这里注释说的很清楚，根据spark.shuffle.sort.bypassMergeThreshold的值（默认是200）判断是否需要进行Map端的聚合操作
    // 如果partitions的个数小于200就不进行该操作
    // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
    // need map-side aggregation, then write numPartitions files directly and just concatenate
    // them at the end. This avoids doing serialization and deserialization twice to merge
    // together the spilled files, which would happen with the normal code path. The downside is
    // having multiple files open at a time and thus more memory allocated to buffers.
    new BypassMergeSortShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])  // 这里是判断是否使用tungsten的方式
  } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {    // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
    new SerializedShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else {    // 如果不是上述两种方式，就使用默认的方式
    // Otherwise, buffer map outputs in a deserialized form:
    new BaseShuffleHandle(shuffleId, numMaps, dependency)
  }
}

使用一张图来总结一下上面的过程：

webp

然后我们来看ShuffleManager的getWriter方法：

HashShuffleManager：

/** Get a writer for a given partition. Called on executors by map tasks. */override def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext)
    : ShuffleWriter[K, V] = {  new HashShuffleWriter(
    shuffleBlockResolver, handle.asInstanceOf[BaseShuffleHandle[K, V, _]], mapId, context)
}

SortShuffleManager：

/** Get a writer for a given partition. Called on executors by map tasks. */override def getWriter[K, V](
    handle: ShuffleHandle,
    mapId: Int,
    context: TaskContext): ShuffleWriter[K, V] = {
  numMapsForShuffle.putIfAbsent(
    handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)  val env = SparkEnv.get
  handle match {    case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>      new UnsafeShuffleWriter(
        env.blockManager,
        shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
        context.taskMemoryManager(),
        unsafeShuffleHandle,
        mapId,
        context,
        env.conf)    case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>      new BypassMergeSortShuffleWriter(
        env.blockManager,
        shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
        bypassMergeSortHandle,
        mapId,
        context,
        env.conf)    case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>      new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
  }
}

webp

从上述源码可以看出，getWriter方法内部实际上是根据传进来的ShuffleHandle的具体类型来判断使用哪种ShuffleWriter的，然后最终执行ShuffleWriter的write方法，下面我们就分为HashShuffleManager和SortShuffleManager两种类型来进行分析。

1、HashShuffleManager

从上面的源码中可以看到HashShuffleManager最终实例化的是HashShuffleWriter，实例化的时候有一行比较重要的代码：

private val shuffle = shuffleBlockResolver.forMapTask(dep.shuffleId, mapId, numOutputSplits, ser,
  writeMetrics)

我们来看forMapTask这个方法：

def forMapTask(shuffleId: Int, mapId: Int, numReducers: Int, serializer: Serializer,
    writeMetrics: ShuffleWriteMetrics): ShuffleWriterGroup = {  // 为每个ShuffleMapTask实例化了一个ShuffleWriterGroup
  new ShuffleWriterGroup {    // 实例化ShuffleState并保存shuffleId和ShuffleState的对应关系
    shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numReducers))    // 根据shuffleId获得对应的ShuffleState
    private val shuffleState = shuffleStates(shuffleId)    val openStartTime = System.nanoTime    val serializerInstance = serializer.newInstance()    // 获得该ShuffleWriterGroup的writers
    val writers: Array[DiskBlockObjectWriter] = {      Array.tabulate[DiskBlockObjectWriter](numReducers) { bucketId =>        // 生成ShuffleBlockId，是一个case class，我们可以通过name方法看到其具体的组成：
        // override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
        val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)        // 通过DiskBlockManager的getFile方法获得File
        val blockFile = blockManager.diskBlockManager.getFile(blockId)        // 临时目录
        val tmp = Utils.tempFileWith(blockFile)        // 使用BlockManager的getDiskWriter方法获得DiskBlockObjectWriter
        // 注意这里的bufferSize默认情况下是32kb，可以通过spark.shuffle.file.buffer进行配置
        // private val bufferSize = conf.getSizeAsKb("spark.shuffle.file.buffer", "32k").toInt * 1024
        blockManager.getDiskWriter(blockId, tmp, serializerInstance, bufferSize, writeMetrics)
      }
    }    // Creating the file to write to and creating a disk writer both involve interacting with
    // the disk, so should be included in the shuffle write time.
    writeMetrics.incShuffleWriteTime(System.nanoTime - openStartTime)    override def releaseWriters(success: Boolean) {
      shuffleState.completedMapTasks.add(mapId)
    }
  }
}

为每个ShuffleMapTask实例化一个ShuffleWriterGroup，其中包含了一组writers，每个writer对应一个reducer。

然后我们进入到HashShuffleWriter的write方法：

/** Write a bunch of records to this task's output */override def write(records: Iterator[Product2[K, V]]): Unit = {  val iter = if (dep.aggregator.isDefined) {    // 判断是否进行map端的combine操作
    if (dep.mapSideCombine) {
      dep.aggregator.get.combineValuesByKey(records, context)
    } else {
      records
    }
  } else {
    require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
    records
  }  for (elem <- iter) {    val bucketId = dep.partitioner.getPartition(elem._1)
    shuffle.writers(bucketId).write(elem._1, elem._2)
  }
}

如果没有进行Map端的combine操作，根据key获得bucketId，实际上是进行取模运算：

def nonNegativeMod(x: Int, mod: Int): Int = {  val rawMod = x % mod
  rawMod + (if (rawMod < 0) mod else 0)
}

下面就是根据bucketId获得ShuffleWriterGroup中对应的writer，然后执行其write方法，将key和value写入到对应的block file中：

/**
 * Writes a key-value pair.
 */def write(key: Any, value: Any) {  if (!initialized) {
    open()
  }
  objOut.writeKey(key)
  objOut.writeValue(value)
  recordWritten()
}

写入完成后，我们回到ShuffleMapTask的runTask方法中，接下来执行的是：

writer.stop(success = true).get

即HashShuffleWriter的stop方法，该方法返回的是Option[MapStatus]，最主要的一句代码为：

Some(commitWritesAndBuildStatus())

进入到commitWritesAndBuildStatus方法：

private def commitWritesAndBuildStatus(): MapStatus = {  // Commit the writes. Get the size of each bucket block (total block size).
  val sizes: Array[Long] = shuffle.writers.map { writer: DiskBlockObjectWriter =>    // 调用DiskBlockObjectWriter的commitAndClose方法
    writer.commitAndClose()    // 获得每个bucket block的大小
    writer.fileSegment().length
  }  // 重命名所有的shuffle文件，每个executor只有一个ShuffleBlockResolver，所以使用了synchronized关键字
  // rename all shuffle files to final paths
  // Note: there is only one ShuffleBlockResolver in executor
  shuffleBlockResolver.synchronized {
    shuffle.writers.zipWithIndex.foreach { case (writer, i) =>      val output = blockManager.diskBlockManager.getFile(writer.blockId)      if (sizes(i) > 0) {        if (output.exists()) {          // Use length of existing file and delete our own temporary one
          sizes(i) = output.length()
          writer.file.delete()
        } else {          // Commit by renaming our temporary file to something the fetcher expects
          if (!writer.file.renameTo(output)) {            throw new IOException(s"fail to rename ${writer.file} to $output")
          }
        }
      } else {        if (output.exists()) {
          output.delete()
        }
      }
    }
  }  MapStatus(blockManager.shuffleServerId, sizes)
}

这里我们需要注意最终返回的是封装的MapStatus，它记录了产生的磁盘文件的位置，然后Executor中的MapOutputTrackerWorker将MapStatus信息发送给Driver中的MapOutputTrackerMaster，后面Shuffle Read的之后就会从Driver的MapOutputTrackerMaster获取MapStatus的信息，也就是获取对应的上一个ShuffleMapTask的计算结果的输出的文件位置信息。

Map端combine的情况

再来补充一下Map端combine的情况：

if (dep.mapSideCombine) {
  dep.aggregator.get.combineValuesByKey(records, context)
} else {
  records
}

进入到Aggregator的combineValuesByKey方法：

def combineValuesByKey(
    iter: Iterator[_ <: Product2[K, V]],
    context: TaskContext): Iterator[(K, C)] = {  // 首先实例化ExternalAppendOnlyMap
  val combiners = new ExternalAppendOnlyMap[K, V, C](createCombiner, mergeValue, mergeCombiners)  // 执行ExternalAppendOnlyMap的insertAll方法
  combiners.insertAll(iter)
  updateMetrics(context, combiners)
  combiners.iterator
}

首先实例化了ExternalAppendOnlyMap，然后执行ExternalAppendOnlyMap的insertAll方法：

def insertAll(entries: Iterator[Product2[K, V]]): Unit = {  if (currentMap == null) {    throw new IllegalStateException(      "Cannot insert new elements into a map after calling iterator")
  }  // An update function for the map that we reuse across entries to avoid allocating
  // a new closure each time
  var curEntry: Product2[K, V] = null
  val update: (Boolean, C) => C = (hadVal, oldVal) => {    if (hadVal) mergeValue(oldVal, curEntry._2) else createCombiner(curEntry._2)
  }  while (entries.hasNext) {
    curEntry = entries.next()    val estimatedSize = currentMap.estimateSize()    if (estimatedSize > _peakMemoryUsedBytes) {
      _peakMemoryUsedBytes = estimatedSize
    }    if (maybeSpill(currentMap, estimatedSize)) {
      currentMap = new SizeTrackingAppendOnlyMap[K, C]
    }
    currentMap.changeValue(curEntry._1, update)
    addElementsRead()
  }
}

具体的实现方式就不再解释了，简单的说就是将key相同的value进行合并，如果某个key有对应的值就执行merge(也可以理解为更新)操作，如果没有对应的值就新建一个combiner，需要注意的是如果内存不够的话就会将数据spill到磁盘。

HashShuffle方式的Shuffle Write部分至此结束，使用一张图概括一下：

webp

接下来看一下SortShuffle方式的具体流程。

2、SortShuffleManager

为了解决Hash Shuffle产生小文件过多的问题，产生了Sort Shuffle，解下来我们就一起看一下Sort Shuffle的Write部分。

上文中我们已经提到，SortShuffleManager中的getWriter会根据不同的ShuffleHandle产生相应的ShuffleWriter：

SerializedShuffleHandle 对应 UnsafeShuffleWriter
BypassMergeSortShuffleHandle 对应 BypassMergeSortShuffleWriter
BaseShuffleHandle 对应 SortShuffleWriter

下面我们分别进行分析：

2.1、BaseShuffleHandle & SortShuffleWriter

首先来看一下SortShuffleWriter，直接来看它的write方法：

/** Write a bunch of records to this task's output */override def write(records: Iterator[Product2[K, V]]): Unit = {
  sorter = if (dep.mapSideCombine) {
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")    new ExternalSorter[K, V, C](
      context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
  } else {    // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
    // care whether the keys get sorted in each partition; that will be done on the reduce side
    // if the operation being run is sortByKey.
    new ExternalSorter[K, V, V](
      context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
  }
  sorter.insertAll(records)  // Don't bother including the time to open the merged output file in the shuffle write time,
  // because it just opens a single file, so is typically too fast to measure accurately
  // (see SPARK-3570).
  val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)  val tmp = Utils.tempFileWith(output)  try {    val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)    val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
    shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
    mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
  } finally {    if (tmp.exists() && !tmp.delete()) {
      logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
    }
  }
}

我们看到，内部有一个非常重要的部分，即ExternalSorter，而关于ExternalSorter的使用，源码中的注释说的很清楚了，这里就不做翻译了：

/**
* Users interact with this class in the following way:
*
* 1. Instantiate an ExternalSorter.
*
* 2. Call insertAll() with a set of records.
*
* 3. Request an iterator() back to traverse sorted/aggregated records.
*     - or -
*    Invoke writePartitionedFile() to create a file containing sorted/aggregated outputs
*    that can be used in Spark's sort shuffle.
*/

我们就根据这三个步骤进行说明：

2.1.1 第一步

首先就是实例化ExternalSorter，这里有一个判断，如果要进行map端的combine操作的话就需要指定Aggregator和Ordering，否则这两个参数为None。我们熟悉的reduceByKey就进行了Map端的combine操作，如下图所示：

webp

2.1.2 第二步（这一步非常重要）

通过判断是否进行Map端combine操作而实例化出不同的ExternalSorter后，就会调用insertAll方法，将输入的记录写入到内存中，如果内存不足就spill到磁盘中，具体的实现我们来看insertAll方法：

def insertAll(records: Iterator[Product2[K, V]]): Unit = {  // TODO: stop combining if we find that the reduction factor isn't high
  // 首先判断是否需要进行Map端的combine操作
  val shouldCombine = aggregator.isDefined  if (shouldCombine) {    // 如果需要进行map端的combine操作，使用PartitionedAppendOnlyMap作为缓存
    // 将record根据key对value按照获得的聚合函数进行聚合操作(combine)
    // Combine values in-memory first using our AppendOnlyMap
    // 获得聚合函数，例如我们使用reduceByKey时编写的函数
    val mergeValue = aggregator.get.mergeValue    // 获取createCombiner函数
    val createCombiner = aggregator.get.createCombiner    var kv: Product2[K, V] = null
    // 定义update函数，主要的逻辑是：如果某个key已经存在记录(record)就使用上面获取
    // 的聚合函数进行聚合操作，如果还不存在记录就使用createCombiner方法进行初始化操作
    val update = (hadValue: Boolean, oldValue: C) => {      if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
    }    // 循环遍历所有的records(记录)
    while (records.hasNext) {      // 记录spill的频率，每当read一条record的时候都会记录一次
      addElementsRead()      // 使用kv储存当前读的record
      kv = records.next()      // 这里的map和下面else中的buffer都是用来缓存的数据结构
      // 如果进行Map端的聚合操作，使用的就是PartitionedAppendOnlyMap[K, C]
      // 如果不进行Map端的聚合操作，使用的是PartitionedPairBuffer[K, C]
      // 调用上面定义的update函数将记录插入到map中
      map.changeValue((getPartition(kv._1), kv._1), update)      // 判断是否要进行spill操作
      maybeSpillCollection(usingMap = true)
    }
  } else {    // 如果不需要进行Map端的聚合操作，就直接将记录放到buffer(PartitionedPairBuffer)中
    // Stick values into our buffer
    while (records.hasNext) {
      addElementsRead()      val kv = records.next()
      buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
      maybeSpillCollection(usingMap = false)
    }
  }
}

具体的流程用注释的方式写在了上面的源码中，这里我们先来看一下PartitionedAppendOnlyMap和PartitionedPairBuffer分别是如何工作的：

PartitionedAppendOnlyMap：

首先来看PartitionedAppendOnlyMap的changeValue实现，实际上，PartitionedAppendOnlyMap是继承自SizeTrackingAppendOnlyMap，而SizeTrackingAppendOnlyMap又继承自AppendOnlyMap，这里调用的changeValue方法实际上是SizeTrackingAppendOnlyMap的changeValue方法：

override def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {  // 首先调用父类的changeValue方法
  val newValue = super.changeValue(key, updateFunc)  // 然后调用SizeTracker接口的afterUpdate方法
  super.afterUpdate()  // 返回newValue
  newValue
}

父类(AppendOnlyMap)的changeValue方法：

def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
  assert(!destroyed, destructionMessage)  val k = key.asInstanceOf[AnyRef]  // key为空时候的处理，增加长度
  if (k.eq(null)) {    if (!haveNullValue) {
      incrementSize()
    }
    nullValue = updateFunc(haveNullValue, nullValue)
    haveNullValue = true
    return nullValue
  }  var pos = rehash(k.hashCode) & mask  var i = 1
  while (true) {    // 这里的data是一个数组，用来同时存储key和value：key0, value0, key1, value1, key2, value2, etc.
    // 即2 * pos上存储的是key的值，2 * pos + 1上存储的是value的值
    val curKey = data(2 * pos)    // 如果key已经存在，就调用updateFunc方法更新value
    if (k.eq(curKey) || k.equals(curKey)) {      val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V])
      data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]      return newValue
    } else if (curKey.eq(null)) {      // 如果key不存在就将该key和对应的value添加到data这个数组中
      val newValue = updateFunc(false, null.asInstanceOf[V])
      data(2 * pos) = k
      data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
      incrementSize()      return newValue
    } else {      // 否则继续计算位置(pos)
      val delta = i
      pos = (pos + delta) & mask
      i += 1
    }
  }  null.asInstanceOf[V] // Never reached but needed to keep compiler happy}

然后是SizeTracker接口的afterUpdate方法

protected def afterUpdate(): Unit = {
  numUpdates += 1
  if (nextSampleNum == numUpdates) {
    takeSample()
  }
}

更新数据的更新次数，如果更新的次数达到nextSampleNum，就执行采样操作，主要用来评估内存的使用情况。

PartitionedPairBuffer：

再来看PartitionedPairBuffer的insert方法，也就是不进行Map端combine操作的情况：

def insert(partition: Int, key: K, value: V): Unit = {  // 如果当前的大小达到了capacity的值就需要扩大该数组
  if (curSize == capacity) {
    growArray()
  }  // 存储key，这里存储的是(partition Id, key)的格式
  data(2 * curSize) = (partition, key.asInstanceOf[AnyRef])  // 存储value
  data(2 * curSize + 1) = value.asInstanceOf[AnyRef]
  curSize += 1
  // 参考上面PartitionedAppendOnlyMap的部分
  afterUpdate()
}

直接将数据存储到buffer中。

执行完上面的更新数据操作后，就要判断是否要将数据spill到磁盘，即maybeSpillCollection方法：

private def maybeSpillCollection(usingMap: Boolean): Unit = {  var estimatedSize = 0L  // 这里需要判断使用的是map(PartitionedAppendOnlyMap)还是buffer(PartitionedPairBuffer)
  // 如果true就是map，false就是buffer
  if (usingMap) {    // 估计当前map的内存占用大小
    estimatedSize = map.estimateSize()    // 如果超过内存的限制，就将缓存中的数据spill到磁盘
    if (maybeSpill(map, estimatedSize)) {      // spill到磁盘后，重置缓存
      map = new PartitionedAppendOnlyMap[K, C]
    }
  } else {    // 不进行Map端聚合操作的情况
    estimatedSize = buffer.estimateSize()    if (maybeSpill(buffer, estimatedSize)) {
      buffer = new PartitionedPairBuffer[K, C]
    }
  }  if (estimatedSize > _peakMemoryUsedBytes) {
    _peakMemoryUsedBytes = estimatedSize
  }
}

上面代码的主要作用就是估计当前缓存(map或者buffer)使用内存的大小，如果超过了内存使用的限制，就要将缓存中的数据spill到磁盘中，同时重置当前的缓存。

下面就来看一下maybeSpill方法：

// 如果成功spill到磁盘就返回true，否则返回falseprotected def maybeSpill(collection: C, currentMemory: Long): Boolean = {  var shouldSpill = false
  // 在进行真正的spill操作之前向TaskMemoryManager申请再多分配一些内存
  if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {    // Claim up to double our current memory from the shuffle memory pool
    val amountToRequest = 2 * currentMemory - myMemoryThreshold    val granted =
      taskMemoryManager.acquireExecutionMemory(amountToRequest, MemoryMode.ON_HEAP, null)
    myMemoryThreshold += granted    // If we were granted too little memory to grow further (either tryToAcquire returned 0,
    // or we already had more memory than myMemoryThreshold), spill the current collection
    // 如果内存仍然不够用，就认定为需要spill到磁盘
    shouldSpill = currentMemory >= myMemoryThreshold
  }  // 如果内存中元素的个数超过了强制spill的上限也会认定为需要进行spill操作
  shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold  // Actually spill
  // 接下来就真正将数据spill到磁盘
  if (shouldSpill) {
    _spillCount += 1
    logSpillage(currentMemory)    // spill操作
    spill(collection)
    _elementsRead = 0
    _memoryBytesSpilled += currentMemory
    releaseMemory()
  }
  shouldSpill
}

在进行真正的spill操作之前会向TaskMemoryManager申请再多分配一些内存，如果还不能够满足，或者不能分配更多的内存，或者内存中元素的个数超过了强制spill的上限，最终就会执行spill操作，接下来进入spill方法：

// 这里的collection就是指的map或者bufferoverride protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {  // Because these files may be read during shuffle, their compression must be controlled by
  // spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
  // createTempShuffleBlock here; see SPARK-3426 for more context.
  // 获取临时的BlockId(TempShuffleBlockId)及对应的File
  val (blockId, file) = diskBlockManager.createTempShuffleBlock()  // These variables are reset after each flush
  var objectsWritten: Long = 0
  var spillMetrics: ShuffleWriteMetrics = null
  var writer: DiskBlockObjectWriter = null
  def openWriter(): Unit = {
    assert (writer == null && spillMetrics == null)
    spillMetrics = new ShuffleWriteMetrics
    writer = blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)
  }  // 获得DiskWriter(DiskBlockObjectWriter)
  openWriter()  // List of batch sizes (bytes) in the order they are written to disk
  // 用来储存每个batch对应的size
  val batchSizes = new ArrayBuffer[Long]  // How many elements we have in each partition
  // 用来储存每个partition有多少元素
  val elementsPerPartition = new Array[Long](numPartitions)  // Flush the disk writer's contents to disk, and update relevant variables.
  // The writer is closed at the end of this process, and cannot be reused.
  def flush(): Unit = {    val w = writer
    writer = null
    w.commitAndClose()
    _diskBytesSpilled += spillMetrics.shuffleBytesWritten
    batchSizes.append(spillMetrics.shuffleBytesWritten)
    spillMetrics = null
    objectsWritten = 0
  }  var success = false
  try {    // 排序部分的操作，返回迭代器
    val it = collection.destructiveSortedWritablePartitionedIterator(comparator)    // 循环的到的迭代器，执行write操作
    while (it.hasNext) {      val partitionId = it.nextPartition()
      require(partitionId >= 0 && partitionId < numPartitions,        s"partition Id: ${partitionId} should be in the range [0, ${numPartitions})")
      it.writeNext(writer)
      elementsPerPartition(partitionId) += 1
      objectsWritten += 1
      // 如果写的对象达到serializerBatchSize的个数时就进行flush操作
      if (objectsWritten == serializerBatchSize) {
        flush()
        openWriter()
      }
    }    if (objectsWritten > 0) {
      flush()
    } else if (writer != null) {      val w = writer
      writer = null
      w.revertPartialWritesAndClose()
    }
    success = true
  } finally {    if (!success) {      // This code path only happens if an exception was thrown above before we set success;
      // close our stuff and let the exception be thrown further
      if (writer != null) {
        writer.revertPartialWritesAndClose()
      }      if (file.exists()) {        if (!file.delete()) {
          logWarning(s"Error deleting ${file}")
        }
      }
    }
  }  // 实例化SpilledFile，并保存在数据结构ArrayBuffer[SpilledFile]中
  spills.append(SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition))
}

先来简单的看一下排序部分的逻辑：

def destructiveSortedWritablePartitionedIterator(keyComparator: Option[Comparator[K]])
  : WritablePartitionedIterator = {  // 这里的partitionedDestructiveSortedIterator会根据是map或者buffer有不同的实现
  val it = partitionedDestructiveSortedIterator(keyComparator)  // 最后返回的是WritablePartitionedIterator，上面进行写操作的时候就是调用该迭代器中的writeNext方法
  new WritablePartitionedIterator {    private[this] var cur = if (it.hasNext) it.next() else null
    def writeNext(writer: DiskBlockObjectWriter): Unit = {
      writer.write(cur._1._2, cur._2)
      cur = if (it.hasNext) it.next() else null
    }    def hasNext(): Boolean = cur != null
    def nextPartition(): Int = cur._1._1
  }
}

如果collection是map，具体的实现为(PartitionedAppendOnlyMap)：

def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]])
  : Iterator[((Int, K), V)] = {  val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)
  destructiveSortedIterator(comparator)
}

如果collection是buffer，具体的实现为(PartitionedPairBuffer)：

override def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]])
  : Iterator[((Int, K), V)] = {  val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)  new Sorter(new KVArraySortDataFormat[(Int, K), AnyRef]).sort(data, 0, curSize, comparator)
  iterator
}

这里需要注意，首先都是获取比较器，比较的数据格式为((partition Id, key), value)，而两者的区别在于前者才是真正的Destructive级别的，具体的实现在destructiveSortedIterator方法中，而不管采用哪种方式，其底层都是通过timSort算法实现的，具体的排序逻辑就不详细说明了，有兴趣的朋友可以深入研究下去。

接下来就进入到第三步。

2.1.3 第三步

先贴出该步骤的代码：

...// 这里注释说的很清楚了，只打开了一个文件// Don't bother including the time to open the merged output file in the shuffle write time,// because it just opens a single file, so is typically too fast to measure accurately// (see SPARK-3570).val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)val tmp = Utils.tempFileWith(output)try {  // 构造blockId
  val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)  // 写数据
  val partitionLengths = sorter.writePartitionedFile(blockId, tmp)  // 写index文件
  shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)  // 进行Shuffle Read的时候需要参考该信息
  mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally {  if (tmp.exists() && !tmp.delete()) {
    logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
  }
}

writePartitionedFile：

首先来看writePartitionedFile方法：

def writePartitionedFile(
    blockId: BlockId,
    outputFile: File): Array[Long] = {  // Track location of each range in the output file
  val lengths = new Array[Long](numPartitions)  // 首先判断spills中是否有数据，即判断是否有数据被spill到了磁盘中
  if (spills.isEmpty) {    // 数据只在内存中的情况
    // Case where we only have in-memory data
    val collection = if (aggregator.isDefined) map else buffer    // 获得迭代器
    val it = collection.destructiveSortedWritablePartitionedIterator(comparator)    // 进行迭代并将数据写到磁盘
    while (it.hasNext) {      val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
        context.taskMetrics.shuffleWriteMetrics.get)      val partitionId = it.nextPartition()      while (it.hasNext && it.nextPartition() == partitionId) {
        it.writeNext(writer)
      }
      writer.commitAndClose()      val segment = writer.fileSegment()      // 最后返回的是每个partition写入的数据的长度
      lengths(partitionId) = segment.length
    }
  } else {    // 如果有数据被spill到了磁盘中，我们就需要进行merge-sort操作
    // We must perform merge-sort; get an iterator by partition and write everything directly.
    for ((id, elements) <- this.partitionedIterator) {      if (elements.hasNext) {        val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
          context.taskMetrics.shuffleWriteMetrics.get)        for (elem <- elements) {
          writer.write(elem._1, elem._2)
        }
        writer.commitAndClose()        val segment = writer.fileSegment()
        lengths(id) = segment.length
      }
    }
  }
  context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)
  context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)
  context.internalMetricsToAccumulators(    InternalAccumulator.PEAK_EXECUTION_MEMORY).add(peakMemoryUsedBytes)
  lengths
}

下面我们就看一下this.partitionedIterator即内存和磁盘中的数据是如何合到一起的：

def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {  val usingMap = aggregator.isDefined  val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer  if (spills.isEmpty) {    // Special case: if we have only in-memory data, we don't need to merge streams, and perhaps
    // we don't even need to sort by anything other than partition ID
    if (!ordering.isDefined) {      // The user hasn't requested sorted keys, so only sort by partition ID, not key
      groupByPartition(collection.partitionedDestructiveSortedIterator(None))
    } else {      // We do need to sort by both partition ID and key
      groupByPartition(collection.partitionedDestructiveSortedIterator(Some(keyComparator)))
    }
  } else {    // Merge spilled and in-memory data
    merge(spills, collection.partitionedDestructiveSortedIterator(comparator))
  }
}

我们只考虑spills不为空的情况，即执行merge方法：

private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
    : Iterator[(Int, Iterator[Product2[K, C]])] = {  // 根据每个SpilledFile实例化一个SpillReader，这些SpillReader组成一个Seq
  val readers = spills.map(new SpillReader(_))  // 获得内存BufferedIterator
  val inMemBuffered = inMemory.buffered  // 根据partition的个数进行迭代
  (0 until numPartitions).iterator.map { p =>    // 实例化IteratorForPartition，即当前partition下的Iterator
    val inMemIterator = new IteratorForPartition(p, inMemBuffered)    // 这里就是合并操作
    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)    if (aggregator.isDefined) {      // Perform partial aggregation across partitions
      // 如果需要map端的combine操作则需要根据key进行聚合操作
      (p, mergeWithAggregation(
        iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
    } else if (ordering.isDefined) {      // No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);
      // sort the elements without trying to merge them
      // 排序合并，例如sortByKey
      (p, mergeSort(iterators, ordering.get))
    } else {
      (p, iterators.iterator.flatten)
    }
  }
}

具体的mergeWithAggregation和mergeSort就不一一说明了，下面再来看一下writeIndexFileAndCommit

writeIndexFileAndCommit：

再来看writeIndexFileAndCommit方法：

def writeIndexFileAndCommit(
    shuffleId: Int,
    mapId: Int,
    lengths: Array[Long],
    dataTmp: File): Unit = {  val indexFile = getIndexFile(shuffleId, mapId)  val indexTmp = Utils.tempFileWith(indexFile)  try {    val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp)))    Utils.tryWithSafeFinally {      // We take in lengths of each block, need to convert it to offsets.
      var offset = 0L
      out.writeLong(offset)      for (length <- lengths) {
        offset += length
        out.writeLong(offset)
      }
    } {
      out.close()
    }    val dataFile = getDataFile(shuffleId, mapId)    // There is only one IndexShuffleBlockResolver per executor, this synchronization make sure
    // the following check and rename are atomic.
    synchronized {      val existingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)      if (existingLengths != null) {        // Another attempt for the same task has already written our map outputs successfully,
        // so just use the existing partition lengths and delete our temporary map outputs.
        System.arraycopy(existingLengths, 0, lengths, 0, lengths.length)        if (dataTmp != null && dataTmp.exists()) {
          dataTmp.delete()
        }
        indexTmp.delete()
      } else {        // This is the first successful attempt in writing the map outputs for this task,
        // so override any existing index and data files with the ones we wrote.
        if (indexFile.exists()) {
          indexFile.delete()
        }        if (dataFile.exists()) {
          dataFile.delete()
        }        if (!indexTmp.renameTo(indexFile)) {          throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
        }        if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {          throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
        }
      }
    }
  } finally {    if (indexTmp.exists() && !indexTmp.delete()) {
      logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")
    }
  }
}

具体的实现就不详细说明了，主要就是根据上一步的到的partition长度的数组将偏移量写入到index文件中。

最后就是实例化MapStatus，shuffle read的时候根据MapStatus获取数据。至此BaseShuffleHandle & SortShuffleWriter的部分就结束了。

BypassMergeSortShuffleHandle & BypassMergeSortShuffleWriter

再来看一下BypassMergeSortShuffleWriter的write方法：

public void write(Iterator<Product2<K, V>> records) throws IOException {
  assert (partitionWriters == null);  if (!records.hasNext()) {
    partitionLengths = new long[numPartitions];
    shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);    return;
  }  final SerializerInstance serInstance = serializer.newInstance();  final long openStartTime = System.nanoTime();
  partitionWriters = new DiskBlockObjectWriter[numPartitions];  // 针对每一个reducer建立一个临时文件
  for (int i = 0; i < numPartitions; i++) {    final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
      blockManager.diskBlockManager().createTempShuffleBlock();    final File file = tempShuffleBlockIdPlusFile._2();    final BlockId blockId = tempShuffleBlockIdPlusFile._1();
    partitionWriters[i] =
      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics).open();
  }  // Creating the file to write to and creating a disk writer both involve interacting with
  // the disk, and can take a long time in aggregate when we open many files, so should be
  // included in the shuffle write time.
  writeMetrics.incShuffleWriteTime(System.nanoTime() - openStartTime);  // 根据partition将记录写入到不同的临时文件中
  while (records.hasNext()) {    final Product2<K, V> record = records.next();    final K key = record._1();
    partitionWriters[partitioner.getPartition(key)].write(key, record._2());
  }  for (DiskBlockObjectWriter writer : partitionWriters) {
    writer.commitAndClose();
  }  // 将所有的临时文件内容按照partition Id合并到一个文件
  File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);  File tmp = Utils.tempFileWith(output);  try {    // 将记录和partition长度信息分别写入到data文件和index文件中
    partitionLengths = writePartitionedFile(tmp);
    shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
  } finally {    if (tmp.exists() && !tmp.delete()) {
      logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
    }
  }
  mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}

该writer在进行写记录之前会根据reducer的个数(例如R个)生成R个临时文件，然后将记录写入对应的临时文件中，最后将这些文件进行合并操作并写入到一个文件中，由于直接将记录写入了临时文件，并没有缓存在内存中，所以如果reducer的个数过多的话，就会为每个reducer打开一个临时文件，如果reducer的数量过多的话就会影响性能，所以使用该种方式需要满足一下条件(下面是源码中的注释)：

no Ordering is specified.

no Aggregator is specified.

the number of partitions is less than spark.shuffle.sort.bypassMergeThreshold.

其中spark.shuffle.sort.bypassMergeThreshold的个数默认为200个。

SerializedShuffleHandle & UnsafeShuffleWriter

最后再来看一下UnsafeShuffleWriter，也就是通常所说的Tungsten。

UnsafeShuffleWriter的write方法：

public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {  // Keep track of success so we know if we encountered an exception
  // We do this rather than a standard try/catch/re-throw to handle
  // generic throwables.
  boolean success = false;  try {    while (records.hasNext()) {      // 循环便利所有记录，对其作用insertRecordIntoSorter方法
      insertRecordIntoSorter(records.next());
    }    // 将数据输出到磁盘上
    closeAndWriteOutput();
    success = true;
  } finally {    if (sorter != null) {      try {
        sorter.cleanupResources();
      } catch (Exception e) {        // Only throw this error if we won't be masking another
        // error.
        if (success) {          throw e;
        } else {
          logger.error("In addition to a failure during writing, we failed during " +                       "cleanup.", e);
        }
      }
    }
  }
}

首先来看insertRecordIntoSorter：

void insertRecordIntoSorter(Product2<K, V> record) throws IOException {  assert(sorter != null);  final K key = record._1();  final int partitionId = partitioner.getPartition(key);
  serBuffer.reset();
  serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
  serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
  serOutputStream.flush();  final int serializedRecordSize = serBuffer.size();  assert (serializedRecordSize > 0);
  sorter.insertRecord(
    serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
}

可以看出实际上调用的是ShuffleExternalSorter的insertRecord方法，限于篇幅，具体的底层实现暂不说明，以后有时间会单独分析一下Tungsten的部分，接下来看一下closeAndWriteOutput方法：

void closeAndWriteOutput() throws IOException {  assert(sorter != null);
  updatePeakMemoryUsed();
  serBuffer = null;
  serOutputStream = null;  // 获得spilled的文件
  final SpillInfo[] spills = sorter.closeAndGetSpills();
  sorter = null;  final long[] partitionLengths;  // 最终的输出文件
  final File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);  // 临时文件
  final File tmp = Utils.tempFileWith(output);  try {    try {      // 将spilled的文件合并并写入到临时文件
      partitionLengths = mergeSpills(spills, tmp);
    } finally {      for (SpillInfo spill : spills) {        if (spill.file.exists() && ! spill.file.delete()) {
          logger.error("Error while deleting spill file {}", spill.file.getPath());
        }
      }
    }    // 将partition的长度信息写入index文件
    shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
  } finally {    if (tmp.exists() && !tmp.delete()) {
      logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
    }
  }  // mapStatus
  mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}

用图来总结一下上面描述的三种方式如下所示：

webp

BaseShuffleHandle & SortShuffleWriter

webp

SerializedShuffleHandle & UnsafeShuffleWriter

webp

BypassMergeSortShuffleHandle & BypassMergeSortShuffleWriter

最后需要说明的是如果采用的是SortShuffleManager，最后每个task产生的文件的个数为2 * M(M代表Mapper端ShuffleMapTask的个数)，相对于Hash的方式来说文件的个数明显减少。

至此Shuffle Write的部分就分析完了，下一遍文章会继续分析Shuffle Read的部分。

本文为原创，欢迎转载，转载请注明出处、作者，谢谢！

作者：sun4lower
链接：https://www.jianshu.com/p/766290f06329