上一篇文章我们着重分析了Task的提交过程,本文中我们将对Task的运行进行详细的分析。
我们从CoarseGrainedExecutorBackend接收到CoarseGrainedSchedulerBackend发过来的LaunchTask消息开始:
case LaunchTask(data) => if (executor == null) { logError("Received LaunchTask command but executor was null") System.exit(1) } else { // 反序列化 val taskDesc = ser.deserialize[TaskDescription](data.value) logInfo("Got assigned task " + taskDesc.taskId) // 调用Executor的launchTask来运行Task executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber, taskDesc.name, taskDesc.serializedTask) }
接着进入Executor的launchTask方法:
def launchTask( context: ExecutorBackend, taskId: Long, attemptNumber: Int, taskName: String, serializedTask: ByteBuffer): Unit = { // 实例化TaskRunner val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName, serializedTask) // 放入ConcurrentHashMap[Long, TaskRunner]的数据结构中 runningTasks.put(taskId, tr) // 在线程池中运行刚才实例化的TaskRunner,也就是执行其中的run()方法 threadPool.execute(tr) }
Executor的launchTask方法首先实例化一个TaskRunner(实现了Runnable接口),然后使用线程池中的线程执行实例化的TaskRunner中的run()方法,下面就进入到TaskRunner的run()方法中,为了便于大家阅读我们将该方法分成几个部分:
// 实例化TaskMemoryManager,即内存管理val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId)// 记录反序列化的开始事件val deserializeStartTime = System.currentTimeMillis()// 设置ClassLoaderThread.currentThread.setContextClassLoader(replClassLoader)// 序列化器val ser = env.closureSerializer.newInstance()// 打印日志信息logInfo(s"Running $taskName (TID $taskId)")// 通过ExecutorBackend的statusUpdate方法向Driver发消息,汇报Task的状态为RUNNING状态execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)var taskStart: Long = 0// GC事件startGCTime = computeTotalGcTime()
Driver(DriverEndpoint)接收到消息后的处理不是我们关注的重点,我们聚焦于Task是怎样运行的,继续阅读下面的源码:
try { // 反序列化成Task的依赖关系,包括taskBytes val (taskFiles, taskJars, taskBytes) = Task.deserializeWithDependencies(serializedTask) // 更新依赖关系,也就是下载依赖(文件、jar),下载的时候使用了synchronized关键字 // 因为对于每个Executor中的Tasks而言,这些依赖是共享资源 updateDependencies(taskFiles, taskJars) // 将taskBytes反序列化成Task task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader) // 设置内存管理器 task.setTaskMemoryManager(taskMemoryManager) // If this task has been killed before we deserialized it, let's quit now. Otherwise, // continue executing the task. if (killed) { // Throw an exception rather than returning, because returning within a try{} block // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl // exception will be caught by the catch block, leading to an incorrect ExceptionFailure // for the task. throw new TaskKilledException } logDebug("Task " + taskId + "'s epoch is " + task.epoch) env.mapOutputTracker.updateEpoch(task.epoch) // 调用task的run()方法来执行任务并获得执行结果 // Run the actual task and measure its runtime. taskStart = System.currentTimeMillis() var threwException = true val (value, accumUpdates) = try { val res = task.run( taskAttemptId = taskId, attemptNumber = attemptNumber, metricsSystem = env.metricsSystem) threwException = false res } finally { ... } ... // 后面是对Task运行完成后返回结果进行的处理
首先就是反序列化依赖关系,关于序列化和反序列化我们会在本文的最统一的进行总结。然后将taskBytes反序列化成Task,最后调用Task的run()方法来执行具体的Task并获得执行结果,后面就是对Task运行完成后返回结果的处理,我们在Task运行完成后再进行分析,接下来我们进入Task的run()方法:
final def run( taskAttemptId: Long, attemptNumber: Int, metricsSystem: MetricsSystem) : (T, AccumulatorUpdates) = { context = new TaskContextImpl( stageId, partitionId, taskAttemptId, attemptNumber, taskMemoryManager, metricsSystem, internalAccumulators, runningLocally = false) TaskContext.setTaskContext(context) context.taskMetrics.setHostname(Utils.localHostName()) context.taskMetrics.setAccumulatorsUpdater(context.collectInternalAccumulators) taskThread = Thread.currentThread() if (_killed) { kill(interruptThread = false) } try { (runTask(context), context.collectAccumulators()) } catch { ... } finally { ... } }
可以看到内部实际上调用的是Task的runTask方法,而根据不同的Task类型运行的就是ShuffleMapTask或者ResultTask的runTask方法,下面我们就分别进行说明:
ShuffleMapTask
override def runTask(context: TaskContext): MapStatus = { // Deserialize the RDD using the broadcast variable. // 记录反序列化开始的时间 val deserializeStartTime = System.currentTimeMillis() // 获取序列化/反序列化器 val ser = SparkEnv.get.closureSerializer.newInstance() // 反序列化RDD及其ShuffleDependency val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) // 计算出反序列化所需要的时间 _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime metrics = Some(context.taskMetrics) var writer: ShuffleWriter[Any, Any] = null try { // 获得ShuffleManager,分成Hash和Sort的方式,默认是Sort的方式 // ShuffleManager是在SparkEnv中创建的(包括Driver和Executor) // Driver使用它注册shuffles,而Executors可以向他读取和写入数据 val manager = SparkEnv.get.shuffleManager writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context) writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]) writer.stop(success = true).get } catch { case e: Exception => try { if (writer != null) { writer.stop(success = false) } } catch { case e: Exception => log.debug("Could not stop writer", e) } throw e } }
因为Shuffle是影响整个Spark应用程序运行的关键所在,所以关于Shuffle的部分我们会单独用文章分析,现在关心的是Task的具体计算,可以看出最后执行的是RDD的iterator方法,该方法就是我们针对当前Task所对应的Partition进行计算的关键所在,在具体的处理内部会迭代Partition的元素并交给我们自定义的function进行处理。
final def iterator(split: Partition, context: TaskContext): Iterator[T] = { if (storageLevel != StorageLevel.NONE) { SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel) } else { computeOrReadCheckpoint(split, context) } }
第一次肯定是没有缓存的,所以直接调用compute,而具体的RDD实现不同的compute逻辑,我们这里以MapPartitionsRDD的compute方法为例:
override def compute(split: Partition, context: TaskContext): Iterator[U] = f(context, split.index, firstParent[T].iterator(split, context))
可以清楚的看见直接执行了我们编写的函数f,这里注意第二个参数,同样也是调用的父RDD的iterator方法,这样就将同一个Stage内的函数进行展开计算,形如:
// RDD1x = 1 + y // 这里的y就可以代表从HDFS中读取的数据// RDD2z = x + 3// 展开之后z = (1 + y) + 3// 这里只是打个比方,方便大家理解
ResultTask
override def runTask(context: TaskContext): U = { // Deserialize the RDD and the func using the broadcast variables. // 记录反序列化事件 val deserializeStartTime = System.currentTimeMillis() // 获取序列化/反序列化器 val ser = SparkEnv.get.closureSerializer.newInstance() // 执行反序列化,和Shuffle不同返回的是RDD和我们编写的业务逻辑 val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime metrics = Some(context.taskMetrics) // 执行我们编写的业务逻辑代码 func(context, rdd.iterator(partition, context)) }
我们再来看ResultTask,和Shuffle不同的是ResultTask会直接产生最后的计算结果。
接下来我们回过头来看一下Task的run()方法对计算结果的处理:
override def run(): Unit = { ... try { ... // 记录task运行结束的时间 val taskFinish = System.currentTimeMillis() // If the task has been killed, let's fail it. if (task.killed) { throw new TaskKilledException } // 序列化器 val resultSer = env.serializer.newInstance() // 记录序列化开始时间 val beforeSerialization = System.currentTimeMillis() // 对返回的结果进行序列化 val valueBytes = resultSer.serialize(value) // 记录序列化结束的时间 val afterSerialization = System.currentTimeMillis() // 记录一系列统计信息 for (m <- task.metrics) { // Deserialization happens in two parts: first, we deserialize a Task object, which // includes the Partition. Second, Task.run() deserializes the RDD and function to be run m.setExecutorDeserializeTime( (taskStart - deserializeStartTime) + task.executorDeserializeTime) // We need to subtract Task.run()'s deserialization time to avoid double-counting m.setExecutorRunTime((taskFinish - taskStart) - task.executorDeserializeTime) m.setJvmGCTime(computeTotalGcTime() - startGCTime) m.setResultSerializationTime(afterSerialization - beforeSerialization) m.updateAccumulators() } // 使用DirectTaskResult对结果等信息进行封装 val directResult = new DirectTaskResult(valueBytes, accumUpdates, task.metrics.orNull) // 对DirectTaskResult进行序列化 val serializedDirectResult = ser.serialize(directResult) // 获取序列化后的大小 val resultSize = serializedDirectResult.limit // directSend = sending directly back to the driver val serializedResult: ByteBuffer = { // 判断序列化后的大小是否大于maxResultSize的限制(默认大小为1GB) if (maxResultSize > 0 && resultSize > maxResultSize) { logWarning(s"Finished $taskName (TID $taskId). Result is larger than maxResultSize " + s"(${Utils.bytesToString(resultSize)} > ${Utils.bytesToString(maxResultSize)}), " + s"dropping it.") ser.serialize(new IndirectTaskResult[Any](TaskResultBlockId(taskId), resultSize)) // 然后再判断序列化后的大小是否大于等于akkaFrameSize - AkkaUtils.reservedSizeBytes,默认大小为:128MB-200k } else if (resultSize >= akkaFrameSize - AkkaUtils.reservedSizeBytes) { // 获得blockId val blockId = TaskResultBlockId(taskId) // 通过blockManager写入,这里是存储级别是MEMORY_AND_DISK_SER env.blockManager.putBytes( blockId, serializedDirectResult, StorageLevel.MEMORY_AND_DISK_SER) logInfo( s"Finished $taskName (TID $taskId). $resultSize bytes result sent via BlockManager)") // 序列化 ser.serialize(new IndirectTaskResult[Any](blockId, resultSize)) } else { logInfo(s"Finished $taskName (TID $taskId). $resultSize bytes result sent to driver") // 不经过BlockManager,直接返回序列化后的结果 serializedDirectResult } } execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult) } catch { ... } finally { runningTasks.remove(taskId) } }
具体的结果(serializedResult)需要通过判断序列化后的大小resultSize来决定:
如果resultSize的大于maxResultSize(通过“spark.driver.maxResultSize”进行配置),同时保证maxResultSize的值是大于0的,那么返回的就是对IndirectTaskResult[Any](TaskResultBlockId(taskId), resultSize)序列化后的结果,并打下Warning日志
如果resultSize的小于等于maxResultSize并且大于等于128MB-200k,就通过BlockManager进行存储,存储的级别为MEMORY_AND_DISK_SER,并且最后对封装的IndirectTaskResult进行序列化后的结果
如果resultSize的大小小于128MB-200k,则直接返回序列化后的结果
最后通过调用ExecutorBackend(Standalone下就是CoarseGrainedExecutorBackend)的statusUpdate方法将结果返回给DriverEndpoint,具体就是CoarseGrainedExecutorBackend向DriverEndpoint发送StatusUpdate来传输执行结果:
override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) { // 将信息封装成StatusUpdate val msg = StatusUpdate(executorId, taskId, state, data) driver match { case Some(driverRef) => driverRef.send(msg) case None => logWarning(s"Drop $msg because has not yet connected to driver") } }
DriverEndpoint在接收到statusUpdate消息后进行的操作:
case StatusUpdate(executorId, taskId, state, data) => // 首先调用TaskSchedulerImpl的statusUpdate方法 scheduler.statusUpdate(taskId, state, data.value) // 下面就是释放并重新分配刚才Task使用的计算资源 if (TaskState.isFinished(state)) { executorDataMap.get(executorId) match { case Some(executorInfo) => executorInfo.freeCores += scheduler.CPUS_PER_TASK makeOffers(executorId) case None => // Ignoring the update since we don't know about the executor. logWarning(s"Ignored task status update ($taskId state $state) " + s"from unknown executor with ID $executorId") } }
上面的操作分成两步:首先调用TaskSchedulerImpl的statusUpdate方法;然后就是释放并重新分配刚才Task使用的计算资源,我们直接进入TaskSchedulerImpl的statusUpdate方法:
def statusUpdate(tid: Long, state: TaskState, serializedData: ByteBuffer) { var failedExecutor: Option[String] = None synchronized { try { if (state == TaskState.LOST && taskIdToExecutorId.contains(tid)) { // We lost this entire executor, so remember that it's gone val execId = taskIdToExecutorId(tid) if (executorIdToTaskCount.contains(execId)) { removeExecutor(execId, SlaveLost(s"Task $tid was lost, so marking the executor as lost as well.")) failedExecutor = Some(execId) } } taskIdToTaskSetManager.get(tid) match { case Some(taskSet) => if (TaskState.isFinished(state)) { taskIdToTaskSetManager.remove(tid) taskIdToExecutorId.remove(tid).foreach { execId => if (executorIdToTaskCount.contains(execId)) { executorIdToTaskCount(execId) -= 1 } } } if (state == TaskState.FINISHED) { taskSet.removeRunningTask(tid) taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) } else if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) { taskSet.removeRunningTask(tid) taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData) } case None => logError( ("Ignoring update with state %s for TID %s because its task set is gone (this is " + "likely the result of receiving duplicate task finished status updates)") .format(state, tid)) } } catch { case e: Exception => logError("Exception in statusUpdate", e) } } // 防止产生死锁 // Update the DAGScheduler without holding a lock on this, since that can deadlock if (failedExecutor.isDefined) { dagScheduler.executorLost(failedExecutor.get) backend.reviveOffers() } }
上面的源码中最主要的部分就是使用TaskResultGetter来处理Successful或是FailedTask,即分别调用了TaskResultGetter的enqueueSuccessfulTask方法和enqueueFailedTask方法,我们现在关注的是Task执行成功的情况(对于失败的情况简单来说就是进行重试),所以我们进入TaskResultGetter的enqueueSuccessfulTask方法:(注意下面只选取了主要的部分)
// 对结果进行了反序列化处理val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match { // 下面就是匹配受到结果的类型,进而进行不同的处理 case directResult: DirectTaskResult[_] => if (!taskSetManager.canFetchMoreResults(serializedData.limit())) { return } // deserialize "value" without holding any lock so that it won't block other threads. // We should call it here, so that when it's called again in // "TaskSetManager.handleSuccessfulTask", it does not need to deserialize the value. directResult.value() (directResult, serializedData.limit()) case IndirectTaskResult(blockId, size) => if (!taskSetManager.canFetchMoreResults(size)) { // dropped by executor if size is larger than maxResultSize sparkEnv.blockManager.master.removeBlock(blockId) return } logDebug("Fetching indirect task result for TID %s".format(tid)) scheduler.handleTaskGettingResult(taskSetManager, tid) val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId) if (!serializedTaskResult.isDefined) { /* We won't be able to get the task result if the machine that ran the task failed * between when the task ended and when we tried to fetch the result, or if the * block manager had to flush the result. */ scheduler.handleFailedTask( taskSetManager, tid, TaskState.FINISHED, TaskResultLost) return } val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]]( serializedTaskResult.get) sparkEnv.blockManager.master.removeBlock(blockId) (deserializedResult, size) }// 使用统计系统记录ResultSizeresult.metrics.setResultSize(size) scheduler.handleSuccessfulTask(taskSetManager, tid, result)
具体就是根据发过来的结果的类型进行模式匹配,然后分情况进行处理:
如果接收到的是DirectTaskResult类型的数据,也就是说序列化后的大小小于128MB-200k的话,就返回(directResult, serializedData.limit())给(result, size);
如果接收到的是IndirectTaskResult,且序列化后的大小大于1GB的话,就dropped掉,否则就通过BlockManager获取上面使用BlcokManager存储的数据,然后进行反序列化处理,处理完成后返回(deserializedResult, size)给(result, size)。
最后调用TaskSchedulerImpl的handleSuccessfulTask方法:
def handleSuccessfulTask( taskSetManager: TaskSetManager, tid: Long, taskResult: DirectTaskResult[_]): Unit = synchronized { taskSetManager.handleSuccessfulTask(tid, taskResult) }
进而调用TaskSetManager的handleSuccessfulTask方法:
def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = { ... sched.dagScheduler.taskEnded( tasks(index), Success, result.value(), result.accumUpdates, info, result.metrics) ... }
最主要的就是调用DAGScheduler的taskEnded方法:
def taskEnded( task: Task[_], reason: TaskEndReason, result: Any, accumUpdates: Map[Long, Any], taskInfo: TaskInfo, taskMetrics: TaskMetrics): Unit = { eventProcessLoop.post( CompletionEvent(task, reason, result, accumUpdates, taskInfo, taskMetrics)) }
通过eventProcessLoop.post将CompletionEvent加入到消息队列中,我们直接看DAGScheduler对该消息的处理:
case completion @ CompletionEvent(task, reason, _, _, taskInfo, taskMetrics) => dagScheduler.handleTaskCompletion(completion)
至此我们就不再往下追踪了,感兴趣的朋友可以继续追踪下去,接下来的文章我们开始对Shuffle部分进行细致的分析。
使用一张图来简单的概括一下上面的流程:
补充:Task的序列化和反序列化的总结:
序列化:
1、对RDD及其ShuffleDependency的序列化:
try { // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). // For ResultTask, serialize and broadcast (rdd, func). val taskBinaryBytes: Array[Byte] = stage match { case stage: ShuffleMapStage => closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array() case stage: ResultStage => closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array() } taskBinary = sc.broadcast(taskBinaryBytes) } catch {
2、TaskSetManager:对Task依赖关系的序列化
val serializedTask: ByteBuffer = try { Task.serializeWithDependencies(task, sched.sc.addedFiles, sched.sc.addedJars, ser) } catch { 序列化完成后封装成TaskDescription:return Some(new TaskDescription(taskId = taskId, attemptNumber = attemptNum, execId, taskName, index, serializedTask))
3、CoarseGrainedSchedulerBackend中的DriverEndpoint:对TaskDescription的序列化:
// Launch tasks returned by a set of resource offersprivate def launchTasks(tasks: Seq[Seq[TaskDescription]]) { for (task <- tasks.flatten) { val serializedTask = ser.serialize(task)
反序列化:
1、CoarseGrainedExecutorBackend接收到LaunchTask消息后:反序列化成TaskDescription
case LaunchTask(data) => if (executor == null) { logError("Received LaunchTask command but executor was null") System.exit(1) } else { val taskDesc = ser.deserialize[TaskDescription](data.value)
2、Executor在使用线程池中的线程运行TaskRunner的run()方法的时候:反序列化依赖关系
try { val (taskFiles, taskJars, taskBytes) = Task.deserializeWithDependencies(serializedTask)
3、Executor在使用线程池中的线程运行TaskRunner的run()方法的时候:反序列化成Task
task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)
4、ShuffleMapTask或者ResultTask在执行runTask()方法的时候:反序列化RDD及其ShuffleDependency
ShuffleMapTask:
val ser = SparkEnv.get.closureSerializer.newInstance()val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
ResultTask:
val ser = SparkEnv.get.closureSerializer.newInstance()val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
本文参照的是Spark 1.6.3版本的源码,同时给出Spark 2.1.0版本的连接:
本文为原创,欢迎转载,转载请注明出处、作者,谢谢!
作者:sun4lower
链接:https://www.jianshu.com/p/b2d419b3ade6