前言
在文章TaskScheduler 任务提交与调度源码解析 中介绍了Task在executor上的逻辑分配,调用TaskSchedulerImpl的resourceOffers()方法,得到了TaskDescription序列的序列Seq[Seq[TaskDescription]],即对某个task需要在某个executor上执行的描述,仅仅是逻辑上的,还并未真正到executor上执行,本文将从源码角度解析Task是怎么被分配到executor上执行的。
Driver端发送LaunchTask事件
通过resourceOffers逻辑分配完task后,CoarseGrainedSchedulerBackend以Seq[Seq[TaskDescription]]参数调用了launchTasks方法:
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) { for (task <- tasks.flatten) { //序列化TaskDescription val serializedTask = ser.serialize(task) if (serializedTask.limit >= maxRpcMessageSize) { scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr => try { var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " + "spark.rpc.message.maxSize (%d bytes). Consider increasing " + "spark.rpc.message.maxSize or using broadcast variables for large values." msg = msg.format(task.taskId, task.index, serializedTask.limit, maxRpcMessageSize) taskSetMgr.abort(msg) } catch { case e: Exception => logError("Exception in error callback", e) } } } else { //根据executorId获取executor描述信息executorData val executorData = executorDataMap(task.executorId) //减少相应的freeCores executorData.freeCores -= scheduler.CPUS_PER_TASK logInfo(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " + s"${executorData.executorHost}.") //利用executorData中的executorEndpoint,发送LaunchTask事件,LaunchTask事件中包含序列化后的task executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask))) } } }
先将TaskDescription序列化后判断其大小是否超过akka规定的上限,若没有则通过executorData的executorEndpoint来发送LaunchTask事件,executorEndpoint是Diver端和executor端通信的引用,发送LaunchTask事件给executor,将Task传递给executor执行。
Executor端接收LaunchTask事件
driver端向executor发送任务需要通过后台辅助进程CoarseGrainedSchedulerBackend,那么自然而然executor接收任务也有对应的后台辅助进程CoarseGrainedExecutorBackend,该进程与executor一一对应,提供了executor和driver通讯的功能。下面看看CoarseGrainedExecutorBackend接收到事件后是如何处理的:
case LaunchTask(data) => if (executor == null) { exitExecutor(1, "Received LaunchTask command but executor was null") } else { // 将TaskDescription反序列化 val taskDesc = ser.deserialize[TaskDescription](data.value) logInfo("Got assigned task " + taskDesc.taskId) //调用executor的launchTask来加载该task executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber, taskDesc.name, taskDesc.serializedTask) }
将task反序列化后得到TaskDescription ,调用executor的launchTask来加载该task,继续跟进:
def launchTask( context: ExecutorBackend, taskId: Long, attemptNumber: Int, taskName: String, serializedTask: ByteBuffer): Unit = { // 创建一个TaskRunner val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName, serializedTask) runningTasks.put(taskId, tr) //将tr放到线程池中执行 threadPool.execute(tr) }
创建了一个TaskRunner(继承于 Runnable)并加入到线程池中执行,重点就是TaskRunner中的run方法了,代码太长保留只要逻辑代码:
override def run(): Unit = { ... try { //反序列化task,得到taskFiles、jar包taskFiles和Task二进制数据taskBytes val (taskFiles, taskJars, taskProps, taskBytes) = Task.deserializeWithDependencies(serializedTask) Executor.taskDeserializationProps.set(taskProps) //下载task依赖的文件和jar包 updateDependencies(taskFiles, taskJars) //反序列化出task task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader) ... val value = try { //调用task的run方法,真正执行task val res = task.run( taskAttemptId = taskId, attemptNumber = attemptNumber, metricsSystem = env.metricsSystem) threwException = false //返回结果 res } finally { val releasedLocks = env.blockManager.releaseAllLocksForTask(taskId) //通过任务内存管理器清理所有的分配的内存 val freedMemory = taskMemoryManager.cleanUpAllAllocatedMemory() if (freedMemory > 0 && !threwException) { val errMsg = s"Managed memory leak detected; size = $freedMemory bytes, TID = $taskId" if (conf.getBoolean("spark.unsafe.exceptionOnMemoryLeak", false)) { throw new SparkException(errMsg) } else { logWarning(errMsg) } } ... val resultSer = env.serializer.newInstance() val beforeSerialization = System.currentTimeMillis() //序列化task结果value val valueBytes = resultSer.serialize(value) val afterSerialization = System.currentTimeMillis() ... // 将序列化后的结果包装成DirectTaskResult对象 val directResult = new DirectTaskResult(valueBytes, accumUpdates) //再将directResult 序列化, val serializedDirectResult = ser.serialize(directResult) val resultSize = serializedDirectResult.limit // directSend = sending directly back to the driver val serializedResult: ByteBuffer = { //若task结果大于所有maxResultSize(可配置,默认1G),则直接丢弃,driver在返回的对象中拿不到对应的结果 if (maxResultSize > 0 && resultSize > maxResultSize) { ser.serialize(new IndirectTaskResult[Any](TaskResultBlockId(taskId), resultSize)) //若task结果大小超过akka最大能传输的大小(运行结果无法通过消息传递 ),则将结果写入BlockManager } else if (resultSize > maxDirectResultSize) { val blockId = TaskResultBlockId(taskId) env.blockManager.putBytes( blockId, new ChunkedByteBuffer(serializedDirectResult.duplicate()), StorageLevel.MEMORY_AND_DISK_SER) logInfo( s"Finished $taskName (TID $taskId). $resultSize bytes result sent via BlockManager)") ser.serialize(new IndirectTaskResult[Any](blockId, resultSize)) //结果比较小能以消息传递,直接返回 } else { logInfo(s"Finished $taskName (TID $taskId). $resultSize bytes result sent to driver") serializedDirectResult } } // 向Driver端发状态更新 execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult) } catch { ... //向Driver端发状态更新 execBackend.statusUpdate(taskId, TaskState.FAILED, serializedTaskEndReason) ... } finally { // 不管成功与否,都需要将task从runningTasks中移除 runningTasks.remove(taskId) } }
通过Task的deserializeWithDependencies反序列化得到taskFiles、jar包taskFiles和Task二进制数据taskBytes
下载task依赖的文件和jar包
反序列化出task
调用task的run方法,真正执行task,并返回结果
清除分配内存
序列化task的结果,包装成directResult,再次序列化,根据其结果大小将结果以不同的方式返回给driver
若task结果大于所有maxResultSize(可配置,默认1G),则直接丢弃,driver在返回的对象中拿不到对应的结果
若task结果大小超过akka最大能传输的大小(运行结果无法通过消息传递 ),则将结果写入BlockManager
结果比较小能以消息传递,直接返回
最后通过CoarseGrainedExecutorBackend的statusUpdate方法来返回结果给driver,该方法会使用driverRpcEndpointRef 发送一条包含 serializedResult 的 StatusUpdate 消息给 driver。
我们再来看看task的run方法都干了什么?
final def run( taskAttemptId: Long, attemptNumber: Int, metricsSystem: MetricsSystem): T = { SparkEnv.get.blockManager.registerTask(taskAttemptId) //创建一个task运行的上下文实例 context = new TaskContextImpl( stageId, partitionId, taskAttemptId, attemptNumber, taskMemoryManager, localProperties, metricsSystem, metrics) TaskContext.setTaskContext(context) taskThread = Thread.currentThread() if (_killed) { kill(interruptThread = false) } try { runTask(context) } catch { ... } finally { ... //标记完成,释放内存 } }
再继续看runTask方法,task有两种实现,分别是ResultTask(ResultStage的task,个数为最后FinalRdd的partition个数)、ShuffleMapTask(ShuffleMapStage的task,个数为最后FinalRdd的partition个数),两者对应的runTask也有不同的实现,先看ResultTask:
override def runTask(context: TaskContext): U = { val deserializeStartTime = System.currentTimeMillis() val ser = SparkEnv.get.closureSerializer.newInstance() //反序列化 val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime //对rdd的指定分区的迭代器执行func函数,并返回结果 func(context, rdd.iterator(partition, context)) }
使用广播变量反序列化得到rdd和func,数据来源于taskBinary
对rdd的指定分区的迭代器执行func函数,并返回结果
这里的func函数根据具体操作而不同,遍历分区的每条记录是通过迭代器iterator来获取的。
再来看ShuffleMapTask的实现,shuffleMapTask的输出直接通过Shuffle write写磁盘,为下游的stage的Shuffle Read准备数据,:
override def runTask(context: TaskContext): MapStatus = { // Deserialize the RDD using the broadcast variable. val deserializeStartTime = System.currentTimeMillis() val ser = SparkEnv.get.closureSerializer.newInstance() // 使用广播变量反序列化出rdd和ShuffleDependency val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime var writer: ShuffleWriter[Any, Any] = null try { // 获取shuffleManager val manager = SparkEnv.get.shuffleManager // 通过shuffleManager的getWriter()方法,获得shuffle的writer writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context) // 通过rdd指定分区的迭代器iterator方法来遍历每一条数据,再之上再调用writer的write方法以写数据 writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]) writer.stop(success = true).get } catch { case e: Exception => try { if (writer != null) { writer.stop(success = false) } } catch { case e: Exception => log.debug("Could not stop writer", e) } throw e } }
通过广播变量反序列化出rdd和ShuffleDependency,数据来源于taskBinary
获取ShuffleManager的writer对象的write方法来将一个rdd的某个分区写入到磁盘
通过rdd的iterator方法能遍历对应分区的所有数据
Driver端接收到结果后的处理在后续文章中再解析……
作者:BIGUFO
链接:https://www.jianshu.com/p/959954008583