Spark Streaming程序的运行,不断的产生job,不断的生成RDD、不断的接收数据存储数据,不断的保存元数据等,如果不清理这些数据,内存和磁盘空间都会崩溃,看一下Spark Streaming是如何做清理工作的
Spark Streaming在Job运行完成时会触发数据清理动作,看JobHandler中run()方法的代码
def run() { try { val formattedTime = UIUtils.formatBatchTime(job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false) val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}" val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]" ssc.sc.setJobDescription( s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""") ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString) ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString) // We need to assign `eventLoop` to a temp variable. Otherwise, because // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then // it's possible that when `post` is called, `eventLoop` happens to null. var _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobStarted(job, clock.getTimeMillis())) // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK-4835 for more details. PairRDDFunctions.disableOutputSpecValidation.withValue(true) { // run方法中包含了job的提交函数,触发sparkContext.runJob,真正的提交job job.run() } _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobCompleted(job, clock.getTimeMillis())) } } else { // JobScheduler has been stopped. } } finally { ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null) ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null) } }
job.run执行之后,job运行完成。发送一个JobCompleted消息给事件循环器,事件循环器调用handleJobCompletion()方法,代码如下
private def handleJobCompletion(job: Job, completedTime: Long) { val jobSet = jobSets.get(job.time) jobSet.handleJobCompletion(job) job.setEndTime(completedTime) listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo)) logInfo("Finished job " + job.id + " from job set of time " + jobSet.time) if (jobSet.hasCompleted) { jobSets.remove(jobSet.time) jobGenerator.onBatchCompletion(jobSet.time) logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format( jobSet.totalDelay / 1000.0, jobSet.time.toString, jobSet.processingDelay / 1000.0 )) listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo)) } job.result match { case Failure(e) => reportError("Error running job " + job, e) case _ => } }
这里判断了jobSet是否完成,如果完成调用jobGenerator的onBatchCompletion方法,代码如下
jobGenerator.onBatchCompletion(jobSet.time)
onBachCompletion的代码如下
def onBatchCompletion(time: Time) { eventLoop.post(ClearMetadata(time)) }
然后发送一个ClearMetadata消息,看他的ClearMetadata的处理方法,代码如下
private def clearMetadata(time: Time) { ssc.graph.clearMetadata(time) // If checkpointing is enabled, then checkpoint, // else mark batch to be fully processed if (shouldCheckpoint) { eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true)) } else { // If checkpointing is not enabled, then delete metadata information about // received blocks (block data not saved in any case). Otherwise, wait for // checkpointing of this batch to complete. val maxRememberDuration = graph.getMaxInputStreamRememberDuration() jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration) jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration) markBatchFullyProcessed(time) } }
这里调用了DStreamGreph的clearMetadata()方法,代码如下
def clearMetadata(time: Time) { logDebug("Clearing metadata for time " + time) this.synchronized { outputStreams.foreach(_.clearMetadata(time)) } logDebug("Cleared old metadata for time " + time) }
分别调用每一个outputStream的clearMetadata(time)方法,代码如下
private[streaming] def clearMetadata(time: Time) { val unpersistData = ssc.conf.getBoolean("spark.streaming.unpersist", true) val oldRDDs = generatedRDDs.filter(_._1 <= (time - rememberDuration)) logDebug("Clearing references to old RDDs: [" + oldRDDs.map(x => s"${x._1} -> ${x._2.id}").mkString(", ") + "]") generatedRDDs --= oldRDDs.keys if (unpersistData) { logDebug("Unpersisting old RDDs: " + oldRDDs.values.map(_.id).mkString(", ")) oldRDDs.values.foreach { rdd => rdd.unpersist(false) // Explicitly remove blocks of BlockRDD rdd match { case b: BlockRDD[_] => logInfo("Removing blocks of RDD " + b + " of time " + time) b.removeBlocks() case _ => } } } logDebug("Cleared " + oldRDDs.size + " RDDs that were older than " + (time - rememberDuration) + ": " + oldRDDs.keys.mkString(", ")) dependencies.foreach(_.clearMetadata(time)) }
第一步从generatedRDDs中过滤出不用的oldRDDs ,过滤的依据是当前batch的时间-rememberDuration,rememberDuration很关键,一般是batch的倍数,如果有windows操作,他会加上windowsDuration,最终结果就是保证还需要被使用的RDD不被清理。
第二步从内存数据结构generatedRDDs中删除oldRDDs
第三步判断是否清理RDD的持久化数据,默认是清理,调用rdd的unpersist方法清理缓存数据。如果是BlockRDD,调用BlockRDD的removeBlocks()方法,从BlockManager中清除BlockRDD接收的数据
第四步清理依赖关系
作者:海纳百川_spark
链接:https://www.jianshu.com/p/c4b477d7c13d