本文内容以以Socket数据来源为例,通过WordCount计算来跟踪Job的生成
代码如下:
objectNetworkWordCount { defmain(args:Array[String]) { if (args.length< 2) { System.err.println("Usage:NetworkWordCount<hostname> <port>") System.exit(1) } val sparkConf= newSparkConf().setAppName("NetworkWordCount").setMaster("local[2]") val ssc = newStreamingContext(sparkConf,Seconds(1)) val lines= ssc.socketTextStream(args(0),args(1).toInt,StorageLevel.MEMORY_AND_DISK_SER) val words= lines.flatMap(_.split("")) val wordCounts= words.map(x => (x,1)).reduceByKey(_+ _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
从ssc.start()开始看,在start方法中调用了scheduler的start()方法,这里的scheduler就是
JobScheduler,我们看start的代码
def start(): Unit = synchronized { if (eventLoop != null) return // scheduler has already been started logDebug("Starting JobScheduler") eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") { override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e) } // 启动JobScheduler的事件循环器 eventLoop.start() // attach rate controllers of input streams to receive batch completion updates for { inputDStream <- ssc.graph.getInputStreams rateController <- inputDStream.rateController } ssc.addStreamingListener(rateController) listenerBus.start(ssc.sparkContext) receiverTracker = new ReceiverTracker(ssc) inputInfoTracker = new InputInfoTracker(ssc) // 启动ReceiverTracker,数据的接收逻辑从这里开始 receiverTracker.start() // 启动JobGenerator,job的生成从这里开始 jobGenerator.start() logInfo("Started JobScheduler") }
Spark Streaming由JobScheduler、ReceiverTracker、JobGenerator三大组件组成,其中ReceiverTracker、
JobGenerator包含在JobScheduler中。这里分别执行三大组件的start方法。
我们先看Job的生成,jobGenerator.start()方法。在JobGenerator的start方法中都做了什么,继续往下看。
首先启动了一个EventLoop并来回调processEvent方法,那么什么时候会触发回调呢,来看一下EventLoop的内部结构
private[spark] abstract class EventLoop\[E](name: String) extends Logging { //线程安全的阻塞队列 private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque\[E]() private val stopped = new AtomicBoolean(false) private val eventThread = new Thread(name) { //后台线程 setDaemon(true) override def run(): Unit = { try { while (!stopped.get) { val event = eventQueue.take() try { //回调子类的onReceive方法,就是事件的逻辑代码 onReceive(event) } catch { case NonFatal(e) => { try { onError(e) } catch { case NonFatal(e) => logError("Unexpected error in " + name, e) } } } } } catch { case ie: InterruptedException => // exit even if eventQueue is not empty case NonFatal(e) => logError("Unexpected error in " + name, e) } } } def start(): Unit = { if (stopped.get) { throw new IllegalStateException(name + " has already been stopped") } // Call onStart before starting the event thread to make sure it happens before onReceive onStart() // 启动事件循环器 eventThread.start() } def stop(): Unit = { // stopped.compareAndSet(false, true) 判断是否为false,同时赋值为true if (stopped.compareAndSet(false, true)) { eventThread.interrupt() var onStopCalled = false try { eventThread.join() // Call onStop after the event thread exits to make sure onReceive happens before onStop onStopCalled = true onStop() } catch { case ie: InterruptedException => Thread.currentThread().interrupt() if (!onStopCalled) { // ie is thrown from `eventThread.join()`. Otherwise, we should not call `onStop` since // it's already called. onStop() } } } else { // Keep quiet to allow calling `stop` multiple times. } } def post(event: E): Unit = { eventQueue.put(event) } def isActive: Boolean = eventThread.isAlive protected def onStart(): Unit = {} protected def onStop(): Unit = {} protected def onReceive(event: E): Unit protected def onError(e: Throwable): Unit }
在EventLoop内部其实是维护了一个队列,开辟了一条后台线程来回调实现类的onReceive方法。
那么是什么时候把事件放入EventLoop的队列中呢,就要找EventLoop的post方法了。在JobGenerator实例化的时
候创建了一个RecurringTimer,代码如下:
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, // 回调 eventLoop.post(GenerateJobs(new Time(longTime)))将GenerateJobs事件放入事件循环器 longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
RecurringTimer就是一个定时器,看一下他的构造参数和内部代码,
* @param clock 时钟
* @param period 间歇时间
* @param callback 回调方法
* @param name 定时器的名称
很清楚的知道根据用户传入的时间间隔,周期性的回调callback方法。Callback就是前面看到的
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
将GenerateJobs事件提交到EventLoop的队列中,此时RecurringTimer还没有执行。
回到JobGenerator中的start方法向下看,因为是第一次运行,所以调用了startFirstTime方法。
在startFirstTime方法中,有一行关键代码timer.start(startTime.milliseconds),终于看到了定时器的启动
从定时器的start方法开始往回看,周期性的回调eventLoop.post方法将GenerateJobs事件发送到EvenLoop的队列,然后回调rocessEvent方法,看generateJobs(time)。
generateJobs代码如下
private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try { jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => // 获取元数据信息 val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) // 提交jobSet jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) } 进入graph.generateJobs(time) ,调用每一个outputStream的generateJob方法,generateJob代码如下private[streaming] def generateJob(time: Time): Option[Job] = { getOrCompute(time) match { case Some(rdd) => { // jobRunc中包装了runJob的方法 val jobFunc = () => { val emptyFunc = { (iterator: Iterator[T]) => {} } context.sparkContext.runJob(rdd, emptyFunc) } Some(new Job(time, jobFunc)) } case None => None } }
getOrCompute返回一个RDD,RDD的生成以后再说,定义了一个函数jobFunc,可以看到函数的作用是提交job,
把jobFunc封装到Job对象然后返回。
返回的是多个job,jobs生成成功后提交JobSet,代码如下
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
然后分别提交每一个job,把job包装到JobHandler(Runnable子类)交给线程池运行,执行JobHandler的run
方法,调用job.run(),在Job的run方法中就一行,执行Try(func()),这个func()函数就是上面代码中
的jobFunc,看到这里整个Job的生成与提交就连通了。下面附上一张Job动态生成流程图
以上内容如有错误,欢迎指正
作者:海纳百川_spark
链接:https://www.jianshu.com/p/717b3ed2be54