Debugging Spark Streaming Application

This guide walks you through the different debugging options available to peek at the internals of your Spark Streaming application. The 3 important pleaces to look are:

  • Spark UI
  • Driver Logs
  • Executor Logs

Spark UI

Once you start the streaming job, there is a wealth of information available in the Spark and Streaming UI to know more about what’s happening in your streaming application. To get to the Spark UI, you can click on the attached cluster as shown below

../../../_images/Getting2SparkUI.png

Streaming Tab

Once you get to the Spark UI, you will see a Streaming tab if a streaming job is running in this cluster. (If there is no streaming job running in this cluster, then this tab will not be visible. You can skip to the “Driver Logs” section in this notebook to know how to check for exceptions that might have happened while starting the streaming job.)

The first thing to look for in this page is to check if your streaming application is receiving any input events from your source. In this case, you can see the job receives 1000 events/second. (Note: For TextFileStream, since files are input, the # of input events is always 0. In such cases, you can look at the “Completed Batches” section in the notebook to figure out how to find more information.)

If you have an application that receives multiple input streams, you can click on the “Input Rate” link which will show the # of events received for each receiver.

../../../_images/StreamingTab.png

Processing Time

As you scroll down, find the graph for “Processing Time”. This is one of the key graphs to understand the performance of your streaming job. As a general rule of thumb, it is good if you can process each batch within 80% of your batch processing time.

For this application, the batch interval was 2 seconds. The average processing time is 450ms which is well under the batch interval. If the average processing time is closer or greater than your batch interval, then you will have a streaming application that will start queuing up resulting in backlog soon which can bring down your streaming job eventually.

../../../_images/ProcessingTime.png

Completed Batches

Towards the end of the page, you will see a list of all the completed batches. The page displays details about the last 1000 batches that completed. From the table, you can get the # of events processed for each batch and their processing time. If you want to know more about what happened on one of the batches, you can click on the batch link to get to the Batch Details Page.

../../../_images/CompletedBatches.png

Batch Details Page

This page has all the details you want to know about a batch. Two key things are:

  • Input: It has details about the input to the batch. In this case, it has details about the kafka topic, partition and offsets read by Spark Streaming for this batch. In case of TextFileStream, you will see a list of file names that was read for this batch. This is the best way to start debugging a Streaming application reading from text files.
  • Processing: You can click on the link to the Job ID which has all the details about the processing done during this batch.
../../../_images/BatchDetailsPage.png

Job Details Page

The job details page shows the DStream DAG visualization for the batch. This is a very useful visualization to understand the DAG of operations for every batch. In this case, you can see that the batch read input from kafka direct stream followed by a flat map operation and then a map operation. The resulting DStream was then used to update a global state using updateStateByKey. (The greyed boxes represents skipped stages. Spark is smart enough to skip some stages if they don’t need to be recomputed. If the data is checkpointed or cached, then Spark would skip recomputing those stages. In this case, those stages correspond to the dependency on previous batches because of updateStateBykey. Since Spark Streaming internally checkpoints the DStream and it reads from the checkpoint instead of dependending on the previous batches, they are shown as greyed stages.)

At the bottom of the page, you will also find the list of jobs that were executed for this batch. You can click on the links in the description to drill further into the task level execution.

../../../_images/JobDetailsPage1.png
../../../_images/JobDetailsPage2.png

Task Details Page

This is the most granular level of debugging you can get into from the Spark UI for a Spark Streaming application. This page has all the tasks that were executed for this batch. If you are investigating performance issues of your streaming application, then this page would provide information like the # of tasks that were executed and where they were executed (on which executors), shuffle information, etc.

Debugging TIP: Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while procesing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor in your cluster.

../../../_images/TaskDetailsPage.png

Driver Logs

Driver logs are helpful for 2 purposes:

  • Exceptions: Sometimes, you may not see the streaming tab in the Spark UI. This is because the Streaming job was not started because of some exception. You can drill into the Driver logs to look at the stack trace of the exception. In some cases, the streaming job may have started properly. But you will see all the batches never going to the Completed batches section. They might all be in processing or failed state. In such cases too, driver logs could be handy to understand on the nature of the underlying issues.
  • Prints: Any print statements as part of the DStream DAG shows up in the logs too. If you want to quickly check the contents of a DStream, you can do dstream.print() (scala) or dstream.pprint() (python). The contents of the DStream will be in the logs. You can also do dstream.foreachRDD{ print statements here }. They will also show up in the logs. (Note: Just having print statements in the streaming function outside of the DStream DAG will not show up in the logs. Spark Streaming generates and executes only the DStream DAG. So the print statements have to be part of that DAG.)

The following table shows the dstream transformations and where the corresponding log location would be if the transformation had a print statement:

Description Location
foreachRDD(), transform() Driver Stdout Logs
foreachPartition() Executor’s Stdout Logs

To get to the driver logs, you can click on the attached cluster.

../../../_images/Get2DriverLogs.png
../../../_images/DriverLogs.png

Note: For pyspark streaming, all the prints and exceptions does not automatically show up in the logs. The current limitation is that a notebook cell needs to be active for the logs to show up. Since the streaming job runs in the background thread, the logs are lost. If you want to see the logs while running a pyspark streaming application, you can provide ssc.awaitTerminationOrTimeout(x) in one of the cells in the notebook. This will put the cell on hold for ‘x’ seconds. After the ‘x’ seconds, all the prints and exceptions during that time will be present in the logs.

Executor Logs

Executor logs are sometimes helpful if you see certain tasks are misbehaving and would like to see the logs for specific tasks. From the task details page shown above, you can get the executor where the task was run. Once you have that, you can go to the clusters UI page, click on the # nodes and then the master. The master page lists all the workers. You can choose the worker where the suspicious task was run and then get to the log4j output.

../../../_images/Clusters.png
../../../_images/SparkMaster.png
../../../_images/ExecutorPage.png