This guide walks you through the different debugging options available to peek at the internals of your Apache Spark application. The three important places to look are:
Once you start the job, the Spark UI shows information about what’s happening in your application. To get to the Spark UI, click the attached cluster:
Once you get to the Spark UI, you will see a Streaming tab if a streaming job is running in this cluster. If there is no streaming job running in this cluster, this tab will not be visible. You can skip to Driver logs to learn how to check for exceptions that might have happened while starting the streaming job.
The first thing to look for in this page is to check if your streaming application is receiving any input events from your source. In this case, you can see the job receives 1000 events/second.
If you have an application that receives multiple input streams, you can click the Input Rate link which will show the # of events received for each receiver.
As you scroll down, find the graph for Processing Time. This is one of the key graphs to understand the performance of your streaming job. As a general rule of thumb, it is good if you can process each batch within 80% of your batch processing time.
For this application, the batch interval was 2 seconds. The average processing time is 450ms which is well under the batch interval. If the average processing time is closer or greater than your batch interval, then you will have a streaming application that will start queuing up resulting in backlog soon which can bring down your streaming job eventually.
Towards the end of the page, you will see a list of all the completed batches. The page displays details about the last 1000 batches that completed. From the table, you can get the # of events processed for each batch and their processing time. If you want to know more about what happened on one of the batches, you can click the batch link to get to the Batch Details Page.
This page has all the details you want to know about a batch. Two key things are:
Input: Has details about the input to the batch. In this case, it has details about the Apache Kafka topic, partition and offsets read by Spark Structured Streaming for this batch. In case of TextFileStream, you see a list of file names that was read for this batch. This is the best way to start debugging a Streaming application reading from text files.
Processing: You can click the link to the Job ID which has all the details about the processing done during this batch.
The job details page shows a DAG visualization. This is a very useful to understand the order of operations and dependencies for every batch. In this case, you can see that the batch read input from Kafka direct stream followed by a flat map operation and then a map operation. The resulting stream was then used to update a global state using updateStateByKey. (The grayed boxes represents skipped stages. Spark is smart enough to skip some stages if they don’t need to be recomputed. If the data is checkpointed or cached, then Spark would skip recomputing those stages. In this case, those stages correspond to the dependency on previous batches because of
updateStateBykey. Since Spark Structured Streaming internally checkpoints the stream and it reads from the checkpoint instead of depending on the previous batches, they are shown as grayed stages.)
At the bottom of the page, you will also find the list of jobs that were executed for this batch. You can click the links in the description to drill further into the task level execution.
This is the most granular level of debugging you can get into from the Spark UI for a Spark application. This page has all the tasks that were executed for this batch. If you are investigating performance issues of your streaming application, then this page would provide information such as the number of tasks that were executed and where they were executed (on which executors) and shuffle information
Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor in your cluster.
A thread dump shows a snapshot of a JVM’s thread states.
Thread dumps are useful in debugging a specific hanging or slow-running task. To view a specific task’s thread dump in the Spark UI:
Click the Jobs tab.
In the Jobs table, find the target job that corresponds to the thread dump you want to see, and click the link in the Description column.
In the job’s Stages table, find the target stage that corresponds to the thread dump you want to see, and click the link in the Description column.
In the stage’s Tasks list, find the target task that corresponds to the thread dump you want to see, and note its Task ID and Executor ID values.
Click the Executors tab.
In the Executors table, find the row that contains the Executor ID value that corresponds to the Executor ID value that you noted earlier. In that row, click the link in the Thread Dump column.
In the Thread dump for executor table, click the row where the Thread Name column contains (TID followed by the Task ID value that you noted earlier. (If the task has finished running, you will not find a matching thread). The task’s thread dump is shown.
Thread dumps are also useful for debugging issues where the driver appears to be hanging (for example, no Spark progress bars are showing) or making no progress on queries (for example, Spark progress bars are stuck at 100%). To view the driver’s thread dump in the Spark UI:
Click the Executors tab.
In the Executors table, in the driver row, click the link in the Thread Dump column. The driver’s thread dump is shown.
Driver logs are helpful for 2 purposes:
Exceptions: Sometimes, you may not see the Streaming tab in the Spark UI. This is because the Streaming job was not started because of some exception. You can drill into the Driver logs to look at the stack trace of the exception. In some cases, the streaming job may have started properly. But you will see all the batches never going to the Completed batches section. They might all be in processing or failed state. In such cases too, driver logs could be handy to understand on the nature of the underlying issues.
Prints: Any print statements as part of the DAG shows up in the logs too.
Executor logs are sometimes helpful if you see certain tasks are misbehaving and would like to see the logs for specific tasks. From the task details page shown above, you can get the executor where the task was run. Once you have that, you can go to the clusters UI page, click the # nodes, and then the master. The master page lists all the workers. You can choose the worker where the suspicious task was run and then get to the log4j output.