Skip to main content

Spark stage high I/O

Next, look at the I/O stats of the longest stage again:

Long Stage I/O

What is high I/O?

How much data needs to be in an I/O column to be considered high? To figure this out, first start with the highest number in any of the given columns. Then consider the total number of CPU cores you have across all our workers. Generally each core can read and write about 3 MBs per second.

Divide your biggest I/O column by the number of cluster worker cores, then divide that by duration seconds. If the result is around 3 MB, then you're probably I/O bound. That would be high I/O.

High input

If you see a lot of input into your stage, that means you're spending a lot of time reading data. First, identify what data this stage is reading. See Identifying an expensive read in Spark's DAG.

After you identify the specific data, here are some approaches to speeding up your reads:

High output

If you see a lot of output from your stage, that means you're spending a lot of time writing data. Here are some approaches to resolving this:

High shuffle

Databricks recommends you set spark.sql.shuffle.partitions=auto to let Spark pick the number of optimal shuffle partitions automatically. If you're not familiar with shuffle, this is the time to learn.

No high I/O

If you don't see high I/O in any of the columns, then you need to dig deeper. See Slow Spark stage with little I/O.