Spark stage high I/O

Next, look at the I/O stats of the longest stage again:

Long Stage I/O

What is high I/O?

How much data needs to be in an I/O column to be considered high? To figure this out, first start with the highest number in any of the given columns. Then consider the total number of CPU cores you have across all our workers. Generally each core can read and write about 3 MBs per second.

Divide your biggest I/O column by the number of cluster worker cores, then divide that by duration seconds. If the result is around 3 MB, then you're probably I/O bound. That would be high I/O.

High input

If you see a lot of input into your stage, that means you're spending a lot of time reading data. First, identify what data this stage is reading. See Identifying an expensive read in Spark's DAG.

After you identify the specific data, here are some approaches to speeding up your reads:

Use Delta.
Use liquid clustering for better data skipping. See Use liquid clustering for tables.
Try Photon. It can help a lot with read speed, especially for wide tables.
Make your query more selective so it doesn't need to read as much data.
Reconsider your data layout so that data skipping is more effective.
If you're reading the same data multiple times, use the Delta cache.
If you're doing a join, consider trying to get DFP working.
Increase the size of your cluster or use serverless compute.

High output

If you see a lot of output from your stage, that means you're spending a lot of time writing data. Here are some approaches to resolving this:

Are you rewriting a lot of data? See How to determine if Spark is rewriting data to check. If you are rewriting a lot of data:
- See if you have a merge that needs to be optimized.
- Use deletion vectors to mark existing rows as removed or changed without rewriting the Parquet file.
Enable Photon if it isn't already. Photon can help a lot with write speed.
Increase the size of your cluster or use serverless compute.

High shuffle

Databricks recommends you set spark.sql.shuffle.partitions=auto to let Spark pick the number of optimal shuffle partitions automatically. If you're not familiar with shuffle, this is the time to learn.

No high I/O

If you don't see high I/O in any of the columns, then you need to dig deeper. See Slow Spark stage with little I/O.

What is high I/O?​

High input​

High output​

High shuffle​

No high I/O​

What is high I/O?

High input

High output

High shuffle

No high I/O