Spark stage high I/O

Next, look at the I/O stats of the longest stage again:

Long Stage I/O

What is high I/O?

How much data needs to be in an I/O column to be considered high? To figure this out, first start with the highest number in any of the given columns. Then consider the total number of CPU cores you have across all our workers. Generally each core can read and write about 3 MBs per second.

Divide your biggest I/O column by the number of cluster worker cores, then divide that by duration seconds. If the result is around 3 MB, then you’re probably I/O bound. That would be high I/O.

High input

If you see a lot of input into your stage, that means you’re spending a lot of time reading data. First, identify what data this stage is reading. See Identifying an expensive read in Spark’s DAG.

After you identify the specific data, here are some approaches to speeding up your reads:

  • Use Delta.

  • Try Photon. It can help a lot with read speed, especially for wide tables.

  • Make your query more selective so it doesn’t need to read as much data.

  • Reconsider your data layout so that data skipping is more effective.

  • If you’re reading the same data multiple times, use the Delta cache.

  • If you’re doing a join, consider trying to get DFP working.

High output

If you see a lot of output from your stage, that means you’re spending a lot of time writing data. Here are some approaches to resolving this:

High shuffle

If you’re not familiar with shuffle, this is the time to learn.

No high I/O

If you don’t see high I/O in any of the columns, then you need to dig deeper. See Slow Spark stage with little I/O.