One Spark task

If you see a long-running stage with just one task, that’s likely a sign of a problem. While this one task is running only one CPU is utilized and the rest of the cluster may be idle. This happens most frequently in the following situations:

  • Expensive UDF on small data

  • Window function without PARTITION BY statement

  • Reading from an unsplittable file type. This means the file cannot be read in multiple parts, so you end up with one big task. Gzip is an example of an unsplittable file type.

  • Setting the multiLine option when reading a JSON or CSV file

  • Schema inference of a large file

  • Use of repartition(1) or coalesce(1)