Spark Submit (legacy)

The Spark Submit task type is a legacy pattern for configuring JARs as tasks. Databricks recommends using the JAR task. See JAR task for jobs.

Requirements

  • You can run spark-submit tasks only on new clusters.

  • You must upload your JAR file to a location or Maven repository compatible with your compute configuration. See Java and Scala library support.

  • You cannot access JAR files stored in volumes.

  • Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see Cluster autoscaling.

  • Spark-submit does not support Databricks Utilities (dbutils) reference. To use Databricks Utilities, use JAR tasks instead.

  • If you use a Unity Catalog-enabled cluster, spark-submit is supported only if the cluster uses the single user access mode. Shared access mode is not supported. See Access modes.

  • Structured Streaming jobs should never have maximum concurrent runs set to greater than 1. Streaming jobs should be set to run using the cron expression "* * * * * ?" (every minute). Because a streaming task runs continuously, it should always be the final task in a job.

Configure a Spark Submit task

Add a Spark Submit task from the Tasks tab in the Jobs UI by doing the following:

  1. In the Type drop-down menu, select Spark Submit.

  2. Use Compute to configure a cluster that supports the logic in your task.

  3. Use the Parameters text box to provide all arguments and configurations necessary to run your task as a JSON array of strings.

    • The first three arguments are used to identify the main class to run in a JAR at a specified path, as in the following example:

      ["--class", "org.apache.spark.mainClassName", "dbfs:/Filestore/libraries/jar_path.jar"]
      
    • You cannot override the master, deploy-mode, and executor-cores settings configured by Databricks

    • Use --jars and --py-files to add dependent Java, Scala, and Python libraries.

    • Use --conf to set Spark configurations.

    • The --jars, --py-files, --files arguments support DBFS and S3 paths.

    • By default, the Spark submit job uses all available memory, excluding memory reserved for Databricks services. You can set --driver-memory, and --executor-memory to a smaller value to leave some room for off-heap usage.

  4. Click Save task.