Spark Submit (legacy)
The Spark Submit task type is a legacy pattern for configuring JARs as tasks. Databricks recommends using the JAR task. See JAR task for jobs.
Requirements
You can run spark-submit tasks only on new clusters.
You must upload your JAR file to a location or Maven repository compatible with your compute configuration. See Java and Scala library support.
You cannot access JAR files stored in volumes.
Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see Cluster autoscaling.
Spark-submit does not support Databricks Utilities (dbutils) reference. To use Databricks Utilities, use JAR tasks instead.
If you use a Unity Catalog-enabled cluster, spark-submit is supported only if the cluster uses the single user access mode. Shared access mode is not supported. See Access modes.
Structured Streaming jobs should never have maximum concurrent runs set to greater than 1. Streaming jobs should be set to run using the cron expression
"* * * * * ?"
(every minute). Because a streaming task runs continuously, it should always be the final task in a job.
Configure a Spark Submit task
Add a Spark Submit
task from the Tasks tab in the Jobs UI by doing the following:
In the Type drop-down menu, select
Spark Submit
.Use Compute to configure a cluster that supports the logic in your task.
Use the Parameters text box to provide all arguments and configurations necessary to run your task as a JSON array of strings.
The first three arguments are used to identify the main class to run in a JAR at a specified path, as in the following example:
["--class", "org.apache.spark.mainClassName", "dbfs:/Filestore/libraries/jar_path.jar"]
You cannot override the
master
,deploy-mode
, andexecutor-cores
settings configured by DatabricksUse
--jars
and--py-files
to add dependent Java, Scala, and Python libraries.Use
--conf
to set Spark configurations.The
--jars
,--py-files
,--files
arguments support DBFS and S3 paths.By default, the Spark submit job uses all available memory, excluding memory reserved for Databricks services. You can set
--driver-memory
, and--executor-memory
to a smaller value to leave some room for off-heap usage.
Click Save task.