Spark Submit Task Deprecation Notice & Migration Guide

warning

The Spark Submit task is deprecated and pending removal. Usage of this task type is disallowed for new use cases and strongly discouraged for existing customers. See Spark Submit (legacy) for the original documentation for this task type. Keep reading for migration instructions.

Why is Spark Submit being deprecated?

The Spark Submit task type is being deprecated due to technical limitations and feature gaps that are not in the JAR, Notebook, or Python script tasks. These tasks offer better integration with Databricks features, improved performance, and greater reliability.

Deprecation measures

Databricks is implementing the following measures in connection with the deprecation:

Restricted creation: Only users who have used Spark Submit tasks in the preceding month, starting in November 2025, can create new Spark Submit tasks. If you need an exception, contact your account support.
Databricks Runtime version restrictions: Spark Submit usage is restricted to existing Databricks Runtime versions and maintenance releases. Existing Databricks Runtime versions with Spark Submit will continue to receive security and bugfix maintenance releases until the feature is shut down completely. Databricks Runtime 17.3+ and 18.x+ will not support this task type.
UI warnings: Warnings appear throughout the Databricks UI where Spark Submit tasks are in use, and communications are sent to workspace administrators in accounts of existing users.

Migrate JVM workloads to JAR tasks

For JVM workloads, migrate your Spark Submit tasks to JAR tasks. JAR tasks provide better feature support and integration with Databricks.

Follow these steps to migrate:

Create a new JAR task in your job.
From your Spark Submit task parameters, identify the first three arguments. They generally follow this pattern: ["--class", "org.apache.spark.mainClassName", "dbfs:/path/to/jar_file.jar"]
Remove the --class parameter.
Set the main class name (for example, org.apache.spark.mainClassName) as the Main class for your JAR task.
Provide the path to your JAR file (for example, dbfs:/path/to/jar_file.jar) in the JAR task configuration.
Copy any remaining arguments from your Spark Submit task to the JAR task parameters.
Run the JAR task and verify it works as expected.

For detailed information on configuring JAR tasks, see JAR task.

Migrate R workloads

If you're launching an R script directly from a Spark Submit task, multiple migration paths are available.

Option A: Use Notebook tasks

Migrate your R script to a Databricks notebook. Notebook tasks support a full set of features, including cluster autoscaling, and provide better integration with the Databricks platform.

Option B: Bootstrap R scripts from a Notebook task

Use a Notebook task to bootstrap your R scripts. Create a notebook with the following code and reference your R file as a job parameter. Modify to add parameters used by your R script, if needed:

R
dbutils.widgets.text("script_path", "", "Path to script")
script_path <- dbutils.widgets.get("script_path")
source(script_path)

Find jobs that use Spark Submit tasks

You can use the following Python scripts to identify jobs in your workspace that contain Spark Submit tasks. A valid personal access or other token will be needed and your workspace URL should be used.

Option A: Fast scan (run this first, persistent jobs only)

This script only scans persistent jobs (created via /jobs/create or the web interface) and does not include ephemeral jobs created via /runs/submit. This is the recommended first-line method for identifying Spark Submit usage because it is much faster.

Python
#!/usr/bin/env python3
"""
Requirements:
    databricks-sdk>=0.20.0

Usage:
    export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
    export DATABRICKS_TOKEN="your-token"
    python3 list_spark_submit_jobs.py

Output:
    CSV format with columns: Job ID, Owner ID/Email, Job Name

Incorrect:
    export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com/?o=12345678910"
"""

import csv
import os
import sys
from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import PermissionDenied


def main():
    # Get credentials from environment
    workspace_url = os.environ.get("DATABRICKS_HOST")
    token = os.environ.get("DATABRICKS_TOKEN")

    if not workspace_url or not token:
        print(
            "Error: Set DATABRICKS_HOST and DATABRICKS_TOKEN environment variables",
            file=sys.stderr,
        )
        sys.exit(1)

    # Initialize client
    client = WorkspaceClient(host=workspace_url, token=token)

    # Scan workspace for persistent jobs with Spark Submit tasks
    # Using list() to scan only persistent jobs (faster than list_runs())
    print(
        "Scanning workspace for persistent jobs with Spark Submit tasks...",
        file=sys.stderr,
    )
    jobs_with_spark_submit = []
    total_jobs = 0

    # Iterate through all jobs (pagination is handled automatically by the SDK)
    skipped_jobs = 0
    for job in client.jobs.list(expand_tasks=True, limit=25):
        try:
            total_jobs += 1
            if total_jobs % 1000 == 0:
                print(f"Scanned {total_jobs} jobs total", file=sys.stderr)

            # Check if job has any Spark Submit tasks
            if job.settings and job.settings.tasks:
                has_spark_submit = any(
                    task.spark_submit_task is not None for task in job.settings.tasks
                )

                if has_spark_submit:
                    # Extract job information
                    job_id = job.job_id
                    owner_email = job.creator_user_name or "Unknown"
                    job_name = job.settings.name or f"Job {job_id}"

                    jobs_with_spark_submit.append(
                        {"job_id": job_id, "owner_email": owner_email, "job_name": job_name}
                    )
        except PermissionDenied:
            # Skip jobs that the user doesn't have permission to access
            skipped_jobs += 1
            continue

    # Print summary to stderr
    print(f"Scanned {total_jobs} jobs total", file=sys.stderr)
    if skipped_jobs > 0:
        print(
            f"Skipped {skipped_jobs} jobs due to insufficient permissions",
            file=sys.stderr,
        )
    print(
        f"Found {len(jobs_with_spark_submit)} jobs with Spark Submit tasks",
        file=sys.stderr,
    )
    print("", file=sys.stderr)

    # Output CSV to stdout
    if jobs_with_spark_submit:
        writer = csv.DictWriter(
            sys.stdout,
            fieldnames=["job_id", "owner_email", "job_name"],
            quoting=csv.QUOTE_MINIMAL,
        )
        writer.writeheader()
        writer.writerows(jobs_with_spark_submit)
    else:
        print("No jobs with Spark Submit tasks found.", file=sys.stderr)


if __name__ == "__main__":
    main()

Option B: Comprehensive scan (slower, includes ephemeral jobs from last 30 days)

If you need to identify ephemeral jobs created via /runs/submit, use this more exhaustive script. This script scans all job runs from the last 30 days in your workspace, including both persistent jobs (created via /jobs/create) and ephemeral jobs. This script may take hours to run in large workspaces.

Python
#!/usr/bin/env python3
"""
Requirements:
    databricks-sdk>=0.20.0

Usage:
    export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
    export DATABRICKS_TOKEN="your-token"
    python3 list_spark_submit_runs.py

Output:
    CSV format with columns: Job ID, Run ID, Owner ID/Email, Job/Run Name

Incorrect:
    export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com/?o=12345678910"
"""

import csv
import os
import sys
import time
from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import PermissionDenied


def main():
    # Get credentials from environment
    workspace_url = os.environ.get("DATABRICKS_HOST")
    token = os.environ.get("DATABRICKS_TOKEN")

    if not workspace_url or not token:
        print(
            "Error: Set DATABRICKS_HOST and DATABRICKS_TOKEN environment variables",
            file=sys.stderr,
        )
        sys.exit(1)

    # Initialize client
    client = WorkspaceClient(host=workspace_url, token=token)

    thirty_days_ago_ms = int((time.time() - 30 * 24 * 60 * 60) * 1000)

    # Scan workspace for runs with Spark Submit tasks
    # Using list_runs() instead of list() to include ephemeral jobs created via /runs/submit
    print(
        "Scanning workspace for runs with Spark Submit tasks from the last 30 days... (this will take more than an hour in large workspaces)",
        file=sys.stderr,
    )
    runs_with_spark_submit = []
    total_runs = 0
    seen_job_ids = set()

    # Iterate through all runs (pagination is handled automatically by the SDK)
    skipped_runs = 0
    for run in client.jobs.list_runs(
        expand_tasks=True,
        limit=25,
        completed_only=True,
        start_time_from=thirty_days_ago_ms,
    ):
        try:
            total_runs += 1
            if total_runs % 1000 == 0:
                print(f"Scanned {total_runs} runs total", file=sys.stderr)

            # Check if run has any Spark Submit tasks
            if run.tasks:
                has_spark_submit = any(
                    task.spark_submit_task is not None for task in run.tasks
                )

                if has_spark_submit:
                    # Extract job information from the run
                    job_id = run.job_id if run.job_id else "N/A"
                    run_id = run.run_id if run.run_id else "N/A"
                    owner_email = run.creator_user_name or "Unknown"
                    # Use run name if available, otherwise try to construct a name
                    run_name = run.run_name or (
                        f"Run {run_id}" if run_id != "N/A" else "Unnamed Run"
                    )

                    # Track unique job IDs to avoid duplicates for persistent jobs
                    # (ephemeral jobs may have the same job_id across multiple runs)
                    key = (job_id, run_id)
                    if key not in seen_job_ids:
                        seen_job_ids.add(key)
                        runs_with_spark_submit.append(
                            {
                                "job_id": job_id,
                                "run_id": run_id,
                                "owner_email": owner_email,
                                "job_name": run_name,
                            }
                        )
        except PermissionDenied:
            # Skip runs that the user doesn't have permission to access
            skipped_runs += 1
            continue

    # Print summary to stderr
    print(f"Scanned {total_runs} runs total", file=sys.stderr)
    if skipped_runs > 0:
        print(
            f"Skipped {skipped_runs} runs due to insufficient permissions",
            file=sys.stderr,
        )
    print(
        f"Found {len(runs_with_spark_submit)} runs with Spark Submit tasks",
        file=sys.stderr,
    )
    print("", file=sys.stderr)

    # Output CSV to stdout
    if runs_with_spark_submit:
        writer = csv.DictWriter(
            sys.stdout,
            fieldnames=["job_id", "run_id", "owner_email", "job_name"],
            quoting=csv.QUOTE_MINIMAL,
        )
        writer.writeheader()
        writer.writerows(runs_with_spark_submit)
    else:
        print("No runs with Spark Submit tasks found.", file=sys.stderr)


if __name__ == "__main__":
    main()

Need help?

If you need additional help, please contact your account support.

Why is Spark Submit being deprecated?​

Deprecation measures​

Migrate JVM workloads to JAR tasks​

Migrate R workloads​

Option A: Use Notebook tasks​

Option B: Bootstrap R scripts from a Notebook task​

Find jobs that use Spark Submit tasks​

Option A: Fast scan (run this first, persistent jobs only)​

Option B: Comprehensive scan (slower, includes ephemeral jobs from last 30 days)​

Need help?​