Migrate production workloads to Databricks

This guide explains how to move your production jobs from Apache Spark on other platforms to Apache Spark on Databricks.


Databricks job

A single unit of code that you can bundle and submit to Databricks. A Databricks job is equivalent to a Spark application with a single SparkContext. The entry point can be in a library (for example, JAR, egg, wheel) or a notebook. You can run Databricks jobs on a schedule with sophisticated retries and alerting mechanisms. The primary interfaces for running jobs are the Jobs API and UI.


A set of instances in your account that are managed by Databricks but incur no Databricks charges when they are idle. Submitting multiple jobs on a pool ensures your jobs start quickly. You can set guardrails (instance types, instance limits, and so on) and autoscaling policies for the pool of instances. A pool is equivalent to an autoscaling cluster on other Spark platforms.

Migration steps

This section provides the steps for moving your production jobs to Databricks.

Step 1: Create a pool

Create an autoscaling pool. This is equivalent to creating an autoscaling cluster in other Spark platforms. On other platforms, if instances in the autoscaling cluster are idle for a few minutes or hours, you pay for them. Databricks manages the instance pool for you for free. That is, you don’t pay Databricks if these machines are not in use; you pay only the cloud provider. Databricks charges only when jobs are run on the instances.

Create pool

Key configurations:

  • Min Idle: Number of standby instances, not in use by jobs, that the pool maintains. You can set this to 0.

  • Max Capacity: This is an optional field. If you already have cloud provider instance limits set, you can leave this field empty. If you want to set additional max limits, set a high value so that a large number of jobs can share the pool.

  • Idle Instance Auto Termination: The instances over Min Idle are released back to the cloud provider if they are idle for the specified period. The higher the value, the more the instances are kept ready and thereby your jobs will start faster.

Step 2: Run a job on a pool

You can run a job on a pool using the Jobs API or the UI. You must run each job by providing a cluster spec. When a job is about to start, Databricks automatically creates a new cluster from the pool. The cluster is automatically terminated when the job finishes. You are charged exactly for the amount of time your job was run. This is the most cost-effective way to run jobs on Databricks. Each new cluster has:

  • One associated SparkContext, which is equivalent to a Spark application on other Spark platforms.

  • A driver node and a specified number of workers. For a single job, you can specify a worker range. Databricks autoscales a single Spark job based on the resources needed for that job. Databricks benchmarks show that this can save you up to 30% on cloud costs, depending on the nature of your job.

There are three ways to run jobs on a pool: API/CLI, Airflow, UI.


  1. Download and configure the Databricks CLI.

  2. Run the following command to submit your code one time. The API returns a URL that you can use to track the progress of the job run.

    databricks runs submit --json
      "run_name": "my spark job",
      "new_cluster": {
        "spark_version": "7.3.x-scala2.12",
        "instance_pool_id": "0313-121005-test123-pool-ABCD1234",
        "num_workers": 10
        "libraries": [
        "jar": "dbfs:/my-jar.jar"
        "timeout_seconds": 3600,
        "spark_jar_task": {
        "main_class_name": "com.databricks.ComputeModels"
  3. To schedule a job, use the following example. Jobs created through this mechanism are displayed in the jobs list page. The return value is a job_id that you can use to look at the status of all the runs.

    databricks jobs create --json
      "name": "Nightly model training",
      "new_cluster": {
         "spark_version": "7.3.x-scala2.12",
         "instance_pool_id": "0313-121005-test123-pool-ABCD1234",
         "num_workers": 10
       "libraries": [
         "jar": "dbfs:/my-jar.jar"
       "email_notifications": {
         "on_start": ["john@foo.com"],
         "on_success": ["sally@foo.com"],
         "on_failure": ["bob@foo.com"]
       "timeout_seconds": 3600,
       "max_retries": 2,
       "schedule": {
       "quartz_cron_expression": "0 15 22 ? \* \*",
       "timezone_id": "America/Los_Angeles"
       "spark_jar_task": {
         "main_class_name": "com.databricks.ComputeModels"

If you use spark-submit to submit Spark jobs, the following table shows how spark-submit parameters map to different arguments in the Create a new job operation (POST /jobs/create) in the Jobs API.

spark-submit parameter

How it applies on Databricks


Use the spark_jar_task structure to provide the main class name and the parameters.


Use the libraries argument to provide the list of dependencies.


For Python jobs, use the spark_python_task structure. You can use the libraries argument to provide egg or wheel dependencies.


In the cloud, you don’t need to manage a long running master node. All the instances and jobs are managed by Databricks services. Ignore this parameter.


Ignore this parameter on Databricks.


In the new_cluster structure, use the spark_conf argument.


In the new_cluster structure, use the num_workers argument. You can also use the autoscale option to provide a range (recommended).

–driver-memory, –driver-cores

Based on the driver memory and cores you need, choose an appropriate instance type.

You will provide the instance type for the driver during the pool creation. Ignore this parameter during job submission.

–executor-memory, –executor-cores

Based on the executor memory you need, choose an appropriate instance type.

You will provide the instance type for the workers during the pool creation. Ignore this parameter during job submission.


Set spark.driver.extraClassPath to the appropriate value in spark_conf argument.


Set spark.driver.extraJavaOptions to the appropriate value in the spark_conf argument.


Set spark.files to the appropriate value in the spark_conf argument.


In the submit job run request (POST /jobs/runs/submit), use the run_name argument. In the create job request (POST /jobs/create), use the name argument.


Databricks offers an Airflow operator if you want to use Airflow to submit jobs in Databricks. The Databricks Airflow operator calls the Trigger a new job run operation (POST /jobs/run-now) of the Jobs API to submit jobs to Databricks. See Orchestrate Databricks jobs with Apache Airflow.


Databricks provides a simple and intuitive easy-to-use UI to submit and schedule jobs. To create and submit jobs from the UI, follow the step-by-step guide.

Step 3: Troubleshoot jobs

Databricks provides lots of tools to help you troubleshoot your jobs.

Access logs and Spark UI

Databricks maintains a fully managed Spark history server to allow you to access all the Spark logs and Spark UI for each job run. They can be accessed from the job runs page as well as the job run details page:

Job run

Forward logs

You can also forward cluster logs to your cloud storage location. To send logs to your location of choice, use the cluster_log_conf parameter in the new_cluster structure.

View metrics

While the job is running, you can go to the cluster page and look at the live Ganglia metrics in the Metrics tab. Databricks also snapshots these metrics every 15 minutes and stores them, so you can look at these metrics even after your job is completed. To send metrics to your metrics server, you can install custom agents in the cluster. See Monitor performance.

Ganglia metrics

Set alerts

Use email_notifications in the Create a new job operation (POST /jobs/create) in the Jobs API to get alerts on job failures.

You can also forward these email alerts to PagerDuty, Slack, and other monitoring systems.

Frequently asked questions (FAQs)

Can I run jobs without a pool?

Yes. Pools are optional. You can directly run jobs on a new cluster. In such cases, Databricks creates the cluster by asking the cloud provider for the required instances. With pools, cluster startup time will be around 30s if instances are available in the pool.

What is a notebook job?

Databricks has different job types—JAR, Python, notebook. A notebook job type runs code in the specified notebook. See Automatic availability zones.

When should I use a notebook job when compared to JAR job?

A JAR job is equivalent to a spark-submit job. It executes the JAR and then you can look at the logs and Spark UI for troubleshooting. A notebook job executes the specified notebook. You can import libraries in a notebook and call your libraries from the notebook too. The advantage of using a notebook job as the main entry point is you can easily debug your production jobs’ intermediate results in the notebook output area. See JAR jobs.

Can I connect to my own Hive metastore?

Yes, Databricks supports both external Hive metastores and Glue catalog. See