Delta Live Tables API guide

Important

This article’s content has been retired and might not be updated. See Delta Live Tables in the Databricks REST API Reference.

The Delta Live Tables API allows you to create, edit, delete, start, and view details about pipelines.

Important

To access Databricks REST APIs, you must authenticate.

Create a pipeline

Endpoint

HTTP Method

2.0/pipelines

POST

Creates a new Delta Live Tables pipeline.

Example

This example creates a new triggered pipeline.

Request

curl --netrc -X POST \
https://<databricks-instance>/api/2.0/pipelines \
--data @pipeline-settings.json

pipeline-settings.json:

{
  "name": "Wikipedia pipeline (SQL)",
  "storage": "/Users/username/data",
  "clusters": [
    {
      "label": "default",
      "autoscale": {
        "min_workers": 1,
        "max_workers": 5,
        "mode": "ENHANCED"
      }
    }
  ],
  "libraries": [
    {
      "notebook": {
        "path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
      }
    }
  ],
  "continuous": false
}

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response

{
  "pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5"
}

Request structure

See PipelineSettings.

Response structure

Field Name

Type

Description

pipeline_id

STRING

The unique identifier for the newly created pipeline.

Edit a pipeline

Endpoint

HTTP Method

2.0/pipelines/{pipeline_id}

PUT

Updates the settings for an existing pipeline.

Example

This example adds a target parameter to the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5:

Request

curl --netrc -X PUT \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 \
--data @pipeline-settings.json

pipeline-settings.json

{
  "id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
  "name": "Wikipedia pipeline (SQL)",
  "storage": "/Users/username/data",
  "clusters": [
    {
      "label": "default",
      "autoscale": {
        "min_workers": 1,
        "max_workers": 5,
        "mode": "ENHANCED"
      }
    }
  ],
  "libraries": [
    {
      "notebook": {
        "path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
      }
    }
  ],
  "target": "wikipedia_quickstart_data",
  "continuous": false
}

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Request structure

See PipelineSettings.

Delete a pipeline

Endpoint

HTTP Method

2.0/pipelines/{pipeline_id}

DELETE

Deletes a pipeline from the Delta Live Tables system.

Example

This example deletes the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5:

Request

curl --netrc -X DELETE \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Start a pipeline update

Endpoint

HTTP Method

2.0/pipelines/{pipeline_id}/updates

POST

Starts an update for a pipeline. You can start an update for the entire pipeline graph, or a selective update of specific tables.

Examples

Start a full refresh

This example starts an update with full refresh for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5:

Request
curl --netrc -X POST \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates \
--data '{ "full_refresh": "true" }'

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response
{
  "update_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8",
  "request_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8"
}

Start an update of selected tables

This example starts an update that refreshes the sales_orders_cleaned and sales_order_in_chicago tables in the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5:

Request
curl --netrc -X POST \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates \
--data '{ "refresh_selection": ["sales_orders_cleaned", "sales_order_in_chicago"] }'

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response
{
  "update_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8",
  "request_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8"
}

Start a full update of selected tables

This example starts an update of the sales_orders_cleaned and sales_order_in_chicago tables, and an update with full refresh of the customers and sales_orders_raw tables in the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5.

Request
curl --netrc -X POST \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates \
--data '{ "refresh_selection": ["sales_orders_cleaned", "sales_order_in_chicago"], "full_refresh_selection": ["customers", "sales_orders_raw"] }'

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response
{
  "update_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8",
  "request_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8"
}

Request structure

Field Name

Type

Description

full_refresh

BOOLEAN

Whether to reprocess all data. If true, the Delta Live Tables system resets all tables that are resettable before running the pipeline.

This field is optional.

The default value is false.

An error is returned if full_refesh is true and either refresh_selection or full_refresh_selection is set.

refresh_selection

An array of STRING

A list of tables to update. Use refresh_selection to start a refresh of a selected set of tables in the pipeline graph.

This field is optional. If both refresh_selection and full_refresh_selection are empty, the entire pipeline graph is refreshed.

An error is returned if:

  • full_refesh is true and refresh_selection is set.

  • One or more of the specified tables does not exist in the pipeline graph.

full_refresh_selection

An array of STRING

A list of tables to update with full refresh. Use full_refresh_selection to start an update of a selected set of tables. The states of the specified tables are reset before the Delta Live Tables system starts the update.

This field is optional. If both refresh_selection and full_refresh_selection are empty, the entire pipeline graph is refreshed.

An error is returned if:

  • full_refesh is true and refresh_selection is set.

  • One or more of the specified tables does not exist in the pipeline graph.

  • One or more of the specified tables is not resettable.

Response structure

Field Name

Type

Description

update_id

STRING

The unique identifier of the newly created update.

request_id

STRING

The unique identifier of the request that started the update.

Get the status of a pipeline update request

Endpoint

HTTP Method

2.0/pipelines/{pipeline_id}/requests/{request_id}

GET

Gets the status and information for the pipeline update associated with request_id, where request_id is a unique identifier for the request initiating the pipeline update. If the update is retried or restarted, then the new update inherits the request_id.

Example

For the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5, this example returns status and information for the update associated with request ID a83d9f7c-d798-4fd5-aa39-301b6e6f4429:

Request

curl --netrc -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/requests/a83d9f7c-d798-4fd5-aa39-301b6e6f4429

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response

{
   "status": "TERMINATED",
   "latest_update":{
     "pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
     "update_id": "90da8183-89de-4715-b5a9-c243e67f0093",
     "config":{
       "id": "aae89b88-e97e-40c4-8e1a-1b7ac76657e8",
       "name": "Retail sales (SQL)",
       "storage": "/Users/username/data",
       "configuration":{
         "pipelines.numStreamRetryAttempts": "5"
       },
       "clusters":[
         {
           "label": "default",
           "autoscale":{
             "min_workers": 1,
             "max_workers": 5,
             "mode": "ENHANCED"
           }
         }
       ],
       "libraries":[
         {
           "notebook":{
             "path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
           }
         }
       ],
       "continuous": false,
       "development": true,
       "photon": true,
       "edition": "advanced",
       "channel": "CURRENT"
     },
     "cause": "API_CALL",
     "state": "COMPLETED",
     "cluster_id": "1234-567891-abcde123",
     "creation_time": 1664304117145,
     "full_refresh": false,
     "request_id": "a83d9f7c-d798-4fd5-aa39-301b6e6f4429"
   }
}

Response structure

Field Name

Type

Description

status

STRING

The status of the pipeline update request. One of

  • ACTIVE: An update for this request is actively running or may be retried in a new update.

  • TERMINATED: The request is terminated and will not be retried or restarted.

pipeline_id

STRING

The unique identifier of the pipeline.

update_id

STRING

The unique identifier of the update.

config

PipelineSettings

The pipeline settings.

cause

STRING

The trigger for the update. One of API_CALL, RETRY_ON_FAILURE, SERVICE_UPGRADE, SCHEMA_CHANGE, JOB_TASK, or USER_ACTION.

state

STRING

The state of the update. One of QUEUED, CREATED WAITING_FOR_RESOURCES, INITIALIZING, RESETTING, SETTING_UP_TABLES, RUNNING, STOPPING, COMPLETED, FAILED, or CANCELED.

cluster_id

STRING

The identifier of the cluster running the update.

creation_time

INT64

The timestamp when the update was created.

full_refresh

BOOLEAN

Whether this update resets all tables before running

refresh_selection

An array of STRING

A list of tables to update without full refresh.

full_refresh_selection

An array of STRING

A list of tables to update with full refresh.

request_id

STRING

The unique identifier of the request that started the update. This is the value returned by the update request. If the update is retried or restarted, then the new update inherits the request_id. However, the update_id will be different.

Stop any active pipeline update

Endpoint

HTTP Method

2.0/pipelines/{pipeline_id}/stop

POST

Stops any active pipeline update. If no update is running, this request is a no-op.

For a continuous pipeline, the pipeline execution is paused. Tables currently processing finish refreshing, but downstream tables are not refreshed. On the next pipeline update, Delta Live Tables performs a selected refresh of tables that did not complete processing, and resumes processing of the remaining pipeline DAG.

For a triggered pipeline, the pipeline execution is stopped. Tables currently processing finish refreshing, but downstream tables are not refreshed. On the next pipeline update, Delta Live Tables refreshes all tables.

Example

This example stops an update for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5:

Request

curl --netrc -X POST \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/stop

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

List pipeline events

Endpoint

HTTP Method

2.0/pipelines/{pipeline_id}/events

GET

Retrieves events for a pipeline.

Example

This example retrieves a maximum of 5 events for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5.

Request

curl --netrc -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/events?max_results=5

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Request structure

Field Name

Type

Description

page_token

STRING

Page token returned by previous call. This field is mutually exclusive with all fields in this request except max_results. An error is returned if any fields other than max_results are set when this field is set.

This field is optional.

max_results

INT32

The maximum number of entries to return in a single page. The system may return fewer than max_results events in a response, even if there are more events available.

This field is optional.

The default value is 25.

The maximum value is 100. An error is returned if the value of max_results is greater than 100.

order_by

STRING

A string indicating a sort order by timestamp for the results, for example, ["timestamp asc"].

The sort order can be ascending or descending. By default, events are returned in descending order by timestamp.

This field is optional.

filter

STRING

Criteria to select a subset of results, expressed using a SQL-like syntax. The supported filters are:

  • level='INFO' (or WARN or ERROR)

  • level in ('INFO', 'WARN')

  • id='[event-id]'

  • timestamp > 'TIMESTAMP' (or >=,<,<=,=)

Composite expressions are supported, for example: level in ('ERROR', 'WARN') AND timestamp> '2021-07-22T06:37:33.083Z'

This field is optional.

Response structure

Field Name

Type

Description

events

An array of pipeline events.

The list of events matching the request criteria.

next_page_token

STRING

If present, a token to fetch the next page of events.

prev_page_token

STRING

If present, a token to fetch the previous page of events.

Get pipeline details

Endpoint

HTTP Method

2.0/pipelines/{pipeline_id}

GET

Gets details about a pipeline, including the pipeline settings and recent updates.

Example

This example gets details for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5:

Request

curl --netrc -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response

{
  "pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
  "spec": {
    "id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
    "name": "Wikipedia pipeline (SQL)",
    "storage": "/Users/username/data",
    "clusters": [
      {
        "label": "default",
        "autoscale": {
          "min_workers": 1,
          "max_workers": 5,
          "mode": "ENHANCED"
        }
      }
    ],
    "libraries": [
      {
        "notebook": {
          "path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
        }
      }
    ],
    "target": "wikipedia_quickstart_data",
    "continuous": false
  },
  "state": "IDLE",
  "cluster_id": "1234-567891-abcde123",
  "name": "Wikipedia pipeline (SQL)",
  "creator_user_name": "username",
  "latest_updates": [
    {
      "update_id": "8a0b6d02-fbd0-11eb-9a03-0242ac130003",
      "state": "COMPLETED",
      "creation_time": "2021-08-13T00:37:30.279Z"
    },
    {
      "update_id": "a72c08ba-fbd0-11eb-9a03-0242ac130003",
      "state": "CANCELED",
      "creation_time": "2021-08-13T00:35:51.902Z"
    },
    {
      "update_id": "ac37d924-fbd0-11eb-9a03-0242ac130003",
      "state": "FAILED",
      "creation_time": "2021-08-13T00:33:38.565Z"
    }
  ],
  "run_as_user_name": "username"
}

Response structure

Field Name

Type

Description

pipeline_id

STRING

The unique identifier of the pipeline.

spec

PipelineSettings

The pipeline settings.

state

STRING

The state of the pipeline. One of IDLE or RUNNING.

If state = RUNNING, then there is at least one active update.

cluster_id

STRING

The identifier of the cluster running the pipeline.

name

STRING

The user-friendly name for this pipeline.

creator_user_name

STRING

The username of the pipeline creator.

latest_updates

An array of UpdateStateInfo

Status of the most recent updates for the pipeline, ordered with the newest update first.

run_as_user_name

STRING

The username that the pipeline runs as.

Get update details

Endpoint

HTTP Method

2.0/pipelines/{pipeline_id}/updates/{update_id}

GET

Gets details for a pipeline update.

Example

This example gets details for update 9a84f906-fc51-11eb-9a03-0242ac130003 for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5:

Request

curl --netrc -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates/9a84f906-fc51-11eb-9a03-0242ac130003

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response

{
  "update": {
    "pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
    "update_id": "9a84f906-fc51-11eb-9a03-0242ac130003",
    "config": {
      "id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
      "name": "Wikipedia pipeline (SQL)",
      "storage": "/Users/username/data",
      "configuration": {
        "pipelines.numStreamRetryAttempts": "5"
      },
      "clusters": [
        {
          "label": "default",
          "autoscale": {
            "min_workers": 1,
            "max_workers": 5,
            "mode": "ENHANCED"
          }
        }
      ],
      "libraries": [
        {
          "notebook": {
            "path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
          }
        }
      ],
      "target": "wikipedia_quickstart_data",
      "continuous": false,
      "development": false
    },
    "cause": "API_CALL",
    "state": "COMPLETED",
    "creation_time": 1628815050279,
    "full_refresh": true,
    "request_id": "a83d9f7c-d798-4fd5-aa39-301b6e6f4429"
  }
}

Response structure

Field Name

Type

Description

pipeline_id

STRING

The unique identifier of the pipeline.

update_id

STRING

The unique identifier of this update.

config

PipelineSettings

The pipeline settings.

cause

STRING

The trigger for the update. One of API_CALL, RETRY_ON_FAILURE, SERVICE_UPGRADE.

state

STRING

The state of the update. One of QUEUED, CREATED WAITING_FOR_RESOURCES, INITIALIZING, RESETTING, SETTING_UP_TABLES, RUNNING, STOPPING, COMPLETED, FAILED, or CANCELED.

cluster_id

STRING

The identifier of the cluster running the pipeline.

creation_time

INT64

The timestamp when the update was created.

full_refresh

BOOLEAN

Whether this was a full refresh. If true, all pipeline tables were reset before running the update.

List pipelines

Endpoint

HTTP Method

2.0/pipelines/

GET

Lists pipelines defined in the Delta Live Tables system.

Example

This example retrieves details for pipelines where the name contains quickstart:

Request

curl --netrc -X GET \
https://<databricks-instance>/api/2.0/pipelines?filter=name%20LIKE%20%27%25quickstart%25%27

Replace:

  • <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response

{
  "statuses": [
    {
      "pipeline_id": "e0f01758-fc61-11eb-9a03-0242ac130003",
      "state": "IDLE",
      "name": "DLT quickstart (Python)",
      "latest_updates": [
        {
          "update_id": "ee9ae73e-fc61-11eb-9a03-0242ac130003",
          "state": "COMPLETED",
          "creation_time": "2021-08-13T00:34:21.871Z"
        }
      ],
      "creator_user_name": "username"
    },
    {
      "pipeline_id": "f4c82f5e-fc61-11eb-9a03-0242ac130003",
      "state": "IDLE",
      "name": "My DLT quickstart example",
      "creator_user_name": "username"
    }
  ],
  "next_page_token": "eyJ...==",
  "prev_page_token": "eyJ..x9"
}

Request structure

Field Name

Type

Description

page_token

STRING

Page token returned by previous call.

This field is optional.

max_results

INT32

The maximum number of entries to return in a single page. The system may return fewer than max_results events in a response, even if there are more events available.

This field is optional.

The default value is 25.

The maximum value is 100. An error is returned if the value of max_results is greater than 100.

order_by

An array of STRING

A list of strings specifying the order of results, for example, ["name asc"]. Supported order_by fields are id and name. The default is id asc.

This field is optional.

filter

STRING

Select a subset of results based on the specified criteria.

The supported filters are:

"notebook='<path>'" to select pipelines that reference the provided notebook path.

name LIKE '[pattern]' to select pipelines with a name that matches pattern. Wildcards are supported, for example: name LIKE '%shopping%'

Composite filters are not supported.

This field is optional.

Response structure

Field Name

Type

Description

statuses

An array of PipelineStateInfo

The list of events matching the request criteria.

next_page_token

STRING

If present, a token to fetch the next page of events.

prev_page_token

STRING

If present, a token to fetch the previous page of events.

Data structures

AwsAttributes

Attributes set during cluster creation related to Amazon Web Services.

Field Name

Type

Description

first_on_demand

INT32

The first first_on_demand nodes of the cluster will be placed on on-demand instances. If this value is greater than 0, the cluster driver node will be placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, first_on_demand nodes will be placed on on-demand instances and the remainder will be placed on availability instances. This value does not affect cluster size and cannot be mutated over the lifetime of a cluster.

availability

AwsAvailability

Availability type used for all subsequent nodes past the first_on_demand ones. Note: If first_on_demand is zero, this availability type will be used for the entire cluster.

zone_id

STRING

Identifier for the availability zone (AZ) in which the cluster resides. By default, the setting has a value of auto, otherwise known as Auto-AZ. With Auto-AZ, Databricks selects the AZ based on available IPs in the workspace subnets and retries in other availability zones if AWS returns insufficient capacity errors.

If you want, you can also specify an availability zone to use. This benefits accounts that have reserved instances in a specific AZ. Specify the AZ as a string (for example, "us-west-2a"). The provided availability zone must be in the same region as the Databricks deployment. For example, “us-west-2a” is not a valid zone ID if the Databricks deployment resides in the “us-east-1” region.

The list of available zones as well as the default value can be found by using the GET /api/2.0/clusters/list-zones call.

instance_profile_arn

STRING

Nodes for this cluster will only be placed on AWS instances with this instance profile. If omitted, nodes will be placed on instances without an instance profile. The instance profile must have previously been added to the Databricks environment by an account administrator.

This feature may only be available to certain customer plans.

spot_bid_price_percent

INT32

The max price for AWS spot instances, as a percentage of the corresponding instance type’s on-demand price. For example, if this field is set to 50, and the cluster needs a new i3.xlarge spot instance, then the max price is half of the price of on-demand i3.xlarge instances. Similarly, if this field is set to 200, the max price is twice the price of on-demand i3.xlarge instances. If not specified, the default value is 100. When spot instances are requested for this cluster, only spot instances whose max price percentage matches this field will be considered. For safety, we enforce this field to be no more than 10000.

ebs_volume_type

EbsVolumeType

The type of EBS volumes that will be launched with this cluster.

ebs_volume_count

INT32

The number of volumes launched for each instance. You can choose up to 10 volumes. This feature is only enabled for supported node types. Legacy node types cannot specify custom EBS volumes. For node types with no instance store, at least one EBS volume needs to be specified; otherwise, cluster creation will fail.

These EBS volumes will be mounted at /ebs0, /ebs1, and etc. Instance store volumes will be mounted at /local_disk0, /local_disk1, and etc.

If EBS volumes are attached, Databricks will configure Spark to use only the EBS volumes for scratch storage because heterogeneously sized scratch devices can lead to inefficient disk utilization. If no EBS volumes are attached, Databricks will configure Spark to use instance store volumes.

If EBS volumes are specified, then the Spark configuration spark.local.dir will be overridden.

ebs_volume_size

INT32

The size of each EBS volume (in GiB) launched for each instance. For general purpose SSD, this value must be within the range 100 - 4096. For throughput optimized HDD, this value must be within the range 500 - 4096. Custom EBS volumes cannot be specified for the legacy node types (memory-optimized and compute-optimized).

ebs_volume_iops

INT32

The number of IOPS per EBS gp3 volume.

This value must be between 3000 and 16000.

The value of IOPS and throughput is calculated based on AWS documentation to match the maximum performance of a gp2 volume with the same volume size.

For more information, see the EBS volume limit calculator.

ebs_volume_throughput

INT32

The throughput per EBS gp3 volume, in MiB per second.

This value must be between 125 and 1000.

If neither ebs_volume_iops nor ebs_volume_throughput is specified, the values are inferred from the disk size:

Disk size

IOPS

Throughput

Greater than 1000

3 times the disk size, up to 16000

250

Between 170 and 1000

3000

250

Below 170

3000

125

AwsAvailability

The set of AWS availability types supported when setting up nodes for a cluster.

Type

Description

SPOT

Use spot instances.

ON_DEMAND

Use on-demand instances.

SPOT_WITH_FALLBACK

Preferably use spot instances, but fall back to on-demand instances if spot instances cannot be acquired (for example, if AWS spot prices are too high).

ClusterLogConf

Path to cluster log.

Field Name

Type

Description

dbfs OR s3

DbfsStorageInfo

S3StorageInfo

DBFS location of cluster log. Destination must be provided. For example, { "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }

S3 location of cluster log. destination and either region or warehouse must be provided. For example, { "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }

DbfsStorageInfo

DBFS storage information.

Field Name

Type

Description

destination

STRING

DBFS destination. Example: dbfs:/my/path

EbsVolumeType

Databricks supports gp2 and gp3 EBS volume types. Follow the instructions at Manage SSD storage to select gp2 or gp3 for your workspace.

Type

Description

GENERAL_PURPOSE_SSD

Provision extra storage using AWS EBS volumes.

THROUGHPUT_OPTIMIZED_HDD

Provision extra storage using AWS st1 volumes.

FileStorageInfo

File storage information.

Note

This location type is only available for clusters set up using Databricks Container Services.

Field Name

Type

Description

destination

STRING

File destination. Example: file:/my/file.sh

InitScriptInfo

Path to an init script.

For instructions on using init scripts with Databricks Container Services, see Use an init script.

Note

The file storage type (field name: file) is only available for clusters set up using Databricks Container Services. See FileStorageInfo.

Field Name

Type

Description

workspace OR dbfs (deprecated)

OR S3

WorkspaceStorageInfo

DbfsStorageInfo (deprecated)

S3StorageInfo

Workspace location of init script. Destination must be provided. For example, { "workspace" : { "destination" : "/Users/someone@domain.com/init_script.sh" } }

(Deprecated) DBFS location of init script. Destination must be provided. For example, { "dbfs" : { "destination" : "dbfs:/home/init_script" } }

S3 location of init script. Destination and either region or warehouse must be provided. For example, { "s3": { "destination" : "s3://init_script_bucket/prefix", "region" : "us-west-2" } }

KeyValue

A key-value pair that specifies configuration parameters.

Field Name

Type

Description

key

STRING

The configuration property name.

value

STRING

The configuration property value.

NotebookLibrary

A specification for a notebook containing pipeline code.

Field Name

Type

Description

path

STRING

The absolute path to the notebook.

This field is required.

PipelinesAutoScale

Attributes defining an autoscaling cluster.

Field Name

Type

Description

min_workers

INT32

The minimum number of workers to which the cluster can scale down when underutilized. It is also the initial number of workers the cluster will have after creation.

max_workers

INT32

The maximum number of workers to which the cluster can scale up when overloaded. max_workers must be strictly greater than min_workers.

mode

STRING

The autoscaling mode for the cluster:

PipelineLibrary

A specification for pipeline dependencies.

Field Name

Type

Description

notebook

NotebookLibrary

The path to a notebook defining Delta Live Tables datasets. The path must be in the Databricks workspace, for example: { "notebook" : { "path" : "/my-pipeline-notebook-path" } }.

PipelinesNewCluster

A pipeline cluster specification.

The Delta Live Tables system sets the following attributes. These attributes cannot be configured by users:

  • spark_version

Field Name

Type

Description

label

STRING

A label for the cluster specification, either default to configure the default cluster, or maintenance to configure the maintenance cluster.

This field is optional. The default value is default.

spark_conf

KeyValue

An object containing a set of optional, user-specified Spark configuration key-value pairs. You can also pass in a string of extra JVM options to the driver and the executors via spark.driver.extraJavaOptions and spark.executor.extraJavaOptions respectively.

Example Spark confs: {"spark.speculation": true, "spark.streaming.ui.retainedBatches": 5} or {"spark.driver.extraJavaOptions": "-verbose:gc -XX:+PrintGCDetails"}

aws_attributes

AwsAttributes

Attributes related to clusters running on Amazon Web Services. If not specified at cluster creation, a set of default values will be used.

node_type_id

STRING

This field encodes, through a single value, the resources available to each of the Spark nodes in this cluster. For example, the Spark nodes can be provisioned and optimized for memory or compute intensive workloads A list of available node types can be retrieved by using the GET 2.0/clusters/list-node-types call.

driver_node_type_id

STRING

The node type of the Spark driver. This field is optional; if unset, the driver node type will be set as the same value as node_type_id defined above.

ssh_public_keys

An array of STRING

SSH public key contents that will be added to each Spark node in this cluster. The corresponding private keys can be used to login with the user name ubuntu on port 2200. Up to 10 keys can be specified.

custom_tags

KeyValue

An object containing a set of tags for cluster resources. Databricks tags all cluster resources with these tags in addition to default_tags.

Note:

  • Tags are not supported on legacy node types such as compute-optimized and memory-optimized

  • Databricks allows at most 45 custom tags.

cluster_log_conf

ClusterLogConf

The configuration for delivering Spark logs to a long-term storage destination. Only one destination can be specified for one cluster. If this configuration is provided, the logs will be delivered to the destination every 5 mins. The destination of driver logs is <destination>/<cluster-ID>/driver, while the destination of executor logs is <destination>/<cluster-ID>/executor.

spark_env_vars

KeyValue

An object containing a set of optional, user-specified environment variable key-value pairs. Key-value pairs of the form (X,Y) are exported as is (that is, export X='Y') while launching the driver and workers.

In order to specify an additional set of SPARK_DAEMON_JAVA_OPTS, Databricks recommends appending them to $SPARK_DAEMON_JAVA_OPTS as shown in the following example. This ensures that all default Databricks managed environmental variables are included as well.

Example Spark environment variables: {"SPARK_WORKER_MEMORY": "28000m", "SPARK_LOCAL_DIRS": "/local_disk0"} or {"SPARK_DAEMON_JAVA_OPTS": "$SPARK_DAEMON_JAVA_OPTS -Dspark.shuffle.service.enabled=true"}

init_scripts

An array of InitScriptInfo

The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If cluster_log_conf is specified, init script logs are sent to <destination>/<cluster-ID>/init_scripts.

instance_pool_id

STRING

The optional ID of the instance pool to which the cluster belongs. See Pool configuration reference.

driver_instance_pool_id

STRING

The optional ID of the instance pool to use for the driver node. You must also specify instance_pool_id. See Instance Pools API.

policy_id

STRING

A cluster policy ID.

num_workers OR autoscale

INT32 OR InitScriptInfo

If num_workers, number of worker nodes that this cluster should have. A cluster has one Spark driver and num_workers executors for a total of num_workers + 1 Spark nodes.

When reading the properties of a cluster, this field reflects the desired number of workers rather than the actual number of workers. For instance, if a cluster is resized from 5 to 10 workers, this field is updated to reflect the target size of 10 workers, whereas the workers listed in executors gradually increase from 5 to 10 as the new nodes are provisioned.

If autoscale, parameters needed to automatically scale clusters up and down based on load.

This field is optional.

apply_policy_default_values

BOOLEAN

Whether to use policy default values for missing cluster attributes.

PipelineSettings

The settings for a pipeline deployment.

Field Name

Type

Description

id

STRING

The unique identifier for this pipeline.

The identifier is created by the Delta Live Tables system, and must not be provided when creating a pipeline.

name

STRING

A user-friendly name for this pipeline.

This field is optional.

By default, the pipeline name must be unique. To use a duplicate name, set allow_duplicate_names to true in the pipeline configuration.

storage

STRING

A path to a DBFS directory for storing checkpoints and tables created by the pipeline.

This field is optional.

The system uses a default location if this field is empty.

configuration

A map of STRING:STRING

A list of key-value pairs to add to the Spark configuration of the cluster that will run the pipeline.

This field is optional.

Elements must be formatted as key:value pairs.

clusters

An array of PipelinesNewCluster

An array of specifications for the clusters to run the pipeline.

This field is optional.

If this is not specified, the system will select a default cluster configuration for the pipeline.

libraries

An array of PipelineLibrary

The notebooks containing the pipeline code and any dependencies required to run the pipeline.

target

STRING

A database name for persisting pipeline output data.

See Publish data from Delta Live Tables pipelines to the Hive metastore for more information.

continuous

BOOLEAN

Whether this is a continuous pipeline.

This field is optional.

The default value is false.

development

BOOLEAN

Whether to run the pipeline in development mode.

This field is optional.

The default value is false.

photon

BOOLEAN

Whether Photon acceleration is enabled for this pipeline.

This field is optional.

The default value is false.

channel

STRING

The Delta Live Tables release channel specifying the runtime version to use for this pipeline. Supported values are:

  • preview to test the pipeline with upcoming changes to the Delta Live Tables runtime.

  • current to use the current Delta Live Tables runtime version.

This field is optional.

The default value is current.

edition

STRING

The Delta Live Tables product edition to run the pipeline:

  • CORE supports streaming ingest workloads.

  • PRO also supports streaming ingest workloads and adds support for change data capture (CDC) processing.

  • ADVANCED supports all the features of the PRO edition and adds support for workloads that require Delta Live Tables expectations to enforce data quality constraints.

This field is optional.

The default value is advanced.

PipelineStateInfo

The state of a pipeline, the status of the most recent updates, and information about associated resources.

Field Name

Type

Description

state

STRING

The state of the pipeline. One of IDLE or RUNNING.

pipeline_id

STRING

The unique identifier of the pipeline.

cluster_id

STRING

The unique identifier of the cluster running the pipeline.

name

STRING

The user-friendly name of the pipeline.

latest_updates

An array of UpdateStateInfo

Status of the most recent updates for the pipeline, ordered with the newest update first.

creator_user_name

STRING

The username of the pipeline creator.

run_as_user_name

STRING

The username that the pipeline runs as. This is a read only value derived from the pipeline owner.

S3StorageInfo

S3 storage information.

Field Name

Type

Description

destination

STRING

S3 destination. For example: s3://my-bucket/some-prefix You must configure the cluster with an instance profile and the instance profile must have write access to the destination. You cannot use AWS keys.

region

STRING

S3 region. For example: us-west-2. Either region or warehouse must be set. If both are set, warehouse is used.

warehouse

STRING

S3 warehouse. For example: https://s3-us-west-2.amazonaws.com. Either region or warehouse must be set. If both are set, warehouse is used.

enable_encryption

BOOL

(Optional)Enable server side encryption, false by default.

encryption_type

STRING

(Optional) The encryption type, it could be sse-s3 or sse-kms. It is used only when encryption is enabled and the default type is sse-s3.

kms_key

STRING

(Optional) KMS key used if encryption is enabled and encryption type is set to sse-kms.

canned_acl

STRING

(Optional) Set canned access control list. For example: bucket-owner-full-control. If canned_acl is set, the cluster instance profile must have s3:PutObjectAcl permission on the destination bucket and prefix. The full list of possible canned ACLs can be found at https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. By default only the object owner gets full control. If you are using cross account role for writing data, you may want to set bucket-owner-full-control to make bucket owner able to read the logs.

UpdateStateInfo

The current state of a pipeline update.

Field Name

Type

Description

update_id

STRING

The unique identifier for this update.

state

STRING

The state of the update. One of QUEUED, CREATED, WAITING_FOR_RESOURCES, INITIALIZING, RESETTING, SETTING_UP_TABLES, RUNNING, STOPPING, COMPLETED, FAILED, or CANCELED.

creation_time

STRING

Timestamp when this update was created.

WorkspaceStorageInfo

Workspace storage information.

Field Name

Type

Description

destination

STRING

File destination. Example: /Users/someone@domain.com/init_script.sh