Statement Execution API: Run SQL on warehouses

Important

To access Databricks REST APIs, you must authenticate.

This tutorial shows you how to use the Databricks SQL Statement Execution API 2.0 to run SQL statements from Databricks SQL warehouses.

To view the Databricks SQL Statement Execution API 2.0 reference, see Statement Execution.

Before you begin

Before you start this tutorial, be sure that you have:

  • Either Databricks CLI version 0.205 or above or curl, as follows:

    • The Databricks CLI is a command-line tool for sending and receiving Databricks REST API requests and responses. If you use Databricks CLI version 0.205 or above, it must be configured for authenticating with your Databricks workspace. See Install or update the Databricks CLI and Authentication for the Databricks CLI.

      For example, to authenticate with Databricks personal access token authentication, follow the steps at Databricks personal access tokens for workspace users.

      And then to use the Databricks CLI to create a Databricks configuration profile for your personal access token, do the following:

      Note

      The following procedure uses the Databricks CLI to create a Databricks configuration profile with the name DEFAULT. If you already have a DEFAULT configuration profile, this procedure overwrites your existing DEFAULT configuration profile.

      To check whether you already have a DEFAULT configuration profile, and to view this profile’s settings if it exists, use the Databricks CLI to run the command databricks auth env --profile DEFAULT.

      To create a configuration profile with a name other than DEFAULT, replace the DEFAULT part of --profile DEFAULT in the following databricks configure command with a different name for the configuration profile.

      1. Use the Databricks CLI to create a Databricks configuration profile named DEFAULT that uses Databricks personal access token authentication. To do this, run the following command:

        databricks configure --profile DEFAULT
        
      2. For the prompt Databricks Host, enter your Databricks workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com.

      3. For the prompt Personal Access Token, enter the Databricks personal access token for your workspace.

      In this tutorial’s Databricks CLI examples, note the following:

      • This tutorial assumes that you have an environment variable DATABRICKS_SQL_WAREHOUSE_ID on your local development machine. This environment variable represents the ID of your Databricks SQL warehouse. This ID is the string of letters and numbers following /sql/1.0/warehouses/ in the HTTP path field for your warehouse. To learn how to get your warehouse’s HTTP path value, see Get connection details for a Databricks compute resource.

      • If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, replace \ with ^, and replace ${...} with %...%.

      • If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, in JSON document declarations, replace the opening and closing ' with ", and replace inner " with \".

    • curl is a command-line tool for sending and receiving REST API requests and responses. See also Install curl. Or, adapt this tutorial’s curl examples for use with similar tools such as HTTPie.

      In this tutorial’s curl examples, note the following:

      • Instead of --header "Authorization: Bearer ${DATABRICKS_TOKEN}", you can use a .netrc file. If you use a .netrc file, replace --header "Authorization: Bearer ${DATABRICKS_TOKEN}" with --netrc.

      • If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, replace \ with ^, and replace ${...} with %...%.

      • If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, in JSON document declarations, replace the opening and closing ' with ", and replace inner " with \".

      Also, for this tutorial’s curl examples, this tutorial assumes you have the following environment variables on your local development machine:

      • DATABRICKS_HOST, representing the workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com, for your Databricks workspace.

      • DATABRICKS_TOKEN, representing a Databricks personal access token for your Databricks workspace user.

      • DATABRICKS_SQL_WAREHOUSE_ID, representing the ID of your Databricks SQL warehouse. This ID is the string of letters and numbers following /sql/1.0/warehouses/ in the HTTP path field for your warehouse. To learn how to get your warehouse’s HTTP path value, see Get connection details for a Databricks compute resource.

      Note

      As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens.

      If you use personal access token authentication, Databricks recommends using personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

      To create a Databricks personal access token, follow the staps in Databricks personal access tokens for workspace users.

      Warning

      Databricks strongly discourages hard-coding information into your scripts, as this sensitive information can be exposed in plain text through version control systems. Databricks recommends that you use approaches such as environment variables that you set on your development machine instead. Removing such hard-coded information from your scripts helps to make those scripts more portable as well.

  • This tutorial assumes that you also have jq, a command-line processor for querying JSON response payloads, which the Databricks SQL Statement Execution API returns to you after each call that you make to the Databricks SQL Statement Execution API. See Download jq.

  • You must have at least one table that you can execute SQL statements against. This tutorial is based on the lineitem table in the tpch schema (also known as a database) within the samples catalog. If you do not have access to this catalog, schema, or table from your workspace, substitute them throughout this tutorial with your own.

Step 1: Execute a SQL statement and save the data result as JSON

Run the following command, which does the following:

  1. Uses the specified SQL warehouse, along with the specified token if you are using curl, to query for three columns from the first two rows of the lineitem table in the tcph schema within the samples catalog.

  2. Saves the response payload in JSON format in a file named sql-execution-response.json within the current working directory.

  3. Prints the contents of the sql-execution-response.json file.

  4. Sets a local environment variable named SQL_STATEMENT_ID. This variable contains the ID of the corresponding SQL statement. You can use this SQL statement ID for getting information about that statement later as needed, which is demonstrated in Step 2. You can also view this SQL statement and get its statement ID from the query history section of the Databricks SQL console, or by calling the Query History API.

  5. Sets an additional local environment variable named NEXT_CHUNK_EXTERNAL_LINK that contains an API URL fragment for getting the next chunk of JSON data. If the response data is too large, the Databricks SQL Statement Execution API provides the response in chunks. You can use this API URL fragment for getting the next chunk of data, which is demonstrated in Step 2. If there is no next chunk, then this environment variable is set to null.

  6. Prints the values of the SQL_STATEMENT_ID and NEXT_CHUNK_INTERNAL_LINK environment variables.

databricks api post /api/2.0/sql/statements \
--profile <profile-name> \
--json '{
  "warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'",
  "catalog": "samples",
  "schema": "tpch",
  "statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem WHERE l_extendedprice > :extended_price AND l_shipdate > :ship_date LIMIT :row_limit",
  "parameters": [
    { "name": "extended_price", "value": "60000", "type": "DECIMAL(18,2)" },
    { "name": "ship_date", "value": "1995-01-01", "type": "DATE" },
    { "name": "row_limit", "value": "2", "type": "INT" }
  ]
}' \
> 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \
&& echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK

Replace <profile-name> with the name of your Databricks configuration profile for authentication.

curl --request POST \
https://${DATABRICKS_HOST}/api/2.0/sql/statements/ \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
  "warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'",
  "catalog": "samples",
  "schema": "tpch",
  "statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem WHERE l_extendedprice > :extended_price AND l_shipdate > :ship_date LIMIT :row_limit",
  "parameters": [
    { "name": "extended_price", "value": "60000", "type": "DECIMAL(18,2)" },
    { "name": "ship_date", "value": "1995-01-01", "type": "DATE" },
    { "name": "row_limit", "value": "2", "type": "INT" }
  ]
}' \
--output 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \
&& echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK

In the preceding request:

  • Parameterized queries consist of each query parameter’s name preceded by a colon (for example, :extended_price) with a matching name and value object in the parameters array. An optional type can also be specified, with the default value of STRING if not specified.

    Warning

    Databricks strongly recommends that you use parameters as a best practice for your SQL statements.

    If you use the Databricks SQL Statement Execution API with an application that generates SQL dynamically, this can result in SQL injection attacks. For example, if you generate SQL code based on a user’s selections in a user interface and do not take appropriate measures, an attacker could inject malicious SQL code to change the logic of your initial query, thereby reading, changing, or deleting sensitive data.

    Parameterized queries help protect against SQL injections attacks by handling input arguments separately from the rest of your SQL code and interpreting these arguments as literal values. Parameters also help with code reusability.

  • By default, any returned data is in JSON array format, and the default location for any of the SQL statement’s data results is within the response payload. To make this behavior explicit, add "format":"JSON_ARRAY","disposition":"INLINE" to the request payload. If you attempt to return data results larger than 25 MiB in the response payload, a failure status is returned and the SQL statement is canceled. For data results larger than 25 MiB, you can use external links instead of trying to return it in the response payload, which is demonstrated in Step 3.

  • The command stores the response payload’s contents to a local file. Local data storage is not supported by the Databricks SQL Statement Execution API directly.

  • By default, after 10 seconds, if the SQL statement has not yet finished executing through the warehouse, the Databricks SQL Statement Execution API returns only the SQL statement ID and its current status, instead of the statement’s result. To change this behavior, add "wait_timeout" to the request and set it to "<x>s", where <x> can be between 5 and 50 seconds inclusive, for example "50s". To return the SQL statement ID and its current status immediately, set wait_timeout to 0s.

  • By default, the SQL statement continues to run if the timeout period is reached. To cancel a SQL statement if the timeout period is reached instead, add "on_wait_timeout":"CANCEL" to the request payload.

  • To limit the number of bytes returned, add "byte_limit" to the request and set it to the number of bytes, for example 1000.

  • To limit the number of rows returned, instead of adding a LIMIT clause to statement, you can add "row_limit" to the request and set it to the number of rows, for example "statement":"SELECT * FROM lineitem","row_limit":2.

  • If the result is larger than the specified byte_limit or row_limit, the truncated field is set to true in the response payload.

If the statement’s result is available before the wait timeout ends, the response is as follows:

{
  "manifest": {
    "chunks": [
      {
        "chunk_index": 0,
        "row_count": 2,
        "row_offset": 0
      }
    ],
    "format": "JSON_ARRAY",
    "schema": {
      "column_count": 3,
      "columns": [
        {
          "name": "l_orderkey",
          "position": 0,
          "type_name": "LONG",
          "type_text": "BIGINT"
        },
        {
          "name": "l_extendedprice",
          "position": 1,
          "type_name": "DECIMAL",
          "type_precision": 18,
          "type_scale": 2,
          "type_text": "DECIMAL(18,2)"
        },
        {
          "name": "l_shipdate",
          "position": 2,
          "type_name": "DATE",
          "type_text": "DATE"
        }
      ]
    },
    "total_chunk_count": 1,
    "total_row_count": 2,
    "truncated": false
  },
  "result": {
    "chunk_index": 0,
    "data_array": [
      [
        "2",
        "71433.16",
        "1997-01-28"
      ],
      [
        "7",
        "86152.02",
        "1996-01-15"
      ]
    ],
    "row_count": 2,
    "row_offset": 0
  },
  "statement_id": "00000000-0000-0000-0000-000000000000",
  "status": {
    "state": "SUCCEEDED"
  }
}

If the wait timeout ends before the statement’s result is available, the response looks like this instead:

{
  "statement_id": "00000000-0000-0000-0000-000000000000",
  "status": {
    "state": "PENDING"
  }
}

If the statement’s result data is too large (for example in this case, by running SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem LIMIT 300000), the result data is chunked and looks like this instead. Note that "...": "..." indicates omitted results here for brevity:

{
  "manifest": {
    "chunks": [
      {
        "chunk_index": 0,
        "row_count": 188416,
        "row_offset": 0
      },
      {
        "chunk_index": 1,
        "row_count": 111584,
        "row_offset": 188416
      }
    ],
    "format":"JSON_ARRAY",
    "schema": {
      "column_count":3,
      "columns": [
        {
          "...": "..."
        }
      ]
    },
    "total_chunk_count": 2,
    "total_row_count": 300000,
    "truncated": false
  },
  "result": {
    "chunk_index": 0,
    "data_array": [
      [
        "2",
        "71433.16",
        "1997-01-28"
      ],
      [
        "..."
      ]
    ],
    "next_chunk_index": 1,
    "next_chunk_internal_link": "/api/2.0/sql/statements/00000000-0000-0000-0000-000000000000/result/chunks/1?row_offset=188416",
    "row_count": 188416,
    "row_offset": 0
  },
  "statement_id": "00000000-0000-0000-0000-000000000000",
  "status": {
    "state": "SUCCEEDED"
  }
}

Step 2: Get a statement’s current execution status and data result as JSON

You can use a SQL statement’s ID to get that statement’s current execution status and, if the execution succeeded, that statement’s result. If you forget the statement’s ID, you can get it from the query history section of the Databricks SQL console, or by calling the Query History API. For example, you could keep polling this command, checking each time to see if the execution has succeeded.

To get a SQL statement’s current execution status and, if the execution succeeded, that statement’s result and an API URL fragment for getting any next chunk of JSON data, run the following command. This command assumes that you have an environment variable on your local development machine named SQL_STATEMENT_ID, which is set to the value of the ID of the SQL statement from the previous step. Of course, you can substitute ${SQL_STATEMENT_ID} in the following command with the hard-coded ID of the SQL statement.

databricks api get /api/2.0/sql/statements/${SQL_STATEMENT_ID} \
--profile <profile-name> \
> 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK

Replace <profile-name> with the name of your Databricks configuration profile for authentication.

curl --request GET \
https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID} \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--output 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK

If the NEXT_CHUNK_INTERNAL_LINK is set to a non-null value, you can use it to get the next chunk of data, and so on, for example with the following command:

databricks api get /${NEXT_CHUNK_INTERNAL_LINK} \
--profile <profile-name> \
> 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .next_chunk_internal_link 'sql-execution-response.json') \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK

Replace <profile-name> with the name of your Databricks configuration profile for authentication.

curl --request GET \
https://${DATABRICKS_HOST}${NEXT_CHUNK_INTERNAL_LINK} \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--output 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .next_chunk_internal_link 'sql-execution-response.json') \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK

You can keep running the preceding command, over and over again, to get the next chunk, and so on. Note that as soon as the last chunk is fetched, the SQL statement is closed. After this closure, you cannot use that statement’s ID to get its current status or to fetch any more chunks.

Step 4: Cancel a SQL statement’s execution

If you need to cancel a SQL statement that has not yet succeeded, run the following command:

databricks api post /api/2.0/sql/statements/${SQL_STATEMENT_ID}/cancel \
--profile <profile-name> \
--json '{}'

Replace <profile-name> with the name of your Databricks configuration profile for authentication.

curl --request POST \
https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID}/cancel \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}"

Security best practices

The Databricks SQL Statement Execution API increases the security of data transfers by using end-to-end transport layer security (TLS) encryption and short-lived credentials such as presigned URLs.

There are several layers in this security model. At the transport layer, it is only possible to call the Databricks SQL Statement Execution API by using TLS 1.2 or above. Also, callers of the Databricks SQL Statement Execution API must be authenticated with a valid Databricks personal access token that maps to a user who has the entitlement to use Databricks SQL. This user must have CAN USE access for the specific SQL warehouse that is being used, and access can be restricted with IP access lists. This applies to all requests to the Databricks SQL Statement Execution API. Furthermore, for executing statements, the authenticated user must have permission to the data objects (such as tables, views, and functions) that are used in each statement. This is enforced by existing access control mechanisms in Unity Catalog or by using table ACLs. (See Data governance with Unity Catalog for more details.) This also means that only the user who executes a statement can make fetch requests for the statement’s results.

Databricks recommends the following security best practices whenever you use the Databricks SQL Statement Execution API along with the EXTERNAL_LINKS disposition to retrieve large data sets:

  • Remove the Databricks authorization header for Amazon S3 requests

  • Protect presigned URLs

  • Configure network restrictions on storage accounts

  • Configure logging on storage accounts

The EXTERNAL_LINKS disposition can be disabled upon request by creating a support case. See Support.

Remove the Databricks authorization header for Amazon S3 requests

All calls to the Databricks SQL Statement Execution API that use curl must include an Authorization header that contains Databricks access credentials. Do not include this Authorization header whenever you download data from Amazon S3. This header is not required and might unintentionally expose your Databricks access credentials.

Protect presigned URLs

Whenever you use the EXTERNAL_LINKS disposition, a short-lived, presigned URL is generated, which the caller can use to download the results directly from Amazon S3 by using TLS. As a short-lived credential is embedded within this presigned URL, you should protect the URL.

Configure network restrictions on storage accounts

Whenever you use the EXTERNAL_LINKS disposition, the client will obtain presigned URLs to download query results directly from Amazon S3.

Databricks recommends that you use S3 bucket policies to restrict access to your S3 buckets to trusted IP addresses and VPCs. Check that you have the correct workspace root bucket storage configurations, and review the Databricks recommendations for customer-managed VPC.

Configure logging on storage accounts

In addition to enforcing network-level restrictions on the underlying storage account, you can monitor whether anyone tries to bypass these restrictions by setting up S3 server access logging or AWS CloudTrail data events as well as the appropriate monitoring and alerting around them.