Statement Execution API: Run SQL on warehouses
Important
To access Databricks REST APIs, you must authenticate.
This tutorial shows you how to use the Databricks SQL Statement Execution API 2.0 to run SQL statements from Databricks SQL warehouses.
To view the Databricks SQL Statement Execution API 2.0 reference, see Statement Execution.
Before you begin
Before you start this tutorial, be sure that you have:
Either Databricks CLI version 0.205 or above or
curl
, as follows:The Databricks CLI is a command-line tool for sending and receiving Databricks REST API requests and responses. If you use Databricks CLI version 0.205 or above, it must be configured for authenticating with your Databricks workspace. See Install or update the Databricks CLI and Authentication for the Databricks CLI.
For example, to authenticate with Databricks personal access token authentication, follow the steps at Databricks personal access tokens for workspace users.
And then to use the Databricks CLI to create a Databricks configuration profile for your personal access token, do the following:
Note
The following procedure uses the Databricks CLI to create a Databricks configuration profile with the name
DEFAULT
. If you already have aDEFAULT
configuration profile, this procedure overwrites your existingDEFAULT
configuration profile.To check whether you already have a
DEFAULT
configuration profile, and to view this profile’s settings if it exists, use the Databricks CLI to run the commanddatabricks auth env --profile DEFAULT
.To create a configuration profile with a name other than
DEFAULT
, replace theDEFAULT
part of--profile DEFAULT
in the followingdatabricks configure
command with a different name for the configuration profile.Use the Databricks CLI to create a Databricks configuration profile named
DEFAULT
that uses Databricks personal access token authentication. To do this, run the following command:databricks configure --profile DEFAULT
For the prompt Databricks Host, enter your Databricks workspace instance URL, for example
https://dbc-a1b2345c-d6e7.cloud.databricks.com
.For the prompt Personal Access Token, enter the Databricks personal access token for your workspace.
In this tutorial’s Databricks CLI examples, note the following:
This tutorial assumes that you have an environment variable
DATABRICKS_SQL_WAREHOUSE_ID
on your local development machine. This environment variable represents the ID of your Databricks SQL warehouse. This ID is the string of letters and numbers following/sql/1.0/warehouses/
in the HTTP path field for your warehouse. To learn how to get your warehouse’s HTTP path value, see Get connection details for a Databricks compute resource.If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, replace
\
with^
, and replace${...}
with%...%
.If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, in JSON document declarations, replace the opening and closing
'
with"
, and replace inner"
with\"
.
curl is a command-line tool for sending and receiving REST API requests and responses. See also Install curl. Or, adapt this tutorial’s
curl
examples for use with similar tools such as HTTPie.In this tutorial’s
curl
examples, note the following:Instead of
--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
, you can use a .netrc file. If you use a.netrc
file, replace--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
with--netrc
.If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, replace
\
with^
, and replace${...}
with%...%
.If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, in JSON document declarations, replace the opening and closing
'
with"
, and replace inner"
with\"
.
Also, for this tutorial’s
curl
examples, this tutorial assumes you have the following environment variables on your local development machine:DATABRICKS_HOST
, representing the workspace instance name, for exampledbc-a1b2345c-d6e7.cloud.databricks.com
, for your Databricks workspace.DATABRICKS_TOKEN
, representing a Databricks personal access token for your Databricks workspace user.DATABRICKS_SQL_WAREHOUSE_ID
, representing the ID of your Databricks SQL warehouse. This ID is the string of letters and numbers following/sql/1.0/warehouses/
in the HTTP path field for your warehouse. To learn how to get your warehouse’s HTTP path value, see Get connection details for a Databricks compute resource.
Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens.
If you use personal access token authentication, Databricks recommends using personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
To create a Databricks personal access token, follow the staps in Databricks personal access tokens for workspace users.
Warning
Databricks strongly discourages hard-coding information into your scripts, as this sensitive information can be exposed in plain text through version control systems. Databricks recommends that you use approaches such as environment variables that you set on your development machine instead. Removing such hard-coded information from your scripts helps to make those scripts more portable as well.
This tutorial assumes that you also have jq, a command-line processor for querying JSON response payloads, which the Databricks SQL Statement Execution API returns to you after each call that you make to the Databricks SQL Statement Execution API. See Download jq.
You must have at least one table that you can execute SQL statements against. This tutorial is based on the
lineitem
table in thetpch
schema (also known as a database) within thesamples
catalog. If you do not have access to this catalog, schema, or table from your workspace, substitute them throughout this tutorial with your own.
Step 1: Execute a SQL statement and save the data result as JSON
Run the following command, which does the following:
Uses the specified SQL warehouse, along with the specified token if you are using
curl
, to query for three columns from the first two rows of thelineitem
table in thetcph
schema within thesamples
catalog.Saves the response payload in JSON format in a file named
sql-execution-response.json
within the current working directory.Prints the contents of the
sql-execution-response.json
file.Sets a local environment variable named
SQL_STATEMENT_ID
. This variable contains the ID of the corresponding SQL statement. You can use this SQL statement ID for getting information about that statement later as needed, which is demonstrated in Step 2. You can also view this SQL statement and get its statement ID from the query history section of the Databricks SQL console, or by calling the Query History API.Sets an additional local environment variable named
NEXT_CHUNK_EXTERNAL_LINK
that contains an API URL fragment for getting the next chunk of JSON data. If the response data is too large, the Databricks SQL Statement Execution API provides the response in chunks. You can use this API URL fragment for getting the next chunk of data, which is demonstrated in Step 2. If there is no next chunk, then this environment variable is set tonull
.Prints the values of the
SQL_STATEMENT_ID
andNEXT_CHUNK_INTERNAL_LINK
environment variables.
databricks api post /api/2.0/sql/statements \
--profile <profile-name> \
--json '{
"warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'",
"catalog": "samples",
"schema": "tpch",
"statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem WHERE l_extendedprice > :extended_price AND l_shipdate > :ship_date LIMIT :row_limit",
"parameters": [
{ "name": "extended_price", "value": "60000", "type": "DECIMAL(18,2)" },
{ "name": "ship_date", "value": "1995-01-01", "type": "DATE" },
{ "name": "row_limit", "value": "2", "type": "INT" }
]
}' \
> 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \
&& echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK
Replace <profile-name>
with the name of your Databricks configuration profile for authentication.
curl --request POST \
https://${DATABRICKS_HOST}/api/2.0/sql/statements/ \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
"warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'",
"catalog": "samples",
"schema": "tpch",
"statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem WHERE l_extendedprice > :extended_price AND l_shipdate > :ship_date LIMIT :row_limit",
"parameters": [
{ "name": "extended_price", "value": "60000", "type": "DECIMAL(18,2)" },
{ "name": "ship_date", "value": "1995-01-01", "type": "DATE" },
{ "name": "row_limit", "value": "2", "type": "INT" }
]
}' \
--output 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \
&& echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK
In the preceding request:
Parameterized queries consist of each query parameter’s name preceded by a colon (for example,
:extended_price
) with a matchingname
andvalue
object in theparameters
array. An optionaltype
can also be specified, with the default value ofSTRING
if not specified.Warning
Databricks strongly recommends that you use parameters as a best practice for your SQL statements.
If you use the Databricks SQL Statement Execution API with an application that generates SQL dynamically, this can result in SQL injection attacks. For example, if you generate SQL code based on a user’s selections in a user interface and do not take appropriate measures, an attacker could inject malicious SQL code to change the logic of your initial query, thereby reading, changing, or deleting sensitive data.
Parameterized queries help protect against SQL injections attacks by handling input arguments separately from the rest of your SQL code and interpreting these arguments as literal values. Parameters also help with code reusability.
By default, any returned data is in JSON array format, and the default location for any of the SQL statement’s data results is within the response payload. To make this behavior explicit, add
"format":"JSON_ARRAY","disposition":"INLINE"
to the request payload. If you attempt to return data results larger than 25 MiB in the response payload, a failure status is returned and the SQL statement is canceled. For data results larger than 25 MiB, you can use external links instead of trying to return it in the response payload, which is demonstrated in Step 3.The command stores the response payload’s contents to a local file. Local data storage is not supported by the Databricks SQL Statement Execution API directly.
By default, after 10 seconds, if the SQL statement has not yet finished executing through the warehouse, the Databricks SQL Statement Execution API returns only the SQL statement ID and its current status, instead of the statement’s result. To change this behavior, add
"wait_timeout"
to the request and set it to"<x>s"
, where<x>
can be between5
and50
seconds inclusive, for example"50s"
. To return the SQL statement ID and its current status immediately, setwait_timeout
to0s
.By default, the SQL statement continues to run if the timeout period is reached. To cancel a SQL statement if the timeout period is reached instead, add
"on_wait_timeout":"CANCEL"
to the request payload.To limit the number of bytes returned, add
"byte_limit"
to the request and set it to the number of bytes, for example1000
.To limit the number of rows returned, instead of adding a
LIMIT
clause tostatement
, you can add"row_limit"
to the request and set it to the number of rows, for example"statement":"SELECT * FROM lineitem","row_limit":2
.If the result is larger than the specified
byte_limit
orrow_limit
, thetruncated
field is set totrue
in the response payload.
If the statement’s result is available before the wait timeout ends, the response is as follows:
{
"manifest": {
"chunks": [
{
"chunk_index": 0,
"row_count": 2,
"row_offset": 0
}
],
"format": "JSON_ARRAY",
"schema": {
"column_count": 3,
"columns": [
{
"name": "l_orderkey",
"position": 0,
"type_name": "LONG",
"type_text": "BIGINT"
},
{
"name": "l_extendedprice",
"position": 1,
"type_name": "DECIMAL",
"type_precision": 18,
"type_scale": 2,
"type_text": "DECIMAL(18,2)"
},
{
"name": "l_shipdate",
"position": 2,
"type_name": "DATE",
"type_text": "DATE"
}
]
},
"total_chunk_count": 1,
"total_row_count": 2,
"truncated": false
},
"result": {
"chunk_index": 0,
"data_array": [
[
"2",
"71433.16",
"1997-01-28"
],
[
"7",
"86152.02",
"1996-01-15"
]
],
"row_count": 2,
"row_offset": 0
},
"statement_id": "00000000-0000-0000-0000-000000000000",
"status": {
"state": "SUCCEEDED"
}
}
If the wait timeout ends before the statement’s result is available, the response looks like this instead:
{
"statement_id": "00000000-0000-0000-0000-000000000000",
"status": {
"state": "PENDING"
}
}
If the statement’s result data is too large (for example in this case, by running SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem LIMIT 300000
), the result data is chunked and looks like this instead. Note that "...": "..."
indicates omitted results here for brevity:
{
"manifest": {
"chunks": [
{
"chunk_index": 0,
"row_count": 188416,
"row_offset": 0
},
{
"chunk_index": 1,
"row_count": 111584,
"row_offset": 188416
}
],
"format":"JSON_ARRAY",
"schema": {
"column_count":3,
"columns": [
{
"...": "..."
}
]
},
"total_chunk_count": 2,
"total_row_count": 300000,
"truncated": false
},
"result": {
"chunk_index": 0,
"data_array": [
[
"2",
"71433.16",
"1997-01-28"
],
[
"..."
]
],
"next_chunk_index": 1,
"next_chunk_internal_link": "/api/2.0/sql/statements/00000000-0000-0000-0000-000000000000/result/chunks/1?row_offset=188416",
"row_count": 188416,
"row_offset": 0
},
"statement_id": "00000000-0000-0000-0000-000000000000",
"status": {
"state": "SUCCEEDED"
}
}
Step 2: Get a statement’s current execution status and data result as JSON
You can use a SQL statement’s ID to get that statement’s current execution status and, if the execution succeeded, that statement’s result. If you forget the statement’s ID, you can get it from the query history section of the Databricks SQL console, or by calling the Query History API. For example, you could keep polling this command, checking each time to see if the execution has succeeded.
To get a SQL statement’s current execution status and, if the execution succeeded, that statement’s result and an API URL fragment for getting any next chunk of JSON data, run the following command. This command assumes that you have an environment variable on your local development machine named SQL_STATEMENT_ID
, which is set to the value of the ID of the SQL statement from the previous step. Of course, you can substitute ${SQL_STATEMENT_ID}
in the following command with the hard-coded ID of the SQL statement.
databricks api get /api/2.0/sql/statements/${SQL_STATEMENT_ID} \
--profile <profile-name> \
> 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK
Replace <profile-name>
with the name of your Databricks configuration profile for authentication.
curl --request GET \
https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID} \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--output 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK
If the NEXT_CHUNK_INTERNAL_LINK
is set to a non-null
value, you can use it to get the next chunk of data, and so on, for example with the following command:
databricks api get /${NEXT_CHUNK_INTERNAL_LINK} \
--profile <profile-name> \
> 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .next_chunk_internal_link 'sql-execution-response.json') \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK
Replace <profile-name>
with the name of your Databricks configuration profile for authentication.
curl --request GET \
https://${DATABRICKS_HOST}${NEXT_CHUNK_INTERNAL_LINK} \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--output 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .next_chunk_internal_link 'sql-execution-response.json') \
&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK
You can keep running the preceding command, over and over again, to get the next chunk, and so on. Note that as soon as the last chunk is fetched, the SQL statement is closed. After this closure, you cannot use that statement’s ID to get its current status or to fetch any more chunks.
Step 3: Fetch large results using external links
This section demonstrates an optional configuration that uses the EXTERNAL_LINKS
disposition to retrieve large data sets. The default location (disposition) for the SQL statement result data is within the response payload, but these results are limited to 25 MiB. By setting the disposition
to EXTERNAL_LINKS
, the response contains URLs you can use to fetch the chunks of the results data with standard HTTP. The URLs point to your workspace’s internal DBFS, where the result chunks are temporarily stored.
Warning
Databricks strongly recommends that you protect the URLs that are returned by the EXTERNAL_LINKS
disposition.
When you use the EXTERNAL_LINKS
disposition, a short-lived, presigned URL is generated, which can be used to download the results directly from Amazon S3. As a short-lived access credential is embedded in this presigned URL, you should protect the URL.
Because presigned URLs are already generated with embedded temporary access credentials, you must not set an Authorization
header in the download requests.
The EXTERNAL_LINKS
disposition can be disabled upon request by creating a support case. See Support.
See also Security best practices.
Note
The response payload output format and behavior, once they are set for a particular SQL statement ID, cannot be changed.
In this mode, the API enables you to store result data in JSON format (JSON
), CSV format (CSV
), or Apache Arrow format (ARROW_STREAM
), that must be queried separately with HTTP. Also, when using this mode, it is not possible to inline the result data within the response payload.
The following command demonstrates using EXTERNAL_LINKS
and Apache Arrow format. Use this pattern instead of the similar query demonstrated in Step 1:
databricks api post /api/2.0/sql/statements/ \
--profile <profile-name> \
--json '{
"warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'",
"catalog": "samples",
"schema": "tpch",
"format": "ARROW_STREAM",
"disposition": "EXTERNAL_LINKS",
"statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem WHERE l_extendedprice > :extended_price AND l_shipdate > :ship_date LIMIT :row_limit",
"parameters": [
{ "name": "extended_price", "value": "60000", "type": "DECIMAL(18,2)" },
{ "name": "ship_date", "value": "1995-01-01", "type": "DATE" },
{ "name": "row_limit", "value": "100000", "type": "INT" }
]
}' \
> 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \
&& echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID
Replace <profile-name>
with the name of your Databricks configuration profile for authentication.
curl --request POST \
https://${DATABRICKS_HOST}/api/2.0/sql/statements/ \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
"warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'",
"catalog": "samples",
"schema": "tpch",
"format": "ARROW_STREAM",
"disposition": "EXTERNAL_LINKS",
"statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem WHERE l_extendedprice > :extended_price AND l_shipdate > :ship_date LIMIT :row_limit",
"parameters": [
{ "name": "extended_price", "value": "60000", "type": "DECIMAL(18,2)" },
{ "name": "ship_date", "value": "1995-01-01", "type": "DATE" },
{ "name": "row_limit", "value": "100000", "type": "INT" }
]
}' \
--output 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json' \
&& export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \
&& echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID
The response is as follows:
{
"manifest": {
"chunks": [
{
"byte_count": 2843848,
"chunk_index": 0,
"row_count": 100000,
"row_offset": 0
}
],
"format": "ARROW_STREAM",
"schema": {
"column_count": 3,
"columns": [
{
"name": "l_orderkey",
"position": 0,
"type_name": "LONG",
"type_text": "BIGINT"
},
{
"name": "l_extendedprice",
"position": 1,
"type_name": "DECIMAL",
"type_precision": 18,
"type_scale": 2,
"type_text": "DECIMAL(18,2)"
},
{
"name": "l_shipdate",
"position": 2,
"type_name": "DATE",
"type_text": "DATE"
}
]
},
"total_byte_count": 2843848,
"total_chunk_count": 1,
"total_row_count": 100000,
"truncated": false
},
"result": {
"external_links": [
{
"byte_count": 2843848,
"chunk_index": 0,
"expiration": "<url-expiration-timestamp>",
"external_link": "<url-to-data-stored-externally>",
"row_count": 100000,
"row_offset": 0
}
]
},
"statement_id": "00000000-0000-0000-0000-000000000000",
"status": {
"state": "SUCCEEDED"
}
}
If the request times out, the response looks like this instead:
{
"statement_id": "00000000-0000-0000-0000-000000000000",
"status": {
"state": "PENDING"
}
}
To get that statement’s current execution status and, if the execution succeeded, that statement’s result, run the following command:
databricks api get /api/2.0/sql/statements/${SQL_STATEMENT_ID} \
--profile <profile-name> \
> 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json'
Replace <profile-name>
with the name of your Databricks configuration profile for authentication.
curl --request GET \
https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID} \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--output 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json'
If the response is large enough (for example in this case, by running SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem
with no row limit), the response will have multiple chunks, as in the following example below. Note that "...": "..."
indicates omitted results here for brevity:
{
"manifest": {
"chunks": [
{
"byte_count": 11469280,
"chunk_index": 0,
"row_count": 403354,
"row_offset": 0
},
{
"byte_count": 6282464,
"chunk_index": 1,
"row_count": 220939,
"row_offset": 403354
},
{
"...": "..."
},
{
"byte_count": 6322880,
"chunk_index": 10,
"row_count": 222355,
"row_offset": 3113156
}
],
"format":"ARROW_STREAM",
"schema": {
"column_count": 3,
"columns": [
{
"...": "..."
}
]
},
"total_byte_count": 94845304,
"total_chunk_count": 11,
"total_row_count": 3335511,
"truncated": false
},
"result": {
"external_links": [
{
"byte_count": 11469280,
"chunk_index": 0,
"expiration": "<url-expiration-timestamp>",
"external_link": "<url-to-data-stored-externally>",
"next_chunk_index": 1,
"next_chunk_internal_link": "/api/2.0/sql/statements/00000000-0000-0000-0000-000000000000/result/chunks/1?row_offset=403354",
"row_count": 403354,
"row_offset": 0
}
]
},
"statement_id": "00000000-0000-0000-0000-000000000000",
"status": {
"state": "SUCCEEDED"
}
}
To download the stored content’s results, you can run the following curl
command, using the URL in the external_link
object and specifying where you want to download the file. Do not include your Databricks token in this command:
curl "<url-to-result-stored-externally>" \
--output "<path/to/download/the/file/locally>"
To download a specific chunk of a streamed content’s results, you can use one of the following:
The
next_chunk_index
value from the response payload for the next chunk (if there is a next chunk).One of the chunk indexes from the response payload’s manifest for any available chunk if there are multiple chunks.
For example, to get the chunk with a chunk_index
of 10
from the previous response, run the following command:
databricks api get /api/2.0/sql/statements/${SQL_STATEMENT_ID}/result/chunks/10 \
--profile <profile-name> \
> 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json'
Replace <profile-name>
with the name of your Databricks configuration profile for authentication.
curl --request GET \
https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID}/result/chunks/10 \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--output 'sql-execution-response.json' \
&& jq . 'sql-execution-response.json'
Note
Running the preceding command returns a new presigned URL.
To download the stored chunk, use the URL in the external_link
object.
For more information about the Apache Arrow format, see:
Step 4: Cancel a SQL statement’s execution
If you need to cancel a SQL statement that has not yet succeeded, run the following command:
databricks api post /api/2.0/sql/statements/${SQL_STATEMENT_ID}/cancel \
--profile <profile-name> \
--json '{}'
Replace <profile-name>
with the name of your Databricks configuration profile for authentication.
curl --request POST \
https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID}/cancel \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
Security best practices
The Databricks SQL Statement Execution API increases the security of data transfers by using end-to-end transport layer security (TLS) encryption and short-lived credentials such as presigned URLs.
There are several layers in this security model. At the transport layer, it is only possible to call the Databricks SQL Statement Execution API by using TLS 1.2 or above. Also, callers of the Databricks SQL Statement Execution API must be authenticated with a valid Databricks personal access token that maps to a user who has the entitlement to use Databricks SQL. This user must have CAN USE access for the specific SQL warehouse that is being used, and access can be restricted with IP access lists. This applies to all requests to the Databricks SQL Statement Execution API. Furthermore, for executing statements, the authenticated user must have permission to the data objects (such as tables, views, and functions) that are used in each statement. This is enforced by existing access control mechanisms in Unity Catalog or by using table ACLs. (See Data governance with Unity Catalog for more details.) This also means that only the user who executes a statement can make fetch requests for the statement’s results.
Databricks recommends the following security best practices whenever you use the Databricks SQL Statement Execution API along with the EXTERNAL_LINKS
disposition to retrieve large data sets:
Remove the Databricks authorization header for Amazon S3 requests
Protect presigned URLs
Configure network restrictions on storage accounts
Configure logging on storage accounts
The EXTERNAL_LINKS
disposition can be disabled upon request by creating a support case. See Support.
Remove the Databricks authorization header for Amazon S3 requests
All calls to the Databricks SQL Statement Execution API that use curl
must include an Authorization
header that contains Databricks access credentials. Do not include this Authorization
header whenever you download data from Amazon S3. This header is not required and might unintentionally expose your Databricks access credentials.
Protect presigned URLs
Whenever you use the EXTERNAL_LINKS
disposition, a short-lived, presigned URL is generated, which the caller can use to download the results directly from Amazon S3 by using TLS. As a short-lived credential is embedded within this presigned URL, you should protect the URL.
Configure network restrictions on storage accounts
Whenever you use the EXTERNAL_LINKS
disposition, the client will obtain presigned URLs to download query results directly from Amazon S3.
Databricks recommends that you use S3 bucket policies to restrict access to your S3 buckets to trusted IP addresses and VPCs. Check that you have the correct workspace root bucket storage configurations, and review the Databricks recommendations for customer-managed VPC.
Configure logging on storage accounts
In addition to enforcing network-level restrictions on the underlying storage account, you can monitor whether anyone tries to bypass these restrictions by setting up S3 server access logging or AWS CloudTrail data events as well as the appropriate monitoring and alerting around them.