Databricks SQL Connector for Python

Preview

This library is in Public Preview.

The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL commands on Databricks clusters and Databricks SQL endpoints. The Databricks SQL Connector for Python is easier to set up and use than similar Python libraries such as pyodbc. This library follows PEP 249 – Python Database API Specification v2.0.

Requirements

Get started

  • Gather the following information for the cluster or SQL endpoint that you want to use:

    • The server hostname of the cluster. You can get this from the Server Hostname value in the Advanced Options > JDBC/ODBC tab for your cluster.
    • The HTTP path of the cluster. You can get this from the HTTP Path value in the Advanced Options > JDBC/ODBC tab for your cluster.
    • Your Databricks personal access token for the workspace for the cluster. Databricks personal access tokens typically start with dapi.
    • The server hostname of the SQL endpoint. You can get this from the Server Hostname value in the Connection Details tab for your SQL endpoint.
    • The HTTP path of the SQL endpoint. You can get this from the HTTP Path value in the Connection Details tab for your SQL endpoint.
    • Your Databricks personal access token. Databricks personal access tokens typically start with dapi.
  • Install the Databricks SQL Connector for Python library on your development machine by running pip install databricks-sql-connector.

Example

The following code example demonstrates how to call the Databricks SQL Connector for Python to run a basic SQL command on a cluster or SQL endpoint. This command returns the first two rows from the specified database table. The code uses the environment variables that were previously specified.

from databricks import sql

connection = sql.connect(
  server_hostname='<server-hostname>',
  http_path='<http-path>',
  access_token='<personal-access-token>')

cursor = connection.cursor()

cursor.execute('SELECT * FROM <database-name>.<table-name> LIMIT 2')

result = cursor.fetchall()

for row in result:
  print(row)

cursor.close()

Replace:

  • <server-hostname> with the Server Hostname value.
  • <http-path> with the HTTP Path value.
  • <personal-access-token> with your Databricks personal access token.
  • <database-name> with the name of the database to query.
  • <table-name> with the name of the table to query.

API reference

Package

databricks-sql-connector

Usage: pip install databricks-sql-connector

See also databricks-sql-connector in the Python Package Index (PyPI).

Module

databricks.sql

Usage: from databricks import sql

Methods

connect method

Creates a connection to a database.

Returns a Connection object.

Parameters

server_hostname

Type: str

The server hostname for the cluster or SQL endpoint. To get the server hostname, see the instructions earlier in this article.

This parameter is required.

Example: dbc-a1b2345c-d6e7.cloud.databricks.com

http_path

Type: str

The HTTP path of the cluster or SQL endpoint. To get the HTTP path, see the instructions earlier in this article.

This parameter is required.

Example: sql/protocolv1/o/1234567890123456/1234-567890-test123 for a cluster. /sql/1.0/endpoints/a1b234c567d8e9fa for a SQL endpoint.

access_token

Type: str

Your Databricks personal access token for the workspace for the cluster or SQL endpoint. To create a token, see the instructions earlier in this article.

This parameter is required.

Example: dapi...<the-remaining-portion-of-your-token>

Classes

Connection class

Represents a connection to a database.

Methods
close method

Closes the connection to the database and releases all associated resources on the server. Any additional calls to this connection will throw an Error.

No parameters.

No return value.

cursor method

Returns a mechanism that enables traversal over the records in a database.

No parameters.

Returns a Cursor object.

Cursor class

Attributes
arraysize attribute

Used with the fetchmany method, specifies the internal buffer size, which is also how many rows are actually fetched from the server at a time. The default value is 10000. For narrow results (results in which each row does not contain a lot of data), you should increase this value for better performance.

Read-write access.

description attribute

Contains a Python list of tuple objects. Each of these tuple objects contains 7 values, with the first 2 items of each tuple object containing information describing a single result column as follows:

  • name: The name of the column.
  • type_code: A string representing the type of the column. For example, an integer column will have a type code of int.

The remaining 5 items of each 7-item tuple object are not implemented, and their values are not defined. They will typically be returned as 4 None values followed by a single True value.

Read-only access.

Methods
cancel method

Interrupts the running of any database query or command that the cursor has started. To release the associated resources on the server, call the close method after calling the cancel method.

No parameters.

No return value.

close method

Closes the cursor and releases the associated resources on the server. Closing an already closed cursor might throw an error.

No parameters.

No return value.

execute method

Prepares and then runs a database query or command.

No return value.

Parameters

operation

Type: str

The query or command to prepare and then run.

This parameter is required.

Example without the parameters parameter:

cursor.execute(
 'SELECT * FROM default.diamonds WHERE cut="Ideal" LIMIT 2'
)

Example with the parameters parameter:

cursor.execute(
 'SELECT * FROM default.diamonds WHERE cut=%(cut_type)s LIMIT 2',
 { 'cut_type': 'Ideal' }
)

parameters

Type: dictionary

A sequence of parameters to use with the operation parameter.

This parameter is optional. The default is None.

executemany method

Prepares and then runs a database query or command using all parameter sequences in the seq_of_parameters argument. Only the final result set is retained.

No return value.

Parameters

operation

Type: str

The query or command to prepare and then run.

This parameter is required.

seq_of_parameters

Type: list of dict

A sequence of many sets of parameter values to use with the operation parameter.

This parameter is required.

fetchall method

Gets all (or all remaining) rows of a query.

No parameters.

Returns all (or all remaining) rows of the query as a Python list of tuple objects.

Throws an Error if the previous call to the execute method did not return any data or no execute call has yet been made.

fetchmany method

Gets the next rows of a query.

Returns up to the size argument (or the arraysize attribute if size is not specified) of the next rows of a query as a Python list of tuple objects.

Throws an Error if the previous call to the execute method did not return any data or no execute call has yet been made.

Parameters

size

Type: int

The number of next rows to get.

This parameter is optional. If not specified, the value of the arraysize attribute is used.

Example: cursor.fetchmany(10)

fetchone method

Gets the next row of the dataset.

No parameters.

Returns the next row of the dataset as a single sequence as a Python tuple object, or returns None if there is no more available data.

Throws an Error if the previous call to the execute method did not return any data or no execute call has yet been made.

Type conversions

The following table maps Apache Spark SQL data types to their Python data type equivalents.

Apache Spark SQL data type Python data type
array str
bigint int
binary bytearray
boolean bool
date datetime.date
decimal decimal.Decimal
double float
int int
map str
null NoneType
smallint int
string str
struct str
timestamp datetime.datetime
tinyint int

Troubleshooting

An EOFError exception is thrown

Issue: When you run your code, you receive an EOFError exception similar to the following:

Traceback (most recent call last):
  ...
  File ".../databricks/sql/TCLIService/TCLIService.py", line 187, in OpenSession
    return self.recv_OpenSession()
  File ".../databricks/sql/TCLIService/TCLIService.py", line 199, in recv_OpenSession
    (fname, mtype, rseqid) = iprot.readMessageBegin()
  File ".../thrift/protocol/TBinaryProtocol.py", line 148, in readMessageBegin
    name = self.trans.readAll(sz)
  File ".../thrift/transport/TTransport.py", line 68, in readAll
    raise EOFError()
EOFError

Possible causes: This exception can occur due to an invalid personal access token or a lack of permissions to access the specified Databricks resource.

Recommended fix: Check your connection details and try again.

Additional resources

For more information, see: