Connect Python and pyodbc to Databricks
You can connect from your local Python code through ODBC to data in a Databricks cluster or SQL warehouse. To do this, you can use the open source Python code module pyodbc
.
Follow these instructions to install, configure, and use pyodbc
.
For more information about pyodbc
, see the pyodbc Wiki.
Note
Databricks offers the Databricks SQL Connector for Python as an alternative to pyodbc
. The Databricks SQL Connector for Python is easier to set up and use, and has a more robust set of coding constructs, than pyodbc
. However pyodbc
may have better performance when fetching queries results above 10 MB.
Requirements
A local development machine running one of the following:
macOS
Windows
A Unix or Linux distribution that supports
.rpm
or.deb
files
pip.
For Unix, Linux, or macOS, Homebrew.
A Databricks cluster, a Databricks SQL warehouse, or both. For more information, see Create a cluster and Configure SQL warehouses.
Follow the instructions for Unix, Linux, or macOS or for Windows.
Unix, Linux, or macOS
If your local Python code is running on a Unix, Linux, or macOS machine, follow these instructions.
Step 1: Install software
In this step, you download and install the Databricks ODBC driver, the unixodbc
package, and the pyodbc
module. (The pyodbc
module requires the unixodbc
package on Unix, Linux, and macOS.)
Download the Databricks ODBC driver.
To install the Databricks ODBC driver, open the
SimbaSparkODBC.zip
file that you downloaded.Do one of the following:
macOS: Double-click the extracted
Simba Spark.dmg
file. Then double-click theSimbaSparkODBC.pkg
file that displays, and follow any on-screen directions.Linux: Use your distribution’s package manager utility to install the extracted
simbaspark.rpm
orsimbaspark.deb
file, and follow any on-screen directions.
Install the
unixodbc
package: from the terminal, runbrew install unixodbc
. For more information, see unixodbc on the Homebrew website.Install the
pyodbc
module: from the terminal, runpip install pyodbc
. For more information, see pyodbc on the PyPI website and Install in the pyodbc Wiki.
Step 2: Configure software
Specify connection details for the Databricks cluster and SQL warehouse for pyodbc
to use.
Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens or personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
Add the following content to the
/etc/odbc.ini
file on your machine:Tip
If you do not want to or cannot use the
/etc/odbc.ini
file on your machine, you can specify connection details directly in Python code. To do this, skip the rest of this step and proceed to Step 3: Test your configuration.[ODBC Data Sources] Databricks_Cluster = Simba Spark ODBC Driver [Databricks_Cluster] Driver = <driver-path> Description = Simba Spark ODBC Driver DSN HOST = <server-hostname> PORT = 443 Schema = default SparkServerType = 3 AuthMech = 3 UID = token PWD = <personal-access-token> ThriftTransport = 2 SSL = 1 HTTPPath = <http-path>
In the preceding configuration file, replace the following placeholders, and then save the file:
Replace
<driver-path>
with one of the following:macOS:
/Library/simba/spark/lib/libsparkodbc_sbu.dylib
Linux 64-bit:
/opt/simba/spark/lib/64/libsparkodbc_sb64.so
Linux 32-bit:
/opt/simba/spark/lib/32/libsparkodbc_sb32.so
Replace
<server-hostname>
with the Server Hostname value from the Advanced Options > JDBC/ODBC tab for your cluster.Replace
<personal-access-token>
with the value of your personal access token for your Databricks workspace.Replace
<http-path>
with the HTTP Path value from the Advanced Options > JDBC/ODBC tab for your cluster.
Tip
To allow
pyodbc
to switch connections to a different cluster, add an entry to the[ODBC Data Sources]
section and a matching entry below[Databricks_Cluster]
with the specific connection details. Each entry must have a unique name within this file.[ODBC Data Sources] SQL_Warehouse = Simba Spark ODBC Driver [SQL_Warehouse] Driver = <driver-path> HOST = <server-hostname> PORT = 443 Schema = default SparkServerType = 3 AuthMech = 3 UID = token PWD = <personal-access-token> ThriftTransport = 2 SSL = 1 HTTPPath = <http-path>
In the preceding configuration file, replace the following placeholders, and then save the file:
Replace
<driver-path>
with one of the following:macOS:
/Library/simba/spark/lib/libsparkodbc_sbu.dylib
Linux 64-bit:
/opt/simba/spark/lib/64/libsparkodbc_sb64.so
Linux 32-bit:
/opt/simba/spark/lib/32/libsparkodbc_sb32.so
Replace
<server-hostname>
with the Server Hostname value from the Connection Details tab for your SQL warehouse.Replace
<personal-access-token>
with the value of your personal access token for your SQL warehouse.Replace
<http-path>
with the HTTP Path value from the Connection Details tab for your SQL warehouse.
Tip
To allow
pyodbc
to switch connections to a different SQL warehouse, add an entry to the[ODBC Data Sources]
section and a matching entry below[SQL_Warehouse]
with the specific connection details. Each entry must have a unique name within this file.Add the preceding information you just added to the
/etc/odbc.ini
file to the corresponding/usr/local/etc/odbc.ini
file on your machine as well.Add the following content to the
/etc/odbcinst.ini
file on your machine:[ODBC Drivers] Simba SQL Server ODBC Driver = Installed [Simba Spark ODBC Driver] Driver = <driver-path>
In the preceding content, replace
<driver-path>
with one of the following values, and then save the file:macOS:
/Library/simba/spark/lib/libsparkodbc_sbu.dylib
Linux 64-bit:
/opt/simba/spark/lib/64/libsparkodbc_sb64.so
Linux 32-bit:
/opt/simba/spark/lib/32/libsparkodbc_sb32.so
Add the information you just added to the
/etc/odbcinst.ini
file to the corresponding/usr/local/etc/odbcinst.ini
file on your machine as well.Add the following information at the end of the
simba.sparkodbc.ini
file on your machine, and then save the file. For macOS, this file is in/Library/simba/spark/lib
.DriverManagerEncoding=UTF-16 ODBCInstLib=/usr/local/Cellar/unixodbc/2.3.9/lib/libodbcinst.dylib
Step 3: Test your configuration
In this step, you write and run Python code to use your Databricks cluster or Databricks SQL warehouse to query a database table and display the first two rows of query results.
To query by using a cluster:
Create a file named
pyodbc-test-cluster.py
with the following content. Replace<table-name>
with the name of the database table to query, save the file, and then run the file with your Python interpreter.import pyodbc # Replace <table-name> with the name of the database table to query. table_name = "<table-name>" # Connect to the Databricks cluster by using the # Data Source Name (DSN) that you created earlier. conn = pyodbc.connect("DSN=Databricks_Cluster", autocommit=True) # Run a SQL query by using the preceding connection. cursor = conn.cursor() cursor.execute(f"SELECT * FROM {table_name} LIMIT 2") # Print the rows retrieved from the query. print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n") for row in cursor.fetchall(): print(row)
Note
If you skipped Step 2: Configure software and did not use an
/etc/odbc.ini
file, then specify connection details in the call topyodbc.connect
, for example:conn = pyodbc.connect("Driver=<driver-path>;" + "HOST=<server-hostname>;" + "PORT=443;" + "Schema=default;" + "SparkServerType=3;" + "AuthMech=3;" + "UID=token;" + "PWD=<personal-access-token>;" + "ThriftTransport=2;" + "SSL=1;" + "HTTPPath=<http-path>", autocommit=True)
Replace the placeholders with the values as described in Step 2: Configure software.
To speed up running the code, start the cluster that corresponds to the
HTTPPath
setting in yourodbc.ini
file.Run the
pyodbc-test-cluster.py
file with your Python interpreter. The first two rows of the database table are displayed.
To query by using a SQL warehouse:
Create a file named
pyodbc-test-cluster.py
. Replace<table-name>
with the name of the database table to query, and then save the file.import pyodbc # Replace <table-name> with the name of the database table to query. table_name = "<table-name>" # Connect to the SQL warehouse by using the # Data Source Name (DSN) that you created earlier. conn = pyodbc.connect("DSN=SQL_Warehouse", autocommit=True) # Run a SQL query by using the preceding connection. cursor = conn.cursor() cursor.execute(f"SELECT * FROM {table_name} LIMIT 2") # Print the rows retrieved from the query. print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n") for row in cursor.fetchall(): print(row)
Note
If you skipped Step 2: Configure software and did not use an
/etc/odbc.ini
file, then specify connection details in the call topyodbc.connect
, for example:conn = pyodbc.connect("Driver=<driver-path>;" + "HOST=<server-hostname>;" + "PORT=443;" + "Schema=default;" + "SparkServerType=3;" + "AuthMech=3;" + "UID=token;" + "PWD=<personal-access-token>;" + "ThriftTransport=2;" + "SSL=1;" + "HTTPPath=<http-path>", autocommit=True)
Replace the placeholders with the values as described in Step 2: Configure software.
To speed up running the code, start the SQL warehouse that corresponds to the
HTTPPath
setting in yourodbc.ini
file.Run the
pyodbc-test-warehouse.py
file with your Python interpreter. The first two rows of the database table are displayed.
Next steps
To run the Python test code against a different cluster or SQL warehouse, change the settings in the preceding two
odbc.ini
files. Or add a new entry to the[ODBC Data Sources]
section, along with matching connection details, to the twoodbc.ini
files. Then change the DSN name in the test code to match the related name in[ODBC Data Sources]
.To run the Python test code against a different database tables, change the
table_name
value.To run the Python test code with a different SQL query, change the
execute
command string.
Windows
If your local Python code is running on a Windows machine, follow these instructions.
Step 1: Install software
Download the Databricks ODBC driver.
To install the Databricks ODBC driver, open the
SimbaSparkODBC.zip
file that you downloaded.Double-click the extracted
Simba Spark.msi
file, and follow any on-screen directions.Install the
pyodbc
module: from an administrative command prompt, runpip install pyodbc
. For more information, see pyodbc on the PyPI website and Install in the pyodbc Wiki.
Step 2: Configure software
Specify connection details for the Databricks cluster or Databricks SQL warehouse for pyodbc
to use.
To specify connection details for a cluster:
Add a data source name (DSN) that contains information about your cluster: start the ODBC Data Sources application: on the Start menu, begin typing
ODBC
, and then click ODBC Data Sources.On the User DSN tab, click Add. In the Create New Data Source dialog box, click Simba Spark ODBC Driver, and then click Finish.
In the Simba Spark ODBC Driver DSN Setup dialog box, change the following values:
Data Source Name:
Databricks_Cluster
Description:
My cluster
Spark Server Type:
SparkThriftServer (Spark 1.1 and later)
Host(s): The Server Hostname value from the Advanced Options, JDBC/ODBC tab for your cluster.
Port:
443
Database:
default
Mechanism:
User Name and Password
User Name:
token
Password: The value of your personal access token for your Databricks workspace.
Thrift Transport:
HTTP
Click HTTP Options. In the HTTP Properties dialog box, for HTTP Path, enter the HTTP Path value from the Advanced Options, JDBC/ODBC tab for your cluster, and then click OK.
Click SSL Options. In the SSL Options dialog box, check the Enable SSL box, and then click OK.
Click Test. If the test succeeds, click OK.
Tip
To allow pyodbc
to switch connections to a different cluster, repeat this procedure with the specific connection details. Each DSN must have a unique name.
To specify connection details for a SQL warehouse:
In the ODBC Data Sources application, on the User DSN tab, click Add. In the Create New Data Source dialog box, click Simba Spark ODBC Driver, and then click Finish.
In the Simba Spark ODBC Driver dialog box, enter the following values:
Data Source Name:
SQL_Warehouse
Description:
My warehouse
Spark Server Type:
SparkThriftServer (Spark 1.1 and later)
Host(s): The Server Hostname value from the Connection Details tab your SQL warehouse.
Port:
443
Database:
default
Mechanism:
User Name and Password
User Name:
token
Password: The value of your personal access token for your SQL warehouse.
Thrift Transport:
HTTP
Click HTTP Options. In the HTTP Properties dialog box, for HTTP Path, enter the HTTP Path value from the Connection Details tab your SQL warehouse, and then click OK.
Click SSL Options. In the SSL Options dialog box, check the Enable SSL box, and then click OK.
Click Test. If the test succeeds, click OK.
Tip
To allow pyodbc
to connect to switch connections to a different SQL warehouse, repeat this procedure with the specific connection details. Each DSN must have a unique name.
Step 3: Test your configuration
In this step, you write and run Python code to use your Databricks cluster or Databricks SQL warehouse to query a database table and display the first two rows of query results.
To query by using a cluster:
Create a file named
pyodbc-test-cluster.py
with the following content. Replace<table-name>
with the name of the database table to query, and then save the file.import pyodbc # Replace <table-name> with the name of the database table to query. table_name = "<table-name>" # Connect to the Databricks cluster by using the # Data Source Name (DSN) that you created earlier. conn = pyodbc.connect("DSN=Databricks_Cluster", autocommit=True) # Run a SQL query by using the preceding connection. cursor = conn.cursor() cursor.execute(f"SELECT * FROM {table_name} LIMIT 2") # Print the rows retrieved from the query. print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n") for row in cursor.fetchall(): print(row)
To speed up running the code, start the cluster that corresponds to the Host(s) value in the Simba Spark ODBC Driver DSN Setup dialog box for your Databricks cluster.
Run the
pyodbc-test-cluster.py
file with your Python interpreter. The first two rows of the database table are displayed.
To query by using a SQL warehouse:
Create a file named
pyodbc-test-cluster.py
. Replace<table-name>
with the name of the database table to query, and then save the file.import pyodbc # Replace <table-name> with the name of the database table to query. table_name = "<table-name>" # Connect to the SQL warehouse by using the # Data Source Name (DSN) that you created earlier. conn = pyodbc.connect("DSN=SQL_Warehouse", autocommit=True) # Run a SQL query by using the preceding connection. cursor = conn.cursor() cursor.execute(f"SELECT * FROM {table_name} LIMIT 2") # Print the rows retrieved from the query. print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n") for row in cursor.fetchall(): print(row)
To speed up running the code, start the SQL warehouse that corresponds to the Host(s) value in the Simba Spark ODBC Driver DSN Setup dialog box for your Databricks SQL warehouse.
Run the
pyodbc-test-warehouse.py
file with your Python interpreter. The first two rows of the database table are displayed.
Next steps
To run the Python test code against a different cluster or SQL warehouse, change the Host(s) value in the Simba Spark ODBC Driver DSN Setup dialog box for your Databricks cluster or Databricks SQL warehouse. Or create a new DSN. Then change the DSN name in the test code to match the related Data Source Name.
To run the Python test code against a different database table, change the
table_name
value.To run the Python test code with a different SQL query, change the
execute
command string.
Troubleshooting
This section addresses common issues when using pyodbc
with Databricks.
Unicode decode error
Issue: You receive an error message similar to the following:
<class 'pyodbc.Error'> returned a result with an error set
Traceback (most recent call last):
File "/Users/user/.pyenv/versions/3.7.5/lib/python3.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 2112-2113: illegal UTF-16 surrogate
Cause: An issue exists in pyodbc
version 4.0.31 or below that could manifest with such symptoms when running queries that return columns with long names or a long error message. The issue has been fixed by a newer version of pyodbc
.
Solution: Upgrade your installation of pyodbc
to version 4.0.32 or above.
General troubleshooting
See Issues in the mkleehammer/pyodbc repository on GitHub.