Connect Python and pyodbc to Databricks

You can connect from your local Python code through ODBC to data in a Databricks cluster or SQL warehouse. To do this, you can use the open source Python code module pyodbc.

Follow these instructions to install, configure, and use pyodbc.

For more information about pyodbc, see the pyodbc Wiki.

Note

Databricks offers the Databricks SQL Connector for Python as an alternative to pyodbc. The Databricks SQL Connector for Python is easier to set up and use, and has a more robust set of coding constructs, than pyodbc. However pyodbc may have better performance when fetching queries results above 10 MB.

Requirements

  • A local development machine running one of the following:

    • macOS

    • Windows

    • A Unix or Linux distribution that supports .rpm or .deb files

  • pip.

  • For Unix, Linux, or macOS, Homebrew.

  • A Databricks cluster, a Databricks SQL warehouse, or both. For more information, see Create a cluster and Configure SQL warehouses.

Follow the instructions for Unix, Linux, or macOS or for Windows.

Unix, Linux, or macOS

If your local Python code is running on a Unix, Linux, or macOS machine, follow these instructions.

Step 1: Install software

In this step, you download and install the Databricks ODBC driver, the unixodbc package, and the pyodbc module. (The pyodbc module requires the unixodbc package on Unix, Linux, and macOS.)

  1. Download the Databricks ODBC driver.

  2. To install the Databricks ODBC driver, open the SimbaSparkODBC.zip file that you downloaded.

  3. Do one of the following:

    • macOS: Double-click the extracted Simba Spark.dmg file. Then double-click the SimbaSparkODBC.pkg file that displays, and follow any on-screen directions.

    • Linux: Use your distribution’s package manager utility to install the extracted simbaspark.rpm or simbaspark.deb file, and follow any on-screen directions.

  4. Install the unixodbc package: from the terminal, run brew install unixodbc. For more information, see unixodbc on the Homebrew website.

  5. Install the pyodbc module: from the terminal, run pip install pyodbc. For more information, see pyodbc on the PyPI website and Install in the pyodbc Wiki.

Step 2: Configure software

Specify connection details for the Databricks cluster and SQL warehouse for pyodbc to use.

Note

As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens or personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

  1. Add the following content to the /etc/odbc.ini file on your machine:

    Tip

    If you do not want to or cannot use the /etc/odbc.ini file on your machine, you can specify connection details directly in Python code. To do this, skip the rest of this step and proceed to Step 3: Test your configuration.

    [ODBC Data Sources]
    Databricks_Cluster = Simba Spark ODBC Driver
    
    [Databricks_Cluster]
    Driver          = <driver-path>
    Description     = Simba Spark ODBC Driver DSN
    HOST            = <server-hostname>
    PORT            = 443
    Schema          = default
    SparkServerType = 3
    AuthMech        = 3
    UID             = token
    PWD             = <personal-access-token>
    ThriftTransport = 2
    SSL             = 1
    HTTPPath        = <http-path>
    

    In the preceding configuration file, replace the following placeholders, and then save the file:

    • Replace <driver-path> with one of the following:

      • macOS: /Library/simba/spark/lib/libsparkodbc_sbu.dylib

      • Linux 64-bit: /opt/simba/spark/lib/64/libsparkodbc_sb64.so

      • Linux 32-bit: /opt/simba/spark/lib/32/libsparkodbc_sb32.so

    • Replace <server-hostname> with the Server Hostname value from the Advanced Options > JDBC/ODBC tab for your cluster.

    • Replace <personal-access-token> with the value of your personal access token for your Databricks workspace.

    • Replace <http-path> with the HTTP Path value from the Advanced Options > JDBC/ODBC tab for your cluster.

    Tip

    To allow pyodbc to switch connections to a different cluster, add an entry to the [ODBC Data Sources] section and a matching entry below [Databricks_Cluster] with the specific connection details. Each entry must have a unique name within this file.

    [ODBC Data Sources]
    SQL_Warehouse = Simba Spark ODBC Driver
    
    [SQL_Warehouse]
    Driver          = <driver-path>
    HOST            = <server-hostname>
    PORT            = 443
    Schema          = default
    SparkServerType = 3
    AuthMech        = 3
    UID             = token
    PWD             = <personal-access-token>
    ThriftTransport = 2
    SSL             = 1
    HTTPPath        = <http-path>
    

    In the preceding configuration file, replace the following placeholders, and then save the file:

    • Replace <driver-path> with one of the following:

      • macOS: /Library/simba/spark/lib/libsparkodbc_sbu.dylib

      • Linux 64-bit: /opt/simba/spark/lib/64/libsparkodbc_sb64.so

      • Linux 32-bit: /opt/simba/spark/lib/32/libsparkodbc_sb32.so

    • Replace <server-hostname> with the Server Hostname value from the Connection Details tab for your SQL warehouse.

    • Replace <personal-access-token> with the value of your personal access token for your SQL warehouse.

    • Replace <http-path> with the HTTP Path value from the Connection Details tab for your SQL warehouse.

    Tip

    To allow pyodbc to switch connections to a different SQL warehouse, add an entry to the [ODBC Data Sources] section and a matching entry below [SQL_Warehouse] with the specific connection details. Each entry must have a unique name within this file.

  2. Add the preceding information you just added to the /etc/odbc.ini file to the corresponding /usr/local/etc/odbc.ini file on your machine as well.

  3. Add the following content to the /etc/odbcinst.ini file on your machine:

    [ODBC Drivers]
    Simba SQL Server ODBC Driver = Installed
    
    [Simba Spark ODBC Driver]
    Driver = <driver-path>
    

    In the preceding content, replace <driver-path> with one of the following values, and then save the file:

    • macOS: /Library/simba/spark/lib/libsparkodbc_sbu.dylib

    • Linux 64-bit: /opt/simba/spark/lib/64/libsparkodbc_sb64.so

    • Linux 32-bit: /opt/simba/spark/lib/32/libsparkodbc_sb32.so

  4. Add the information you just added to the /etc/odbcinst.ini file to the corresponding /usr/local/etc/odbcinst.ini file on your machine as well.

  5. Add the following information at the end of the simba.sparkodbc.ini file on your machine, and then save the file. For macOS, this file is in /Library/simba/spark/lib.

    DriverManagerEncoding=UTF-16
    ODBCInstLib=/usr/local/Cellar/unixodbc/2.3.9/lib/libodbcinst.dylib
    

Step 3: Test your configuration

In this step, you write and run Python code to use your Databricks cluster or Databricks SQL warehouse to query a database table and display the first two rows of query results.

To query by using a cluster:

  1. Create a file named pyodbc-test-cluster.py with the following content. Replace <table-name> with the name of the database table to query, save the file, and then run the file with your Python interpreter.

    import pyodbc
    
    # Replace <table-name> with the name of the database table to query.
    table_name = "<table-name>"
    
    # Connect to the Databricks cluster by using the
    # Data Source Name (DSN) that you created earlier.
    conn = pyodbc.connect("DSN=Databricks_Cluster", autocommit=True)
    
    # Run a SQL query by using the preceding connection.
    cursor = conn.cursor()
    cursor.execute(f"SELECT * FROM {table_name} LIMIT 2")
    
    # Print the rows retrieved from the query.
    print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n")
    for row in cursor.fetchall():
      print(row)
    

    Note

    If you skipped Step 2: Configure software and did not use an /etc/odbc.ini file, then specify connection details in the call to pyodbc.connect, for example:

    conn = pyodbc.connect("Driver=<driver-path>;" +
                          "HOST=<server-hostname>;" +
                          "PORT=443;" +
                          "Schema=default;" +
                          "SparkServerType=3;" +
                          "AuthMech=3;" +
                          "UID=token;" +
                          "PWD=<personal-access-token>;" +
                          "ThriftTransport=2;" +
                          "SSL=1;" +
                          "HTTPPath=<http-path>",
                          autocommit=True)
    

    Replace the placeholders with the values as described in Step 2: Configure software.

  2. To speed up running the code, start the cluster that corresponds to the HTTPPath setting in your odbc.ini file.

  3. Run the pyodbc-test-cluster.py file with your Python interpreter. The first two rows of the database table are displayed.

To query by using a SQL warehouse:

  1. Create a file named pyodbc-test-cluster.py. Replace <table-name> with the name of the database table to query, and then save the file.

    import pyodbc
    
    # Replace <table-name> with the name of the database table to query.
    table_name = "<table-name>"
    
    # Connect to the SQL warehouse by using the
    # Data Source Name (DSN) that you created earlier.
    conn = pyodbc.connect("DSN=SQL_Warehouse", autocommit=True)
    
    # Run a SQL query by using the preceding connection.
    cursor = conn.cursor()
    cursor.execute(f"SELECT * FROM {table_name} LIMIT 2")
    
    # Print the rows retrieved from the query.
    print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n")
    for row in cursor.fetchall():
    print(row)
    

    Note

    If you skipped Step 2: Configure software and did not use an /etc/odbc.ini file, then specify connection details in the call to pyodbc.connect, for example:

    conn = pyodbc.connect("Driver=<driver-path>;" +
                          "HOST=<server-hostname>;" +
                          "PORT=443;" +
                          "Schema=default;" +
                          "SparkServerType=3;" +
                          "AuthMech=3;" +
                          "UID=token;" +
                          "PWD=<personal-access-token>;" +
                          "ThriftTransport=2;" +
                          "SSL=1;" +
                          "HTTPPath=<http-path>",
                          autocommit=True)
    

    Replace the placeholders with the values as described in Step 2: Configure software.

  2. To speed up running the code, start the SQL warehouse that corresponds to the HTTPPath setting in your odbc.ini file.

  3. Run the pyodbc-test-warehouse.py file with your Python interpreter. The first two rows of the database table are displayed.

Next steps

  • To run the Python test code against a different cluster or SQL warehouse, change the settings in the preceding two odbc.ini files. Or add a new entry to the [ODBC Data Sources] section, along with matching connection details, to the two odbc.ini files. Then change the DSN name in the test code to match the related name in [ODBC Data Sources].

  • To run the Python test code against a different database tables, change the table_name value.

  • To run the Python test code with a different SQL query, change the execute command string.

Windows

If your local Python code is running on a Windows machine, follow these instructions.

Step 1: Install software

  1. Download the Databricks ODBC driver.

  2. To install the Databricks ODBC driver, open the SimbaSparkODBC.zip file that you downloaded.

  3. Double-click the extracted Simba Spark.msi file, and follow any on-screen directions.

  4. Install the pyodbc module: from an administrative command prompt, run pip install pyodbc. For more information, see pyodbc on the PyPI website and Install in the pyodbc Wiki.

Step 2: Configure software

Specify connection details for the Databricks cluster or Databricks SQL warehouse for pyodbc to use.

To specify connection details for a cluster:

  1. Add a data source name (DSN) that contains information about your cluster: start the ODBC Data Sources application: on the Start menu, begin typing ODBC, and then click ODBC Data Sources.

  2. On the User DSN tab, click Add. In the Create New Data Source dialog box, click Simba Spark ODBC Driver, and then click Finish.

  3. In the Simba Spark ODBC Driver DSN Setup dialog box, change the following values:

    • Data Source Name: Databricks_Cluster

    • Description: My cluster

    • Spark Server Type: SparkThriftServer (Spark 1.1 and later)

    • Host(s): The Server Hostname value from the Advanced Options, JDBC/ODBC tab for your cluster.

    • Port: 443

    • Database: default

    • Mechanism: User Name and Password

    • User Name: token

    • Password: The value of your personal access token for your Databricks workspace.

    • Thrift Transport: HTTP

  4. Click HTTP Options. In the HTTP Properties dialog box, for HTTP Path, enter the HTTP Path value from the Advanced Options, JDBC/ODBC tab for your cluster, and then click OK.

  5. Click SSL Options. In the SSL Options dialog box, check the Enable SSL box, and then click OK.

  6. Click Test. If the test succeeds, click OK.

Tip

To allow pyodbc to switch connections to a different cluster, repeat this procedure with the specific connection details. Each DSN must have a unique name.

To specify connection details for a SQL warehouse:

  1. In the ODBC Data Sources application, on the User DSN tab, click Add. In the Create New Data Source dialog box, click Simba Spark ODBC Driver, and then click Finish.

  2. In the Simba Spark ODBC Driver dialog box, enter the following values:

    • Data Source Name: SQL_Warehouse

    • Description: My warehouse

    • Spark Server Type: SparkThriftServer (Spark 1.1 and later)

    • Host(s): The Server Hostname value from the Connection Details tab your SQL warehouse.

    • Port: 443

    • Database: default

    • Mechanism: User Name and Password

    • User Name: token

    • Password: The value of your personal access token for your SQL warehouse.

    • Thrift Transport: HTTP

  3. Click HTTP Options. In the HTTP Properties dialog box, for HTTP Path, enter the HTTP Path value from the Connection Details tab your SQL warehouse, and then click OK.

  4. Click SSL Options. In the SSL Options dialog box, check the Enable SSL box, and then click OK.

  5. Click Test. If the test succeeds, click OK.

Tip

To allow pyodbc to connect to switch connections to a different SQL warehouse, repeat this procedure with the specific connection details. Each DSN must have a unique name.

Step 3: Test your configuration

In this step, you write and run Python code to use your Databricks cluster or Databricks SQL warehouse to query a database table and display the first two rows of query results.

To query by using a cluster:

  1. Create a file named pyodbc-test-cluster.py with the following content. Replace <table-name> with the name of the database table to query, and then save the file.

    import pyodbc
    
    # Replace <table-name> with the name of the database table to query.
    table_name = "<table-name>"
    
    # Connect to the Databricks cluster by using the
    # Data Source Name (DSN) that you created earlier.
    conn = pyodbc.connect("DSN=Databricks_Cluster", autocommit=True)
    
    # Run a SQL query by using the preceding connection.
    cursor = conn.cursor()
    cursor.execute(f"SELECT * FROM {table_name} LIMIT 2")
    
    # Print the rows retrieved from the query.
    print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n")
    for row in cursor.fetchall():
      print(row)
    
  2. To speed up running the code, start the cluster that corresponds to the Host(s) value in the Simba Spark ODBC Driver DSN Setup dialog box for your Databricks cluster.

  3. Run the pyodbc-test-cluster.py file with your Python interpreter. The first two rows of the database table are displayed.

To query by using a SQL warehouse:

  1. Create a file named pyodbc-test-cluster.py. Replace <table-name> with the name of the database table to query, and then save the file.

    import pyodbc
    
    # Replace <table-name> with the name of the database table to query.
    table_name = "<table-name>"
    
    # Connect to the SQL warehouse by using the
    # Data Source Name (DSN) that you created earlier.
    conn = pyodbc.connect("DSN=SQL_Warehouse", autocommit=True)
    
    # Run a SQL query by using the preceding connection.
    cursor = conn.cursor()
    cursor.execute(f"SELECT * FROM {table_name} LIMIT 2")
    
    # Print the rows retrieved from the query.
    print(f"Query output: SELECT * FROM {table_name} LIMIT 2\n")
    for row in cursor.fetchall():
    print(row)
    
  2. To speed up running the code, start the SQL warehouse that corresponds to the Host(s) value in the Simba Spark ODBC Driver DSN Setup dialog box for your Databricks SQL warehouse.

  3. Run the pyodbc-test-warehouse.py file with your Python interpreter. The first two rows of the database table are displayed.

Next steps

  • To run the Python test code against a different cluster or SQL warehouse, change the Host(s) value in the Simba Spark ODBC Driver DSN Setup dialog box for your Databricks cluster or Databricks SQL warehouse. Or create a new DSN. Then change the DSN name in the test code to match the related Data Source Name.

  • To run the Python test code against a different database table, change the table_name value.

  • To run the Python test code with a different SQL query, change the execute command string.

Troubleshooting

This section addresses common issues when using pyodbc with Databricks.

Unicode decode error

Issue: You receive an error message similar to the following:

<class 'pyodbc.Error'> returned a result with an error set
Traceback (most recent call last):
File "/Users/user/.pyenv/versions/3.7.5/lib/python3.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 2112-2113: illegal UTF-16 surrogate

Cause: An issue exists in pyodbc version 4.0.31 or below that could manifest with such symptoms when running queries that return columns with long names or a long error message. The issue has been fixed by a newer version of pyodbc.

Solution: Upgrade your installation of pyodbc to version 4.0.32 or above.

General troubleshooting

See Issues in the mkleehammer/pyodbc repository on GitHub.