Connecting BI Tools

You can connect business intelligence (BI) tools to Databricks clusters to query data in tables. Every Databricks cluster runs a JDBC/ODBC server on the driver node. This topic covers general installation and configuration instructions for most BI tools. For tool-specific connection instructions, see Business Intelligence Tools.

Requirements

To access a cluster via JDBC/ODBC you must have Can Attach To permission.

Note

If you connect to a terminated cluster using JDBC/ODBC and have Can Restart permission, the cluster will be restarted.

Step 1: Download and install a JDBC/ODBC driver

For most BI tools, you need a JDBC or ODBC driver, according to the tool’s specification, to make a connection to Databricks clusters.

  1. Go to the ODBC or JDBC driver download page.
  2. Fill out the form.
  3. In the Driver version drop-down, select the driver type.
  4. Download the driver and add it your program dependencies.

Step 2: Configure JDBC/ODBC connection

Note

JDBC/ODBC connectivity to Spark 2.x clusters is available in the Databricks Operational Security Package. Databricks supports connecting to Spark 1.x clusters for all customers. Your cluster’s Spark version determines how you configure the JDBC/ODBC connection. This topic focuses primarily on how to connect to Spark 2.x clusters. For Spark 1.x clusters, see the instructions in Spark 1.x Cluster Connectivity.

Here are some of the parameters a JDBC/ODBC driver might require:

Parameters Spark 2.x Cluster Spark 1.x Cluster
Username/password See Username and password.
Host YOUR_WORKSPACE_URL DRIVER_PUBLIC_DNS
Port 443 10000
HTTP Path See Construct the JDBC URL.
The following are usually specified in the “httpPath” for JDBC and the DSN conf for ODBC.
Spark Server Type Spark Thrift Server
Schema/Database default
Authentication Mechanism (AuthMech) Username and password authentication HTTP authentication
Thrift Transport http binary
SSL true false
The following is for performance. Ask your vendor to change the parameter if you can’t access it.
(Batch) Fetch Size 100000

Note

  • To turn off SSL, set spark.hadoop.hive.server2.use.SSL false.
  • To use binary transport, set spark.hadoop.hive.server2.transport.mode binary.

Username and password

To establish the connection, you use a personal access token to authenticate to the cluster gateway:

  • Username: token
  • Password: <personal-access-token>

If SSO is disabled, you can also use your Databricks username and password to authenticate to the cluster gateway. If SSO is enabled, you must use a personal access token.

Construct the JDBC URL

  1. On the cluster detail page, click the JDBC/ODBC tab. It contains the hostname, port, protocol, and HTTP path.

    JDBC/ODBC Tab
  2. Construct a JDBC connection string (URL) that looks like:

    jdbc:spark://<server-hostname>:<port>:443/default;transportMode=http;ssl=1;httpPath=<http-path>;AuthMech=3;UID=token;PWD=<personal-access-token>
    

    Replace <server-hostname>, <port>, and <http-path> with the values from the cluster detail page, set UID to the string token, and replace <personal-access-token> with your personal access token.

ODBC Data Source Name (DSN) configuration for the Simba ODBC driver

The Data Source Name (DSN) configuration contains the parameters for communicating with a specific database. BI tools like Tableau usually provide a friendly user interface for entering these parameters. If you have to install and manage the Simba ODBC driver yourself, you might need to create the configuration files and also allow your Driver Manager (odbc32.dll on Windows and unixODBC /iODBC on Unix) to access them.

After you download and install the Simba ODBC driver, create two files, /etc/odbc.ini and /etc/odbcinst.ini. The content in /etc/odbc.ini can be:

[Databricks-Spark-2-x]
Driver=Simba
Server=<server-hostname>
HOST=<server-hostname>
PORT=<port>
SparkServerType=3
Schema=default
ThriftTransport=2
SSL=1
AuthMech=3
UID=token
PWD=<personal-access-token>
HTTPPath=<http-path>

The content in /etc/odbcinst.ini can be:

[ODBC Drivers]
Simba = Installed
[Simba]
Driver = <driver-path>

Set <driver-path> according to the type of operating system you chose when you downloaded the driver in Step 1. For example:

  • MacOs /Library/simba/spark/lib/libsparkodbc_sbu.dylib
  • Linux /opt/simba/sparkodbc/lib/universal/libsimbasparkodbc.dylib

You can specify the paths of the two files in your environment variables so that they can be used by the Driver Manager:

export ODBCINI=/etc/odbc.ini
export ODBCSYSINI=/etc/odbcinst.ini
export SIMBASPARKINI=<simba-ini-path>/simba.sparkodbc.ini # (Contains the configuration for debugging the Simba driver)

where <simba-ini-path> is

  • MacOS /Library/simba/spark/lib
  • Linux /opt/simba/sparkodbc/lib/universal

Troubleshooting

Fetching result set is slow after statement execution
After a query execution, you can fetch result rows by calling the next() method on the returned ResultSet repeatedly. This method triggers a request to the driver Thrift server to fetch a batch of rows back if the buffered ones are exhausted. We found the size of the batch significantly affects the performance. The default value in the most of the JDBC/ODBC drivers is too conservative, and we recommend that you set it to at least 100,000. Contact the BI tool provider if you cannot access this configuration.
Timeout/Exception when creating the connection

Once you have the server hostname, you can run the following tests from a terminal to check for connectivity to the endpoint.

curl https://<server-hostname>:<port>/sql/protocolv1/o/0/<cluster-id> -H "Authorization: Basic $(echo -n 'token:<personal-access-token>' | base64)"

If the connection times out, check whether your network settings of the connection are correct.

TTransportException

If the response contains a TTransportException (the error is expected) like the following, it means that the gateway is functioning properly and you have passed in valid credentials. If you are not able to connect with the same credentials, check that the client you are using is properly configured and is using the latest Simba drivers (version >= 1.2.0):

<h2>HTTP ERROR: 500</h2>
<p>Problem accessing /cliservice. Reason:
<pre> javax.servlet.ServletException: org.apache.thrift.transport.TTransportException</pre></p>
Referencing temporary views

If the response contains the message Table or view not found: SPARK..temp_view it means that a temporary view is not properly referenced in the client application. Simba has an internal configuration parameter called UseNativeQuery that decides whether the query is translated or not before being submitted to the Thrift server. By default, the parameter is set to 0, in which case Simba can modify the query. In particular, Simba creates a custom #temp schema for temporary views and it expects the client application to reference a temporary view with this schema. You can avoid using this special alias by setting UseNativeQuery=1, which prevents Simba from modifying the query. In this case, Simba sends the query directly to the Thrift server. However, the client needs to make sure that the queries are written in the dialect that Spark expects, that is, HiveQL.

To sum up, you have the following options to handle temporary views over Simba and Spark:

  • UseNativeQuery=0 and reference the view by prefixing its name with #temp.
  • UseNativeQuery=1 and make sure the query is written in the dialect that Spark expects.
Other errors
  • If you get the error 401 Unauthorized, check the credentials you are using:

    <h2>HTTP ERROR: 401</h2>
    <p>Problem accessing /sql/protocolv1/o/0/test-cluster. Reason:
    <pre>    Unauthorized</pre></p>
    

    If you use a personal access token to authenticate, verify that the username is token (not your username) and the password is a personal access token (the token should start with dapi).

  • Responses such as 404, Not Found usually indicate problems with locating the specified cluster:

    <h2>HTTP ERROR: 404</h2>
    <p>Problem accessing /sql/protocolv1/o/0/missing-cluster. Reason:
    <pre>    RESOURCE_DOES_NOT_EXIST: No cluster found matching: missing-cluster</pre></p>