Connect Tableau and Databricks

This article shows you how to use Partner Connect to connect from Databricks to Tableau Desktop and from Tableau Desktop or Tableau Cloud to Databricks. This article also includes information about Tableau Server on Linux.

Note

To configure Databricks sign-on from Tableau Server, see Configure Databricks sign-on from Tableau Server.

When you use Databricks as a data source with Tableau, you can provide powerful interactive analytics, bringing the contributions of your data scientists and data engineers to your business analysts by scaling to massive datasets.

Requirements to connect Tableau and Databricks

Connection information

The connection details for a compute resource or SQL warehouse, specifically the Server Hostname and HTTP Path values.

Connect data managed by Databricks Unity Catalog to Tableau Desktop

Connect data managed by the legacy Databricks Hive metastore to Tableau Desktop

  • Tableau Desktop 2019.3 or above.

  • Databricks ODBC Driver 2.6.15 or above.

Authentication options

Use one of the following authentication options:

Connect Databricks to Tableau Desktop using Partner Connect

You can use Partner Connect to connect a compute resource or SQL warehouse with Tableau Desktop in just a few clicks.

  1. Make sure your Databricks account, workspace, and the signed-in user all meet the requirements for Partner Connect.

  2. In the sidebar, click Partner Connect button Partner Connect.

  3. Click the Tableau tile.

  4. In the Connect to partner dialog, for Compute, choose the name of the Databricks compute resource that you want to connect.

  5. Choose Download connection file.

  6. Open the downloaded connection file, which starts Tableau Desktop.

  7. In Tableau Desktop, enter your authentication credentials, and then click Sign In:

    • To use a Databricks personal access token, enter token for Username and your personal access token for Password.

    • Username / Password: Not applicable. See Authentication options.

Connect Tableau Desktop to Databricks

Follow these instructions to connect from Tableau Desktop to a compute resource or SQL warehouse.

Note

To connect faster with Tableau Desktop, use Partner Connect.

  1. Start Tableau Desktop.

  2. Click File > New.

  3. On the Data tab, click Connect to Data.

  4. In the list of connectors, click Databricks.

  5. Enter the Server Hostname and HTTP Path.

  6. For Authentication, choose your authentication method, enter your authentication credentials, and then click Sign in.

    • To use a Databricks personal access token, select Personal Access Token and enter your personal access token for Password.

    • OAuth/Microsoft Entra ID. For OAuth endpoint, enter https://{<server-hostname>}/oidc, where <server-hostname> is the Server Hostname for your compute resource or SQL warehouse. A browser window opens and prompts you to sign in to your IdP.

    • Username / Password: Not applicable. See Authentication options.

    If Unity Catalog is enabled for your workspace, additionally set the default catalog. In the Advanced tab, for Connection properties, add Catalog=<catalog-name>. To change the default catalog, in the Initial SQL tab, enter USE CATALOG <catalog-name>.

Connect Tableau Cloud to Databricks

Follow these instructions to connect to a compute resource or SQL warehouse from Tableau Cloud.

  • Start a new workbook

  • On the menu bar, click Data > New Data Source.

  • On the Connect to Data page, click Connectors > Databricks.

  • On the Databricks page, input the the Server Hostname and HTTP Path values.

  • Select your authentication method and enter the information requested (if any).

  • Click Sign In.

Tableau Server on Linux

Edit /etc/odbcinst.ini to include the following:

[Simba Spark ODBC Driver 64-bit]
Description=Simba Spark ODBC Driver (64-bit)
Driver=/opt/simba/spark/lib/64/libsparkodbc_sb64.so

Note

Tableau Server on Linux recommends 64-bit processing architecture.

Publish and refresh a workbook on Tableau Cloud from Tableau Desktop

This article shows how to publish a workbook from Tableau Desktop to Tableau Cloud and keep it updated when the data source changes. You need a workbook in Tableau Desktop and a Tableau Cloud account.

  1. Extract the workbook’s data from Tableau Desktop: in Tableau Desktop, with the workbook that you want to publish displayed, click Data > <data-source-name> > Extract Data.

  2. In the Extract Data dialog box, click Extract.

  3. Browse to a location on your local machine where you want to save the extracted data, and then click Save.

  4. Publish the workbook’s data source to Tableau Cloud: in Tableau Desktop, click Server > Publish Data Source > <data-source-name>.

  5. If the Tableau Server Sign In dialog box displays, click the Tableau Cloud link, and follow the on-screen directions to sign in to Tableau Cloud.

  6. In the Publish Data Source to Tableau Cloud dialog box, next to Refresh Not Enabled, click the Edit link.

  7. In the flyout box that displays, for Authentication, change Refresh not enabled to Allow refresh access.

  8. Click anywhere outside of this flyout to hide it.

  9. Select Update workbook to use the published data source.

  10. Click Publish. The data source displays in Tableau Cloud.

  11. In Tableau Cloud, in the Publishing Complete dialog box, click Schedule, and follow the on-screen directions.

  12. Publish the workbook to Tableau Cloud: in Tableau Desktop, with the workbook you want to publish displayed, click Server > Publish Workbook.

  13. In the Publish Workbook to Tableau Cloud dialog box, click Publish. The workbook displays in Tableau Cloud.

Tableau Cloud checks for changes to the data source according to the schedule you set, and updates the published workbook if changes are detected.

For more information, see the following on the Tableau website:

Best practices and troubleshooting

The two fundamental actions to optimize Tableau queries are:

  • Reduce the number of records being queried and visualized in a single chart or dashboard.

  • Reduce the number of queries being sent by Tableau in a single chart or dashboard.

Deciding which to try first depends on your dashboard. If you have a number of different charts for individual users all in the same dashboard, it’s likely that Tableau is sending too many queries to Databricks. If you only have a couple of charts but they take a long time to load, there are probably too many records being returned by Databricks to load effectively.

Tableau performance recording, available on both Tableau Desktop and Tableau Server, can help you identify performance bottlenecks by identifying processes that cause latency when you run a particular workflow or dashboard.

Enable performance recording to debug any Tableau issue

For instance, if query execution is the problem, you know it has to do with the data engine process or the data source that you are querying. If the visual layout is performing slowly, you know that it is the VizQL.

If the performance recording says that the latency is in the executing query, it is likely that too much time is taken by Databricks to return the results or by the ODBC/Connector overlay processing the data into SQL for VizQL. When this occurs, you should analyze what you are returning and attempt to change the analytical pattern to have a dashboard per group, segment, or article instead of trying to cram everything into one dashboard and relying on Quick Filters.

If the poor performance is caused by sorting or visual layout, the problem may be the number of marks the dashboard is trying to return. Databricks can return one million records quickly, but Tableau may not be able to compute the layout and sort the results. If this is a problem, aggregate the query and drill into the lower levels. You can also try a bigger machine since Tableau is only constrained by physical resources on the machine on which it is running.

For an in-depth tutorial on the performance recorder, see Create a Performance Recording.

Performance on Tableau Server versus Tableau Desktop

In general, a workflow that runs on Tableau Desktop is no faster on Tableau Server. A dashboard that doesn’t execute on Tableau Desktop will not execute on Tableau Server.

Using Desktop is a much better troubleshooting technique because Tableau Server has more processes to consider when you troubleshoot. If things work in Tableau Desktop but not in Tableau Server, then you can safely narrow the issue down to the processes in Tableau Server that aren’t in Tableau Desktop.

Configuration

By default, the parameters from the connection URL override those in the Simba ODBC DSN. There are two ways you can customize the ODBC configurations from Tableau:

  • .tds file for a single data source:

    1. Follow the instructions in Save Data Sources to export the .tds file for the data source.

    2. Find the property line odbc-connect-string-extras='' in the .tds file and set the parameters. For example, to enable AutoReconnect and UseNativeQuery, you can change the line to odbc-connect-string-extras='AutoReconnect=1,UseNativeQuery=1'.

    3. Reload the .tds file by reconnecting the connection.

    The compute resource is optimized to use less heap memory for collecting large results, so it can serve more rows per fetch block than Simba ODBC’s default. Append RowsFetchedPerBlock=100000' to the value of the odbc-connect-string-extras property.

  • .tdc file for all data sources:

    1. If you have never created a .tdc file, you can add TableauTdcExample.tdc to the folder Document/My Tableau Repository/Datasources.

    2. Add the file to all developers’ Tableau Desktop installations, so that it works when the dashboards are shared.

Optimize charts (worksheets)

There are a number of tactical chart optimizations that can help you improve the performance of your Tableau worksheets.

For filters that don’t change often and are not meant to be interacted with, use context filters, which speed up execution time. Another good rule of thumb is to use if/else statements instead of case/when statements in your queries.

Tableau can push down filters into data sources, which can improve query speeds. For more information about data source push down filters, see Filtering Across Multiple Data Sources Using a Parameter and Filter Data Across Multiple Data Sources.

Try to avoid table calculations, as they scan the full dataset. For more information about table calculations, see Transform Values with Table Calculations.

Optimize dashboards

The following are some tips and troubleshooting exercises you can apply to improve your Tableau dashboard performance.

With Tableau dashboards connected to Databricks, quick filters on individual dashboards that serve a number of different users, functions, or segments can be a common source of issues. You can attach quick filters to all of the charts on the dashboard. One quick filter on a dashboard with five charts causes a minimum of 10 queries to be sent to Databricks. This can grow to greater numbers when more filters are added, and it can cause performance problems because Spark is not built to handle many concurrent queries starting at the same exact moment. This becomes more problematic when the Databricks cluster or SQL warehouse that you are using is not large enough to handle the high volume of queries.

As a first step, we recommend that you use Tableau performance recording to troubleshoot what might be causing the issue.

If the poor performance is caused by sorting or visual layout, the problem could be the number of marks the dashboard is trying to return. Databricks can return one million records quickly, but Tableau may not be able to compute the layout and sort the results. If this is a problem, aggregate the query and drill into the lower levels. You can also try a bigger machine, as Tableau is constrained only by the physical resources on the machine on which it is running.

For information about drilling down in Tableau, see Drill down into the details.

If you see many granular marks, this is often a poor analytical pattern because it doesn’t provide insight. Drilling down from higher levels of aggregation makes more sense and reduces the number of records that must be processed and visualized.

Use actions to optimize dashboards

Use Tableau _actions to click a mark (for example, a state on a map) and be sent to another dashboard that filters based on the state you click. Using _actions reduces the need for multiple filters on one dashboard and the number of records that must be generated. (You are setting an action to not generate records until it gets a predicate to filter on.

For more information, see Actions and 6 Tips to Make Your Dashboards More Performant.

Caching

Caching data is a good way to improve the performance of worksheets or dashboards.

Caching in Tableau

Tableau has four layers of caching before it goes back to the data, whether that data is in a live connection or an extract:

  • Tiles: If someone loads the same dashboard and nothing changes, Tableau tries to reuse the same tiles for the charts. This is similar to Google Maps tiles.

  • Model: If the tiles cache can’t be used, the model cache of mathematical calculations is used to generate visualizations. Tableau Server attempts to use the same models.

  • Abstract: Aggregate results of queries are stored as well. This is the third “defense” level. If a query returns Sum(Sales), Count(orders), Sum(Cost), in a previous query and a future query wants just Sum(Sales), then Tableau grabs that result and uses it.

  • Native Cache: If the query is the same as another one, Tableau uses the same results. This is the last level of caching. If this fails, Tableau goes to the data.

Caching frequency in Tableau

Tableau has administrative settings for caching more or less frequently. If the server is set to Refresh Less Often, Tableau keeps data in the cache for up to 12 hours. If the server is set to Refresh More Often, Tableau returns to the data on every page refresh.

Customers who use the same dashboard repeatedly, for example, “Monday morning pipeline reports”, should be on a server set to Refresh Less Often so that the dashboards all use the same cache.

Cache warming in Tableau

In Tableau, you can warm the cache by setting up a subscription for the dashboard to send before you want the dashboard viewed. (The dashboard must be rendered to generate the subscription email image.) See Warming the Tableau Server Cache Using Subscriptions.

Tableau Desktop: Error The drivers... are not properly installed

Issue: When you try to connect Tableau Desktop to Databricks, Tableau displays an error message in the connection dialog with a link to the driver download page, where you can find driver links and installation instructions.

Cause: Your installation of Tableau Desktop is not running a supported driver.

Resolution: Download the Databricks ODBC driver version 2.6.15 or above.

See also: Error “The drivers… are not properly installed” on the Tableau website.

Primary / foreign key constraints

To propagate primary key (PK) and foreign key (FK) constraints from Databricks to Tableau, you must understand the capabilities and limitations of both platforms regarding constraints.

Understanding Databricks constraints

Databricks supports primary and foreign key constraints starting from Databricks Runtime 15.2. These constraints are informational and not enforced by default, meaning they do not prevent data integrity violations but can be used to optimize queries and provide metadata about data relationships. See Declare primary key and foreign key relationships.

Understanding Tableau uses constraints to create table relationships

Tableau does not directly enforce primary and foreign key constraints. Instead, Tableau uses relationships to model data connections. To work with constraints in Tableau, you must understand that Tableau’s data model offers two levels of modeling: a logical layer and a physical layer. See Tableau Data Model. The implications of this two-level data model on Databricks constraints being recognized as relationships in Tableau are discussed below.

Connecting Databricks to Tableau

When you connect Databricks to Tableau, Tableau attempts to create relationships at the physical layer between tables based on existing key constraints and matching fields. Tableau automatically attempts to detect and create relationships at the physical layer based on primary and foreign key constraints defined in Databricks. If no key constraints are defined, Tableau uses matching column names to auto-generate joins. At the logical layer, only single-column name matches are used to determine a relationship. At the physical layer, this column name matching detects both simple (single-column) and composite (multi-column) key relationships.

If Tableau cannot determine the matching fields, you must manually specify the join relationship between the two tables at the physical layer by supplying the columns, condition, and type of constraint. To shift from the logical layer in the UI to the physical layer, double-click the table at the logical layer.