Skip to main content

Use the TPC-DS sample dataset to evaluate system performance

Databricks provides access to the TPC-DS benchmark dataset, a widely used benchmark for testing the performance of systems built for data warehousing and analytics. The dataset is available in two sizes by default in every Unity Catalog-enabled workspace. These datasets are ideal for testing Databricks performance on a standardized benchmark that simulates realistic retail and e-commerce business scenarios. To learn more about this dataset, see the TPC-DS benchmark documentation.

What's included

The TPC-DS datasets are available in the samples catalog with the following schemas:

  • tpcds_sf1 — Small-scale dataset (approximately 1 GB)
  • tpcds_sf1000 — Large-scale dataset (approximately 1 TB)

Both datasets share the following qualities:

  • Are read-only and queryable by all users in the workspace
  • Are compatible with SQL warehouses and all-purpose clusters
  • Follow the TPC-DS specification for standardized benchmarking

Prerequisites

You must have access to a SQL warehouse or an all-purpose cluster.

Preview the data

To explore the data in the Catalog Explorer UI:

  1. Click Data icon. Catalog in the sidebar.
  2. Enter tpcds in the search bar. Both schemas are in the samples catalog. Click the name of the schema that you want to view.
  3. The Overview tab lists all tables in the schema. Click a table name to open an overview of the columns and data types in that table.
  4. Use the top navigation to view the table's Sample Data or Details.

Query the data

The following queries use the smaller scale dataset, tpcds_sf1. To use the larger scale dataset, replace the schema name with tpcds_sf1000. Click SQL Editor Icon SQL Editor in the sidebar to open the SQL editor. Then, use the following queries to start exploring the data.

Preview tables

SQL
SHOW TABLES IN samples.tpcds_sf1;

Explore a table

SQL
DESCRIBE TABLE samples.tpcds_sf1.customer;
SELECT * FROM samples.tpcds_sf1.customer LIMIT 10;

Sample join and aggregation

SQL
SELECT
i_category,
d_year,
SUM(ss_net_paid) AS total_revenue
FROM samples.tpcds_sf1.store_sales ss
JOIN samples.tpcds_sf1.item i ON ss.ss_item_sk = i.i_item_sk
JOIN samples.tpcds_sf1.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
WHERE d.d_year = 2001
GROUP BY i_category, d_year
ORDER BY total_revenue DESC
LIMIT 10;

Best practices

  • Use the query history and query profile to understand performance characteristics and identify optimization opportunities.
  • Start with the smaller tpcds_sf1 dataset for initial testing, then scale up to tpcds_sf1000 for comprehensive performance evaluation.
  • Compare query performance across different SQL warehouse sizes to determine optimal configurations for your workloads.
  • Use these standardized datasets to establish performance baselines and track improvements over time.