Use the TPC-DS sample dataset to evaluate system performance
Databricks provides access to the TPC-DS benchmark dataset, a widely used benchmark for testing the performance of systems built for data warehousing and analytics. The dataset is available in two sizes by default in every Unity Catalog-enabled workspace. These datasets are ideal for testing Databricks performance on a standardized benchmark that simulates realistic retail and e-commerce business scenarios. To learn more about this dataset, see the TPC-DS benchmark documentation.
What's included
The TPC-DS datasets are available in the samples
catalog with the following schemas:
tpcds_sf1
— Small-scale dataset (approximately 1 GB)tpcds_sf1000
— Large-scale dataset (approximately 1 TB)
Both datasets share the following qualities:
- Are read-only and queryable by all users in the workspace
- Are compatible with SQL warehouses and all-purpose clusters
- Follow the TPC-DS specification for standardized benchmarking
Prerequisites
You must have access to a SQL warehouse or an all-purpose cluster.
Preview the data
To explore the data in the Catalog Explorer UI:
- Click
Catalog in the sidebar.
- Enter tpcds in the search bar. Both schemas are in the
samples
catalog. Click the name of the schema that you want to view. - The Overview tab lists all tables in the schema. Click a table name to open an overview of the columns and data types in that table.
- Use the top navigation to view the table's Sample Data or Details.
Query the data
The following queries use the smaller scale dataset, tpcds_sf1
. To use the larger scale dataset, replace the schema name with tpcds_sf1000
. Click SQL Editor in the sidebar to open the SQL editor. Then, use the following queries to start exploring the data.
Preview tables
SHOW TABLES IN samples.tpcds_sf1;
Explore a table
DESCRIBE TABLE samples.tpcds_sf1.customer;
SELECT * FROM samples.tpcds_sf1.customer LIMIT 10;
Sample join and aggregation
SELECT
i_category,
d_year,
SUM(ss_net_paid) AS total_revenue
FROM samples.tpcds_sf1.store_sales ss
JOIN samples.tpcds_sf1.item i ON ss.ss_item_sk = i.i_item_sk
JOIN samples.tpcds_sf1.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
WHERE d.d_year = 2001
GROUP BY i_category, d_year
ORDER BY total_revenue DESC
LIMIT 10;
Best practices
- Use the query history and query profile to understand performance characteristics and identify optimization opportunities.
- Start with the smaller
tpcds_sf1
dataset for initial testing, then scale up totpcds_sf1000
for comprehensive performance evaluation. - Compare query performance across different SQL warehouse sizes to determine optimal configurations for your workloads.
- Use these standardized datasets to establish performance baselines and track improvements over time.