Sample datasets

There are a variety of sample datasets provided by Databricks and made available by third parties that you can use in your Databricks workspace.

Unity Catalog datasets

Unity Catalog provides access to a number of sample datasets in the samples catalog. You can review these datasets in the Catalog Explorer UI and reference them directly in a notebook or in the SQL editor by using the <catalog-name>.<schema-name>.<table-name> pattern.

The following table lists the available schemas in the samples catalog:

Dataset	Description
`databricks`	File-based sample datasets for building data pipelines, in the `datasets` volume.
`nyctaxi`	Taxi trip records for New York City.
`tpcds_sf1`	Small-scale dataset (approximately 1 GB) from the TPC-DS benchmark.
`tpch`	Large-scale dataset (approximately 1 TB) from the TPC-H Benchmark.
`wanderbricks`	A simulated travel booking platform with users, properties, bookings, reviews, and more.

Dataset	Description
`databricks`	File-based sample datasets for building data pipelines, in the `datasets` volume.
`nyctaxi`	Taxi trip records for New York City.
`tpcds_sf1`	Small-scale dataset (approximately 1 GB) from the TPC-DS benchmark.
`tpch`	Large-scale dataset (approximately 1 TB) from the TPC-H Benchmark.
`wanderbricks`	A simulated travel booking platform with users, properties, bookings, reviews, and more.

databricks

The databricks schema contains the datasets volume, which hosts a collection of Databricks-provided file datasets that you can use to build and test data pipelines. To list the available datasets, run:

SQL
Python

SQL
LIST '/Volumes/samples/databricks/datasets/'

Python
display(dbutils.fs.ls("/Volumes/samples/databricks/datasets/"))

nyctaxi

The nyctaxi schema contains the table trips, which has details about taxi rides in New York City. The following example returns the first 10 records in this table:

SQL
Python

SQL
SELECT * FROM samples.nyctaxi.trips LIMIT 10

Python
display(spark.read.table("samples.nyctaxi.trips").limit(10))

tpcds_sf1

The tpcds_sf1 schema contains data from the TPC-DS benchmark. To list the tables in this schema, run:

SQL
Python

SQL
SHOW TABLES IN samples.tpcds_sf1;

Python
display(spark.sql("SHOW TABLES IN samples.tpcds_sf1"))

For more guidance on how to use this dataset to evaluate system performance, see Use the TPC-DS sample dataset to evaluate system performance.

tpch

The tpch schema contains data from the TPC-H Benchmark. To list the tables in this schema, run:

SQL
Python

SQL
SHOW TABLES IN samples.tpch

Python
display(spark.sql("SHOW TABLES IN samples.tpch"))

wanderbricks

The wanderbricks schema contains a simulated travel booking platform dataset. For details about the wanderbricks dataset tables, see Wanderbricks dataset.

Third-party sample datasets in CSV format

Databricks has built-in tools to quickly upload third-party sample datasets as comma-separated values (CSV) files into Databricks workspaces. Some popular third-party sample datasets available in CSV format:

Sample dataset	To download the sample dataset as a CSV file…
The Squirrel Census	On the Data webpage, click Park Data, Squirrel Data, or Stories.
OWID Dataset Collection	In the GitHub repository, click the datasets folder. Click the subfolder that contains the target dataset, and then click the dataset's CSV file.
Data.gov CSV datasets	On the search results webpage, click the target search result, and next to the CSV icon, click Download.
Diamonds (Requires a Kaggle account)	On the dataset's webpage, on the Data tab, on the Data tab, next to diamonds.csv, click the Download icon.
NYC Taxi Trip Duration (Requires a Kaggle account)	On the dataset's webpage, on the Data tab, next to sample_submission.zip, click the Download icon. To find the dataset's CSV files, extracts the contents of the downloaded ZIP file.

Sample dataset	To download the sample dataset as a CSV file…
The Squirrel Census	On the Data webpage, click Park Data, Squirrel Data, or Stories.
OWID Dataset Collection	In the GitHub repository, click the datasets folder. Click the subfolder that contains the target dataset, and then click the dataset's CSV file.
Data.gov CSV datasets	On the search results webpage, click the target search result, and next to the CSV icon, click Download.
Diamonds (Requires a Kaggle account)	On the dataset's webpage, on the Data tab, on the Data tab, next to diamonds.csv, click the Download icon.
NYC Taxi Trip Duration (Requires a Kaggle account)	On the dataset's webpage, on the Data tab, next to sample_submission.zip, click the Download icon. To find the dataset's CSV files, extracts the contents of the downloaded ZIP file.

To use third-party sample datasets in your Databricks workspace, do the following:

Follow the third-party's instructions to download the dataset as a CSV file to your local machine.
Upload the CSV file from your local machine into your Databricks workspace.
To work with the imported data, use Databricks SQL to query the data. Or you can use a notebook to load the data as a DataFrame.

Third-party sample datasets within libraries

Some third parties include sample datasets within libraries, such as Python Package Index (PyPI) packages or Comprehensive R Archive Network (CRAN) packages. For more information, see the library provider's documentation.

To install a library on a Databricks cluster by using the cluster user interface, see Compute-scoped libraries.
To install a Python library by using a Databricks notebook, see Notebook-scoped Python libraries.
To install an R library by using a Databricks notebook, see Notebook-scoped R libraries.

Databricks datasets (databricks-datasets) mounted to DBFS

Databricks recommends against using DBFS and mounted cloud object storage for most use cases in Unity Catalog-enabled Databricks workspaces. Some sample datasets mounted to DBFS are available in Databricks

note

The availability and location of Databricks datasets are subject to change without notice.

Browse DBFS mounted Databricks datasets

To browse these files from a Python, Scala, or R notebook, you can use Databricks Utilities (dbutils) reference. The following code lists all of the available Databricks datasets.

Python
Scala
R

Python
display(dbutils.fs.ls('/databricks-datasets'))

Scala
display(dbutils.fs.ls("/databricks-datasets"))

R
%fs ls "/databricks-datasets"

Unity Catalog datasets​

databricks​

nyctaxi​

tpcds_sf1​

tpch​

wanderbricks​

Third-party sample datasets in CSV format​

Third-party sample datasets within libraries​

Databricks datasets (databricks-datasets) mounted to DBFS​

Browse DBFS mounted Databricks datasets​