Sample datasets

There are a variety of datasets provided by third parties that you can upload to your Databricks workspace and use. Databricks also provides a variety of datasets that are already mounted to DBFS in your Databricks workspace.

Third-party sample datasets

Databricks has built-in tools to quickly upload third-party sample datasets as comma-separated values (CSV) files into Databricks workspaces. Some popular third-party sample datasets available in CSV format:

Sample dataset

To download the sample dataset as a CSV file…

The Squirrel Census

On the Data webpage, click Park Data, Squirrel Data, or Stories.

OWID Dataset Collection

In the GitHub repository, click the datasets folder. Click the subfolder that contains the target dataset, and then click the dataset’s CSV file.

Data.gov CSV datasets

On the search results webpage, click the target search result, and next to the CSV icon, click Download.

Diamonds (Requires a Kaggle account)

On the dataset’s webpage, on the Data tab, on the Data tab, next to diamonds.csv, click the Download icon.

NYC Taxi Trip Duration (Requires a Kaggle account)

On the dataset’s webpage, on the Data tab, next to sample_submission.zip, click the Download icon. To find the dataset’s CSV files, extracts the contents of the downloaded ZIP file.

UFO Sightings (Requires a data.world account)

On the dataset’s webpage, next to nuforc_reports.csv, click the Download icon.

To use third-party sample datasets in your Databricks workspace, do the following:

  1. Follow the third-party’s instructions to download the dataset as a CSV file to your local machine.

  2. Use Databricks SQL to import the CSV file from your local machine into your Databricks workspace. The maximum file size that you can import is 100 MB.

  3. To work with the imported data, use Databricks SQL to query the data. Or you can use a notebook to load the data as a DataFrame.

Databricks datasets (databricks-datasets)

Databricks includes a variety of datasets mounted to DBFS.

Note

The availability and location of Databricks datasets are subject to change without notice.

Browse Databricks datasets

To browse these files in Data Science & Engineering or Databricks Machine Learning from a notebook using Python, Scala, or R you can use Databricks Utilities. The code in this example lists all of the available Databricks datasets.

display(dbutils.fs.ls('/databricks-datasets'))
display(dbutils.fs.ls("/databricks-datasets"))
%fs ls "/databricks-datasets"

Unity Catalog datasets

Unity Catalog provides access to a number of sample datasets in the samples catalog. You can review these datasets in the Data Explorer UI and reference them directly using the <catalog_name>.<database_name>.<table_name> pattern.

The nyctaxi database contains the table trips, which has details about taxi rides in New York City stored using Delta Lake. The following code example returns all records in this table:

SELECT * FROM samples.nyctaxi.trips

The tpch database contains data from the TPC-H Benchmark. To see tables in this database, run:

SHOW TABLES IN samples.tpch

Get information about Databricks datasets

To get more information about a dataset, you can use a local file API to print out the dataset README (if one is available) by using Python, R, or Scala in a notebook in Data Science & Engineering or Databricks Machine Learning, as shown in this code example.

f = open('/dbfs/databricks-datasets/README.md', 'r')
print(f.read())
scala.io.Source.fromFile("/dbfs/databricks-datasets/README.md").foreach {
  print
}
library(readr)

f = read_lines("/dbfs/databricks-datasets/README.md", skip = 0, n_max = -1L)
print(f)

Create a table based on a Databricks dataset

This code example demonstrates how to use SQL in the Databricks SQL query editor, or how to use Python, Scala, or R in a notebook in Data Science & Engineering or Databricks Machine Learning, to create a table based on a Databricks dataset:

CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')
spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")
spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")
library(SparkR)
sparkR.session()

sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")