Sample datasets (databricks-datasets)

Databricks includes a variety of datasets mounted to Databricks File System (DBFS). These datasets are used in examples throughout the documentation.

Browse Databricks datasets

To browse these files in Data Science & Engineering or Databricks Machine Learning from a notebook using Python, Scala, or R you can use Databricks Utilities. The code in this example lists all of the available Databricks datasets.

display(dbutils.fs.ls('/databricks-datasets'))
display(dbutils.fs.ls("/databricks-datasets"))
%fs ls "/databricks-datasets"

Unity Catalog datasets

Unity Catalog provides access to a number of sample datasets in the samples catalog. You can review these datasets in the data explorer UI and reference them directly using the <catalog_name>.<database_name>.<table_name> pattern.

The nyctaxi database contains the table trips, which has details about taxi rides in New York City stored using Delta Lake. The following code example returns all records in this table:

SELECT * FROM samples.nyctaxi.trips

The tpch database contains data from the TPC-H Benchmark. To see tables in this database, run:

SHOW TABLES IN samples.tpch

Get information about Databricks datasets

To get more information about a dataset, you can use a local file API to print out the dataset README (if one is available) by using Python, R, or Scala in a notebook in Data Science & Engineering or Databricks Machine Learning, as shown in this code example.

f = open('/dbfs/databricks-datasets/README.md', 'r')
print(f.read())
scala.io.Source.fromFile("/dbfs/databricks-datasets/README.md").foreach {
  print
}
library(readr)

f = read_lines("/dbfs/databricks-datasets/README.md", skip = 0, n_max = -1L)
print(f)

Create a table based on a Databricks dataset

This code example demonstrates how to use SQL in the Databricks SQL query editor, or how to use Python, Scala, or R in a notebook in Data Science & Engineering or Databricks Machine Learning, to create a table based on a Databricks dataset:

CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')
spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")
spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")
library(SparkR)
sparkR.session()

sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")