Download data from the internet

This article describes patterns for adding data from the internet to Databricks.

Databricks does not provide any native tools for downloading data from the internet, but you can use open source tools in supported languages to download files using notebooks.

Databricks recommends using Unity Catalog volumes for storing all non-tabular data. You can optionally specify a volume as your destination during download, or move data to a volume after download.

Note

If you do not specify an output path, most open source tools target a directory in your ephemeral storage. See Download a file to ephemeral storage.

Volumes do not support random writes. If you need to unzip downloaded files, Databricks recommends downloading them to ephemeral storage and unzipping them before moving them to volumes. See Expand and read Zip compressed files.

If you are accessing data from cloud object storage, accessing data directly with Apache Spark provides better results. See Connect to data sources.

Some workspace configurations might prevent access to the public internet. Consult your workspace administrator if you need expanded network access.

Download a file to a volume

Databricks recommends storing all non-tabular data in Unity Catalog volumes.

The following examples use packages for Bash, Python, and Scala to download a file to a Unity Catalog volume:

%sh curl https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv --output /Volumes/my_catalog/my_schema/my_volume/curl-subway.csv
import urllib
urllib.request.urlretrieve("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv", "/Volumes/my_catalog/my_schema/my_volume/python-subway.csv")
import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils

FileUtils.copyURLToFile(new URL("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv"), new File("/Volumes/my_catalog/my_schema/my_volume/scala-subway.csv"))

Download a file to ephemeral storage

The following examples use packages for Bash, Python, and Scala to download a file to ephemeral storage attached to the driver:

%sh curl https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv --output /tmp/curl-subway.csv
import urllib
urllib.request.urlretrieve("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv", "/tmp/python-subway.csv")
import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils

FileUtils.copyURLToFile(new URL("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv"), new File("/tmp/scala-subway.csv"))

Because these files are downloaded to ephemeral storage attached to the driver, use %sh to see these files, as in the following example:

%sh ls /tmp/

You can use Bash commands to preview the contents of files download this way, as in the following example:

%sh head /tmp/curl-subway.csv

Move data with dbutils

To access data with Apache Spark, you must move it from ephemeral storage to cloud object storage. Databricks recommends using volumes for managing all access to cloud object storage. See Connect to data sources.

The Databricks Utilities (dbutils) allow you to move files from ephemeral storage attached to the driver to other locations, including Unity Catalog volumes. The following example moves data to a an example volume:

dbutils.fs.mv("file:/tmp/curl-subway.csv", "/Volumes/my_catalog/my_schema/my_volume/subway.csv")

Read downloaded data

After you move the data to a volume, you can read the data as normal. The following code reads in the CSV data moved to a volume:

df = spark.read.format("csv").option("header", True).load("/Volumes/my_catalog/my_schema/my_volume/subway.csv")
display(df)