Download data from the internet

You can use Databricks notebooks to download data from public URLs. Databricks does not provide any native tools for downloading data from the internet, but you can use open source tools in supported languages. If you are accessing data from cloud object storage, accessing data directly with Apache Spark provides better results. See Connect to data sources.

Databricks clusters provide general compute, allowing you to run arbitrary code in addition to Apache Spark commands. Arbitrary commands store results on ephermal storage attached to the driver by default. You must move downloaded data to a new location before reading it with Apache Spark, as Apache Spark cannot read from ephemeral storage. See Work with files on Databricks.

Databricks recommends using Unity Catalog volumes for storing all non-tabular data. You can optionally specify a volume as your destination during download, or move data to a volume after download. Volumes do not support random writes, so download files and unzip them to ephemeral storage before moving them to volumes. See Expand and read Zip compressed files.

Note

Some workspace configurations might prevent access to the public internet. Consult your workspace administrator if you need expanded network access.

Download a file to a volume

Databricks recommends storing all non-tabular data in Unity Catalog volumes.

The following examples use packages for Bash, Python, and Scala to download a file to a Unity Catalog volume:

%sh curl https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv --output /Volumes/my_catalog/my_schema/my_volume/curl-subway.csv
import urllib
urllib.request.urlretrieve("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv", "/Volumes/my_catalog/my_schema/my_volume/python-subway.csv")
import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils

FileUtils.copyURLToFile(new URL("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv"), new File("/Volumes/my_catalog/my_schema/my_volume/scala-subway.csv"))

Download a file to ephemeral storage

The following examples use packages for Bash, Python, and Scala to download a file to ephemeral storage attached to the driver:

%sh curl https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv --output /tmp/curl-subway.csv
import urllib
urllib.request.urlretrieve("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv", "/tmp/python-subway.csv")
import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils

FileUtils.copyURLToFile(new URL("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv"), new File("/tmp/scala-subway.csv"))

Because these files are downloaded to ephemeral storage attached to the driver, use %sh to see these files, as in the following example:

%sh ls /tmp/

You can use Bash commands to preview the contents of files download this way, as in the following example:

%sh head /tmp/curl-subway.csv

Move data with dbutils

To access data with Apache Spark, you must move it from ephemeral storage to cloud object storage. Databricks recommends using volumes for managing all access to cloud object storage. See Connect to data sources.

The Databricks Utilities (dbutils) allow you to move files from ephemeral storage attached to the driver to other locations, including Unity Catalog volumes. The following example moves data to a an example volume:

dbutils.fs.mv("file:/tmp/curl-subway.csv", "/Volumes/my_catalog/my_schema/my_volume/subway.csv")

Read downloaded data

After you move the data to a volume, you can read the data as normal. The following code reads in the CSV data moved to a volume:

df = spark.read.format("csv").option("header", True).load("/Volumes/my_catalog/my_schema/my_volume/subway.csv")
display(df)